MTP: Multi-Hypothesis Tracking and Prediction for Reduced Error Propagation

by   Xinshuo Weng, et al.
Stanford University

Recently, there has been tremendous progress in developing each individual module of the standard perception-planning robot autonomy pipeline, including detection, tracking, prediction of other agents' trajectories, and ego-agent trajectory planning. Nevertheless, there has been less attention given to the principled integration of these components, particularly in terms of the characterization and mitigation of cascading errors. This paper addresses the problem of cascading errors by focusing on the coupling between the tracking and prediction modules. First, by using state-of-the-art tracking and prediction tools, we conduct a comprehensive experimental evaluation of how severely errors stemming from tracking can impact prediction performance. On the KITTI and nuScenes datasets, we find that predictions consuming tracked trajectories as inputs (the typical case in practice) can experience a significant (even order of magnitude) drop in performance in comparison to the idealized setting where ground truth past trajectories are used as inputs. To address this issue, we propose a multi-hypothesis tracking and prediction framework. Rather than relying on a single set of tracking results for prediction, our framework simultaneously reasons about multiple sets of tracking results, thereby increasing the likelihood of including accurate tracking results as inputs to prediction. We show that this framework improves overall prediction performance over the standard single-hypothesis tracking-prediction pipeline by up to 34.2 more significant improvements (up to  70 challenging scenarios involving identity switches and fragments – all with an acceptable computation overhead.



There are no comments yet.


page 1


Towards Robust Human Trajectory Prediction in Raw Videos

Human trajectory prediction has received increased attention lately due ...


We introduce a prediction driven method for visual tracking and segmenta...

Drowned out by the noise: Evidence for Tracking-free Motion Prediction

Autonomous driving consists of a multitude of interacting modules, where...

Control-Aware Prediction Objectives for Autonomous Driving

Autonomous vehicle software is typically structured as a modular pipelin...

RobustTP: End-to-End Trajectory Prediction for Heterogeneous Road-Agents in Dense Traffic with Noisy Sensor Inputs

We present RobustTP, an end-to-end algorithm for predicting future traje...

Multiple target tracking based on sets of trajectories

This paper proposes the set of target trajectories as the state variable...

Globally Consistent Multi-People Tracking using Motion Patterns

Many state-of-the-art approaches to people tracking rely on detecting th...

Code Repositories


Official implementation of deep-multi-trajectory-based single object tracking (IEEE T-CSVT 2021).

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multi-object tracking and trajectory prediction are critical components in modern autonomy stacks. For example, in autonomous driving applications, the outputs of these components are used by the planning module to compute safe and efficient trajectories. Multi-object tracking (MOT) [1, 2, 3, 4, 5, 6, 7, 8, 9] and prediction [10, 11, 12, 13, 14, 15, 16, 17, 18] typically follow a cascaded pipeline, where tracking is performed first to produce past tracklets, followed by a prediction module in charge of predicting other agents’ future trajectories. Although such a modularization eases the development cycle, scalability, and interpretability, it also gives rise to significant integration challenges, with cascading errors being a key concern, e.g., a tracking error such as an identity switch can cause a substantial prediction error as shown in Figure 1 (left).

Perhaps surprisingly, the severity of such cascading errors has been relatively under-explored. Indeed, most works on trajectory prediction typically consider the unrealistic setting whereby the prediction module consumes ground truth (GT) past trajectories as inputs, as opposed to tracklets produced by tracking. In this work, by applying state-of-the-art tracking and prediction methods on the nuScenes [19] and KITTI [20] datasets, we find that predictions consuming tracklets as inputs experience a significant performance drop in comparison to the idealized setting where GT past trajectories are used as inputs. Also, if we restrict the evaluation to challenging scenarios involving tracking errors (which are quite frequent, as we will show), prediction errors are increased by up to 28.2 on KITTI and 17.6

on nuScenes. The reason for such a significant performance drop is that tracking errors such as identity switch typically induce velocity/orientation estimation errors persisting for a few frames, which can have a detrimental impact on prediction accuracy.

Fig. 1: (Left) Using past tracklets as inputs to prediction can cause substantial prediction errors. (Right) By simultaneously reasoning about multiple sets of tracklets via multi-hypothesis data association, one can account for tracking errors and significantly improve prediction performance.

To address the above issue, we propose a Multi-hypothesis Tracking and Prediction (MTP) framework that uses multi-hypothesis data association to output multiple sets of tracklets as tracking results. Then, these sets of tracklets are used as inputs to the prediction module. The key idea is simple: by simultaneously reasoning about multiple sets of tracklets, the likelihood of including accurate tracklets as inputs to prediction is increased (Figure 1 (right)). Note that this is different from the standard tracking-prediction pipeline in [21, 18, 1, 2, 3, 4, 5, 6, 7, 8, 9], where only a single set of tracklets is produced by tracking. In this case, if the past tracklet is off for an object, the prediction might be completely off.

Our MTP framework is inspired by the prediction-planning pipeline, where the prediction network typically predicts multiple sets of future trajectories, referred to as trajectory samples in VAE [13, 15, 16] or GAN-based [10, 11, 12] methods. By reasoning about multiple trajectory samples, the likelihood of considering an accurate prediction is higher, thereby enabling a higher level of safety in planning [22]. MTP exploits a similar idea, whereby multiple sets of tracklets are generated to improve downstream prediction performance. Through experiments on KITTI and nuScenes, we show that the MTP framework improves overall prediction performance (up to 34.2% on the nuScenes dataset), with even more significant improvements (up to 70%) when restricting the evaluation to challenging scenarios involving tracking errors. The MTP framework naturally incurs a computation overhead with respect to its single-hypothesis counterpart, but we show that, fortunately, this overhead is acceptable and still compatible with real-time applications.

The contributions of this paper are threefold: First, we provide a comprehensive experimental assessment of the impact of tracking errors on prediction performance. Second, we propose the MTP framework, which aims at reducing error propagation between MOT and prediction by simultaneously reasoning about multiple sets of tracking results. Third, we thoroughly evaluate the performance of MTP both in terms of prediction accuracy and runtime performance.

Ii Related Work

3D Multi-Object Tracking.

Recent online 3D MOT methods often follow a tracking-by-detection pipeline with two steps: (1) Given trajectories associated up to the last frame and detections in the current frame, an affinity matrix is computed, where each entry represents the similarity value between a past trajectory and a current detection; (2) Given the affinity matrix, the Hungarian algorithm


is used to obtain a locally optimal matching, which entails making a hard assignment about which past trajectory a current detection is assigned to, so that trajectories can be updated to the current frame. Though significant progress has been made recently for the first step, for example, by improving the affinity matrix estimation using Graph Neural Networks

[24, 9, 25] and multi-modal feature learning [2, 1], step two has largely remained the same. In other words, modern 3D MOT methods typically generate a single set of trajectories via the Hungarian algorithm at inference time, which induces tracking errors that can be detrimental to prediction.

Multi-Hypothesis Data Association. To improve single-hypothesis MOT, a natural approach is to leverage multi-hypothesis data association (MHDA). The idea is to maintain multiple hypotheses and delay making assignments. As a result, ambiguity in data association can be better resolved in later frames. MHDA has been popular in the 90’s and successfully applied to MOT [26, 27, 28] and SLAM [29, 30]. However, at the time when MHDA was being actively developed, the topic of trajectory prediction was still in its infancy. To the best of our knowledge, our work is the first to adopt MHDA to improve downstream prediction.

Trajectory Prediction. There has been significant progress on trajectory prediction recently, including [31, 32, 33, 10, 11, 13, 34, 35, 36]. Yet, almost invariably, such works study the prediction task separately from the 3D MOT task. Specifically, they consider GT past trajectories as inputs to prediction, with no direct accounting of tracking errors. Characterizing and mitigating the propagation of tracking errors to prediction is indeed the key motivation of this paper.

Tracking-Prediction Integration. A few works have attempted to better couple MOT and prediction tasks. In end-to-end detection and prediction [21], the MOT and prediction networks are jointly optimized, which increases performance. Yet, it is still a cascaded, single-hypothesis pipeline, and thus prone to predictions being thrown off by tracking errors. In parallelized tracking and prediction [18], a two-branch tracking and prediction network is proposed. Although this method prevents error propagation in the current frame (tracking results in the current frame are not fed into the prediction branch), it can not do so for the next window of prediction. This is because the method in [18] also uses the Hungarian algorithm to generate a single set of tracklets at the current frame, and this can easily lead to tracking errors being propagated to the next window of prediction. In contrast, we replace the Hungarian algorithm with MHDA in the tracking assignment phase, thereby preventing a hard assignment from removing plausible alternative hypotheses. We will show that this idea is quite effective.

Finally, a concurrent and unpublished work [37] has also recognized the importance of understanding how tracking errors can impact prediction. Our paper provides a comprehensive quantitative analysis that corroborates the qualitative findings in [37], and we propose to leverage MHDA to more robustly accounting for tracking errors, while the solution method in [37] is still single-hypothesis-based.

Fig. 2: The most typical tracking errors are: identity switches (top left), wrongly-tracked fragments (bottom left), under-tracked fragments (center), and spurious tracks (right).

Iii How Do Tracking Errors Affect Prediction?

In this section, we experimentally study to what extent tracking errors can impact prediction performance. We start by reviewing typical tracking errors. We then outline our methodology, present qualitative and quantitative results, and finally characterize how frequently such errors can arise.

Iii-a Three key types of tracking errors

Identity Switches (IDS) happen if a GT trajectory is matched with two or more different tracklets. For example, as shown in Figure 2 (top left), the two black GT trajectories are erroneously matched with half of the green and half of the orange tracklets, with a switch in the middle. Such an IDS can happen when two GT trajectories are very close and/or cross each other. IDS can cause large prediction errors as they induce large linear/angular velocity estimation errors, usually persisting for a few frames after the IDS event.

Fragments (FRAG) refer to GT trajectories that do not closely match any tracklet, either because of a wrong association (wrongly-tracked FRAG, Figure 2 (bottom left)), or because the detector misses the detection of the object in later frames (under-tracked FRAG, Figure 2 (bottom center)).

Spurious Tracks are tracklets being completely false positives, that is not corresponding to any GT trajectory (Figure 2 (bottom right)). Spurious tracklets do not affect the recall of the predictions, but do lower the precision.

Iii-B Assessment methodology

In our evaluation, we apply state-of-the-art methods for 3D MOT and prediction, namely AB3DMOT [3] for MOT and PTP for prediction [18]111We use the prediction branch of PTP [18], while obtaining tracking results from [3], to replicate the standard tracking-prediction pipeline., on two standard autonomous driving datasets: KITTI [20] and nuScenes [19].

KITTI. We use the tracking validation set (see [38] for the split). We predict 10 future frames using 10 past frames, i.e., 1 second with an FPS of 10. We consider three main object classes labeled in KITTI, i.e., cars, pedestrians, and cyclists.

nuScenes. We follow the standard nuScenes prediction challenge guidelines [39]. Specifically, we use the prediction test set (see nuScenes code [40] for the split) for evaluation. We consider vehicle classes, namely car, truck, van, trailer, bus, and construction vehicle. We predict 12 future frames using 4 past frames, i.e., the past 2 seconds with an FPS of 2.

Evaluation. We use the standard best of Average Displacement Error (ADE) and best of Final Displacement Error (FDE) [10] to evaluate prediction222The ADE is defined as the mean distance between the predicted and GT trajectory; the FDE is defined as the distance between the predicted final position and GT final position at the end of the prediction horizon., referred to as minADE and minFDE. While using is standard on KITTI in prior work [18], it is common to use a smaller value of on nuScenes such as 5 or 10 (see nuScenes leaderboard [39]). Thus, we use on nuScenes and on KITTI.

To quantify the impact of tracking errors on prediction, a tracking evaluation is needed to find which objects at which frames experience IDS/FRAG errors. To that end, we use the standard 3D tracking evaluation code released in [3], which aims at matching tracked objects with GT at every frame to see: (1) if there is an identity change of the GT the tracked object is matched to (IDS), (2) if there is a GT that is not matched with any tracked objects (FRAG), and (3) if there is a tracklet not matched with any GT within a threshold (spurious tracks). For the matching threshold, we use a 3D Intersection over Union (IoU)333Given two boxes, the IoU is defined as the area of their intersection divided by the area of their union. The IoU, by definition, ranges from 0 to 1 and measures how similar the two boxes are. For example, when two boxes exactly overlap with each other, the IoU equals to 1. of 0.5 on KITTI and a 2D center distance of 2 meters on nuScenes, both of which are standard choices [3, 19].

Fig. 3: IDS causes large prediction errors. (Top): We show predictions on two frames when using GT past trajectories as inputs. Predictions for the yellow object (ID 63) are accurate in both frames. (Bottom): We show predictions on the same two frames, but now using past tracklets as inputs. Due to an IDS error, the object ID is switched from 4967 to 5002, resulting in a velocity estimation error that thwarts predictions.
Fig. 4: Wrongly-tracked FRAG causes prediction errors. (Top): When using GT past trajectories as inputs, predictions for the yellow object are accurate. (Bottom): We show predictions for the same object on the same two frames, but now using past tracklets as inputs. As the detected box of the purple object on the right figure is off by more than 2m from the GT, it causes a wrongly-tracked FRAG error that thwarts predictions.
Fig. 5: Under-tracked FRAG causes missing predictions. (Top): We show predictions on two frames when using GT past trajectories as inputs. As GT past trajectories are accurate and stable, predictions are accurate. (Bottom): We show predictions for the same two objects but now using past tracklets as inputs. As the two objects are missing in the past tracklets (one object is missed in the 2nd frame and the other object is missed in two frames), predictions for these two objects are also missing.

Iii-C Qualitative assessment

Leveraging the methodology outlined in Section III-B, we provide some qualitative insights on the impact of tracking errors on prediction performance.

IDS causes large prediction errors. As shown in Figure 3 (top), as long as we use GT past trajectories as inputs, predictions are accurate. However, when using past tracklets as inputs, particularly when there is an IDS as shown in Figure 3 (bottom), predictions are off due to the sudden (and erroneous) velocity estimation change in the past tracklet.

Wrongly-tracked FRAG can also cause errors. Similar to the case of IDS, a wrongly-tracked FRAG also causes prediction errors. When comparing predictions in Figure 4 top (with GT past trajectories as inputs) and bottom (with past tracklets as inputs), one can see that, when the object’s past tracklet is slightly off, the corresponding predictions are also off due to the orientation change.

Under-tracked FRAG causes missing predictions. Different from the above two cases (which lead to inaccurate predictions), an under-tracked FRAG causes missing predictions, as there is no past tracklet used as inputs to prediction after the FRAG event. As shown in Figure 5, predictions can be missed for objects with under-tracked FRAG errors.

Spurious tracks cause false positives. In contrast to under-tracked FRAG, spurious tracks cause predictions that are not supposed to exist, that is, predictions for ghost objects.

Conclusion. We can see that all tracking errors (IDS, FRAG, and spurious tracks) can lead to prediction errors. In particular, IDS, wrongly-tracked FRAG, and spurious tracks can reduce the precision of the predictions, while under-tracked FRAG and wrongly-tracked FRAG can lead to a lower recall.

Datasets Eval. Targets (# of obj) Inputs to Prediction
KITTI Objects with IDS (33) GT past trajectories 0.100 0.171
Objects with IDS (33) past tracklets 2.820 4.514
Objects with FRAG (330) GT past trajectories 0.177 0.306
Objects with FRAG (330) past tracklets 1.621 2.155
nuScenes Objects with IDS (4160) GT past trajectories 0.473 0.825
Objects with IDS (4160) past tracklets 8.345 13.892
Objects with FRAG (3365) GT past trajectories 0.621 1.108
Objects with FRAG (3365) past tracklets 14.520 21.815
TABLE I: Prediction performance for objects with tracking errors.

Iii-D Quantitative assessment

In this section, we quantify the impact of IDS/FRAG errors in terms of minADE and minFDE (we do not consider spurious tracks, as in such cases there is no corresponding GT that can be used to compute ADE/FDE). In particular, for those instances containing IDS/FRAG errors, we compare prediction performance stemming from using GT past trajectories as inputs with prediction performance stemming from using past tracklets as inputs. The results are shown in Table LABEL:tab:error_pred. One can observe that when we replace GT past trajectories with past tracklets as inputs, there is a significant performance drop. In particular, on KITTI IDS instances, there is a 28 drop from 0.100 to 2.820 in minADE, and on nuScenes IDS instances, there is an 18 drop from 0.473 to 8.345 in minADE. Similar performance drops are observed on FRAG instances. Such performance drops are in agreement with our qualitative findings in Section III-C and make predictions arguably almost useless for these objects.

Fig. 6: (Left): IDS/FRAG frequency. On average, every object trajectory may experience a FRAG error on KITTI, and an IDS and/or FRAG error on nuScenes. (Right): We plot the distribution of distances of erroneously tracked objects with respect to the ego-vehicle on nuScenes. About 200 such objects are close to the ego vehicle (within 5m), that is, at a distance where planning is typically very sensitive to.

Iii-E Frequency and spatial distribution assessment

Sections III-C and III-D characterize and quantify how tracking errors can negatively impact prediction performance. But how often do tracking errors happen? And how far are the objects affected by tracking errors from the ego vehicle (that is, do tracking errors also happen for objects very close to the ego vehicle, for which the planning module would be very sensitive to)? Accordingly, we characterize the frequency and spatial distribution of tracking errors below:

IDS/FRAG frequency. IDS/FRAG cases are indeed quite common. As shown in Figure 6 (left), on average every trajectory in both the KITTI and nuScenes datasets can yield a FRAG, IDS, or both! The frequency of tracking errors, coupled with their negative impact on prediction (Sections III-C and III-D), provide a strong motivation towards developing systematic approaches to account for tracking errors for the purposes of robust prediction (and planning).

IDS/FRAG spatial distribution. Planning, in general, is most sensitive to nearby objects. To understand at a conceptual level whether tracking errors can induce erroneous predictions that in turn can thwart planning, we compute the distance of objects experiencing erroneous tracklets from the ego-vehicle. The results are reported in the histogram in Figure 6 (right). One can see that there is a non-negligible number of IDS/FRAG instances where the tracked object is very close to the ego vehicle. In particular, there are about 1000 IDS/FRAG instances for objects within 15 meters, 500 IDS/FRAG instances for objects within 10 meters, and 200 IDS/FRAG instances for objects within 5 meters. Note that these errors are computed on the nuScenes prediction test set, which only contains 0.83 hours of driving444The nuScenes prediction test set has 150 sequences with 40 frames per sequence and an FPS of 2, for a total of 3000 seconds 0.83 hours.. Thus, we argue that tracking errors can severely hinder safe planning (future research will assess this statement more formally, for example by using planning-aware prediction metrics [41]).

Iv Multi-Hypothesis Tracking and Prediction

To account for the impact of tracking errors on prediction performance, we propose the MTP framework which is visualized in Figure 7. The tracking-prediction pipeline in MTP is relatively standard, in terms of its modularity and sequence of operations (namely, MOT followed by a prediction module). The two key modules introduced are the MHDA and trajectory sampling, as described below:

Multi-Hypothesis Data Association (MHDA). The key idea is to reason about multiple hypotheses simultaneously, with the goal of increasing the likelihood of including accurate tracklets that can be used as inputs to downstream prediction. That is, instead of relying on a hard assignment via the Hungarian algorithm, we use MHDA to enlarge the search space and generate sets of plausible tracklets. Explicitly, we use the Murty’s H-best assignment [28], which maintains sets of tracking results at every frame, where each set is referred to as a hypothesis. Typically, the 1st hypothesis is obtained using the Hungarian algorithm by considering the lowest cost, which results in a list of matches between detections and trajectories. To obtain other hypotheses, we tweak the list of matches in the 1st hypothesis by toggling one match at a time in and out of the list, which results in slightly higher costs. After sorting other hypotheses based on costs, the 2nd hypothesis has the 2nd lowest cost and so on. In the case that the 1st hypothesis is erroneous, other hypotheses with slightly higher costs may correspond to a correct association – thus, by reasoning about multiple hypotheses, the likelihood of retaining accurate tracking results is increased. Each hypothesis (a set of tracklets) is then fed into the prediction module as per Figure 7.

Trajectory Sampling.

Once we obtain predictions by using each hypothesis as input, we sample a subset of the full set of predictions by computing cluster centers using K-Means++

[42], resulting in a diverse set of predictions. The trajectory sampling step is optional, but useful to limit an excessive number of prediction samples, and helps us with carrying out a fair comparison with single-hypothesis prediction methods. For example, if we use and for each hypothesis we generate prediction samples, there will be 200 samples for each object. In this case, we would sample only samples out of the 200 to carry out a fair comparison with single-hypothesis prediction methods using samples.

Conclusion. In summary, our MTP framework can be applied to any tracking-prediction pipeline that is based on single-hypothesis matching – the main modification is to replace the matching algorithm with MHDA.

Fig. 7: Proposed MTP framework. The two key modules introduced in MTP are highlighted in green and orange. We feed the affinity matrix to MHDA to obtain multiple sets of tracklets. Once predictions are performed on all sets of tracklets, we sample a subset of predictions as final results.

V Experiments

As MTP is designed to improve prediction, we follow the standard prediction evaluation as described in Section III-B. For additional implementation details and hyper-parameters, we refer the reader to our code. Here, we categorize our prediction experiments into: 1) targeted evaluation, which analyzes prediction performance for objects affected by tracking errors, and 2) global evaluation, which analyzes prediction performance across all tracked objects, whether or not they are affected by tracking errors. Also, we provide a runtime speed analysis. The key takeaway is that MTP improves both targeted and global prediction performance, with a relatively minor computation overhead.

Targeted evaluation. Results are provided in Table LABEL:tab:quan_ids in terms of minADE and minFDE. The first row of each block corresponds to the standard single-hypothesis tracking-prediction pipeline (AB3DMOT+PTP), which we refer to as STP. As shown in Section III-D, STP yields large prediction errors due to IDS/FRAG. Next, we see that MTP significantly improves prediction performance on both the KITTI and nuScenes datasets. Specifically, when using prediction samples from all hypotheses, we see a 4 minADE performance boost on KITTI IDS (i.e., from 2.820 to 0.707), a 19.5% minADE performance boost on KITTI FRAG, a 2.5 performance boost on nuScenes IDS, and a 2 performance boost on nuScenes FRAG. To compare MTP and STP under the same number of samples (and thus avoid giving MTP an unfair advantage with a larger number of samples), we apply trajectory sampling to MTP. Remarkably, one can see that prediction performance after sampling is only slightly lower (e.g., minADE raises from 0.707 to 0.747 on KITTI IDS), meaning that the proposed trajectory sampling scheme generally retains accurate tracklets. Also, as more hypotheses are used, better performance is achieved (compare the cases with , , and ). Importantly, even when using only 5 hypotheses, prediction performance is improved by 2.5 on KITTI IDS cases.

Finally, even though the minADE and minFDE metrics are not suitable to characterize how MTP improves prediction performance on spurious track instances, it is easy to argue why this is the case. Indeed, the likelihood of removing spurious tracks is increased under MTP, as different hypotheses have different matching results, and some hypotheses may not associate false positive detections to trajectories.

Global evaluation. Results are provided in Table LABEL:tab:quan_all. Again, MTP largely improves performance over STP, e.g., minFDE from 0.278 to 0.238 (14.4% improvement) on KITTI, and minFDE from 3.819 to 2.512 (34.2%) on nuScenes. Improvement on nuScenes is larger as there is a higher percentage of IDS/FRAG instances. In brief, MTP significantly improves both targeted and global prediction performance.

Note that although we follow the nuScenes evaluation protocol, the ADE/FDE numbers in Table LABEL:tab:quan_all are not comparable to the numbers on the nuScenes leaderboard as we consider past tracklets as inputs (as opposed to GT past trajectories, as is the case for the nuScenes leaderboard results).

Datasets Targets Methods
KITTI IDS STP, H=1, k=20 2.820 4.514
MTP (Ours), H=5, k=100 1.099 1.768
MTP (Ours), H=10, k=200 0.844 1.332
MTP (Ours), H=20, k=400 0.707 1.093
MTP (Ours), H=5, k=20, sampling 1.118 1.802
MTP (Ours), H=10, k=20, sampling 0.876 1.390
MTP (Ours), H=20, k=20, sampling 0.747 1.173
KITTI FRAG STP, H=1, k=20 1.621 2.155
MTP (Ours), H=5, k=100 1.436 1.862
MTP (Ours), H=10, k=200 1.385 1.765
MTP (Ours), H=20, k=400 1.305 1.627
MTP (Ours), H=5, k=20, sampling 1.448 1.888
MTP (Ours), H=10, k=20, sampling 1.404 1.801
MTP (Ours), H=20, k=20, sampling 1.335 1.688
nuScenes IDS STP, H=1, k=10 8.345 13.892
MTP (Ours), H=10, k=100 4.143 6.464
MTP (Ours), H=20, k=200 3.321 5.052
MTP (Ours), H=10, k=10, sampling 4.573 7.303
MTP (Ours), H=20, k=10, sampling 3.923 6.210
nuScenes FRAG STP, H=1, k=10 14.520 21.815
MTP (Ours), H=10, k=100 9.017 12.721
MTP (Ours), H=20, k=200 7.697 10.606
MTP (Ours), H=10, k=10, sampling 9.585 13.846
MTP (Ours), H=20, k=10, sampling 8.476 12.105
TABLE II: Prediction performance for objects with IDS/FRAG.

Tracking Error Statistics. To gain insights on why MTP improves prediction performance, we show an intermediate tracking error analysis in Figure 8 for IDS/FRAG instances on nuScenes and KITTI. Specifically, we plot the distribution over frames of IDS/FRAG instances for STP on the left, and the distribution over frames of IDS/FRAG instances present in all of the hypotheses for MTP on the right. One can notice that a large portion of tracking errors in STP does not exist in at least one of the hypotheses being considered by MTP. If we count the number of FRAG/IDS over all frames, the 33 IDS instances experienced by STP on KITTI are reduced to only 9 that are shared by all 20 hypotheses under MTP (72.7% reduction), and the 7083 IDS instances experienced by STP on nuScenes are reduced to only 2835 that are shared by all 20 hypotheses under MTP (60.0% reduction). FRAG errors are reduced by a similar amount. This provides strong justification for the inclusion of MHDA in a tracking-prediction pipeline.

Datasets Methods
KITTI STP, H=1, k=20 0.185 0.278
MTP (Ours), H=5, k=100 0.163 0.235
MTP (Ours), H=10, k=200 0.152 0.215
MTP (Ours), H=20, k=400 0.146 0.203
MTP (Ours), H=5, k=20, sampling 0.170 0.252
MTP (Ours), H=10, k=20, sampling 0.164 0.240
MTP (Ours), H=20, k=20, sampling 0.162 0.238
nuScenes STP, H=1, k=10 2.320 3.819
MTP (Ours), H=10, k=100 1.498 2.293
MTP (Ours), H=20, k=200 1.325 1.979
MTP (Ours), H=10, k=10, sampling 1.691 2.692
MTP (Ours), H=20, k=10, sampling 1.585 2.512
TABLE III: Prediction performance for all objects being predicted.
Fig. 8: We plot the distribution of IDS/FRAG errors over frames for the STP (left) and MTP (right). It is clear that tracking errors are largely reduced when considering all 20 hypotheses in MTP on both KITTI and nuScenes.
STP (H=1) MTP (H=5) MTP (H=10) MTP (H=20)
3D MOT 207.4 65.2 24.6 8.1
Prediction 6.5 6.5 6.3 5.8
TABLE IV: Runtime speed for tracking and prediction on KITTI (FPS).

Runtime Speed. MHDA unavoidably introduces a computation overhead, which is characterized in terms of FPS on the KITTI dataset in Table LABEL:tab:speed. It is expected that, as increases, tracking takes longer. The good news is that, even with , tracking runtime is still acceptable (near real-time), without requiring a GPU implementation. This can be attributed to the excellent speed of [3]. Interestingly, there is nearly no runtime degradation for prediction as increases. This is because predictions for different hypotheses are completely independent, so one can easily run them in parallel, although more GPU memory is needed (1.8Gb per hypothesis is required by [18]). Runtime for K-means++ sampling is negligible and so it is not included in the table.

Methods Parameters
STP, [43] + [44] H=1, k=1 1.024 1.335
Re-tracking, [37] H=1, k=1 0.893 1.241
MTP (Ours) H=20, k=1, sampling 0.852 1.086
MTP (Ours) + Re-tracking H=20, k=1, sampling 0.770 0.967
TABLE V: Global prediction evaluation on WILDTRACK.

Comparison with Concurrent Work [37]. As discussed in Section II, an unpublished work [37] has proposed a single-hypothesis-based re-tracking solution to mitigate the impact of tracking errors on prediction. Though [37]

has open-sourced its code, direct comparison between MTP and

[37] is not immediate as [37] is implemented on a 2D (rather than 3D, as in this paper) detection, tracking, and prediction pipeline, that is, using MaskRCNN [45] for detection in images, using SORT [43] for 2D MOT, and using Social-LSTM [44] for bird’s eye view trajectory prediction. The method is evaluated on a multi-camera dataset, WILDTRACK [46]. To ensure a fair comparison, we add our MHDA and trajectory sampling modules to the SORT+Social-LSTM pipeline implemented by [37] and carry out the evaluation on WILDTRACK. Note that, as [37] did not release detection results by the time this paper was submitted, we use the Detectron2 implementation of MaskRCNN with an X101-FPN backbone [47] to generate detections. Prediction results are shown in Table LABEL:tab:wildtrack, in terms of minADE and minFDE. Here, as Social-LSTM is a deterministic prediction approach. Both the MTP and re-tracking approaches [37] show improvement over STP when using past tracklets as inputs, with MTP showing a slightly larger improvement. Importantly, the re-tracking approach, which is single-hypothesis-based, and the MTP framework are complementary, and indeed combining the two approaches (in the last row in Table LABEL:tab:wildtrack) further improves prediction performance!

Vi Conclusions

In this paper, we studied how tracking errors can impact prediction performance via qualitative and quantitative analyses. These analyses led to the design of the MTP framework, which simultaneously reasons about multiple sets of tracking results in order to account for tracking errors. We demonstrated how MTP significantly improves prediction performance, particularly in those instances containing tracking errors – all for a relatively minor computation overhead.

This work opens up a number of future research directions. First, it is of interest to better understand how to optimally choose and

to consider as a function of computational requirements and target operational design domains. Second, it is of interest to extend our analysis by considering additional methods for MOT and prediction, and planning-aware evaluation metrics (quantitatively assessing how tracking errors ultimately impact planning). Third, the MTP framework is quite general and can be augmented with other techniques aimed at mitigating the propagation of tracking errors. This provides an exciting opportunity to make predictions even more robust to tracking errors. Finally, we plan to study how errors and uncertainty propagate across other modules within the autonomy stack, with the ultimate goal of devising more robust autonomy stacks.


  • [1] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust Multi-Modality Multi-Object Tracking,”

    IEEE Int. Conf. on Computer Vision

    , 2019.
  • [2] A. Kim, A. Osep, and L. Leal-Taixé, “EagerMOT: 3D Multi-Object Tracking via Sensor Fusion,” Proc. IEEE Conf. on Robotics and Automation, 2021.
  • [3] X. Weng, J. Wang, D. Held, and K. Kitani, “3D Multi-Object Tracking: A Baseline and New Evaluation Metrics,” IEEE/RSJ Int. Conf. on Intelligent Robots & Systems, 2020.
  • [4] S. Wang, Y. Sun, and M. Liu, “PointTrackNet: An End-to-End Network for 3D Object Detection and Tracking from Point Cloud,” Proc. IEEE Conf. on Robotics and Automation, 2020.
  • [5] N. Benbarka, J. Schroder, and A. Zell, “Score refinement for confidence-based 3D multi-object tracking,” IEEE/RSJ Int. Conf. on Intelligent Robots & Systems, 2021.
  • [6] Y. Wang, K. Kitani, and X. Weng, “Joint Object Detection and Multi-Object Tracking with Graph Neural Networks,” Proc. IEEE Conf. on Robotics and Automation, 2021.
  • [7] J. Pöschmann, T. Pfeifer, and P. Protzel, “Factor Graph based 3D Multi-Object Tracking in Point Clouds,” IEEE/RSJ Int. Conf. on Intelligent Robots & Systems, 2020.
  • [8] X. Guo and K. Huang, “3D Object Detection and Tracking on Streaming Data,” Proc. IEEE Conf. on Robotics and Automation, 2020.
  • [9] X. Weng, Y. Wang, Y. Man, and K. Kitani, “GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with 2D-3D Multi-Feature Learning,”

    IEEE Conf. on Computer Vision and Pattern Recognition

    , 2020.
  • [10]

    A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks,”

    IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
  • [11] V. Kosaraju, A. Sadeghian, R. Martín-Martín, I. Reid, S. H. Rezatofighi, and S. Savarese, “Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks,” Conference on Neural Information Processing Systems, 2019.
  • [12] S. Eiffert, K. Li, M. Shan, S. Worrall, S. Sukkarieh, and E. Nebot, “Probabilistic Crowd GAN: Multimodal Pedestrian Trajectory Prediction using a Graph Vehicle-Pedestrian Attention Network,” IEEE Robotics and Automation Letters, 2020.
  • [13] B. Ivanovic and M. Pavone, “The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs,” IEEE Int. Conf. on Computer Vision, 2019.
  • [14] X. Weng, J. Wang, S. Levine, K. Kitani, and N. Rhinehart, “Inverting the Pose Forecasting Pipeline with SPF2: Sequential Pointcloud Forecasting for Sequential Pose Forecasting,” Conf. on Robot Learning, 2020.
  • [15] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajectron++: Dynamically-Feasible Trajectory Forecasting With Heterogeneous Data,” European Conf. on Computer Vision, 2020.
  • [16] Y. Yuan, X. Weng, Y. Ou, and K. Kitani, “AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting,” IEEE Int. Conf. on Computer Vision, 2021.
  • [17] D. Cao, J. Li, H. Ma, and M. Tomizuka, “Spectral Temporal Graph Neural Network for Multivariate Time-series Forecasting,” Proc. IEEE Conf. on Robotics and Automation, 2021.
  • [18] X. Weng, Y. Yuan, and K. Kitani, “PTP: Parallelized Tracking and Prediction with Graph Neural Networks and Diversity Sampling,” IEEE Robotics and Automation Letters, 2021.
  • [19] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, and Q. Xu, “nuScenes: A Multimodal Dataset for Autonomous Driving,” IEEE Conf. on Computer Vision and Pattern Recognition, 2020.
  • [20] A. Geiger, P. Lenz, and R. Urtasun, “Are We Ready for Autonomous Driving? the KITTI Vision Benchmark Suite,” IEEE Conf. on Computer Vision and Pattern Recognition, 2012.
  • [21] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun, “PnPNet: End-to-End Perception and Prediction with Tracking in the Loop,” IEEE Conf. on Computer Vision and Pattern Recognition, 2020.
  • [22] N. Rhinehart, J. He, C. Packer, M. A. Wright, R. McAllister, J. E. Gonzalez, and S. Levine, “Contingencies from Observations: Tractable Contingency Planning with Learned Behavior Models,” Proc. IEEE Conf. on Robotics and Automation, 2021.
  • [23] H. W Kuhn, “The Hungarian Method for the Assignment Problem,” Naval Research Logistics, 1955.
  • [24] C. Chen, L. Z. Fragonara, and A. Tsourdos, “Relation3DMOT: Exploiting Deep Affinity for 3D Multi-Object Tracking from View Aggregation,” IEEE Sensors Journal, 2020.
  • [25] J. Li, X. Gao, and T. Jiang, “Graph Networks for Multiple Object Tracking,” IEEE Winter Conf. on Applications of Computer Vision, 2020.
  • [26] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple Hypothesis Tracking Revisited,” IEEE Int. Conf. on Computer Vision, 2015.
  • [27] Y. L. Long, H. Xu, and W. An, “An Algorithm for Tracking Multiple Targets,” IEEE Transactions on Automatic Control, 1979.
  • [28] I. J. Cox and S. L. Hingorani, “An Efficient Implementation of Reid’s Multiple Hypothesis Tracking Algorithm and Its Evaluation for the Purpose of Visual Tracking,” IEEE Transactions on Pattern Analysis & Machine Intelligence, 1996.
  • [29] L. Bernreiter, A. Gawel, H. Sommer, J. Nieto, R. Siegwart, and C. C. Lerma, “Multiple Hypothesis Semantic Mapping for Robust Data Association,” IEEE Robotics and Automation Letters, 2019.
  • [30] J. Wang and B. Englot, “Robust Exploration with Multiple Hypothesis Data Association,” IEEE/RSJ Int. Conf. on Intelligent Robots & Systems, 2018.
  • [31] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity Forecasting,” European Conf. on Computer Vision, 2012.
  • [32] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social LSTM: Human Trajectory Prediction in Crowded Spaces,” IEEE Conf. on Computer Vision and Pattern Recognition, 2016.
  • [33] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes,” European Conf. on Computer Vision, 2016.
  • [34] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents,” IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
  • [35] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine, “PRECOG: PREdiction Conditioned On Goals in Visual Multi-Agent Settings,” IEEE Int. Conf. on Computer Vision, 2019.
  • [36] N. Rhinehart, M. Kris, and P. Vernaza, “R2P2: A ReparameteRized Pushforward Policy for Diverse, Precise Generative Path Forecasting,” European Conf. on Computer Vision, 2018.
  • [37] R. Yu and Z. Zhou, “Towards Robust Human Trajectory Prediction in Raw Videos,” arXiv:2108.08259, 2021.
  • [38]

    S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and K. Granstr, “Mono-Camera 3D Multi-Object Tracking Using Deep Learning Detections and PMBM Filtering,”

    IEEE Intelligent Vehicles Symposium, 2018.
  • [39] “nuScenes Prediction Challenge Guidelines,”
  • [40] “nuScenes Prediction Data Split,”
  • [41] B. Ivanovic and M. Pavone, “Rethinking Trajectory Forecasting Evaluation,” arXiv:2107.10297, 2021.
  • [42] D. Arthur and S. Vassilvitskii, “K-means++: The Advantages of Careful Seeding,” Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, 2007.
  • [43] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple Online and Realtime Tracking,” IEEE International Conference on Image Processing, 2016.
  • [44] K. Goel and A. Robicquet, “Social LSTM: Human trajectory Prediction in Crowded Spaces,” IEEE Conf. on Computer Vision and Pattern Recognition, 2016.
  • [45] K. He, G. Gkioxari, P. Doll, and R. Girshick, “Mask R-CNN,” IEEE Int. Conf. on Computer Vision, 2017.
  • [46] T. Chavdarova, P. Baqu, A. Maksai, C. Jose, T. Bagautdinov, L. Lettry, P. Fua, and L. V. Gool, “WILDTRACK: A Multi-camera HD Dataset for Dense Unscripted Pedestrian Detection,” IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
  • [47] “Detectron2 X-101-FPN Backbone,”˙rcnn˙X˙101˙32x8d˙FPN˙3x.yaml.