Official implementation of deep-multi-trajectory-based single object tracking (IEEE T-CSVT 2021).
Recently, there has been tremendous progress in developing each individual module of the standard perception-planning robot autonomy pipeline, including detection, tracking, prediction of other agents' trajectories, and ego-agent trajectory planning. Nevertheless, there has been less attention given to the principled integration of these components, particularly in terms of the characterization and mitigation of cascading errors. This paper addresses the problem of cascading errors by focusing on the coupling between the tracking and prediction modules. First, by using state-of-the-art tracking and prediction tools, we conduct a comprehensive experimental evaluation of how severely errors stemming from tracking can impact prediction performance. On the KITTI and nuScenes datasets, we find that predictions consuming tracked trajectories as inputs (the typical case in practice) can experience a significant (even order of magnitude) drop in performance in comparison to the idealized setting where ground truth past trajectories are used as inputs. To address this issue, we propose a multi-hypothesis tracking and prediction framework. Rather than relying on a single set of tracking results for prediction, our framework simultaneously reasons about multiple sets of tracking results, thereby increasing the likelihood of including accurate tracking results as inputs to prediction. We show that this framework improves overall prediction performance over the standard single-hypothesis tracking-prediction pipeline by up to 34.2 more significant improvements (up to 70 challenging scenarios involving identity switches and fragments – all with an acceptable computation overhead.READ FULL TEXT VIEW PDF
Official implementation of deep-multi-trajectory-based single object tracking (IEEE T-CSVT 2021).
Multi-object tracking and trajectory prediction are critical components in modern autonomy stacks. For example, in autonomous driving applications, the outputs of these components are used by the planning module to compute safe and efficient trajectories. Multi-object tracking (MOT) [1, 2, 3, 4, 5, 6, 7, 8, 9] and prediction [10, 11, 12, 13, 14, 15, 16, 17, 18] typically follow a cascaded pipeline, where tracking is performed first to produce past tracklets, followed by a prediction module in charge of predicting other agents’ future trajectories. Although such a modularization eases the development cycle, scalability, and interpretability, it also gives rise to significant integration challenges, with cascading errors being a key concern, e.g., a tracking error such as an identity switch can cause a substantial prediction error as shown in Figure 1 (left).
Perhaps surprisingly, the severity of such cascading errors has been relatively under-explored. Indeed, most works on trajectory prediction typically consider the unrealistic setting whereby the prediction module consumes ground truth (GT) past trajectories as inputs, as opposed to tracklets produced by tracking. In this work, by applying state-of-the-art tracking and prediction methods on the nuScenes  and KITTI  datasets, we find that predictions consuming tracklets as inputs experience a significant performance drop in comparison to the idealized setting where GT past trajectories are used as inputs. Also, if we restrict the evaluation to challenging scenarios involving tracking errors (which are quite frequent, as we will show), prediction errors are increased by up to 28.2 on KITTI and 17.6
on nuScenes. The reason for such a significant performance drop is that tracking errors such as identity switch typically induce velocity/orientation estimation errors persisting for a few frames, which can have a detrimental impact on prediction accuracy.
To address the above issue, we propose a Multi-hypothesis Tracking and Prediction (MTP) framework that uses multi-hypothesis data association to output multiple sets of tracklets as tracking results. Then, these sets of tracklets are used as inputs to the prediction module. The key idea is simple: by simultaneously reasoning about multiple sets of tracklets, the likelihood of including accurate tracklets as inputs to prediction is increased (Figure 1 (right)). Note that this is different from the standard tracking-prediction pipeline in [21, 18, 1, 2, 3, 4, 5, 6, 7, 8, 9], where only a single set of tracklets is produced by tracking. In this case, if the past tracklet is off for an object, the prediction might be completely off.
Our MTP framework is inspired by the prediction-planning pipeline, where the prediction network typically predicts multiple sets of future trajectories, referred to as trajectory samples in VAE [13, 15, 16] or GAN-based [10, 11, 12] methods. By reasoning about multiple trajectory samples, the likelihood of considering an accurate prediction is higher, thereby enabling a higher level of safety in planning . MTP exploits a similar idea, whereby multiple sets of tracklets are generated to improve downstream prediction performance. Through experiments on KITTI and nuScenes, we show that the MTP framework improves overall prediction performance (up to 34.2% on the nuScenes dataset), with even more significant improvements (up to 70%) when restricting the evaluation to challenging scenarios involving tracking errors. The MTP framework naturally incurs a computation overhead with respect to its single-hypothesis counterpart, but we show that, fortunately, this overhead is acceptable and still compatible with real-time applications.
The contributions of this paper are threefold: First, we provide a comprehensive experimental assessment of the impact of tracking errors on prediction performance. Second, we propose the MTP framework, which aims at reducing error propagation between MOT and prediction by simultaneously reasoning about multiple sets of tracking results. Third, we thoroughly evaluate the performance of MTP both in terms of prediction accuracy and runtime performance.
3D Multi-Object Tracking.
Recent online 3D MOT methods often follow a tracking-by-detection pipeline with two steps: (1) Given trajectories associated up to the last frame and detections in the current frame, an affinity matrix is computed, where each entry represents the similarity value between a past trajectory and a current detection; (2) Given the affinity matrix, the Hungarian algorithm
is used to obtain a locally optimal matching, which entails making a hard assignment about which past trajectory a current detection is assigned to, so that trajectories can be updated to the current frame. Though significant progress has been made recently for the first step, for example, by improving the affinity matrix estimation using Graph Neural Networks[24, 9, 25] and multi-modal feature learning [2, 1], step two has largely remained the same. In other words, modern 3D MOT methods typically generate a single set of trajectories via the Hungarian algorithm at inference time, which induces tracking errors that can be detrimental to prediction.
Multi-Hypothesis Data Association. To improve single-hypothesis MOT, a natural approach is to leverage multi-hypothesis data association (MHDA). The idea is to maintain multiple hypotheses and delay making assignments. As a result, ambiguity in data association can be better resolved in later frames. MHDA has been popular in the 90’s and successfully applied to MOT [26, 27, 28] and SLAM [29, 30]. However, at the time when MHDA was being actively developed, the topic of trajectory prediction was still in its infancy. To the best of our knowledge, our work is the first to adopt MHDA to improve downstream prediction.
Trajectory Prediction. There has been significant progress on trajectory prediction recently, including [31, 32, 33, 10, 11, 13, 34, 35, 36]. Yet, almost invariably, such works study the prediction task separately from the 3D MOT task. Specifically, they consider GT past trajectories as inputs to prediction, with no direct accounting of tracking errors. Characterizing and mitigating the propagation of tracking errors to prediction is indeed the key motivation of this paper.
Tracking-Prediction Integration. A few works have attempted to better couple MOT and prediction tasks. In end-to-end detection and prediction , the MOT and prediction networks are jointly optimized, which increases performance. Yet, it is still a cascaded, single-hypothesis pipeline, and thus prone to predictions being thrown off by tracking errors. In parallelized tracking and prediction , a two-branch tracking and prediction network is proposed. Although this method prevents error propagation in the current frame (tracking results in the current frame are not fed into the prediction branch), it can not do so for the next window of prediction. This is because the method in  also uses the Hungarian algorithm to generate a single set of tracklets at the current frame, and this can easily lead to tracking errors being propagated to the next window of prediction. In contrast, we replace the Hungarian algorithm with MHDA in the tracking assignment phase, thereby preventing a hard assignment from removing plausible alternative hypotheses. We will show that this idea is quite effective.
Finally, a concurrent and unpublished work  has also recognized the importance of understanding how tracking errors can impact prediction. Our paper provides a comprehensive quantitative analysis that corroborates the qualitative findings in , and we propose to leverage MHDA to more robustly accounting for tracking errors, while the solution method in  is still single-hypothesis-based.
In this section, we experimentally study to what extent tracking errors can impact prediction performance. We start by reviewing typical tracking errors. We then outline our methodology, present qualitative and quantitative results, and finally characterize how frequently such errors can arise.
Identity Switches (IDS) happen if a GT trajectory is matched with two or more different tracklets. For example, as shown in Figure 2 (top left), the two black GT trajectories are erroneously matched with half of the green and half of the orange tracklets, with a switch in the middle. Such an IDS can happen when two GT trajectories are very close and/or cross each other. IDS can cause large prediction errors as they induce large linear/angular velocity estimation errors, usually persisting for a few frames after the IDS event.
Fragments (FRAG) refer to GT trajectories that do not closely match any tracklet, either because of a wrong association (wrongly-tracked FRAG, Figure 2 (bottom left)), or because the detector misses the detection of the object in later frames (under-tracked FRAG, Figure 2 (bottom center)).
Spurious Tracks are tracklets being completely false positives, that is not corresponding to any GT trajectory (Figure 2 (bottom right)). Spurious tracklets do not affect the recall of the predictions, but do lower the precision.
In our evaluation, we apply state-of-the-art methods for 3D MOT and prediction, namely AB3DMOT  for MOT and PTP for prediction 111We use the prediction branch of PTP , while obtaining tracking results from , to replicate the standard tracking-prediction pipeline., on two standard autonomous driving datasets: KITTI  and nuScenes .
KITTI. We use the tracking validation set (see  for the split). We predict 10 future frames using 10 past frames, i.e., 1 second with an FPS of 10. We consider three main object classes labeled in KITTI, i.e., cars, pedestrians, and cyclists.
nuScenes. We follow the standard nuScenes prediction challenge guidelines . Specifically, we use the prediction test set (see nuScenes code  for the split) for evaluation. We consider vehicle classes, namely car, truck, van, trailer, bus, and construction vehicle. We predict 12 future frames using 4 past frames, i.e., the past 2 seconds with an FPS of 2.
Evaluation. We use the standard best of Average Displacement Error (ADE) and best of Final Displacement Error (FDE)  to evaluate prediction222The ADE is defined as the mean distance between the predicted and GT trajectory; the FDE is defined as the distance between the predicted final position and GT final position at the end of the prediction horizon., referred to as minADE and minFDE. While using is standard on KITTI in prior work , it is common to use a smaller value of on nuScenes such as 5 or 10 (see nuScenes leaderboard ). Thus, we use on nuScenes and on KITTI.
To quantify the impact of tracking errors on prediction, a tracking evaluation is needed to find which objects at which frames experience IDS/FRAG errors. To that end, we use the standard 3D tracking evaluation code released in , which aims at matching tracked objects with GT at every frame to see: (1) if there is an identity change of the GT the tracked object is matched to (IDS), (2) if there is a GT that is not matched with any tracked objects (FRAG), and (3) if there is a tracklet not matched with any GT within a threshold (spurious tracks). For the matching threshold, we use a 3D Intersection over Union (IoU)333Given two boxes, the IoU is defined as the area of their intersection divided by the area of their union. The IoU, by definition, ranges from 0 to 1 and measures how similar the two boxes are. For example, when two boxes exactly overlap with each other, the IoU equals to 1. of 0.5 on KITTI and a 2D center distance of 2 meters on nuScenes, both of which are standard choices [3, 19].
Leveraging the methodology outlined in Section III-B, we provide some qualitative insights on the impact of tracking errors on prediction performance.
IDS causes large prediction errors. As shown in Figure 3 (top), as long as we use GT past trajectories as inputs, predictions are accurate. However, when using past tracklets as inputs, particularly when there is an IDS as shown in Figure 3 (bottom), predictions are off due to the sudden (and erroneous) velocity estimation change in the past tracklet.
Wrongly-tracked FRAG can also cause errors. Similar to the case of IDS, a wrongly-tracked FRAG also causes prediction errors. When comparing predictions in Figure 4 top (with GT past trajectories as inputs) and bottom (with past tracklets as inputs), one can see that, when the object’s past tracklet is slightly off, the corresponding predictions are also off due to the orientation change.
Under-tracked FRAG causes missing predictions. Different from the above two cases (which lead to inaccurate predictions), an under-tracked FRAG causes missing predictions, as there is no past tracklet used as inputs to prediction after the FRAG event. As shown in Figure 5, predictions can be missed for objects with under-tracked FRAG errors.
Spurious tracks cause false positives. In contrast to under-tracked FRAG, spurious tracks cause predictions that are not supposed to exist, that is, predictions for ghost objects.
Conclusion. We can see that all tracking errors (IDS, FRAG, and spurious tracks) can lead to prediction errors. In particular, IDS, wrongly-tracked FRAG, and spurious tracks can reduce the precision of the predictions, while under-tracked FRAG and wrongly-tracked FRAG can lead to a lower recall.
|Datasets||Eval. Targets (# of obj)||Inputs to Prediction|
|KITTI||Objects with IDS (33)||GT past trajectories||0.100||0.171|
|Objects with IDS (33)||past tracklets||2.820||4.514|
|Objects with FRAG (330)||GT past trajectories||0.177||0.306|
|Objects with FRAG (330)||past tracklets||1.621||2.155|
|nuScenes||Objects with IDS (4160)||GT past trajectories||0.473||0.825|
|Objects with IDS (4160)||past tracklets||8.345||13.892|
|Objects with FRAG (3365)||GT past trajectories||0.621||1.108|
|Objects with FRAG (3365)||past tracklets||14.520||21.815|
In this section, we quantify the impact of IDS/FRAG errors in terms of minADE and minFDE (we do not consider spurious tracks, as in such cases there is no corresponding GT that can be used to compute ADE/FDE). In particular, for those instances containing IDS/FRAG errors, we compare prediction performance stemming from using GT past trajectories as inputs with prediction performance stemming from using past tracklets as inputs. The results are shown in Table LABEL:tab:error_pred. One can observe that when we replace GT past trajectories with past tracklets as inputs, there is a significant performance drop. In particular, on KITTI IDS instances, there is a 28 drop from 0.100 to 2.820 in minADE, and on nuScenes IDS instances, there is an 18 drop from 0.473 to 8.345 in minADE. Similar performance drops are observed on FRAG instances. Such performance drops are in agreement with our qualitative findings in Section III-C and make predictions arguably almost useless for these objects.
Sections III-C and III-D characterize and quantify how tracking errors can negatively impact prediction performance. But how often do tracking errors happen? And how far are the objects affected by tracking errors from the ego vehicle (that is, do tracking errors also happen for objects very close to the ego vehicle, for which the planning module would be very sensitive to)? Accordingly, we characterize the frequency and spatial distribution of tracking errors below:
IDS/FRAG frequency. IDS/FRAG cases are indeed quite common. As shown in Figure 6 (left), on average every trajectory in both the KITTI and nuScenes datasets can yield a FRAG, IDS, or both! The frequency of tracking errors, coupled with their negative impact on prediction (Sections III-C and III-D), provide a strong motivation towards developing systematic approaches to account for tracking errors for the purposes of robust prediction (and planning).
IDS/FRAG spatial distribution. Planning, in general, is most sensitive to nearby objects. To understand at a conceptual level whether tracking errors can induce erroneous predictions that in turn can thwart planning, we compute the distance of objects experiencing erroneous tracklets from the ego-vehicle. The results are reported in the histogram in Figure 6 (right). One can see that there is a non-negligible number of IDS/FRAG instances where the tracked object is very close to the ego vehicle. In particular, there are about 1000 IDS/FRAG instances for objects within 15 meters, 500 IDS/FRAG instances for objects within 10 meters, and 200 IDS/FRAG instances for objects within 5 meters. Note that these errors are computed on the nuScenes prediction test set, which only contains 0.83 hours of driving444The nuScenes prediction test set has 150 sequences with 40 frames per sequence and an FPS of 2, for a total of 3000 seconds 0.83 hours.. Thus, we argue that tracking errors can severely hinder safe planning (future research will assess this statement more formally, for example by using planning-aware prediction metrics ).
To account for the impact of tracking errors on prediction performance, we propose the MTP framework which is visualized in Figure 7. The tracking-prediction pipeline in MTP is relatively standard, in terms of its modularity and sequence of operations (namely, MOT followed by a prediction module). The two key modules introduced are the MHDA and trajectory sampling, as described below:
Multi-Hypothesis Data Association (MHDA). The key idea is to reason about multiple hypotheses simultaneously, with the goal of increasing the likelihood of including accurate tracklets that can be used as inputs to downstream prediction. That is, instead of relying on a hard assignment via the Hungarian algorithm, we use MHDA to enlarge the search space and generate sets of plausible tracklets. Explicitly, we use the Murty’s H-best assignment , which maintains sets of tracking results at every frame, where each set is referred to as a hypothesis. Typically, the 1st hypothesis is obtained using the Hungarian algorithm by considering the lowest cost, which results in a list of matches between detections and trajectories. To obtain other hypotheses, we tweak the list of matches in the 1st hypothesis by toggling one match at a time in and out of the list, which results in slightly higher costs. After sorting other hypotheses based on costs, the 2nd hypothesis has the 2nd lowest cost and so on. In the case that the 1st hypothesis is erroneous, other hypotheses with slightly higher costs may correspond to a correct association – thus, by reasoning about multiple hypotheses, the likelihood of retaining accurate tracking results is increased. Each hypothesis (a set of tracklets) is then fed into the prediction module as per Figure 7.
Once we obtain predictions by using each hypothesis as input, we sample a subset of the full set of predictions by computing cluster centers using K-Means++, resulting in a diverse set of predictions. The trajectory sampling step is optional, but useful to limit an excessive number of prediction samples, and helps us with carrying out a fair comparison with single-hypothesis prediction methods. For example, if we use and for each hypothesis we generate prediction samples, there will be 200 samples for each object. In this case, we would sample only samples out of the 200 to carry out a fair comparison with single-hypothesis prediction methods using samples.
Conclusion. In summary, our MTP framework can be applied to any tracking-prediction pipeline that is based on single-hypothesis matching – the main modification is to replace the matching algorithm with MHDA.
As MTP is designed to improve prediction, we follow the standard prediction evaluation as described in Section III-B. For additional implementation details and hyper-parameters, we refer the reader to our code. Here, we categorize our prediction experiments into: 1) targeted evaluation, which analyzes prediction performance for objects affected by tracking errors, and 2) global evaluation, which analyzes prediction performance across all tracked objects, whether or not they are affected by tracking errors. Also, we provide a runtime speed analysis. The key takeaway is that MTP improves both targeted and global prediction performance, with a relatively minor computation overhead.
Targeted evaluation. Results are provided in Table LABEL:tab:quan_ids in terms of minADE and minFDE. The first row of each block corresponds to the standard single-hypothesis tracking-prediction pipeline (AB3DMOT+PTP), which we refer to as STP. As shown in Section III-D, STP yields large prediction errors due to IDS/FRAG. Next, we see that MTP significantly improves prediction performance on both the KITTI and nuScenes datasets. Specifically, when using prediction samples from all hypotheses, we see a 4 minADE performance boost on KITTI IDS (i.e., from 2.820 to 0.707), a 19.5% minADE performance boost on KITTI FRAG, a 2.5 performance boost on nuScenes IDS, and a 2 performance boost on nuScenes FRAG. To compare MTP and STP under the same number of samples (and thus avoid giving MTP an unfair advantage with a larger number of samples), we apply trajectory sampling to MTP. Remarkably, one can see that prediction performance after sampling is only slightly lower (e.g., minADE raises from 0.707 to 0.747 on KITTI IDS), meaning that the proposed trajectory sampling scheme generally retains accurate tracklets. Also, as more hypotheses are used, better performance is achieved (compare the cases with , , and ). Importantly, even when using only 5 hypotheses, prediction performance is improved by 2.5 on KITTI IDS cases.
Finally, even though the minADE and minFDE metrics are not suitable to characterize how MTP improves prediction performance on spurious track instances, it is easy to argue why this is the case. Indeed, the likelihood of removing spurious tracks is increased under MTP, as different hypotheses have different matching results, and some hypotheses may not associate false positive detections to trajectories.
Global evaluation. Results are provided in Table LABEL:tab:quan_all. Again, MTP largely improves performance over STP, e.g., minFDE from 0.278 to 0.238 (14.4% improvement) on KITTI, and minFDE from 3.819 to 2.512 (34.2%) on nuScenes. Improvement on nuScenes is larger as there is a higher percentage of IDS/FRAG instances. In brief, MTP significantly improves both targeted and global prediction performance.
Note that although we follow the nuScenes evaluation protocol, the ADE/FDE numbers in Table LABEL:tab:quan_all are not comparable to the numbers on the nuScenes leaderboard as we consider past tracklets as inputs (as opposed to GT past trajectories, as is the case for the nuScenes leaderboard results).
|KITTI||IDS||STP, H=1, k=20||2.820||4.514|
|MTP (Ours), H=5, k=100||1.099||1.768|
|MTP (Ours), H=10, k=200||0.844||1.332|
|MTP (Ours), H=20, k=400||0.707||1.093|
|MTP (Ours), H=5, k=20, sampling||1.118||1.802|
|MTP (Ours), H=10, k=20, sampling||0.876||1.390|
|MTP (Ours), H=20, k=20, sampling||0.747||1.173|
|KITTI||FRAG||STP, H=1, k=20||1.621||2.155|
|MTP (Ours), H=5, k=100||1.436||1.862|
|MTP (Ours), H=10, k=200||1.385||1.765|
|MTP (Ours), H=20, k=400||1.305||1.627|
|MTP (Ours), H=5, k=20, sampling||1.448||1.888|
|MTP (Ours), H=10, k=20, sampling||1.404||1.801|
|MTP (Ours), H=20, k=20, sampling||1.335||1.688|
|nuScenes||IDS||STP, H=1, k=10||8.345||13.892|
|MTP (Ours), H=10, k=100||4.143||6.464|
|MTP (Ours), H=20, k=200||3.321||5.052|
|MTP (Ours), H=10, k=10, sampling||4.573||7.303|
|MTP (Ours), H=20, k=10, sampling||3.923||6.210|
|nuScenes||FRAG||STP, H=1, k=10||14.520||21.815|
|MTP (Ours), H=10, k=100||9.017||12.721|
|MTP (Ours), H=20, k=200||7.697||10.606|
|MTP (Ours), H=10, k=10, sampling||9.585||13.846|
|MTP (Ours), H=20, k=10, sampling||8.476||12.105|
Tracking Error Statistics. To gain insights on why MTP improves prediction performance, we show an intermediate tracking error analysis in Figure 8 for IDS/FRAG instances on nuScenes and KITTI. Specifically, we plot the distribution over frames of IDS/FRAG instances for STP on the left, and the distribution over frames of IDS/FRAG instances present in all of the hypotheses for MTP on the right. One can notice that a large portion of tracking errors in STP does not exist in at least one of the hypotheses being considered by MTP. If we count the number of FRAG/IDS over all frames, the 33 IDS instances experienced by STP on KITTI are reduced to only 9 that are shared by all 20 hypotheses under MTP (72.7% reduction), and the 7083 IDS instances experienced by STP on nuScenes are reduced to only 2835 that are shared by all 20 hypotheses under MTP (60.0% reduction). FRAG errors are reduced by a similar amount. This provides strong justification for the inclusion of MHDA in a tracking-prediction pipeline.
|KITTI||STP, H=1, k=20||0.185||0.278|
|MTP (Ours), H=5, k=100||0.163||0.235|
|MTP (Ours), H=10, k=200||0.152||0.215|
|MTP (Ours), H=20, k=400||0.146||0.203|
|MTP (Ours), H=5, k=20, sampling||0.170||0.252|
|MTP (Ours), H=10, k=20, sampling||0.164||0.240|
|MTP (Ours), H=20, k=20, sampling||0.162||0.238|
|nuScenes||STP, H=1, k=10||2.320||3.819|
|MTP (Ours), H=10, k=100||1.498||2.293|
|MTP (Ours), H=20, k=200||1.325||1.979|
|MTP (Ours), H=10, k=10, sampling||1.691||2.692|
|MTP (Ours), H=20, k=10, sampling||1.585||2.512|
|STP (H=1)||MTP (H=5)||MTP (H=10)||MTP (H=20)|
Runtime Speed. MHDA unavoidably introduces a computation overhead, which is characterized in terms of FPS on the KITTI dataset in Table LABEL:tab:speed. It is expected that, as increases, tracking takes longer. The good news is that, even with , tracking runtime is still acceptable (near real-time), without requiring a GPU implementation. This can be attributed to the excellent speed of . Interestingly, there is nearly no runtime degradation for prediction as increases. This is because predictions for different hypotheses are completely independent, so one can easily run them in parallel, although more GPU memory is needed (1.8Gb per hypothesis is required by ). Runtime for K-means++ sampling is negligible and so it is not included in the table.
|STP,  + ||H=1, k=1||1.024||1.335|
|Re-tracking, ||H=1, k=1||0.893||1.241|
|MTP (Ours)||H=20, k=1, sampling||0.852||1.086|
|MTP (Ours) + Re-tracking||H=20, k=1, sampling||0.770||0.967|
Comparison with Concurrent Work . As discussed in Section II, an unpublished work  has proposed a single-hypothesis-based re-tracking solution to mitigate the impact of tracking errors on prediction. Though 
has open-sourced its code, direct comparison between MTP and is not immediate as  is implemented on a 2D (rather than 3D, as in this paper) detection, tracking, and prediction pipeline, that is, using MaskRCNN  for detection in images, using SORT  for 2D MOT, and using Social-LSTM  for bird’s eye view trajectory prediction. The method is evaluated on a multi-camera dataset, WILDTRACK . To ensure a fair comparison, we add our MHDA and trajectory sampling modules to the SORT+Social-LSTM pipeline implemented by  and carry out the evaluation on WILDTRACK. Note that, as  did not release detection results by the time this paper was submitted, we use the Detectron2 implementation of MaskRCNN with an X101-FPN backbone  to generate detections. Prediction results are shown in Table LABEL:tab:wildtrack, in terms of minADE and minFDE. Here, as Social-LSTM is a deterministic prediction approach. Both the MTP and re-tracking approaches  show improvement over STP when using past tracklets as inputs, with MTP showing a slightly larger improvement. Importantly, the re-tracking approach, which is single-hypothesis-based, and the MTP framework are complementary, and indeed combining the two approaches (in the last row in Table LABEL:tab:wildtrack) further improves prediction performance!
In this paper, we studied how tracking errors can impact prediction performance via qualitative and quantitative analyses. These analyses led to the design of the MTP framework, which simultaneously reasons about multiple sets of tracking results in order to account for tracking errors. We demonstrated how MTP significantly improves prediction performance, particularly in those instances containing tracking errors – all for a relatively minor computation overhead.
This work opens up a number of future research directions. First, it is of interest to better understand how to optimally choose and
to consider as a function of computational requirements and target operational design domains. Second, it is of interest to extend our analysis by considering additional methods for MOT and prediction, and planning-aware evaluation metrics (quantitatively assessing how tracking errors ultimately impact planning). Third, the MTP framework is quite general and can be augmented with other techniques aimed at mitigating the propagation of tracking errors. This provides an exciting opportunity to make predictions even more robust to tracking errors. Finally, we plan to study how errors and uncertainty propagate across other modules within the autonomy stack, with the ultimate goal of devising more robust autonomy stacks.
IEEE Int. Conf. on Computer Vision, 2019.
IEEE Conf. on Computer Vision and Pattern Recognition, 2020.
A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks,”IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and K. Granstr, “Mono-Camera 3D Multi-Object Tracking Using Deep Learning Detections and PMBM Filtering,”IEEE Intelligent Vehicles Symposium, 2018.