Drowned out by the noise: Evidence for Tracking-free Motion Prediction

by   Ameni Trabelsi, et al.
Colorado State University

Autonomous driving consists of a multitude of interacting modules, where each module must contend with errors from the others. Typically, the motion prediction module depends on a robust tracking system to capture each agent's past movement. In this work, we systematically explore the importance of the tracking module for the motion prediction task and ultimately conclude that the tracking module is detrimental to overall motion prediction performance when the module is imperfect (with as low as 1 that use tracking information to models that do not across multiple scenarios and conditions. We find that the tracking information only improves performance in noise-free conditions. A noise-free tracker is unlikely to remain noise-free in real-world scenarios, and the inevitable noise will subsequently negatively affect performance. We thus argue future work should be mindful of noise when developing and testing motion/tracking modules, or that they should do away with the tracking component entirely.



There are no comments yet.


page 4

page 7

page 10


Object Tracking Using Spatio-Temporal Future Prediction

Occlusion is a long-standing problem that causes many modern tracking me...

UHP-SOT: An Unsupervised High-Performance Single Object Tracker

An unsupervised online object tracking method that exploits both foregro...

Multiple Hypothesis Hypergraph Tracking for Posture Identification in Embryonic Caenorhabditis elegans

Current methods in multiple object tracking (MOT) rely on independent ob...

MTP: Multi-Hypothesis Tracking and Prediction for Reduced Error Propagation

Recently, there has been tremendous progress in developing each individu...

Robust Multi-body Feature Tracker: A Segmentation-free Approach

Feature tracking is a fundamental problem in computer vision, with appli...

Generic Tracking and Probabilistic Prediction Framework and Its Application in Autonomous Driving

Accurately tracking and predicting behaviors of surrounding objects are ...

ESCaF: Pupil Centre Localization Algorithm with Candidate Filtering

Algorithms for accurate localization of pupil centre is essential for ga...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous driving depends upon a mixture of perception modules to achieve safe motion planning. Perception must unfold in highly uncertain, rapidly changing, and interactive environments shared with other dynamic agents. Planning focuses on the real-time, safe navigation of such an environment. In this work, we focus on the two perceptive tasks of agent tracking and motion prediction. Typically, these two tasks are cascaded; agent tracking output feeds into motion prediction. Such cascaded approaches are usually highly affected by errors propagating from noisy components. For instance, errors propagated from a noisy tracking module can hinder the performance of the motion prediction and planning modules. Such problems can result in catastrophic failures as the system fails to recover from errors accumulated through the pipeline.

Despite the complexities of cascaded interactions, most works on these topics do not examine how errors propagate and affect downstream modules. In this work, we ask a novel question: does the tracking system, a sub-component of the motion prediction, contribute to overall accuracy improvements in real-world settings. We focus on the tracking module due to the propensity of noise in real-world environments, a reality of several common autonomous driving issues like heavy occlusion, crowded scenes, high inter-frame motion, and camera motion. Thus, we study the effectiveness of a motion prediction module that 1) removes the need for tracking, and 2) is still powerful and robust. We show that, unless we assume the tracking module is noise free, tracking-free motion prediction methods can achieve better performance than the models that use tracking information.

We study the motion prediction module because it is an indispensable task for planning safe and comfortable maneuvers. Recent works [chai2019multipath, messaoudtrajectory] have highlighted two main factors that directly affect the agents future motion: The short term history of the agents movements and their interactions, and the scene context including road and crosswalk polygons, lane directions and boundaries, traffic lights, and other relevant map information. This task is specifically challenging due to the uncertainty of the future decisions of the agents, and it is seemingly intuitive predicting the future trajectories said agents is important. However, as we show in this work, the agent tracking overly complicates the motion prediction task, and actually decreases performance in the inevitable presence of real-world noise.

We describe three models that utilize a Bird’s Eye View (BEV) multi-channel input image representation that integrates both scene context, from a high definition map, and agents’ motion history, obtained from a working object pose estimation module. All three models produce both multi-agent trajectory predictions and spatial uncertainty estimations. Our baseline model is the tracking free model. In order to evaluate the effect of the tracking module, we integrate the tracking information into two of the models. In one model, we integrate an LSTM embedding to represent the agent’s past states based on its tracking information. In the second model, we integrate the identity information obtained from a tracking module in the BEV input image using displacement fields (to the best of our knowledge, this input representation is novel in the task of motion prediction). We evaluate the performance of our models by using a real-world tracker and conclude that, in real-world settings, the tracking module is a hindrance under most noise levels, or, at best, unneeded.

In summary, the main contributions of this paper are:

  • We study the effectiveness of a motion prediction module that removes the need for tracking and is both powerful and robust.

  • We train three deep networks to predict short-term traffic agents trajectories and their spatial uncertainties.

  • We propose a novel input representation that integrates the tracking information using spatio-temporal displacement fields.

  • We conduct various experiments to evaluate the performance of the three models under noisy tracking conditions, and variable actor density conditions.

2 Related Work

Motion prediction has a long and storied history [rasouli2020deep, yurtsever2020survey]

, but, due to space constraints, we will limit this section to approaches that contextualize their efforts in the self-driving systems domain. We first cover classical approaches and then discuss deep learning based approaches by distinguishing the methods that use agent identification information from those that do not use such information.

Most of the deployed systems in industry use well-established, engineered approaches for motion prediction. One common approach is the Kalman Filter (KF)

[kalman1960new, wan2000unscented]

. The KF estimates the agent’s state and propagates it into the future based on kinematic models and assumptions of an underlying physical system. While the KF is efficient for short-term predictions, its performance degrades with longer term predictions because it is mainly agent-centric i.e., ignores external constraints (environment constraints, other agents …) and only considers spatio-temporal agent information. Other classical approaches rely on machine learning models, such as Hidden Markov Models


, Gaussian Mixture Models

[yoo2016visual], Processes [mogelmose2015trajectory] and other techniques to solve this task. However, the real-world performance of these approaches usually suffers from high computation time, generalization issues (when confronted with noisy detections), or complex agent/environment interactions modeling issues.

Various deep learning approaches have been proposed to model agents behaviors in motion prediction task. Like other sequence prediction tasks, many motion prediction methods rely on recurrent architectures, such as LSTMs [fragkiadaki2015learning, lee2017desire, zhao2019multi], to model the agents dynamics. Social-LSTM [alahi2016social] uses LSTM embeddings to model each pedestrian’s motion and then aggregates these embeddings using a social pooling technique to model inter-pedestrians interactions, before predicting their future trajectories. Social-GAN [gupta2018social]

extends Social LSTM by proposing a generative model based on Recurrent Neural Network (RNN). Giuliari

et al.[giuliari2020transformer] leverages Transformers (TFs), a recent technique developed within the NLP field for word sequences modeling, to model pedestrians trajectories. However, these methods have generalization issues when confronted with noisy detections [rhinehart2019precog].

Several model the interactions among agents using Graph Neural Networks. Social-BiGAT

[kosaraju2019social] relies on graph attention networks to model the social interactions between pedestrians where each node in the graph is a pedestrian represented using an LSTM embedding. Social-STGCNN [mohamed2020social]

directly models the pedestrian trajectories as a graph and uses a Time-Extrapolator Convolution Neural Network to operate on the temporal dimension. Though these methods capture the interactions among the agents, they fail to capture the environment constraints. VectorNet

[gao2020vectornet] suggests addressing this failing using polyline subgraphs to represent separate entities including the agents and the environment constraints. These subgraphs then form a global interaction graph that captures interactions among all environment components. Polyline representation allows graph-based approaches to account for the agents interactions with other environment components; however, it is often hard to extract automatically from sensors, and requires human annotations.

In self-driving domain, a bird’s eye view (BEV) raster representation is widely used as input followed by Convolutional Neural Networks (CNNs)[chai2019multipath, cui2019multimodal, djuric2020uncertainty, salzmann2020trajectron++]. The raster encodes the agents history information, context and other map information which allows the network to extract useful appearance features of the agents and their environmental context in order to predict their future trajectories. Djuric et al.[djuric2020uncertainty] uses CNNs to predict short term vehicle trajectories from a BEV raster input, encoding individual agent’s surrounding context. This work was later applied to Vulnerable Road Users (VRUs)[chou2019predicting]. Chai et al.[chai2019multipath] leverages a fixed set of future state-sequence anchors that correspond to modes of the trajectory distribution. Other approaches [chandra2019traphic, djuric2020uncertainty, liang2020pnpnet] take advantage of both LSTMs and CNNs by proposing hybrid models that encode agents states using LSTMs and capture scene context and agents interactions using CNN features.

The existing methods we have discussed use engineered or learned techniques that capture the agent’s past movements in order to predict their future trajectory. Some methods also consider the agent’s interactions with other traffic actors and scene context to better forecast their future movements. A major assumption of these techniques is that the identity of the agents is known through time (i.e., agent tracking is perfect); however, in real-world applications, the performance of tracking will inevitably be imperfect, leading to identity switches [chiu2020probabilistic, chaabane2021deft] (due to heavy occlusion, high inter-frame motion, crowded scenes, or high sensor motion [caesar2020nuscenes], to name but a few issues). Some approaches circumvent this issue with end-to-end methods that jointly learn detection and motion prediction directly from sensor data [casas2018intentnet, djuric2020multinet, luo2018fast, sadat2020perceive, chaabane2020looking, trabelsi2021pose], but they do not consider real-world error either. In this work, we explicitly study the interactions noise in the tracking module has on motion prediction performance by comparing models that integrate tracking information with models that do not use such information.

3 Methods

In this work, we analyse the importance of the tracking module for the task of motion prediction. To this end, we compare methods that integrate the identity information of agents to methods that remove the use of such information. Here, we describe three methods (see Figure 1). The first is a fully convolutional model that outputs trajectory prediction of agents in the scene (§ 3.2). The second is a hybrid CNN-LSTM model that extends the first model by integrating LSTM encoding extracted from agents’ states (§ 3.3). The third is a novel fully convolutional model that integrates the identity information of agents in the BEV input image (further explained in § 3.4). All three models take multi-channel image input of a BEV of the scene and output multi-agent trajectory predictions.

Figure 1: Overview of the three described architectures. We show the input representation rasterized into color-coded RGB image for visualization purposes. Each historical agent polygon is rasterized with the same color as the current polygon but with reduced level of brightness, resulting in the fading effect. A. represents the Track-free CNN method that relies on the tracking free input to predict the agents future trajectories. B.

is the Track-based CNN method that integrates the tracking information in the input representation using displacement vector fields.

C. shows the Hybrid method that extends A. by adding an LSTM encoding to represent each agent history. All agents in A. and C. inputs are represented with the same green color to show that no identity information was used to differentiate among agents. In B., we represent each agent with different color to infer their identity information. The actual size of the input representation is where is the number of past frames and is equal to 1 or 2 depending on the architecture.

For all models we assume to have access to a high definition map

of an operating area, comprising road and crosswalk polygons, lane directions and boundaries and other relevant map information. We assume models have a functioning pose estimation system ingesting the sensor data to detect and pose traffic actors. Lastly, unless specified differently, we assume a perfect tracking system is available for the tracking-based models, providing ground-truth tracking of the detected traffic actors.

3.1 Input representation:

We encode the static map elements from the high definition map in a bird’s eye view (BEV) image centered on the self-driving vehicle (SDV) where each element of the map, including driving lanes, crosswalks and traffic lights, is encoded as a binary mask in its own separate channel. These channels are then rasterized into an RGB image where each element is assigned a different color, as described in [djuric2020uncertainty]. Furthermore, we consider additional channels stacked with the map raster where each channel represents the agents locations at each timestep of the history and present. Each of these channels is a binary mask encoding the agents top down positions in the same BEV frame as introduced above. The final input is then formed of channels representing map information and agent’s history and present locations. We note here that no identity information is inferred as all detections of agents at each timestep are treated similarly.

3.2 Tracking free CNN model:

In this model, we follow [djuric2020multinet] and process the input multi-channel image using a sequence of 2-D convolutions to produce a dense feature representation for each grid cell of the input. We further add three

convolutional layers to finally output a 3-D tensor of size

representing the predicted future movements of the agents present in the scene, where is the size of the grid and is the number of future predictions. For each grid cell containing an agent center at the present timestep, we predict the 2-D centers offsets in future time horizons. In addition to predicting the future trajectories, we also estimate the spatial uncertainties of our predictions. Note that this model utilizes no prior identity information nor tracking step.

3.3 Hybrid model:

The hybrid model is an extension of the first model (§ 3.2) which further integrates an LSTM sequence model [messaoudtrajectory]. The LSTM encodes each agent’s states across past and present timesteps into a single embedding () for agent . In our experiments, an agent state comprises position displacements with respect to the present, relative position changes, and speed at each timestep where and

represents the present state. For each agent, the LSTM embedding is concatenated with CNN features extracted from the CNN network (as in first model) at the grid cell containing the agent’s center at the present timestep. The grid cells that do not contain agent centers are padded with zeros. The obtained feature block is then processed, similarly with the first model, with three

convolutional layers and output a tensor of size representing the future trajectories and the corresponding uncertainties.

Note that the use of LSTM to encode an agent’s past trajectory relies on the assumption that the identity of the agent is well known through time.

3.4 Tracking based CNN model:

This model follows the same architecture as the first model (§ 3.2). The main difference resides in the input representation; in this model, we integrate the identity information in the input image. Specifically, instead of representing each timestep from the past with a binary mask to indicate the presence/absence of a detection at each pixel, we consider a spatio-temporal displacement vector field at each timestep, where a 2-D vector at each pixel parallels the vector from the agent center at that timestep, to the center of the agent at the present timestep. and are the width and height of the input image. At the present timestep, we use a simple binary mask, similar to the initial input representation. The final input image then has channels.

Like first model, this model relies on CNNs to operate on the spatial and the temporal dimension simultaneously and thus it is smaller in size compared to the hybrid model that uses both CNNs and LSTMs (§ 3.3). Though displacement vector fields are a common representation in the segmentation task [ahn2019weakly, neven2019instance], to the best of our knowledge, we are the first to apply this technique in the motion prediction task.

3.5 Loss Function:

In this paper, we train both trajectory prediction and uncertainty estimation jointly. We project the prediction errors on the along-track (AT) and cross-track (CT) directions using the ground-truth heading of agent, and we assume that each projected error in one of the two directions is independent from the other and follows a Laplace distribution with a PDF of a random Laplacian variable computed as:


where mean and diversity are the Laplace parameters. Ideally the AT and CT errors would follow a ground-truth distributions of mean and diversities and , respectively. Since, we assume that the uncertainty increases with time. We define the diversity as a linearly increasing function:


where and are model hyper-parameters defined separately for AT and CT. To train the model, we minimize the Kullback-Leibler (KL) divergence between the ground-truth distribution and the predicted distribution as in [djuric2020multinet] defined as:


where is whether AT or CT.

4 Experimental Details:

In this section, we describe the dataset we used to run the different experiments (§ 4.1). We also give details of the architectures and experimental settings (§ 4.2).

4.1 Dataset:

The goal of this work is to analyse different aspects of motion prediction models and not to compare the performance of our models with state-of-the-art methods. To this end, we use the Lyft Prediction Dataset [houston2020one] to run our experiments. The Lyft Prediction Dataset is the largest public self-driving dataset for motion prediction to date, with 1,118 hours of recorded self-driving perception data. It was collected by a fleet of 20 autonomous vehicles along a fixed route in Palo Alto, California over a four-month period. It consists of 170,000 scenes, 25 seconds long each capturing the positions and motions of the surrounding agents including vehicles, cyclists and pedestrians. The dataset also comprises a high-definition semantic map with 15,242 labelled elements and a high-definition aerial view over the area.

4.2 Experimental settings:

For input representation, we use a BEV image with spatial horizontal dimensions , where each grid cell is . For temporal information, we consider a history of 1s (equivalent to 10 timestamps at 10Hz) resulting in an input of size of where is the number of channels per timestamp ( for both tracking-free CNN model and hybrid model and for tracking based CNN model) and (10 past frames and 1 present frame). We chose to use 1s of history for real-time efficiency following [djuric2020multinet, liang2020pnpnet]. The output tensor is of size where channels. For the backbone network, we use ResNet-50 [he2016deep]

to extract deep features of size

. The models were implemented in PyTorch


and trained from scratch with a batch size of 4 for 2 epochs with Adam optimizer

[kingma2014adam], setting the initial learning rate to that was further decreased by a factor of every thousand iterations. We ran our experiments on a Ubuntu server with a TITAN X GPU with 12 GB of memory. For our experiments, we report along-track (AT) error metric and cross-track (CT) error metric [gong2004methodology], as well as the average displacement error (ADE) and the final displacement error (FDE) [alahi2016social]. All metrics are reported on the validation dataset as specified in [houston2020one] at an horizon of equivalent to 5 seconds in the future.

5 Results

It is a common practice in the field of motion prediction to rely on the agent’s past sequence of tracks in order to predict their future trajectory. Such practice makes a major assumption on the availability of a robust tracking system that provides little-to-no-noise identity information to the agents in the scene. Such assumption does not always hold true as the tracking system is always prone to noise due to several factors including occlusion, crowded scenes, high inter-frame motion… In this section, we first (§ 5.1) evaluate the performance of the motion prediction methods described in § 3 and compare models that integrate the identity information of agents to models that do not use such information. We further (§ 5.2) depict these results by considering scenarios where the knowledge of the agent identity may play a crucial role in the performance of the model prediction such as the case of crowded scenes. We also evaluate the effect of noise in tracking on the performance of the models by applying synthetic noise (§ 5.3) as well as realistic noise coming from real-world tracker (§ 5.4).

5.1 Overall Performance Evaluation

Track-free CNN 1.241 0.571 1.379 2.577
Track-based CNN 1.232 0.549 1.328 2.556
Hybrid 1.229 0.567 1.345 2.552
Table 1: Overall comparison of the described methods using four metrics in meters. Given a noise free tracking system, the Hybrid model performs the best in AT and FDE metrics, while the Track-based model performs the best in CT and ADE metrics.

Results of the three models are summarized in Table 1 with best prediction results shown in bold. We compare the performance using 4 different metrics AT, CT, ADE and FDE (as introduced in § 4.2) averaged over a prediction horizon 5s. We note that even small metric improvements can make a significant difference in the performance and safety of the real-world system.

Using the tracking information in both the Track-based CNN and Hybrid models improves the performance by 3% and 2.5% respectively, compared to the Track-free CNN model. Comparing the Track-based CNN and the Hybrid, the latter obtains better prediction accuracy on the AT and FDE metrics. This is unsurprising, as LSTMs are efficient in learning long-term temporal dependencies and thus can better capture the agent dynamics such as velocity and acceleration. Furthermore, the Track-based CNN model performs better than the Hybrid model in terms of CT and FDE. Thus, such model would perform better in lane association or in passing scenarios. In Figure 2 we show qualitative examples of 2 success cases and 2 failure cases for each of the three models described in this work.

Figure 2: Results examples of the described methods. We plot the target trajectories in Magenta and the predicted trajectories in Cyan. For clearer visualization, The scenes are zoomed in and only the trajectories of a subset of the agents are shown. Rows (1), (2), and (3) show examples using the Track-free CNN model, Hybrid model and Track-based CNN model respectively. Columns (A) and (B) show success cases. In (A), we also plot the uncertainties of the predicted trajectories in light Cyan. Columns (C) and (D) show failure cases. Examples (1)-(C), (2)-(D), (3)-(C) and (3)-(D) show failure in the estimation of the future direction of the agents. We see high error in the cross-track direction. (1)-(C), (1)-(D), (2)-(C) and (3)-(C) show failure in the estimation of the velocity of the agents. Thus we see high error in the along-track direction.

5.2 Model Performance Depends on Agent Velocity and Traffic Density

Since the dataset encloses a variety of scenarios with large amounts of behavioural observations and interactions, it is hard to depict the effect of the tracking module by evaluating the full testing data. Based on preliminary studies, we found that all three models perform equally well in the scenarios where agents are moving slowly or are stationary. We then categorize the scenes using the agent’s velocity and the density of the scene and limit our future comparisons to scenes with agent velocities larger than . Furthermore, we categorize these agents based on the density of their surrounding environment. We measure the density by calculating the radius between the agent and their nearest neighbour. We consider a scenario "dense" if the agent’s radius is less than 4 meters () and a scenario "non dense" if the radius is larger than 10 meters ().

The results are summarized in the Figure 3. We show a bar plot with the performance of each scenario in ADE metric for each model. For dense and non-dense scenarios we specify the performance change (in percentages) with respect to all agents with . First, as expected, performance decreased in all three models since the selected scenarios are relatively challenging due to the high inter-frame motion and density of the scenes. Second, the track-free model significantly decreases in the performance compared to the tracking based models. Comparing to all moving agents (), the performance of the track-free CNN model has decreased by 12.6% on dense scenarios () as compared to a more attenuated decrease of 8.8% and 8.5% for track-based CNN and hybrid models, respectively. On non dense scenarios (), the three models perform more comparably. These findings rightly show that the tracking information can be essential in challenging scenarios, such as a very crowded scene where the input representation of the agents can become less effective. However, for other scenarios, such as non dense scenarios, the three models seem to perform comparably well. This is the noise-free condition — in Sections 5.3 and 5.4, we reevaluated our three models in the context of tracking noise.

Figure 3: Performance evaluation of the described methods in ADE (m) on agents moving with a velocity larger than . We further consider the scene density factor. We calculate the radius between a given agent and their closest neighbor and select those with in one experiment (dense scenarios) and in a second experiment (non dense scenarios). The performance of the three methods degrade in dense scenarios. The decrease is most pronounced with Track-free model which shows the importance of tracking information under these conditions.

5.3 Performance Evaluation with Noisy Tracker

Figure 4: Performance Evaluation of the described methods in ADE (m). We apply synthetic noise to the tracking information. We experiment with an increasing chance of 1 identity switch per track. The performance of the tracking based methods (track-based CNN and Hybrid) decrease with increasing tracking noise. The track-free model is not affected by tracking noise.
Figure 5: Performance Evaluation of the three methods in ADE (m) with synthetic tracking noise of 1% chance. We experiment with an increasing number of identity switches per track. 1 id switch represent identity switch at only 1 timestamp. 2 id switches represent an identity switch for 2 consecutive timestamps and N id switch represent 1 identity switch that started at a random timestamp and continued until the end of the scene. The performance of the tracking based methods (track-based CNN and Hybrid) decrease with increasing tracking noise. The track-free model is not affected by tracking noise.

In these experiments, we evaluate the effect of noisy tracker on the performance of the three models. Being independent from the tracking system, the performance of the track-free CNN model remains constant in these experiments. We apply synthetic noise to the tracking system and evaluate its effect on the performance of the two tracking based models. In the first set of experiments, shown in Figure 4

, we perform random identity switches with varying chances per track. We vary the probability of an identity switch from 0% to 20% per track. The performance of both the track-based CNN and hybrid model has decreased with increasing noise, degrading slightly around 1% chance and then drastically after 2% chance of identity switch per track. Comparing the tracking based models to the track-free CNN model, the latter obtains better performance on all experiments with noise larger than 0.8%. The drastic decrease in the performance of both the track-based CNN and Hybrid model shows that they both rely heavily on the tracking information to capture the agents’ past movements and thus the cascaded tracking noise has a large effect on the performance of these models. For experiments with noise chance larger than 2%, the Hybrid model is clearly shown to be more robust than the Track-based CNN model. Though the Hybrid model is highly dependent on the LSTM input enclosing the tracking information, it also relies on the input image which is independent from the tracking module, which explains its relative robustness to noise compared to track-based CNN model.

The second set of experiments shown in Figure 5, comprises common identity switch scenarios. We set the identity switch chance to 1% per track throughout the experiments and we consider three scenarios: an identity switch happening at a single random timestamp (1 id switch), an identity switch happening for two consecutive timestamps (2 id switches), and an identity switch happening at a random timestamp and continuing until the end of the scene (N id switches). Similarly to the first set of experiments, the performance of both the track-based CNN and hybrid model declines in all three scenarios and falls behind the performance of the track-free model. The Hybrid model, for instance, has decreased from 1.345 to 1.485 in ADE when applying 1 id switch with a 1% chance. It then falls by 35.5% and 59.1% when applying 2 id switches and N id switches respectively. Similar to the findings of the previous experiments, the Hybrid model is more robust to noise compared to Track-based CNN model when dealing with 2 id switches and N id switches.

5.4 Peformance Evaluation with Real-world Tracker:

Track-free CNN 1.241 0.571 1.379 2.577
Track-based CNN 1.268 0.607 1.478 3.013
Hybrid 1.263 0.611 1.485 2.987
Table 2: Overall comparison of the described methods using four metrics (m) using StanfordIPRL-TRI tracker [chiu2020probabilistic]. Real-world trackers introduce noise to the tracking information which affects the performance of track-based methods (Track-based CNN and Hybrid methods).
Figure 6: Performance evaluation of the described methods in ADE (m) using StanfordIPRL-TRI tracker [chiu2020probabilistic] on agents moving with a velocity larger than 3m/s and different scene density. The performance decrease of the track-based methods is more pronounced using the real-world tracker due to the noise introduced to the tracking information.

In this section we evaluate the performance of the tracking based models using a real-world tracker. To this end, we replace the tracking information provided in the dataset with the output of a real-world, popular tracker and reproduce the experiments that we have conducted in the sections § 5.1 and § 5.2 on the tracking based models. We use the StanfordIPRL-TRI tracker introduced in Chiu et al.[chiu2020probabilistic] to run our experiments. We have utilized their publicly available code to run our experiments and selected the same parameters as suggested in their work. The StanfordIPRL-TRI tracker won the nuscenes challenge competition [caesar2020nuscenes] by achieving state-of-the-art results on the nuscenes dataset.

The results of the Track-based CNN and Hybrid models using the StanfordIPRL-TRI tracker are summarized in Table 2. The track-free CNN model does not depend on the tracking module so its performance remains the same as in Table 1. The results suggest that there is a drop in the performance of the two tracking based models when using the StanfordIPRL-TRI tracker. Compared to the track-free CNN model, the track-based CNN model falls behind by 7.17% in ADE, 2.17% in AT and 6.3% in CT. Similarly, the Hybrid model drops back by 7.68% in ADE, 1.77% in AT and 7% in CT with respect to the track-free CNN model. These findings suggest that state-of-the-art trackers introduce enough noise to ultimately hinder motion prediction models quite considerably. These results also show that, unless we have a little-to-no-noise tracker, the tracking free model performs better than the tracking based models.

Further experiments are conducted to evaluate the performance of models using the StanfordIPRL-TRI tracker on challenging scenarios, as described in § 5.2, where we select the agents that moved at a speed higher than . We also consider the dense-versus-non-dense scenarios where the closest neighbor to the agent was located at a radius less than 4 meters and larger than 10 meters, for dense and non dense respectively. Results of this experiment are outlined in Figure 6. We notice that, similarly to the track-free CNN model, the tracking based models performance has dropped. The performance of the Track-based CNN model and Hybrid model drops by 4% and 3.6%, respectively, compared to track-free CNN model. For the dense scenarios, the track-based CNN performance drops by 14.8% compared to the "all moving agents scenario" () (§ 5.2), while the track-free model has dropped by 12.6%. These results suggest that though the performance has decreased across the three models due to the complexity of the scenarios, the track-free model still performs the best.

Figure 7: Qualitative evaluation in the case of a crowded scene where an identity switch happened. We show the performance of the track-based CNN model using ground-truth tracking (1), the track-free CNN model (2), and the track-based CNN model using StanfordIPRL-TRI tracker (3), in 2 examples (A) and (B).

In Figure 7 we highlight examples of challenging scenarios, with crowded scenes where identity switches happened, and compare the performance of the Track-based CNN model, using real-world tracker [chiu2020probabilistic], with the Track-free CNN model and Track-based CNN model using the ground-truth tracking information (no identity switch for this model). Comparing the first and third row, we see that the performance of the track-based model, using the real-world tracker, (third row) degrades compared to the track-based model using the ground-truth tracker (first row) in the presence of identity switches. Comparing the second and third row, both models do not perform well in the two proposed scenarios. However, the track-free model is more robust to crowded scenes.

6 Conclusion

We propose a comprehensive evaluation of three motion prediction models. The first, the Track-free CNN model, operates on a BEV input that was created based on a high definition map and agents detections, with no tracking information included. The second, the Hybrid model, extends the first model by integrating the tracking information in the form of LSTM embeddings of the agents past states. The third, the Track-based CNN model, integrates the tracking information in the BEV input using spatio-temporal displacement fields. We experimentally compare our models across no-noise, synthetic noise, and real-noise conditions. Our results show that, while the tracking based models perform better than the track-free model in the noise-free condition, their performance rapidly degrades when the tracking system produces noise — resulting in the tracking free system performing better. From this, we conclude that practitioners should consider tracking-free options, which are more robust, when creating real-world applications.