On Exposing the Challenging Long Tail in Future Prediction of Traffic Actors

03/23/2021 ∙ by Osama Makansi, et al. ∙ University of Freiburg 0

Predicting the states of dynamic traffic actors into the future is important for autonomous systems to operate safelyand efficiently. Remarkably, the most critical scenarios aremuch less frequent and more complex than the uncriticalones. Therefore, uncritical cases dominate the prediction. In this paper, we address specifically the challenging scenarios at the long tail of the dataset distribution. Our analysis shows that the common losses tend to place challenging cases suboptimally in the embedding space. As a consequence, we propose to supplement the usual loss with aloss that places challenging cases closer to each other. This triggers sharing information among challenging cases andlearning specific predictive features. We show on four public datasets that this leads to improved performance on the challenging scenarios while the overall performance stays stable. The approach is agnostic w.r.t. the used network architecture, input modality or viewpoint, and can be integrated into existing solutions easily.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

Histogram of the ETH-UCY dataset based on the difficulty of the sample (based on displacement error of a Kalman filter

[Kalman1960]). An easy scenario (belongs to the head blue class) and a challenging scenario (belongs to the tail red class) are shown along the prediction of the state-of-the-art (Traj++ EWTA) and our approach. Our approach targets those challenging scenarios (from the tail) and improves their performance while maintaining a good performance on the easy scenarios.

Future prediction in traffic scenarios aims to foresee the future location of dynamic actors based on their current and previous locations and possibly other information about the environment. For an actor in interaction with others, reasoning about possible future locations of the other actors is necessary for path planning and to avoid collisions. Given enough data, some recent prediction methods [ewta, trajectron++, fln] also not just predict a single location of the actor in the future but a multimodal distribution over possible future locations.

The average prediction errors of such methods look promising, but they hide that the training and test data is dominated by simple scenarios, where the trajectory can be smoothly propagated into the future. Such scenarios can be handled with a simple Kalman filter or other autoregressive models. However, the most safety-critical scenarios are those that involve close-by dynamic obstacles and require an evasive maneuver. Such scenarios are rare in both the training and the test data. The more complex and safety-critical they are, the less frequent they are. Fatal cases with a collision are not part of the dataset at all.

As an example, the ETH-UCY dataset is often used to benchmark methods for future trajectory prediction. It is considered a challenging dataset, as it includes interacting pedestrians in crowded scenes. Figure 1 shows the histogram of samples in this dataset based on their difficulty approximated by the prediction error of a Kalman filter. The large majority of scenarios can be well modeled by linear extrapolation, whereas scenarios that require more complex modeling are rare. The depicted challenging scenario showcases a pedestrian (red box), who will turn right in the future to avoid a collision with the stationary pedestrians (black boxes) in front of them.

In this paper, we explicitly address the long-tailed data distribution in future prediction and focus on the rare but important cases rather than the average case. Straightforward ideas to re-distribute the dataset by undersampling the frequent scenarios [undersample1, undersample2] or by reweighting the loss for these samples [reweight4-inverse-effective] are not viable solutions, since it would reduce the (effective) data size dramatically. One can also oversample the challenging scenarios during training [oversample4, oversample6-m2m], yet this repetition of the same rare samples leads to overfitting and does not perform well, as we show in our experiments. Some works have tried to simulate rare cases [forking, simaug]. However, to-date, even the most realistic simulations suffer from the domain gap between the simulated and the real world. An interesting direction for dealing with imbalanced data has been presented by Cao et al., who proposed a loss that ensures larger margins for the minority [ldam-loss].

We pick up this general idea and propose to reshape the feature embedding of the predictor. We show in a detailed analysis of the feature space that, with the usual loss, the challenging examples get placed next to many normal cases. Consequently, the relevant information of these samples gets smoothed out. As we push the challenging scenarios to be in proximity in the embedding, more of these samples that share a similar scenario build a small cluster and are no longer ignored. With this approach we can predict the future trajectory of interacting pedestrians better; see blue trajectory in Figure 1.

Our contributions can be briefly summarized as follows. (1) We analyze the problem of long-tailed data distributions in future prediction for the first time. (2) We propose a novel joint optimization of the regular regression loss for predicting the future location and a loss that reshapes the feature embedding in favor of the long-tail samples. (3) We show that multi-headed networks outperform cVAEs in addressing the multimodal nature of the future.

The proposed approach is easy to integrate into existing approaches, since it is agnostic to the network architecture, viewpoint, and input modalities. We demonstrate this by evaluating on four diverse public datasets. On each of them, the method improves the prediction quality of the challenging cases, while maintaining the quality on simple cases. Our source code will be released upon publication.

2 Related Work

Future prediction.Deep learning methods dominate future prediction. LSTMs [socialLSTM, CIDNN, srlstm, ContextAware, SceneLSTM, CarNet, csp] were mostly used to model the states of the agents over time, while graph-based approaches [RSBG] were used to model the interactions between agents. However, these methods cannot handle the multimodal nature of the future. Meanwhile, several works addressed the multimodality in future prediction by cVAEs [desire, pecnet], GANs [SocialGAN, Sophie, AgentTensor, socialBiGAT, reciprocal], nonparametric approaches [forking, Relation] or a sampling-fitting framework [ewta]

. Recently, graph neural networks

[stgcnn, trajectron++, evolvegraph, spagnn, lanegcn] and transformers [star] have become popular to model the agent interactions. All aforementioned works assume that the scene is static and is observed from a bird’s-eye view. Among these, Trajectron++ [trajectron++] currently performs the best.

In automotive settings, the observation is typically from an egocentric view (e.g, with camera(s) or LiDAR mounted on the vehicle). This introduces new challenges due to the large egomotion of the vehicle and the narrow field of view. Multiple works project the data to the bird’s-eye view using expensive 3D sensors [Kinematic, Uber, Surround, Infer, TrafficPredict, Precog, Drogon]. Some recent approaches work directly on the egocentric view. Deterministic approaches [Sted, Dtp] modeled the motion of the scene via optical flow. Yao et al. [Ego] proposed to use the planned egomotion to improve the predictions. TraPHic [Traphic] exploited the interaction between nearby heterogeneous objects via LSTMs. Some works also tackled the multimodality in future prediction by using Bayesian RNNs to sample multiple futures with uncertainties [Bayesian, Nemo]. Titan [titan] modeled the future as a bi-variate Gaussian and conditioned the learning process on a set of labelled prior actions to further improve the prediction. Makansi et al. [fln] proposed a three-staged framework FLN-RPN, which currently performs the best in the egocentric view.

None of the above approaches addressed the long tail of the data distribution. We base our method on Trajectron++ [trajectron++] in the bird’s-eye setting and FLN-RPN [fln] for the egocentric setting, and specifically address the challenging cases in the long tail of the dataset distribution for the first time.

Learning on imbalanced datasets. Issues with the long tail of a dataset have been well studied for classification tasks. Many works tackled the issue from the data side. A common approach is oversampling of rare classes [oversample0, oversample7, oversample2]. Another option is undersampling of the most frequent classes [undersample1, undersample2]. Several works follow the idea of generating more samples of the minority classes by simulation, which can be considered a more sophisticated version of oversampling [smote, oversample3, oversample5, oversample6-m2m]. Instead of changing the number of samples, samples can also be reweighted in the loss [reweight1-inverse-freq, reweight4-inverse-effective, reweight-by-hardness_0, reweight-by-hardness_1]. Some works proposed to learn these weights [reweight3, reweight5-meta-learning]. Recently, Li et al. [softmax-balanced]

group classes of similar sizes and learn group-wise classifiers.

Another idea is to design loss functions that affect the feature space by increasing the inter-class distance and reducing the intra-class distance

[range-loss, reweight1-inverse-freq]. This concept of enlarging the margin between minority classes leads to a larger margin between classes and, thus, better generalization [crl-loss, ldam-loss, metric-learning-uncertainty, affinity-loss]. Similarly, contrastive learning has become very popular due to promising results on self-supervised feature learning with noise-contrastive learning [nce, simclr, infonce]. Noisy versions of a sample (positives) are forced to be separated from other samples (negatives) [exemplarCNN]. Recently, contrastive learning enabled learning stronger feature extractors for classifying long-tail datasets [Yang2020RethinkingTV].

All these methods were applied to classification tasks, where there is an explicit distinction between frequent and rare classes. Our approach also augments the loss to reshape the data distribution in the embedding space, yet we do not rely on predetermined clusters, since we have a regression task. Given the flexibility of contrastive learning in defining losses based on the definition of positive and negative samples, we adopt a novel way of embedding the samples based on their difficulty as measured by the performance of a Kalman filter and combine the reshaping of the embedding space with the regular regression loss. For sake of fair comparison, we also adapt previous methods tailored for classification and use them in conjunction with the regression loss as detailed in Section 6.5.

Figure 2: Feature space for the UNIV scene from ETH-UCY dataset using t-SNE [tsne]. (a) Training only with the supervised objective for future prediction (e.g, EWTA). Rare challenging scenarios (large green bright circles) are scattered among the frequent easy scenarios (small dark blue circles). We zoom into two challenging (1,4) and two easy scenarios (2,3). (b) Joint learning with the supervised (EWTA) and the contrastive loss. The challenging scenarios form two sub-spaces where they can share relevant features. The two challenging examples (1,4) are close and benefit each other, which improves their future predictions considerably (particularly for 1). (c) Only contrastive learning is used, where all challenging scenarios are strictly mapped to the same location. This destroys the task relevant cues and cannot provide any future prediction.

3 Future Prediction

Given current and past observations , where is the length of the history, future prediction aims to predict the true state of the actor of interest at times in the future. An observation at a single time step can consist of the 2D location , a map of the environment, a bounding box of the actor of interest, an RGB image , semantic segmentation , or future egomotion . For future trajectory prediction, the state is defined as the future trajectory at and for future localization prediction as the future bounding box at .

We address the issue with the long tails of the data distribution in both bird’s-eye view and egocentric settings. As backbone for these scenarios, we use the Trajectron++ [trajectron++] and FLN-RPN [fln], respectively.

3.1 Bird’s-Eye View - Trajectron++

Trajectron++ [trajectron++] is the state-of-the-art method for future trajectory prediction in bird’s-eye view. It takes the dynamic actors, the static environment, and heterogeneous input data into account. Given the past trajectories , and optionally a map of the scene, Trajectron++ builds a directed spatiotemporal graph for a scene based on its topology. It predicts future trajectories . The nodes of the graph represent the actors, and the edges represent their interactions. The actors’ histories are modeled by LSTMs, features of interacting actors are aggregated via point-wise summation, and GRUs are used to decode the future trajectories. The original architecture employs a cVAE to produce multiple future trajectories.

Since cVAEs require multiple runs of the decoder to obtain multiple predictions, we replace the cVAE by the multi-hypotheses networks trained with EWTA (Evolving Winner-Takes-All) [ewta]. The EWTA loss for every sample in the batch is defined as:

(1)
(2)

where

is the number of estimated hypotheses.

is the indicator function that returns if the condition returns true and otherwise. and denote the th predicted future state and the ground truth at future time step (), respectively. The returns the hypotheses closest to the ground truth, where gradually decreases from to during training. While all hypotheses are penalized in the beginning of the training, only the best one would be penalized at the end of the training. The Trajectron++ augmented by EWTA (Figure 3 (top)) produces multiple future trajectories in a single network pass and outperforms the standard Trajectron++, as we show in Section 6.5.

3.2 Egocentric View - FLN-RPN

FLN-RPN [fln] is the state-of-the-art method for future localization prediction in the egocentric setting. FLN-RPN predicts the multimodal distribution of the future localization of an actor in three steps. First, it predicts where an actor is most likely to be in the current image (Reachability Prior). Second, it transfers the reachability prior from the current frame to the future frame using the future egomotion. Finally, past bounding boxes of the actor , images , semantic segmentations at time steps (), the future egomotion and the predicted future reachability prior are given to the network to predict the future localization of the actor of interest. The prediction has the form of set of bounding boxes at . The two key components of FLN-RPN are the reachability prior, which helps overcoming mode collapse, and the EWTA loss function (Eq. 1 with and is replaced by ) that can learn diverse multiple states of the future in a single forward pass. Figure 3 (bottom) illustrates the FLN-RPN framework.

3.3 Difficulty Ranking

Before we explore the effects of the distribution of the challenging scenarios in the feature space on the final prediction, we need to know how challenging each scenario is. Since manual labeling is not a viable option, we use a common and simple metric to measure the difficulty of cases: the displacement error made by the Kalman Filter [Ego, fln, Kalman1960] on this sample. Small errors indicate good approximation with linear extrapolation, whereas large errors indicate a challenging scenario that requires complex nonlinear prediction.

4 Why are Hard Cases Ignored by the Model?

To understand the cause of the problem with samples from the long tail of the data distribution, we visualized the feature embedding of the data from a network trained with a supervised future prediction objective and analyzed particular cases in detail. Figure 2 (left) shows the feature space for the UNIV scene in the ETH-UCY dataset projected to 2D with t-SNE [tsne]. Each dot is a sample from the scene mapped to the feature space by the network trained with EWTA loss (Eq. 1). Hard cases are sprinkled among the easy cases in the feature space without any structure. A closer look at a hard case (1) reveals that it shares some similarity with corresponding easy cases (2,3), which explains its position in the embedding, but the relevant cues, in which it is different from the easy cases, get ignored with normal training. The sample should rather be close to another challenging example (4) to capture the social interaction, where pedestrians walk in groups and follow other groups. We believe that challenging scenarios being alone in a manifold full of easy scenarios causes the network to ignore them and base its decisions on shortcuts learned from the dominant easy scenarios. The network does not get a chance to learn dedicated features to solve challenging cases by reusing some common cues among them (1,4), as long as they get mixed up with the easy cases. Indeed, the prediction for case (1) is quite wrong since it is similar to the prediction of cases (2,3), where social interaction is missing.

5 Reshaping the Embedding with Contrastive Learning

The analysis from the previous section triggers the idea to push hard samples away from the easy ones, such that the relevant cues of similar hard samples get the chance to be no longer ignored during training. We implement this idea with an additional contrastive loss. Contrastive learning enforces certain training samples (positives) to be closer in the embedding to a sample (anchor) than others (negatives). There are multiple ways to express this in a loss. The most popular is

(3)

where

is the learned feature vector at the bottleneck of the network (see Figure 

3), is the positive set of anchor . is the indicator function that returns if the condition returns true and otherwise. is the total number of samples in the batch. is the total number of positive samples for the anchor . is the temperature parameter. Positive samples are often defined as augmented versions of the same image [simclr] or samples belonging to the same class [sup-con]. Negative samples, on the other hand, are other samples in the batch that do not satisfy the positive criterion by some safe margin. Since our goal is to distribute the features based on the difficulty, we define the positive set as the set of samples in the batch which has a difficulty score satisfying , where is a hyper-parameter defining the positivity threshold. Similarly, the negatives samples are defined as all samples with a difficulty score satisfying . Note that we use different thresholds implying that many samples in the batch are neither positive nor negative. In order to minimize this loss, the network must maximize the nominator and minimize the denominator. Doing so, it learns to map the positive samples close in the feature space and the negative ones apart. The result of training with such a loss is shown in Figure 2 (right).

While having the hard cases being pushed together is good for them to share relevant cues and learn prediction models for less common scenarios, there is much diversity among hard cases, and not all of them should be pushed to the same space. In particular, we should not destroy cues shared with the easy examples, which are necessary for the network to solve the actual task. The contrastive loss alone can not predict the future state. To this end, we jointly optimize for the supervised future prediction loss and the self-supervised contrastive loss as:

(4)

where controls the importance of the contrastive loss, hence the strength of the attraction that pulls hard cases together.

Figure 2 (middle) shows the effect of this combination. Cases (1) and (4) fall into the same sub-space resulting in a much better prediction for (1). Other hard cases rather stay with similar easy samples as they have no other hard cases to share information with.

Due to its simplicity, this difficulty-based contrastive learning can be added to any existing method as long as the difficulty can be defined explicitly on the training set.

Figure 3: Schematic that shows how we flexibly integrate the contrastive loss (red) in existing future prediction frameworks. (a) Bird’s-eye view (Trajectron++[trajectron++]). (b) Egocentric view (FLN-RPN[fln]). Independent of the contrastive loss, we modified Trajectron++ by replacing the cVAE with the EWTA [ewta] framework to better capture the multimodality of the predicted future and for faster inference time. The map encoder (dashed gray) is optional and only used for the nuScenes dataset.

6 Experiments

6.1 Datasets

The ETH-UCY dataset is the combination of the ETH [eth] and the UCY [ucy] pedestrian datasets. Both include videos from bird’s-eye view of the pedestrians, where the trajectories are manually annotated. The challenges in these datasets are the frequent interactions between pedestrians, as the scenes are very crowded, and the lack of visual information due to the viewpoint, i.e, the actors are small and uninformative. We present 5-fold cross-validation results on the five scenes of the dataset.

nuScenes [nuscenes] is a large autonomous driving dataset with 1000 scenes, where each is 20 seconds long. It provides HD semantic maps with 11 different layers and accurate bounding box annotations in time. It provides scenarios from bird’s-eye view and egocentric view, and we experiment with each of them.

Waymo [waymo] is the most recent autonomous driving dataset with 1000 scenes, where each is 20 seconds long. We use the validation part of the dataset (202 scenes) to show zero-shot transfer of our approach in egocentric view (i.e, without retraining the model).

6.2 Evaluation Metrics

min-ADE is the minimum average displacement error. It computes the mean distance between all predicted trajectories and the ground truth and reports the error of the closest one. This is sometimes also referred to as oracle (or best-of-many), since the selection of the minimum error depends on the ground truth.

min-FDE is the minimum final displacement error. It computes the distance between the final locations of the predicted trajectories and the ground truth at the end of the predicted time horizon () and, like min-ADE, reports the minimum.

6.3 Training Details

In our experiments for bird’s-eye view, we followed the original training schedule for Trajectron++ [trajectron++]

. We trained the Trajectron++ (EWTA) with batch size 256 for 100 epochs in every EWTA stage (

) for ETH-UCY and for 5 epochs in every EWTA stage for nuScenes. For the experiments in egocentric view, we used ResNet34 [resnet] as the encoder of FLN-RPN [fln] and trained with batch size of 32. Following [trajectron++, fln], we set to 12, 6, 1 and to 0.4, 0.5, 3.0 for ETH-UCY, nuScenes (bird’s-eye view) and nuScenes/Waymo (egocentric view), respectively. The remaining design choices were kept as in the original papers [trajectron++, fln]. For our joint optimization, was chosen based on the validation set as 1, 50, 150 for nuScenes (bird’s-eye view), ETH-UCY, and nuScenes (egocentric view), respectively. We used the recommended value of for [simclr]. and were set such that the ratio of positives and negatives over the batch size are 10% and 40% for Trajectron++ and 33% and 33% for FLN-RPN, respectively. had the dimensions of 232 and 256 for Trajectron++ and FLN-RPN, respectively. A study about the effect of the hyper-parameter is presented in the supplemental material.

6.4 Baselines

Bird’s-eye view (ETH-UCY). We selected a set of recent methods addressing the future trajectory prediction: Graph-based approaches: RSBG[RSBG], S-STGCNN[stgcnn], and Trajectron++[trajectron++] (referred as Traj++); transformer-based approach: STAR[star]; multi-stage networks: TPNet [tpnet] and PECNet [pecnet].

Bird’s-eye view (nuScenes). We compare against a set of baselines including deterministic LSTM-based approaches: S-LSTM [socialLSTM], CSP [csp], and CAR-Net [CarNet]; multimodal graph-based approaches: SpAGNN [spagnn] and Trajectron++ [trajectron++].

Egocentric view. We compare against the multimodal state-of-the-art FLN-RPN [fln].

Moreover, for all settings and datasets, we implemented the common approaches for imbalanced data: resampling [oversample0], reweighting using the inverse class frequency [reweight1-inverse-freq], and reweighting using the effective number of samples [reweight4-inverse-effective]. We also adapt sophisticated long-tail classification methods[ldam-loss, softmax-balanced] to the considered task by defining classes based on the discretization of Kalman filter scores. Then, the network is jointly trained on the regression loss and the considered classification loss (more details are provided in the supplementary). Notice that recent methods: cRT, -norm and LWS introduced by Kang et al. [sota-class-long-tailed] can not be adapted to regression tasks since they do not affect the feature extractor and fully rely on post-processing the classifier which is not needed at test time in our scenario.

6.5 Results & Discussion

To show the validity of the proposed approach, we selected strong baselines and state-of-the-art methods for comparison. Tables 1, 2 and 3 summarize our results on the four different datasets. Since we are interested in improving the quality of the predictions of the rare cases, we report min-ADE and min-FDE for all samples, as well as the top 1-3% challenging cases.

EWTA vs cVAE. Tables 1 and 2 show that our base method, where we use the Trajectron++ as the backbone with the EWTA objective, clearly outperforms the previous state-of-the-art Trajectron++. This shows that EWTA-based sampling for possible future trajectories works better than cVAE-based sampling.

All Top 3% Top 2% Top 1%
RSBG [RSBG] 0.48/0.99 -/- -/- -/-
Reciprocal [reciprocal] 0.44/0.90 -/- -/- -/-
TPNet [tpnet] 0.42/0.90 -/- -/- -/-
S-STGCNN [stgcnn] 0.44/0.75 -/- -/- -/-
STAR [star] 0.26/0.53 -/- -/- -/-
PEC-NET [pecnet] 0.29/0.48 -/- -/- -/-
Traj++ [trajectron++] 0.21/0.41 0.65/1.42 0.71/1.51 0.58/1.23
Traj++ EWTA (ours) 0.16/0.32 0.47/1.07 0.51/1.13 0.42/0.87
+ LDAM [ldam-loss] 0.17/0.33 0.47/1.04 0.50/1.08 0.42/0.83
+ LDAM-DRW [ldam-loss] 0.17/0.33 0.47/1.04 0.51/1.08 0.43/0.83
+ BAGS [softmax-balanced] 0.17/0.32 0.48/1.08 0.51/1.10 0.42/0.85
+ contrastive (ours) 0.16/0.32 0.46/1.03 0.48/1.03 0.38/0.71
Table 1: Average error on the ETH-UCY benchmark over all test samples and over the 1-3% most challenging scenarios in the format of (min-ADE/min-FDE). Joint learning with the contrastive loss yields large improvements on the challenging scenarios while not harming the overall average accuracy.
All Top 3% Top 2% Top 1%
S-LSTM [socialLSTM] 00.00/1.61 00.00/0 00.00/0 00.00/0
CSP [csp] 00.00/1.50 00.00/0 00.00/0 00.00/0
CAR-Net [CarNet] 00.00/1.35 00.00/0 00.00/0 00.00/0
SpAGNN [spagnn] 00.00/1.23 00.00/0 00.00/0 00.00/0
Traj++ [trajectron++] 0.22/0.39 0.55/0.98 0.60/1.04 0.72/1.21
Traj++ EWTA (ours) 0.19/0.32 0.48/0.88 0.50/0.88 0.59/1.02
+ LDAM [ldam-loss] 0.18/0.32 0.48/0.88 0.51/0.93 0.60/1.10
+ LDAM-DRW [ldam-loss] 0.18/0.32 0.50/0.93 0.52/0.96 0.63/1.14
+ BAGS [softmax-balanced] 0.18/0.31 0.48/0.88 0.51/0.94 0.61/1.11
+ contrastive (ours) 0.18/0.30 0.44/0.73 0.46/0.72 0.54/0.85
Table 2: Average error on the nuScenes dataset (bird’s eye view) over all test samples and over the 1-3% most challenging scenarios in the format of (min-ADE/min-FDE). Joint learning with the contrastive loss yields large improvements on the challenging scenarios and even improves the overall average accuracy a little.
nuScenes Egocentric View Waymo Egocentric View
All Top 3% Top 2% Top 1% All Top 3% Top 2% Top 1%
FLN-RPN [fln] 7.10 29.98 31.13 36.16 6.39 24.87 25.49 27.32
+ LDAM [ldam-loss] 8.04 25.23 26.02 31.13 7.61 23.00 23.09 25.05
+ LDAM-DRW [ldam-loss] 8.01 26.63 27.85 34.58 8.05 25.23 25.98 29.32
+ BAGS [softmax-balanced] 7.28 29.54 30.38 35.74 6.67 24.45 24.88 26.66
+ contrastive (ours) 7.04 25.05 25.26 27.49 6.49 22.36 22.72 24.09
Table 3: Results on egocentric datasets (nuScenes and Waymo). We show the min-FDE over all scenarios and over the top 1-3% challenging scenarios. Our approach yields an improvement on the challenging scenarios while maintaining the performance on average.

Large improvements on the challenging cases. Results on all datasets show that our approach yields large improvements on the challenging cases (particularly for the top 1%) while maintaining the overall average error. In particular, on the most challenging cases (top 1%), our approach improves by 18%, 17%, 23% and 12% on the ETH-UCY, nuScenes (bird’s eye-view), nuScenes (egocentric view) and Waymo open dataset, respectively. The challenging training samples, as hypothesized, help each other when they are in proximity in the feature space. Notably, the studied datasets differ in their input modalities (additional semantic maps for nuScenes), viewpoint (bird’s-eye vs egocentric views), and prediction output (2D points in bird’s-eye view while bounding boxes for egocentric view). This indicates that the approach is agnostic to different input modalities and generalizes well.

Comparison to long-tail classification baselines. Tables 1, 2 and 3 show a comparison against recent methods addressing the long-tail problem in classification. Our method based on the contrastive loss outperforms all these techniques on all metrics.

Zero-shot transfer. Results on the Waymo dataset (Tab. 3) show promising zero-shot transfer to unseen dataset, where models were trained on the nuScenes dataset and tested on the validation split of the Waymo dataset.

Avoids bias. In Table 4 we compare our method against the common approaches for imbalanced data: resampling and reweighting. We report across all datasets the performance over all samples and over the most challenging samples (top 1%). As expected, these baselines tend to bias the challenging cases. Hence, the average performance drops significantly (66%, 16%, 44% and 64% for ETH-UCY, nuScenes bird’s, nuScenes egocentric and Waymo). Our method, on the other hand, maintains the average performance over all samples. Detailed results on all metrics and difficulties are provided in the supplementary.

ETH-UCY nuScenes-B nuScenes-E Waymo
All/Top 1% All/Top 1% All/Top 1% All/Top 1%
Baseline 0.32/0.87 0.32/1.02 07.10/36.16 06.39/27.32
+ resample [oversample0] 0.53/1.22 0.37/1.33 10.20/21.62 10.48/19.69
+ reweight [reweight1-inverse-freq] 0.56/0.76 0.58/1.67 14.47/16.20 14.00/16.44
+ reweight [reweight4-inverse-effective] 0.56/0.78 0.60/1.71 16.54/15.46 17.43/18.79
+ contrastive 0.32/0.71 0.30/0.85 07.04/27.49 06.49/24.09
Table 4: Comparison to the common resampling/reweighting techniques on the four datasets. For each method, we show the min-FDE over all samples and over top 1% challenging samples. Our method yields large improvements on the challenging ones while maintaining the average. This is in contrast to the reweighting/resampling baselines, which lead to much worse performance on average. Baseline indicates Traj++ EWTA for bird’s eye view and FLN-RPN for egocentric view.

6.6 Qualitative Results

In Figure 4 (a), we show three challenging examples from ETH-UCY dataset. In all the cases, the future trajectory of the pedestrian (red) is not trivial, and the network must model the interaction between pedestrians to generate a plausible future trajectory. Our approach (blue) generates trajectories that are much closer to the ground truth than Trajectron++ EWTA (cyan). In Figure 4 (b), we show three challenging examples for vehicles from the nuScenes dataset (bird’s-eye view). In these examples, the vehicle changes direction, which requires interpretation of the maps. Our approach succeeds on these examples, whereas Trajectron++ EWTA misses these cues and predicts the simple continuation of the trajectory.

Figure 5 shows four different examples from the egocentric setting. Figure 5 (a) shows a child crossing the street in front of the vehicle. Figure 5 (b) shows a vehicle that will turn right to go down the street, which is rarely encountered. Figure 5 (c) shows an example that is difficult because of the uncommon egomotion of the car moving to the opposite lane to overtake the bus. Figure 5 (d) shows a vehicle that turns right to exit the round-about. In all these examples, our approach makes predictions close to the ground truth (both in scale and location), whereas the baseline fails.

Limitations and failure cases. We also analyzed the limits of our approach to identify room for further improvements. We found that some challenging cases continue to stay in a manifold for easy cases because of missing similarity to other hard cases Figure 6 (b), or easy cases moved wrongly to a manifold of challenging cases Figure 6 (c). Consequently, our approach yields wrong prediction. We also found that our method, like other methods, cannot model unexpected behavior, such as suddenly stopping and turning in the opposite direction Figure 6 (a). We also provide the feature embedding before and after application of our approach for all datasets in the supplementary.

7 Conclusions

We addressed the long-tailed data distributions by acting on the feature embedding. We showed that pulling the rare challenging samples together in the feature embedding via contrastive learning helps improve their final predictions while preserving the performance over the whole dataset. We validated our approach qualitatively and quantitatively on four different datasets, two different viewpoints and different combinations of input and output modalities. The proposed loss can be integrated easily into existing approaches to improve their performance on critical challenging cases. We hypothesize that the concept is generic and could be integrated into other regression tasks with an unbalanced sample distribution, as long as there is a way to identify the underrepresented samples during training.

Figure 4: Qualitative challenging examples for pedestrians from ETH-UCY dataset (a) and vehicles from nuScenes bird’s-eye view dataset (b). Note how our approach outperforms the SOTA (Trajectron++ EWTA) by generating a future trajectory closer to the ground truth. We visualize the best hypothesis for each method. For the examples from nuScenes (b), we show the underlying map on which the method need to reason about.
Figure 5: Qualitative challenging examples from Waymo open dataset (a-b) and nuScenes egocentric view (c-d). For each example, we show both the last observed image (top) and the future image (bottom) along with the predictions (FLN-RPN [fln] and Ours) and the ground truth. We visualize the best hypothesis for each method. The future egomotion is also shown as arrow indicating the motion of the ego-car.
Figure 6: Three examples for different categories of failures from our method. Each example is shown together with three other examples on its left from its manifold resulting from our approach. (a) An example of a pedestrian from ETH-UCY dataset who unexpectedly decided to turn back and go left. Such an unexpected future behavior is very hard to model. (b) A vehicle from nuScenes (top view) dataset that decided to turn right and our approach in unable to change its manifold. (c) An example of a less challenging example from nuScenes (bird’s-eye view) dataset which our approach mistakenly moves to a challenging manifold.

8 Acknowledgments

Experiments were run on the Deep Learning Cluster funded by the German Research Foundation (INST 39/1108-1). Gefördert durch das Bundesministerium für Wirtschaft und Energie aufgrund eines Beschlusses des Deutschen Bundestages.

References

1 Visualization Plots

Figure 7 and 8 show the comparison between our method and different baselines where each circle indicates the performance of one method. These figures illustrate better the improvements gained by our method (dashed arrows).

Figure 7: Average vs. Top 1% error comparison on the ETH-UCY dataset (left) and the nuScenes bird’s eye view (right). Our base method of integrating EWTA with the backbone of Trajectron++ (cyan) outperforms the previous state-of-the-art (magenta). Joint learning with the contrastive loss (blue) yields large improvements on the challenging scenarios while not reducing the overall average accuracy. The improvements are indicated by dashed arrows. While the resampling/reweighting baselines also improve on the hard cases, they increase the average error a lot (overfitting). The model-based baselines for long-tailed (LDAM and BAGS) yield only small improvements on ETH-UCY or worse performance on nuScenes bird’s eye view.
Figure 8: Average vs. Top 1% error comparison on the nuScenes egocentric view dataset (left) and the Waymo open dataset (right). Our approach utilizing the contrastive loss (blue) yields a significant improvement on the challenging scenarios while not reducing the overall average accuracy. The improvements are indicated by dashed arrows. While the resampling/reweighting baselines also improve on the hard cases, they increase the average error a lot (overfitting). The model-based baselines for long-tailed (LDAM and BAGS) yield smaller improvements than our method.

2 Feature Space Visualization

Figure 9 shows the projection of the feature space using tSNE [tsne] on three different datasets with different input modalities and views. For each dataset, we show the feature space embedding without our joint optimization (i.e, only the supervised loss) and with our joint optimization (i.e, additionally utilizing the contrastive loss). Note how our approach reshapes the feature space by pushing the challenging scenarios to be closer so that they can benefit each other as also shown in our quantitative results.

Figure 9: Plot of the feature space using tSNE [tsne] on three different datasets (a and b are different scenes from the ETH-UCY dataset). Top. Training only with the supervised regression loss. Bottom. The resulting feature space when trained jointly with the contrastive loss. Large brighte circles indicate the top 1% challenging scenarios. The darker the color of the sample, the easier it is.

3 Effect of the Strength of the Contrastive Loss

In Table 5 we show a study for the importance of the contrastive loss () used in our approach (Eq. (4)). Using a small factor leads to small improvements on the challenging scenarios as the force of reshaping the feature space is rather weak. On the other hand, using a very large factor yields worse results as the network focuses more on reshaping the feature space and ignores the important cues for the actual task which are learned from the supervised loss. Note that this study is used only to show the effect of the weight of the contrastive loss. In our main results, we use the validation set to select the best value for .

4 More Qualitative Results

We provide more qualitative results from our approach in Figure 10, Figure 11 and Figure 12 for the ETH-UYC, nuScenes (bird’s-eye view) and nuScenes/Waymo (egocentric view) datasets, respectively.

ETH-UCY (AVG)
All Top 3% Top 2% Top 1%
Traj++ EWTA (ours) 0.16/0.32 0.47/1.07 0.51/1.13 0.42/0.87
+ contrastive () 0.17/0.33 0.47/1.04 0.50/1.07 0.43/0.84
+ contrastive () 0.16/0.32 0.46/1.03 0.48/1.03 0.38/0.71
+ contrastive () 0.17/0.32 0.48/1.04 0.52/1.10 0.50/0.97
Table 5: Study of the hyper-parameter on the ETH-UCY dataset. While small yields small improvement on the challenging scenarios, large yields larger errors on the challenging scenarios.
Figure 10: More results from our approach on the ETH-UCY dataset. For all these challenging scenarios, our approach reasons successfully about the social relations to other pedestrians and yields better prediction than the baseline.
Figure 11: More results from our approach on the nuScenes dataset (bird’s-eye view). For all these challenging scenarios, our approach reasons successfully about the semantic cues and predicts the correct trajectory.
Figure 12: More results from our approach on both egocentric view datasets: nuScenes (a-b) and Waymo (c-d). For each example, we show both the last observed image (top) and the future image (bottom) along with the predictions (FLN-RPN [fln] and Ours) and the ground truth. We visualize the best hypothesis for each method. The future egomotion is also shown as arrow indicating the motion of the ego-car.

5 Detailed Quantitative Results

Table 6 show a detailed comparison between our method and the resampling/reweighting baselines across all datasets on all metrics and difficulties. This support our findings that these baselines tend to bias the challenging cases (overfitting) while our approach maintain the average performance and improves largely on the challenging cases.

ETH-UCY nuScenes-Bird’s Eye View nuScenes Egocentric View Waymo Open Dataset
All Top 3% Top 2% Top 1% All Top 3% Top 2% Top 1% All Top 3% Top 2% Top 1% All Top 3% Top 2% Top 1%
Baseline 0.16/0.32 0.47/1.07 0.51/1.13 0.42/0.87 0.19/0.32 0.48/0.88 0.50/0.88 0.59/1.02 7.10 29.98 31.13 36.16 6.39 24.87 25.49 27.32
+ resample [oversample0] 0.25/0.53 0.56/1.16 0.61/1.24 0.61/1.22 0.21/0.37 0.55/0.98 0.61/1.07 0.78/1.33 10.20 18.90 19.37 21.62 10.48 19.46 18.91 19.69
+ reweight [reweight1-inverse-freq] 0.28/0.56 0.41/0.78 0.44/0.81 0.43/0.76 0.33/0.58 0.74/1.28 0.80/1.38 0.99/1.67 14.47 15.33 15.42 16.20 14.00 17.01 16.80 16.44
+ reweight [reweight4-inverse-effective] 0.28/0.56 0.43/0.83 0.45/0.86 0.44/0.78 0.34/0.60 0.75/1.33 0.80/1.42 0.99/1.71 16.54 15.29 15.34 15.46 17.43 20.34 19.40 18.79
+ contrastive 0.16/0.32 0.46/1.03 0.48/1.03 0.38/0.71 0.18/0.30 0.44/0.73 0.46/0.72 0.54/0.85 7.04 25.05 25.26 27.49 6.49 22.36 22.72 24.09
Table 6: Comparison to the common resampling/reweighting techniques on the four datasets. For each method, we show the min-FDE/min-ADE over all samples and over top 1-3% challenging samples. Our method yields large improvements on the challenging ones while maintaining the average. This is in contrast to the reweighting/resampling baselines, which lead to much worse performance on average (see the error increase on the ’All’ columns). Baseline indicates Traj++ EWTA for bird’s eye view and FLN-RPN [fln] for egocentric view.

6 Baselines Implementation Details

In order to use state-of-the-art methods for long-tail classification, we map the regression task to a classification task by assigning classes to training samples based on the error of the Kalman filter. In particular, we group the errors into bins and assign the same class to all samples in each bin. To alleviate the issue of having classes with only one sample, we group all samples with a score greater than a specific threshold into the same bin. This yields 13, 36, 331 classes for ETH-UCY, nuScenes bird’s eye view and nuScenes egocentric view, respectively. For all baselines (including our method), we use the same joint training scheme where two heads (classification and regression) are trained on top of the feature embedding. For the LDAM baseline [ldam-loss], we experiment with different scaling factors and use the best setting . Following BAGS [softmax-balanced], we split the classes into 4 homogeneous groups to ensure that all classes from the same group have roughly the same number of items and use a sampling ration of 8 to ensure that all groups contribute to the mini-batch during training.