Dataset, code and model for the CVPR'20 paper "The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction". And for the ECCV'20 SimAug paper.
This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes. We make two main contributions. The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real world trajectory data, and then extrapolated by human annotators to achieve different latent goals. This provides the first benchmark for quantitative evaluation of the models to predict multi-future trajectories. The second contribution is a new model to generate multiple plausible future trajectories, which contains novel designs of using multi-scale location encodings and convolutional RNNs over graphs. We refer to our model as Multiverse. We show that our model achieves the best results on our dataset, as well as on the real-world VIRAT/ActEV dataset (which just contains one possible future). We will release our data, models and code.READ FULL TEXT VIEW PDF
Dataset, code and model for the CVPR'20 paper "The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction". And for the ECCV'20 SimAug paper.
Forecasting future human behavior is a fundamental problem in video understanding. In particular, future path prediction, which aims at forecasting a pedestrian’s future trajectory in the next few seconds, has received a lot of attention in our community [19, 1, 14, 25]. This functionality is a key component in a variety of applications such as autonomous driving [3, 5], long-term object tracking [18, 46], safety monitoring , robotic planning [41, 40], etc.
Of course, the future is often very uncertain: Given the same historical trajectory, a person may take different paths, depending on their (latent) goals. Thus recent work has started focusing on multi-future trajectory prediction [51, 5, 25, 32, 52, 22].
Consider the example in Fig. 1. We see a person moving from the bottom left towards the top right of the image, and our task is to predict where he will go next. Since there are many possible future trajectories this person might follow, we are interested in learning a model that can generate multiple plausible futures. However, since the ground truth data only contains one trajectory, it is difficult to evaluate such probabilistic models.
To overcome the aforementioned challenges, our first contribution is the creation of a realistic synthetic dataset that allows us to compare models in a quantitative way in terms of their ability to predict multiple plausible futures, rather than just evaluating them against a single observed trajectory as in existing studies. We create this dataset using the 3D CARLA  simulator, where the scenes are manually designed to be similar to those found in the challenging real-world benchmark VIRAT/ActEV [34, 2]. Once we have recreated the static scene, we automatically reconstruct trajectories by projecting real-world data to the 3D simulation world. See Fig. 1 and 3. We then semi-automatically select a set of plausible future destinations (corresponding to semantically meaningful locations in the scene), and ask human annotators to create multiple possible continuations of the real trajectories towards each such goal. In this way, our dataset is “anchored” in reality, and yet contains plausible variations in high-level human behavior, which is impossible to simulate automatically.
We call this dataset the “Forking Paths” dataset, a reference to the short story by Jorge Luis Borges.111 https://en.wikipedia.org/wiki/The_Garden_of_Forking_Paths As shown in Fig. 1, different human annotations have created forkings of future trajectories for the identical historical past. So far, we have collected 750 sequences, with each covering about 15 seconds, from 10 annotators, controlling 127 agents in 7 different scenes. Each agent contains 5.9 future trajectories on average. We render each sequence from 4 different views, and automatically generate dense labels, as illustrated in Fig. 1 and 3. In total, this amounts to 3.2 hours of trajectory sequences, which is comparable to the largest person trajectory benchmark VIRAT/ActEV [2, 34] (4.5 hours), or 5 times bigger than the common ETH/UCY [23, 30] benchmark. We therefore believe this will serve as a useful benchmark for evaluating models that can predict multiple futures. We will release the data together with the code used to generate it.
Our second contribution is to propose a new probabilistic model, Multiverse, which can generate multiple plausible trajectories given the past history of locations and the scene. The model contains two novel design decisions. First, we use a multi-scale representation of locations. In the first scale, the coarse scale, we represent locations on a 2D grid, as shown in Fig. 1(1). This captures high level uncertainty about possible destinations and leads to a better representation of multi-modal distributions. In the second fine scale, we predict a real-valued offset for each grid cell, to get more precise localization. This two-stage approach is partially inspired by object detection methods . The second novelty of our model is to design convolutional RNNs  over the spatial graph as a way of encoding inductive bias about the movement patterns of people.
In addition, we empirically validate our model on the challenging real-world benchmark VIRAT/ActEV [34, 2] for single-future trajectory prediction, in which our model achieves the best-published result. On the proposed simulation data for multi-future prediction, experimental results show our model compares favorably against the state-of-the-art models across different settings. To summarize, the main contributions of this paper are as follows: (i) We introduce the first dataset and evaluation methodology that allows us to compare models in a quantitative way in terms of their ability to predict multiple plausible futures. (ii) We propose a new effective model for multi-future trajectory prediction. (iii) We establish a new state of the art result on the challenging VIRAT/ActEV benchmark, and compare various methods on our multi-future prediction dataset.
Recently there is a large literature on forecasting future trajectories. We briefly review some of these works below.
Single-future trajectory prediction. Recent works have tried to predict a single best trajectory for pedestrians or vehicles. Early works [1, 33, 57, 60] focused on modeling person motions by considering them as points in the scene. Social-LSTM  is a popular method using social pooling to predict future trajectories. These research works [20, 58, 31, 28] have attempted to predict person paths by utilizing visual features. Kooij et al. in  looked at pedestrian’s faces to model their awareness for future prediction. Recently Liang et al.  proposed a joint future activity and trajectory prediction framework that utilized multiple visual features using focal attention . Many works [22, 48, 3, 17, 62] in vehicle trajectory prediction have been proposed. CAR-Net  proposed attention networks on top of scene semantic CNN to predict vehicle trajectories. Chauffeurnet 
utilized imitation learning for trajectory prediction.
Multi-future trajectory prediction. Many works have tried to model the uncertainty of trajectory prediction. A number of works focused on learning the effects of the physical scene, e.g., people tend to walk on the sidewalk instead of grass. Various papers (e.g. [19, 40, 42]
use Inverse Reinforcement Learning (IRL) to forecast human trajectories. Other works[47, 14, 25] like Social-GAN  have utilized generative adversarial networks  to generate diverse person trajectories. In vehicle trajectory prediction, DESIRE  utilized variational auto-encoders (VAE) to predict future vehicle trajectories. Many recent works [52, 5, 51, 32] also proposed probabilistic frameworks for multi-future vehicle trajectory prediction. Different from these previous works, we present a flexible two-stage framework that combines multi-modal distribution modeling and precise location prediction.
have been proposed as a result of self-driving’s surging popularity. With the recent advancement in 3D computer vision research[61, 26, 49, 10, 43, 45, 15], many research works [37, 11, 9, 8, 55, 64, 50] have looked into 3D simulated environment for its flexibility and ability to generate enormous amount of data. We are the first to propose a 3D simulation dataset that is reconstructed from real-world scenarios complemented with a variety of human trajectory continuations for multi-future person trajectory prediction.
In this section, we describe our model for forecasting agent trajectories, which we call Multiverse. We focus on predicting the locations of a single agent for multiple steps into the future, , given a sequence of past video frames, , and agent locations, , where is the history length and is the prediction length. Since there is inherent uncertainty in this task, our goal is to design a model that can effectively predict multiple plausible future trajectories, by computing the multimodal distribution . See Fig. 2 for a high level summary of the model, and the sections below for more details.
The encoder computes a representation of the scene from the history of past locations, , and frames, . We encode each ground truth location by an index representing the nearest cell in a 2D grid of size , indexed from to . Inspired by [21, 29], we encode location with two different grid scales ( and ); we show the benefits of this multi-scale encoding in Section 5.4. For simplicity of presentation, we focus on a single grid.
To make the model more invariant to low-level visual details, and thus more robust to domain shift (e.g., between different scenes, different views of the same scene, or between real and synthetic images), we preprocess each video frame using a pre-trained semantic segmentation model, with possible class labels per pixel. We use the Deeplab model  trained on the ADE20k  dataset, and keep its weights frozen. Let
be this semantic segmentation map modeled as a tensor of size.
where is element wise product, and represents 2D-convolution. The function projects a cell index into an one-hot embedding of size according to its spatial location. We use the final state of this encoder , where is the hidden size, to initialize the state of the decoders. We also use the temporal average of the semantic maps, , during each decoding step. The context is represented as .
After getting the context , our goal is to forecast future locations. We initially focus on predicting locations at the level of grid cells, . In Section 3.3, we discuss how to predict a continuous offset in , which specifies a “delta” from the center of each grid cell, to get a fine-grained location prediction.
Let the coarse distribution over grid locations at time (known as the “belief state”) be denoted by , for and . For brevity, we use a single index. We then compute the belief state by:
Here we use 2D-convolution with one filter and flatten the spatial dimension before applying softmax. The hidden state is updated using:
where embeds into a 3D tensor of size and is the embedding size. is a graph attention network , where the graph structure corresponds to the 2D grid in . More precisely, let be the feature vector corresponding to the -th grid cell in , and let be the corresponding output in , where is the size of the decoder hidden state. We compute these outputs of GAT using:
where are the neighbors of node in with each node represented as , where collects the cell ’s feature in . is some edge function (implemented as an MLP in our experiments) that computes the attention weights.
The graph-structured update function for the RNN ensures that the probability mass “diffuses out” to nearby grid cells in a controlled manner, reflecting the prior knowledge that people do not suddenly jump between distant locations. This inductive bias is also encoded in the convolutional structure, but adding the graph attention network gives improved results, because the weights are input-dependent and not fixed.
The 2D heatmap is useful for capturing multimodal distributions, but does not give very precise location predictions. To overcome this, we train a second convolutional RNN decoder to compute an offset vector for each possible grid cell using a regression output, . This RNN is updated using
To compute the final prediction location, we first flatten the spatial dimension of into . Then we use
where is the index of the selected grid cell, is the center of that cell, and is the predicted offset for that cell at time . For single-future prediction, we use greedy search, namely over the belief state. For multi-future prediction, we use beam search in Section 3.5.
This idea of combining classification and regression is partially inspired by object detection methods (e.g., ). It is worth noting that in concurrent work,  also designed a two-stage model for trajectory forecasting. However, their classification targets are pre-defined anchor trajectories. Ours is not limited by the predefined anchors.
Our model trains on the observed trajectory from time 1 to and predicts the future trajectories (in -coordinates) from time to . We supervise this training by providing ground truth targets for both the heatmap (belief state), , and regression offset map, . In particular, for the coarse decoder, the cross-entropy loss is used:
For the fine decoder, we use the smoothed loss used in object detection :
where is the delta between the true location and the center of the grid cell at and is the ground truth for in Eq.(6). We impose this loss on every cell to improve the robustness.
The final loss is then calculated using
where controls the regularization (weight decay), and is used to balance the regression and classification losses.
To generate multiple qualitatively distinct trajectories, we use the diverse beam search strategy from . To define this precisely, let be the beam at time ; this set contains trajectories (history selections) , , where is an index in , along with their accumulated log probabilities, . Let be the coarse location output probability from Eq. (2) and (3) at time given inputs .
The new beam is computed using
where is a diversity penalty term, and we take the top elements from the set produced by considering values with . If , this reduces to greedy search.
Once we have computed the top future predictions, we add the corresponding offset vectors to get trajectories by . This constitutes the final output of our model.
In this section, we describe our human-annotated simulation dataset, called Forking Paths, for multi-future trajectory evaluation.
Existing datasets. There are several real-world datasets for trajectory evaluation, such as SDD , ETH/UCY [35, 23], KITTI , nuScenes  and VIRAT/ActEV [2, 34]. However, they all share the fundamental problem that one can only observe one out of many possible future trajectories sampled from the underlying distribution. This is broadly acknowledged in prior works [32, 52, 5, 14, 42, 40] but has not yet been addressed.
The closest work to ours is the simulation used in [32, 52, 5]. However, these only contain artificial trajectories, not human generated ones. Also, they use a highly simplified 2D space, with pedestrians oversimplified as points and vehicles as blocks; no other scene semantics are provided.
Reconstructing reality in simulator. In this work, we use CARLA 
, a near-realistic open source simulator built on top of the Unreal Engine 4. Following prior simulation datasets[11, 45], we semi-automatically reconstruct static scenes and their dynamic elements from the real-world videos in ETH/UCY and VIRAT/ActEV. There are 4 scenes in ETH/UCY and 5 in VIRAT/ActEV. We exclude 2 cluttered scenes (UNIV & 0002) that we are not able to reconstruct in CARLA, leaving 7 static scenes in our dataset.
For dynamic movement of vehicle and pedestrian, we first convert the ground truth trajectory annotations from the real-world videos to the ground plane using the provided homography matrices. We then match the real-world trajectories’ origin to correct locations in the re-created scenes.
Human generation of plausible futures. We manually select sequences with more than one pedestrian. We also require that at least one pedestrian could have multiple plausible alternative destinations. We then select one of the pedestrians to be the “controlled agent” (CA) for each sequence, and set meaningful destinations within reach, like a car or an entrance of a building. On average, each agent has about 3 destinations to move towards. In total, we have 127 CAs from 7 scenes. We call each CA and their corresponding scene a scenario.
For each scenario, there are on average 5.9 human annotators to control the agent to the defined destinations. Specifically, they are asked to watch the first 5 seconds of video, from a first-person view (with the camera slightly behind the pedestrian) and/or an overhead view (to give more context). They are then asked to control the motion of the agent so that it moves towards the specified destination in a “natural” way, e.g., without colliding with other moving objects (whose motion is derived from the real videos, and is therefore unaware of the controlled agent). The annotation is considered successful if the agent reached the destination without colliding within the time limit of 10.4 seconds.
Note that our videos are up to 15.2 seconds long. This is slightly longer than previous works (e.g. [1, 14, 28, 47, 25, 60, 62]) that use 3.2 seconds of observation and 4.8 seconds for prediction. (We use 10.4 seconds for the future to allow us to evaluate longer term forecasts.)
Generating the data. Once we have collected human-generated trajectories, 750 in total after data cleaning, we render each one in four camera views (three 45-degree and one top-down view). Each camera view has 127 scenarios in total and each scenario has on average 5.9 future trajectories. With CARLA, we can also simulate different weather conditions, although we did not do so in this work. In addition to agent location, we collect ground truth for pixel-precise scene semantic segmentation from 13 classes including sidewalk, road, vehicle, pedestrian, etc. See Fig. 3.
This section evaluates various methods, including our Multiverse model, for multi-future trajectory prediction on the proposed Forking Paths dataset. To allow comparison with previous works, we also evaluate our model on the challenging VIRAT/ActEV [2, 34] benchmark for single-future path prediction.
Single-Future Evaluation. In real-world videos, each trajectory only has one sample of the future, so models are evaluated on how well they predict that single trajectory. Following prior work [28, 1, 14, 47, 22, 17, 5, 42], we introduce two standard metrics for this setting.
Let be the ground truth trajectory of the -th sample, and be the corresponding prediction. We then employ two distance-based error metrics: i) Average Displacement Error (ADE): the average Euclidean distance between the ground truth coordinates and the prediction coordinates over all time instants:
ii) Final Displacement Error (FDE): the Euclidean distance between the predicted points and the ground truth point at the final prediction time:
Multi-Future Evaluation. Let be the -th true future trajectory for the -th test sample, for , and let be the ’th sample from the predicted distribution over trajectories, for
. Since there is no agreed-upon evaluation metric for this setting, we simply extend the above metrics, as follows: i)Minimum Average Displacement Error Given K Predictions (minADEK): similar to the metric described in [5, 40, 42, 14], for each true trajectory in test sample , we select the closest overall prediction (from the model predictions), and then measure its average error:
ii) Minimum Final Displacement Error Given K Predictions (minFDEK): similar to minADEK, but we only consider the predicted points and the ground truth point at the final prediction time instant:
|LSTM||Traj.||201.0 2.2||183.7 2.1||381.5 3.2||355.0 3.6|
|Social-LSTM ||Traj.||197.5 2.5||180.4 1.0||377.0 3.6||350.3 2.3|
|Social-GAN (PV) ||Traj.||191.2 5.4||176.5 5.2||351.9 11.4||335.0 9.4|
|Social-GAN (V) ||Traj.||187.1 4.7||172.7 3.9||342.1 10.2||326.7 7.7|
|Next ||Traj.+Bbox+RGB+Seg.||186.6 2.7||166.9 2.2||360.0 7.2||326.6 5.0|
|Ours||Traj.+Seg.||168.9 2.1||157.7 2.5||333.8 3.7||316.5 3.4|
Dataset & Setups. The proposed Forking Paths dataset in Section 4 is used for multi-future trajectory prediction evaluation. Following the setting in previous works [28, 1, 14, 1, 14, 47, 32], we downsample the videos to 2.5 fps and extract person trajectories using code released in , and let the models observe 3.2 seconds (8 frames) of the controlled agent before outputting trajectory coordinates in the pixel space. Since the length of the ground truth future trajectories are different, each model needs to predict the maximum length at test time but we evaluate the predictions using the actual length of each true trajectory.
Baseline methods. We compare our method with two simple baselines, and three recent methods with released source code, including a recent model for multi-future prediction and the state-of-the-art model for single-future prediction: Linear is a single layer model that predicts the next coordinates using a linear regressor based on the previous input point. LSTM is a simple LSTM  encoder-decoder model with coordinates input only. Social LSTM : We use the open source implementation from (https://github.com/agrimgupta92/sgan/). Next  is the state-of-the-art method for single-future trajectory prediction on the VIRAT/ActEV dataset. We train the Next model without the activity labels for fair comparison using the code from (https://github.com/google/next-prediction/). Social GAN  is a recent multi-future trajectory prediction model trained using Minimum over N (MoN) loss. We train two model variants (called PV and V) detailed in the paper using the code from  .
All models are trained on real videos (from VIRAT/ActEV – see Section 5.3 for details) and tested on our synthetic videos (with CARLA-generated pixels, and annotator-generated trajectories). Most models just use trajectory data as input, except for our model (which uses trajectory and semantic segmentation) and Next (which uses trajectory, bounding box, semantic segmentation, and RGB frames).
Implementation Details. We use ConvLSTM  cell for both the encoder and decoder. The embedding size is set to 32, and the hidden sizes for the encoder and decoder are both 256. The scene semantic segmentation features are extracted from the deeplab model , pretrained on the ADE-20k  dataset. We use Adadelta optimizer  with an initial learning rate of 0.3 and weight decay of 0.001. Other hyper-parameters for the baselines are the same to the ones in [14, 28]. We evaluate the top predictions for multi-future trajectories. For the models that only output a single trajectory, including Linear, LSTM, Social-LSTM, and Next, we duplicate the output for times before evaluating. For Social-GAN, we use different random noise inputs to get the predictions. For our model, we use diversity beam search [24, 36] as described in Section 3.5.
Quantitative Results. Table 1
lists the multi-future evaluation results, where we divide the evaluation according to the viewing angle of camera, 45-degree vs. top-down view. We repeat all experiments (except “linear”) 5 times with random initialization to produce the mean and standard deviation values. As we see, our model outperforms baselines in all metrics and it performs significantly better on the minADE metric, which suggests better prediction quality over all time instants. Notably, our model outperforms Social GAN by a large margin of at least 8 points on all metrics.
Qualitative analysis. We visualize some outputs of the top 4 methods in Fig. 4. In each image, the yellow trajectories are the history trajectory of each controlled agent (derived from real video data) and the green trajectories are the ground truth future trajectories from human annotators. The predicted trajectories are shown in yellow-orange heatmaps for multi-future prediction methods, and in red lines for single-future prediction methods. As we see, our model correctly generally puts probability mass where there is data, and does not “waste” probability mass where there is no data.
Error analysis. We show some typical errors our model makes in Fig. 5. The first image shows our model misses the correct direction, perhaps due to lack of diversity in our sampling procedure. The second image shows our model sometimes predicts the person will “go through” the car (diagonal red beam) instead of going around it. This may be addressed by adding more training examples of “going around” obstacles. The third image shows our model predicts the person will go to a moving car. This is due to the lack of modeling of the dynamics of other far-away agents in the scene. The fourth image shows a hard case where the person just exits the vehicle and there is no indication of where they will go next (so our model “backs off” to a sensible “stay nearby” prediction). We leave solutions to these problems to future work.
Dataset & Setups. NIST released VIRAT/ActEV  for activity detection research in streaming videos in 2018. This dataset is a new version of the VIRAT  dataset, with more videos and annotations. The length of videos with publicly available annotations is about 4.5 hours. Following , we use the official training set for training and the official validation set for testing. Other setups are the same as in Section 5.2, except we use the single-future evaluation metric.
Quantitative Results. Table 2 (first column) shows the evaluation results. As we see, our model achieves state-of-the-art performance. The improvement is especially large on Final Displacement Error (FDE) metric, attributing to the coarse location decoder that helps regulate the model prediction for long-term prediction. The gain shows that our model does well at both single future prediction (on real data) and multiple future prediction on our quasi-synthetic data.
Generalizing from simulation to real-world. As described in Section 4, we generate simulation data first by reconstructing from real-world videos. To verify the quality of the reconstructed data, and the efficacy of learning from simulation videos, we train all the models on the simulation videos derived from the real data. We then evaluate on the real test set of VIRAT/ActEV. As we see from the right column in Table 2, all models do worse in this scenario, due to the difference between synthetic and real data.
There are two sources of error. The synthetic trajectory data only contains about 60% of the real trajectory data, due to difficulties reconstructing all the real data in the simulator. In addition, the synthetic images are not photo realistic. Thus methods (such as Next ) that rely on RGB input obviously suffer the most, since they have never been trained on “real pixels”. Our method, which uses trajectories plus high level semantic segmentations (which transfers from synthetic to real more easily) suffers the least drop in performance, showing its robustness to “domain shift”. See Table 1 for input source comparison between methods.
|Method||Trained on Real.||Trained on Sim.|
|Linear||32.19 / 60.92||48.65 / 90.84|
|LSTM||23.98 / 44.97||28.45 / 53.01|
|Social-LSTM ||23.10 / 44.27||26.72 / 51.26|
|Social-GAN (V) ||30.40 / 61.93||36.74 / 73.22|
|Social-GAN (PV) ||30.42 / 60.70||36.48 / 72.72|
|Next ||19.78 / 42.43||27.38 / 62.11|
|Ours||18.51 / 35.84||22.94 / 43.35|
We test various ablations of our model on both the single-future and multi-future trajectory prediction to substantiate our design decisions. Results are shown in Table 3, where the ADE/FDE metrics are shown in the “single-future” column and minADE20/minFDE20 metrics (averaged across all views) in the “multi-future” column. We verify three of our key designs by leaving the module out from the full model.
(1) Spatial Graph: Our model is built on top of a spatial 2D graph that uses graph attention to model the scene features. We train model without the spatial graph. As we see, the performance drops on both tasks. (2) Fine location decoder: We test our model without the fine location decoder and only use the grid center as the coordinate output. As we see, the significant performance drops on both tasks verify the efficacy of this new module proposed in our study. (3) Multi-scale grid: As mentioned in Section 3, we utilize two different grid scales (36 18) and (18 9) in training. We see that performance is slightly worse if we only use the fine scale (36 18) .
|Our full model||18.51 / 35.84||166.1 / 329.5|
|No spatial graph||28.68 / 49.87||184.5 / 363.2|
|No fine location decoder||53.62 / 83.57||232.1 / 468.6|
|No multi-scale grid||21.09 / 38.45||171.0 / 344.4|
In this paper, we have introduced the Forking Paths dataset, and the Multiverse
model for multi-future forecasting. Our study is the first to provide a quantitative benchmark and evaluation methodology for multi-future trajectory prediction by using human annotators to create a variety of trajectory continuations under the identical past. Our model utilizes multi-scale location decoders with graph attention model to predict multiple future locations. We have shown that our method achieves state-of-the-art performance on two challenging benchmarks: the large-scale real video dataset and our proposed multi-future trajectory dataset. We believe our dataset, together with our models, will facilitate future research and applications on multi-future prediction.
Convolutional lstm network: a machine learning approach for precipitation nowcasting. In NeurIPS, Cited by: §1, §3.1, §5.2.