The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction

12/13/2019 ∙ by Junwei Liang, et al. ∙ Google Carnegie Mellon University 32

This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes. We make two main contributions. The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real world trajectory data, and then extrapolated by human annotators to achieve different latent goals. This provides the first benchmark for quantitative evaluation of the models to predict multi-future trajectories. The second contribution is a new model to generate multiple plausible future trajectories, which contains novel designs of using multi-scale location encodings and convolutional RNNs over graphs. We refer to our model as Multiverse. We show that our model achieves the best results on our dataset, as well as on the real-world VIRAT/ActEV dataset (which just contains one possible future). We will release our data, models and code.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Forecasting future human behavior is a fundamental problem in video understanding. In particular, future path prediction, which aims at forecasting a pedestrian’s future trajectory in the next few seconds, has received a lot of attention in our community [19, 1, 14, 25]. This functionality is a key component in a variety of applications such as autonomous driving [3, 5], long-term object tracking [18, 46], safety monitoring [28], robotic planning [41, 40], etc.

Of course, the future is often very uncertain: Given the same historical trajectory, a person may take different paths, depending on their (latent) goals. Thus recent work has started focusing on multi-future trajectory prediction [51, 5, 25, 32, 52, 22].

Consider the example in Fig. 1. We see a person moving from the bottom left towards the top right of the image, and our task is to predict where he will go next. Since there are many possible future trajectories this person might follow, we are interested in learning a model that can generate multiple plausible futures. However, since the ground truth data only contains one trajectory, it is difficult to evaluate such probabilistic models.

To overcome the aforementioned challenges, our first contribution is the creation of a realistic synthetic dataset that allows us to compare models in a quantitative way in terms of their ability to predict multiple plausible futures, rather than just evaluating them against a single observed trajectory as in existing studies. We create this dataset using the 3D CARLA [10] simulator, where the scenes are manually designed to be similar to those found in the challenging real-world benchmark VIRAT/ActEV [34, 2]. Once we have recreated the static scene, we automatically reconstruct trajectories by projecting real-world data to the 3D simulation world. See Fig. 1 and 3. We then semi-automatically select a set of plausible future destinations (corresponding to semantically meaningful locations in the scene), and ask human annotators to create multiple possible continuations of the real trajectories towards each such goal. In this way, our dataset is “anchored” in reality, and yet contains plausible variations in high-level human behavior, which is impossible to simulate automatically.

We call this dataset the “Forking Paths” dataset, a reference to the short story by Jorge Luis Borges.111 https://en.wikipedia.org/wiki/The_Garden_of_Forking_Paths As shown in Fig. 1, different human annotations have created forkings of future trajectories for the identical historical past. So far, we have collected 750 sequences, with each covering about 15 seconds, from 10 annotators, controlling 127 agents in 7 different scenes. Each agent contains 5.9 future trajectories on average. We render each sequence from 4 different views, and automatically generate dense labels, as illustrated in Fig. 1 and 3. In total, this amounts to 3.2 hours of trajectory sequences, which is comparable to the largest person trajectory benchmark VIRAT/ActEV [2, 34] (4.5 hours), or 5 times bigger than the common ETH/UCY [23, 30] benchmark. We therefore believe this will serve as a useful benchmark for evaluating models that can predict multiple futures. We will release the data together with the code used to generate it.

Our second contribution is to propose a new probabilistic model, Multiverse, which can generate multiple plausible trajectories given the past history of locations and the scene. The model contains two novel design decisions. First, we use a multi-scale representation of locations. In the first scale, the coarse scale, we represent locations on a 2D grid, as shown in Fig. 1(1). This captures high level uncertainty about possible destinations and leads to a better representation of multi-modal distributions. In the second fine scale, we predict a real-valued offset for each grid cell, to get more precise localization. This two-stage approach is partially inspired by object detection methods [39]. The second novelty of our model is to design convolutional RNNs [56] over the spatial graph as a way of encoding inductive bias about the movement patterns of people.

In addition, we empirically validate our model on the challenging real-world benchmark VIRAT/ActEV [34, 2] for single-future trajectory prediction, in which our model achieves the best-published result. On the proposed simulation data for multi-future prediction, experimental results show our model compares favorably against the state-of-the-art models across different settings. To summarize, the main contributions of this paper are as follows: (i) We introduce the first dataset and evaluation methodology that allows us to compare models in a quantitative way in terms of their ability to predict multiple plausible futures. (ii) We propose a new effective model for multi-future trajectory prediction. (iii) We establish a new state of the art result on the challenging VIRAT/ActEV benchmark, and compare various methods on our multi-future prediction dataset.

2 Related Work

Recently there is a large literature on forecasting future trajectories. We briefly review some of these works below.

Single-future trajectory prediction. Recent works have tried to predict a single best trajectory for pedestrians or vehicles. Early works [1, 33, 57, 60] focused on modeling person motions by considering them as points in the scene. Social-LSTM [1] is a popular method using social pooling to predict future trajectories. These research works [20, 58, 31, 28] have attempted to predict person paths by utilizing visual features. Kooij et al. in [20] looked at pedestrian’s faces to model their awareness for future prediction. Recently Liang et al[28] proposed a joint future activity and trajectory prediction framework that utilized multiple visual features using focal attention [27]. Many works [22, 48, 3, 17, 62] in vehicle trajectory prediction have been proposed. CAR-Net [48] proposed attention networks on top of scene semantic CNN to predict vehicle trajectories. Chauffeurnet [3]

utilized imitation learning for trajectory prediction.

Multi-future trajectory prediction. Many works have tried to model the uncertainty of trajectory prediction. A number of works focused on learning the effects of the physical scene, e.g., people tend to walk on the sidewalk instead of grass. Various papers (e.g. [19, 40, 42]

use Inverse Reinforcement Learning (IRL) to forecast human trajectories. Other works 

[47, 14, 25] like Social-GAN [14] have utilized generative adversarial networks [13] to generate diverse person trajectories. In vehicle trajectory prediction, DESIRE [22] utilized variational auto-encoders (VAE) to predict future vehicle trajectories. Many recent works [52, 5, 51, 32] also proposed probabilistic frameworks for multi-future vehicle trajectory prediction. Different from these previous works, we present a flexible two-stage framework that combines multi-modal distribution modeling and precise location prediction.

Trajectory Datasets. Many vehicle trajectory datasets [4, 6]

have been proposed as a result of self-driving’s surging popularity. With the recent advancement in 3D computer vision research 

[61, 26, 49, 10, 43, 45, 15], many research works [37, 11, 9, 8, 55, 64, 50] have looked into 3D simulated environment for its flexibility and ability to generate enormous amount of data. We are the first to propose a 3D simulation dataset that is reconstructed from real-world scenarios complemented with a variety of human trajectory continuations for multi-future person trajectory prediction.

Figure 2: Overview of our model. The input to the model is the ground truth location history, and a set of video frames, which are preprocessed by a semantic segmentation model. This is encoded by the “History Encoder” convolutional RNN. The output of the encoder is fed to the convolutional RNN decoder for location prediction. The coarse location decoder outputs a heatmap over the 2D grid of size

. The fine location decoder outputs a vector offset within each grid cell. These are combined to generate a multimodal distribution over

for predicted locations.

3 Methods

In this section, we describe our model for forecasting agent trajectories, which we call Multiverse. We focus on predicting the locations of a single agent for multiple steps into the future, , given a sequence of past video frames, , and agent locations, , where is the history length and is the prediction length. Since there is inherent uncertainty in this task, our goal is to design a model that can effectively predict multiple plausible future trajectories, by computing the multimodal distribution . See Fig. 2 for a high level summary of the model, and the sections below for more details.

3.1 History Encoder

The encoder computes a representation of the scene from the history of past locations, , and frames, . We encode each ground truth location by an index representing the nearest cell in a 2D grid of size , indexed from to . Inspired by [21, 29], we encode location with two different grid scales ( and ); we show the benefits of this multi-scale encoding in Section 5.4. For simplicity of presentation, we focus on a single grid.

To make the model more invariant to low-level visual details, and thus more robust to domain shift (e.g., between different scenes, different views of the same scene, or between real and synthetic images), we preprocess each video frame using a pre-trained semantic segmentation model, with possible class labels per pixel. We use the Deeplab model [7] trained on the ADE20k [63] dataset, and keep its weights frozen. Let

be this semantic segmentation map modeled as a tensor of size

.

We then pass these inputs to a convolutional RNN [56, 54] to compute a spatial-temporal feature history:

(1)

where is element wise product, and represents 2D-convolution. The function projects a cell index into an one-hot embedding of size according to its spatial location. We use the final state of this encoder , where is the hidden size, to initialize the state of the decoders. We also use the temporal average of the semantic maps, , during each decoding step. The context is represented as .

3.2 Coarse Location Decoder

After getting the context , our goal is to forecast future locations. We initially focus on predicting locations at the level of grid cells, . In Section 3.3, we discuss how to predict a continuous offset in , which specifies a “delta” from the center of each grid cell, to get a fine-grained location prediction.

Let the coarse distribution over grid locations at time (known as the “belief state”) be denoted by , for and . For brevity, we use a single index

to represent a cell in the 2D grid. Rather than assuming a Markov model, we update this using a convolutional recurrent neural network, with hidden states

. We then compute the belief state by:

(2)

Here we use 2D-convolution with one filter and flatten the spatial dimension before applying softmax. The hidden state is updated using:

(3)

where embeds into a 3D tensor of size and is the embedding size. is a graph attention network [53], where the graph structure corresponds to the 2D grid in . More precisely, let be the feature vector corresponding to the -th grid cell in , and let be the corresponding output in , where is the size of the decoder hidden state. We compute these outputs of GAT using:

(4)

where are the neighbors of node in with each node represented as , where collects the cell ’s feature in . is some edge function (implemented as an MLP in our experiments) that computes the attention weights.

The graph-structured update function for the RNN ensures that the probability mass “diffuses out” to nearby grid cells in a controlled manner, reflecting the prior knowledge that people do not suddenly jump between distant locations. This inductive bias is also encoded in the convolutional structure, but adding the graph attention network gives improved results, because the weights are input-dependent and not fixed.

3.3 Fine Location Decoder

The 2D heatmap is useful for capturing multimodal distributions, but does not give very precise location predictions. To overcome this, we train a second convolutional RNN decoder to compute an offset vector for each possible grid cell using a regression output, . This RNN is updated using

(5)

To compute the final prediction location, we first flatten the spatial dimension of into . Then we use

(6)

where is the index of the selected grid cell, is the center of that cell, and is the predicted offset for that cell at time . For single-future prediction, we use greedy search, namely over the belief state. For multi-future prediction, we use beam search in Section 3.5.

This idea of combining classification and regression is partially inspired by object detection methods (e.g., [39]). It is worth noting that in concurrent work, [5] also designed a two-stage model for trajectory forecasting. However, their classification targets are pre-defined anchor trajectories. Ours is not limited by the predefined anchors.

3.4 Training

Our model trains on the observed trajectory from time 1 to and predicts the future trajectories (in -coordinates) from time to . We supervise this training by providing ground truth targets for both the heatmap (belief state), , and regression offset map, . In particular, for the coarse decoder, the cross-entropy loss is used:

(7)

For the fine decoder, we use the smoothed loss used in object detection [39]:

(8)

where is the delta between the true location and the center of the grid cell at and is the ground truth for in Eq.(6). We impose this loss on every cell to improve the robustness.

The final loss is then calculated using

(9)

where controls the regularization (weight decay), and is used to balance the regression and classification losses.

Note that during training, when updating the RNN, we feed in the predicted soft distribution over locations, . See Eq. (2). An alternative would be to feed in the true values, , i.e., use teacher forcing. However, this is known to suffer from problems [38].

3.5 Inference

To generate multiple qualitatively distinct trajectories, we use the diverse beam search strategy from [24]. To define this precisely, let be the beam at time ; this set contains trajectories (history selections) , , where is an index in , along with their accumulated log probabilities, . Let be the coarse location output probability from Eq. (2) and (3) at time given inputs .

The new beam is computed using

(10)

where is a diversity penalty term, and we take the top elements from the set produced by considering values with . If , this reduces to greedy search.

Once we have computed the top future predictions, we add the corresponding offset vectors to get trajectories by . This constitutes the final output of our model.

Figure 3: Visualization of the Forking Paths dataset. On the left is examples of the real videos and the second column shows the reconstructed scenes. The person in the blue bounding box is the controlled agent and multiple future trajectories annotated by humans are shown by overlaid person frames. On average our dataset has 5.9 future trajectories per scenario. The red circles are the defined destinations. The green trajectories are future trajectories of the reconstructed uncontrolled agents. The scene semantic segmentation ground truth is shown in the third column and the last column shows all four camera views including the top-down view.

4 The Forking Paths Dataset

In this section, we describe our human-annotated simulation dataset, called Forking Paths, for multi-future trajectory evaluation.

Existing datasets. There are several real-world datasets for trajectory evaluation, such as SDD [44], ETH/UCY [35, 23], KITTI [12], nuScenes [4] and VIRAT/ActEV [2, 34]. However, they all share the fundamental problem that one can only observe one out of many possible future trajectories sampled from the underlying distribution. This is broadly acknowledged in prior works [32, 52, 5, 14, 42, 40] but has not yet been addressed.

The closest work to ours is the simulation used in [32, 52, 5]. However, these only contain artificial trajectories, not human generated ones. Also, they use a highly simplified 2D space, with pedestrians oversimplified as points and vehicles as blocks; no other scene semantics are provided.

Reconstructing reality in simulator. In this work, we use CARLA [10]

, a near-realistic open source simulator built on top of the Unreal Engine 4. Following prior simulation datasets 

[11, 45], we semi-automatically reconstruct static scenes and their dynamic elements from the real-world videos in ETH/UCY and VIRAT/ActEV. There are 4 scenes in ETH/UCY and 5 in VIRAT/ActEV. We exclude 2 cluttered scenes (UNIV & 0002) that we are not able to reconstruct in CARLA, leaving 7 static scenes in our dataset.

For dynamic movement of vehicle and pedestrian, we first convert the ground truth trajectory annotations from the real-world videos to the ground plane using the provided homography matrices. We then match the real-world trajectories’ origin to correct locations in the re-created scenes.

Human generation of plausible futures. We manually select sequences with more than one pedestrian. We also require that at least one pedestrian could have multiple plausible alternative destinations. We then select one of the pedestrians to be the “controlled agent” (CA) for each sequence, and set meaningful destinations within reach, like a car or an entrance of a building. On average, each agent has about 3 destinations to move towards. In total, we have 127 CAs from 7 scenes. We call each CA and their corresponding scene a scenario.

For each scenario, there are on average 5.9 human annotators to control the agent to the defined destinations. Specifically, they are asked to watch the first 5 seconds of video, from a first-person view (with the camera slightly behind the pedestrian) and/or an overhead view (to give more context). They are then asked to control the motion of the agent so that it moves towards the specified destination in a “natural” way, e.g., without colliding with other moving objects (whose motion is derived from the real videos, and is therefore unaware of the controlled agent). The annotation is considered successful if the agent reached the destination without colliding within the time limit of 10.4 seconds.

Note that our videos are up to 15.2 seconds long. This is slightly longer than previous works (e.g. [1, 14, 28, 47, 25, 60, 62]) that use 3.2 seconds of observation and 4.8 seconds for prediction. (We use 10.4 seconds for the future to allow us to evaluate longer term forecasts.)

Generating the data. Once we have collected human-generated trajectories, 750 in total after data cleaning, we render each one in four camera views (three 45-degree and one top-down view). Each camera view has 127 scenarios in total and each scenario has on average 5.9 future trajectories. With CARLA, we can also simulate different weather conditions, although we did not do so in this work. In addition to agent location, we collect ground truth for pixel-precise scene semantic segmentation from 13 classes including sidewalk, road, vehicle, pedestrian, etc. See Fig. 3.

5 Experimental results

This section evaluates various methods, including our Multiverse model, for multi-future trajectory prediction on the proposed Forking Paths dataset. To allow comparison with previous works, we also evaluate our model on the challenging VIRAT/ActEV [2, 34] benchmark for single-future path prediction.

5.1 Evaluation Metrics

Single-Future Evaluation. In real-world videos, each trajectory only has one sample of the future, so models are evaluated on how well they predict that single trajectory. Following prior work [28, 1, 14, 47, 22, 17, 5, 42], we introduce two standard metrics for this setting.

Let be the ground truth trajectory of the -th sample, and be the corresponding prediction. We then employ two distance-based error metrics: i) Average Displacement Error (ADE): the average Euclidean distance between the ground truth coordinates and the prediction coordinates over all time instants:

(11)

ii) Final Displacement Error (FDE): the Euclidean distance between the predicted points and the ground truth point at the final prediction time:

(12)

Multi-Future Evaluation. Let be the -th true future trajectory for the -th test sample, for , and let be the ’th sample from the predicted distribution over trajectories, for

. Since there is no agreed-upon evaluation metric for this setting, we simply extend the above metrics, as follows: i)

Minimum Average Displacement Error Given K Predictions (minADEK): similar to the metric described in  [5, 40, 42, 14], for each true trajectory in test sample , we select the closest overall prediction (from the model predictions), and then measure its average error:

(13)

ii) Minimum Final Displacement Error Given K Predictions (minFDEK): similar to minADEK, but we only consider the predicted points and the ground truth point at the final prediction time instant:

(14)

5.2 Multi-Future Prediction on Forking Paths

Method Input Types minADE20 minFDE20
45-degree top-down 45-degree top-down
Linear Traj. 213.2 197.6 403.2 372.9
LSTM Traj. 201.0 2.2 183.7 2.1 381.5 3.2 355.0 3.6
Social-LSTM [1] Traj. 197.5 2.5 180.4 1.0 377.0 3.6 350.3 2.3
Social-GAN (PV) [14] Traj. 191.2 5.4 176.5 5.2 351.9 11.4 335.0 9.4
Social-GAN (V) [14] Traj. 187.1 4.7 172.7 3.9 342.1 10.2 326.7 7.7
Next [28] Traj.+Bbox+RGB+Seg. 186.6 2.7 166.9 2.2 360.0 7.2 326.6 5.0
Ours Traj.+Seg. 168.9 2.1 157.7 2.5 333.8 3.7 316.5 3.4
Table 1: Comparison of different methods on the Forking Paths dataset. Lower numbers are better. The numbers for the column labeled “45 degrees” are averaged over 3 different 45-degree views. For the input types, “Traj.”, “RGB”, “Seg.” and “Bbox.” mean the inputs are coordinates, raw frames, semantic segmentations and bounding boxes of all objects in the scene, respectively. All models are trained on real VIRAT/ActEV videos and tested on synthetic (CARLA-rendered) videos.

Dataset & Setups. The proposed Forking Paths dataset in Section 4 is used for multi-future trajectory prediction evaluation. Following the setting in previous works [28, 1, 14, 1, 14, 47, 32], we downsample the videos to 2.5 fps and extract person trajectories using code released in [28], and let the models observe 3.2 seconds (8 frames) of the controlled agent before outputting trajectory coordinates in the pixel space. Since the length of the ground truth future trajectories are different, each model needs to predict the maximum length at test time but we evaluate the predictions using the actual length of each true trajectory.

Baseline methods. We compare our method with two simple baselines, and three recent methods with released source code, including a recent model for multi-future prediction and the state-of-the-art model for single-future prediction: Linear is a single layer model that predicts the next coordinates using a linear regressor based on the previous input point. LSTM is a simple LSTM [16] encoder-decoder model with coordinates input only. Social LSTM [1]: We use the open source implementation from (https://github.com/agrimgupta92/sgan/). Next [28] is the state-of-the-art method for single-future trajectory prediction on the VIRAT/ActEV dataset. We train the Next model without the activity labels for fair comparison using the code from (https://github.com/google/next-prediction/). Social GAN [14] is a recent multi-future trajectory prediction model trained using Minimum over N (MoN) loss. We train two model variants (called PV and V) detailed in the paper using the code from [14] .

All models are trained on real videos (from VIRAT/ActEV – see Section 5.3 for details) and tested on our synthetic videos (with CARLA-generated pixels, and annotator-generated trajectories). Most models just use trajectory data as input, except for our model (which uses trajectory and semantic segmentation) and Next (which uses trajectory, bounding box, semantic segmentation, and RGB frames).

Implementation Details. We use ConvLSTM [56] cell for both the encoder and decoder. The embedding size is set to 32, and the hidden sizes for the encoder and decoder are both 256. The scene semantic segmentation features are extracted from the deeplab model [7], pretrained on the ADE-20k [63] dataset. We use Adadelta optimizer [59] with an initial learning rate of 0.3 and weight decay of 0.001. Other hyper-parameters for the baselines are the same to the ones in  [14, 28]. We evaluate the top predictions for multi-future trajectories. For the models that only output a single trajectory, including Linear, LSTM, Social-LSTM, and Next, we duplicate the output for times before evaluating. For Social-GAN, we use different random noise inputs to get the predictions. For our model, we use diversity beam search [24, 36] as described in Section 3.5.

Quantitative Results. Table 1

lists the multi-future evaluation results, where we divide the evaluation according to the viewing angle of camera, 45-degree vs. top-down view. We repeat all experiments (except “linear”) 5 times with random initialization to produce the mean and standard deviation values. As we see, our model outperforms baselines in all metrics and it performs significantly better on the minADE metric, which suggests better prediction quality over all time instants. Notably, our model outperforms Social GAN by a large margin of at least 8 points on all metrics.

Qualitative analysis. We visualize some outputs of the top 4 methods in Fig. 4. In each image, the yellow trajectories are the history trajectory of each controlled agent (derived from real video data) and the green trajectories are the ground truth future trajectories from human annotators. The predicted trajectories are shown in yellow-orange heatmaps for multi-future prediction methods, and in red lines for single-future prediction methods. As we see, our model correctly generally puts probability mass where there is data, and does not “waste” probability mass where there is no data.

Figure 4: Qualitative analysis. See text for details.

Error analysis. We show some typical errors our model makes in Fig. 5. The first image shows our model misses the correct direction, perhaps due to lack of diversity in our sampling procedure. The second image shows our model sometimes predicts the person will “go through” the car (diagonal red beam) instead of going around it. This may be addressed by adding more training examples of “going around” obstacles. The third image shows our model predicts the person will go to a moving car. This is due to the lack of modeling of the dynamics of other far-away agents in the scene. The fourth image shows a hard case where the person just exits the vehicle and there is no indication of where they will go next (so our model “backs off” to a sensible “stay nearby” prediction). We leave solutions to these problems to future work.

Figure 5: Error analysis. See text for details.

5.3 Single-Future Prediction on VIRAT/ActEV

Dataset & Setups. NIST released VIRAT/ActEV [2] for activity detection research in streaming videos in 2018. This dataset is a new version of the VIRAT [34] dataset, with more videos and annotations. The length of videos with publicly available annotations is about 4.5 hours. Following [28], we use the official training set for training and the official validation set for testing. Other setups are the same as in Section 5.2, except we use the single-future evaluation metric.

Quantitative Results. Table 2 (first column) shows the evaluation results. As we see, our model achieves state-of-the-art performance. The improvement is especially large on Final Displacement Error (FDE) metric, attributing to the coarse location decoder that helps regulate the model prediction for long-term prediction. The gain shows that our model does well at both single future prediction (on real data) and multiple future prediction on our quasi-synthetic data.

Generalizing from simulation to real-world. As described in Section 4, we generate simulation data first by reconstructing from real-world videos. To verify the quality of the reconstructed data, and the efficacy of learning from simulation videos, we train all the models on the simulation videos derived from the real data. We then evaluate on the real test set of VIRAT/ActEV. As we see from the right column in Table 2, all models do worse in this scenario, due to the difference between synthetic and real data.

There are two sources of error. The synthetic trajectory data only contains about 60% of the real trajectory data, due to difficulties reconstructing all the real data in the simulator. In addition, the synthetic images are not photo realistic. Thus methods (such as Next [28]) that rely on RGB input obviously suffer the most, since they have never been trained on “real pixels”. Our method, which uses trajectories plus high level semantic segmentations (which transfers from synthetic to real more easily) suffers the least drop in performance, showing its robustness to “domain shift”. See Table 1 for input source comparison between methods.

Method Trained on Real. Trained on Sim.
Linear 32.19 / 60.92 48.65 / 90.84
LSTM 23.98 / 44.97 28.45 / 53.01
Social-LSTM [1] 23.10 / 44.27 26.72 / 51.26
Social-GAN (V) [14] 30.40 / 61.93 36.74 / 73.22
Social-GAN (PV) [14] 30.42 / 60.70 36.48 / 72.72
Next  [28] 19.78 / 42.43 27.38 / 62.11
Ours 18.51 / 35.84 22.94 / 43.35
Table 2: Comparison of different methods on the VIRAT/ActEV dataset. We report ADE/FDE metrics. First column is for models trained on real video training set and second column is for models trained on the simulated version of this dataset.

5.4 Ablation Experiments

We test various ablations of our model on both the single-future and multi-future trajectory prediction to substantiate our design decisions. Results are shown in Table 3, where the ADE/FDE metrics are shown in the “single-future” column and minADE20/minFDE20 metrics (averaged across all views) in the “multi-future” column. We verify three of our key designs by leaving the module out from the full model.

(1) Spatial Graph: Our model is built on top of a spatial 2D graph that uses graph attention to model the scene features. We train model without the spatial graph. As we see, the performance drops on both tasks. (2) Fine location decoder: We test our model without the fine location decoder and only use the grid center as the coordinate output. As we see, the significant performance drops on both tasks verify the efficacy of this new module proposed in our study. (3) Multi-scale grid: As mentioned in Section 3, we utilize two different grid scales (36 18) and (18 9) in training. We see that performance is slightly worse if we only use the fine scale (36 18) .

Method Single-Future Multi-Future
Our full model 18.51 / 35.84 166.1 / 329.5
No spatial graph 28.68 / 49.87 184.5 / 363.2
No fine location decoder 53.62 / 83.57 232.1 / 468.6
No multi-scale grid 21.09 / 38.45 171.0 / 344.4
Table 3: Performance on ablated versions of our model on single and multi-future trajectory prediction. Lower numbers are better.

6 Conclusion

In this paper, we have introduced the Forking Paths dataset, and the Multiverse

model for multi-future forecasting. Our study is the first to provide a quantitative benchmark and evaluation methodology for multi-future trajectory prediction by using human annotators to create a variety of trajectory continuations under the identical past. Our model utilizes multi-scale location decoders with graph attention model to predict multiple future locations. We have shown that our method achieves state-of-the-art performance on two challenging benchmarks: the large-scale real video dataset and our proposed multi-future trajectory dataset. We believe our dataset, together with our models, will facilitate future research and applications on multi-future prediction.

References

  • [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016) Social lstm: human trajectory prediction in crowded spaces. In CVPR, Cited by: §1, §2, §4, §5.1, §5.2, §5.2, Table 1, Table 2.
  • [2] G. Awad, A. Butt, K. Curtis, J. Fiscus, A. Godil, A. F. Smeaton, Y. Graham, W. Kraaij, G. Quénot, J. Magalhaes, D. Semedo, and S. Blasi (2018) TRECVID 2018: benchmarking video activity detection, video captioning and matching, video storytelling linking and video search. In TRECVID, Cited by: §1, §1, §1, §4, §5.3, §5.
  • [3] M. Bansal, A. Krizhevsky, and A. Ogale (2018) Chauffeurnet: learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079. Cited by: §1, §2.
  • [4] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §2, §4.
  • [5] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov (2019) MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449. Cited by: §1, §1, §2, §3.3, §4, §4, §5.1, §5.1.
  • [6] M. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, et al. (2019) Argoverse: 3d tracking and forecasting with rich maps. In CVPR, Cited by: §2.
  • [7] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §3.1, §5.2.
  • [8] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied question answering. In CVPRW, Cited by: §2.
  • [9] C. R. de Souza, A. Gaidon, Y. Cabon, and A. M. López (2017) Procedural generation of videos to train deep action recognition networks. In CVPR, pp. 2594–2604. Cited by: §2.
  • [10] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: §1, §2, §4.
  • [11] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig (2016) Virtual worlds as proxy for multi-object tracking analysis. In CVPR, Cited by: §2, §4.
  • [12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §4.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §2.
  • [14] A. Gupta, J. Johnson, S. Savarese, and A. Alahi (2018) Social gan: socially acceptable trajectories with generative adversarial networks. In CVPR, Cited by: §1, §2, §4, §4, §5.1, §5.1, §5.2, §5.2, §5.2, Table 1, Table 2.
  • [15] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. Eslami, M. Riedmiller, et al. (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286. Cited by: §2.
  • [16] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §5.2.
  • [17] J. Hong, B. Sapp, and J. Philbin (2019) Rules of the road: predicting driving behavior with a convolutional model of semantic interactions. In CVPR, Cited by: §2, §5.1.
  • [18] R. Kalman (1960) A new approach to linear filtering and prediction problems. Trans. ASME, D 82, pp. 35–44. Cited by: §1.
  • [19] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert (2012) Activity forecasting. In ECCV, Cited by: §1, §2.
  • [20] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila (2014) Context-based pedestrian path prediction. In ECCV, Cited by: §2.
  • [21] S. Lazebnik, C. Schmid, and J. Ponce (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In CVPR), Cited by: §3.1.
  • [22] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In CVPR, Cited by: §1, §2, §2, §5.1.
  • [23] A. Lerner, Y. Chrysanthou, and D. Lischinski (2007) Crowds by example. In Computer Graphics Forum, pp. 655–664. Cited by: §1, §4.
  • [24] J. Li, W. Monroe, and D. Jurafsky (2016) A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562. Cited by: §3.5, §5.2.
  • [25] Y. Li (2019) Which way are you going? imitative decision learning for path forecasting in dynamic scenes. In CVPR, Cited by: §1, §1, §2, §4.
  • [26] J. Liang, D. Fan, H. Lu, P. Huang, J. Chen, L. Jiang, and A. Hauptmann (2017) An event reconstruction tool for conflict monitoring using social media. In AAAI, Cited by: §2.
  • [27] J. Liang, L. Jiang, L. Cao, L. Li, and A. G. Hauptmann (2018) Focal visual-text attention for visual question answering. In CVPR, Cited by: §2.
  • [28] J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann, and L. Fei-Fei (2019) Peeking into the future: predicting future person activities and locations in videos. In CVPR, Cited by: §1, §2, §4, §5.1, §5.2, §5.2, §5.2, §5.3, §5.3, Table 1, Table 2.
  • [29] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §3.1.
  • [30] M. Luber, J. A. Stork, G. D. Tipaldi, and K. O. Arras (2010) People tracking with human motion predictions from social forces. In ICRA, Cited by: §1.
  • [31] W. Ma, D. Huang, N. Lee, and K. M. Kitani (2017) Forecasting interactive dynamics of pedestrians with fictitious play. In CVPR, Cited by: §2.
  • [32] O. Makansi, E. Ilg, O. Cicek, and T. Brox (2019) Overcoming limitations of mixture density networks: a sampling and fitting framework for multimodal future prediction. In CVPR, Cited by: §1, §2, §4, §4, §5.2.
  • [33] H. Manh and G. Alaghband (2018) Scene-lstm: a model for human trajectory prediction. arXiv preprint arXiv:1808.04018. Cited by: §2.
  • [34] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen, J. T. Lee, S. Mukherjee, J. Aggarwal, H. Lee, L. Davis, et al. (2011) A large-scale benchmark dataset for event recognition in surveillance video. In CVPR, Cited by: §1, §1, §1, §4, §5.3, §5.
  • [35] S. Pellegrini, A. Ess, and L. Van Gool (2012) Improving data association by joint modeling of pedestrian trajectories and groupings. In ECCV, Cited by: §4.
  • [36] T. Plötz and S. Roth (2018) Neural nearest neighbors networks. In NeurIPS, Cited by: §5.2.
  • [37] W. Qiu, F. Zhong, Y. Zhang, S. Qiao, Z. Xiao, T. S. Kim, and Y. Wang (2017) Unrealcv: virtual worlds for computer vision. In ACM Multimedia, Cited by: §2.
  • [38] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2015) Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. Cited by: §3.4.
  • [39] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §1, §3.3, §3.4.
  • [40] N. Rhinehart, K. M. Kitani, and P. Vernaza (2018) R2p2: a reparameterized pushforward policy for diverse, precise generative path forecasting. In ECCV, Cited by: §1, §2, §4, §5.1.
  • [41] N. Rhinehart and K. M. Kitani (2017) First-person activity forecasting with online inverse reinforcement learning. In ICCV, Cited by: §1.
  • [42] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine (2019) PRECOG: prediction conditioned on goals in visual multi-agent settings. arXiv preprint arXiv:1905.01296. Cited by: §2, §4, §5.1, §5.1.
  • [43] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In ECCV, Cited by: §2.
  • [44] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese (2016) Learning social etiquette: human trajectory understanding in crowded scenes. In ECCV, Cited by: §4.
  • [45] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, Cited by: §2, §4.
  • [46] A. Sadeghian, A. Alahi, and S. Savarese (2017) Tracking the untrackable: learning to track multiple cues with long-term dependencies. In ICCV, Cited by: §1.
  • [47] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, and S. Savarese (2018) SoPhie: an attentive gan for predicting paths compliant to social and physical constraints. arXiv preprint arXiv:1806.01482. Cited by: §2, §4, §5.1, §5.2.
  • [48] A. Sadeghian, F. Legros, M. Voisin, R. Vesel, A. Alahi, and S. Savarese (2018) Car-net: clairvoyant attentive recurrent network. In ECCV, Cited by: §2.
  • [49] S. Shah, D. Dey, C. Lovett, and A. Kapoor (2018) Airsim: high-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pp. 621–635. Cited by: §2.
  • [50] C. Sun, P. Karlsson, J. Wu, J. B. Tenenbaum, and K. Murphy (2019) Stochastic prediction of multi-agent interactions from partial observations. arXiv preprint arXiv:1902.09641. Cited by: §2.
  • [51] Y. C. Tang and R. Salakhutdinov (2019) Multiple futures prediction. arXiv preprint arXiv:1911.00997. Cited by: §1, §2.
  • [52] L. A. Thiede and P. P. Brahma (2019) Analyzing the variety loss in the context of probabilistic trajectory prediction. arXiv preprint arXiv:1907.10178. Cited by: §1, §2, §4, §4.
  • [53] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §3.2.
  • [54] Y. Wang, L. Jiang, M. Yang, L. Li, M. Long, and L. Fei-Fei (2019) Eidetic 3d lstm: a model for video prediction and beyond. In ICLR, Cited by: §3.1.
  • [55] Y. Wu, L. Jiang, and Y. Yang (2019) Revisiting embodiedqa: a simple baseline and beyond. arXiv preprint arXiv:1904.04166. Cited by: §2.
  • [56] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015)

    Convolutional lstm network: a machine learning approach for precipitation nowcasting

    .
    In NeurIPS, Cited by: §1, §3.1, §5.2.
  • [57] H. Xue, D. Q. Huynh, and M. Reynolds (2018) SS-lstm: a hierarchical lstm model for pedestrian trajectory prediction. In WACV, Cited by: §2.
  • [58] T. Yagi, K. Mangalam, R. Yonetani, and Y. Sato (2018) Future person localization in first-person videos. In CVPR, Cited by: §2.
  • [59] M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §5.2.
  • [60] P. Zhang, W. Ouyang, P. Zhang, J. Xue, and N. Zheng (2019) SR-lstm: state refinement for lstm towards pedestrian trajectory prediction. In CVPR, Cited by: §2, §4.
  • [61] Y. Zhang, G. M. Gibson, R. Hay, R. W. Bowman, M. J. Padgett, and M. P. Edgar (2015) A fast 3d reconstruction system with a low-cost camera accessory. Scientific reports 5, pp. 10909. Cited by: §2.
  • [62] T. Zhao, Y. Xu, M. Monfort, W. Choi, C. Baker, Y. Zhao, Y. Wang, and Y. N. Wu (2019) Multi-agent tensor fusion for contextual trajectory prediction. In CVPR, Cited by: §2, §4.
  • [63] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In CVPR, Cited by: §3.1, §5.2.
  • [64] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, pp. 3357–3364. Cited by: §2.