1 Introduction
Forecasting future human behavior is a fundamental problem in video understanding. In particular, future path prediction, which aims at forecasting a pedestrian’s future trajectory in the next few seconds, has received a lot of attention in our community [19, 1, 14, 25]. This functionality is a key component in a variety of applications such as autonomous driving [3, 5], longterm object tracking [18, 46], safety monitoring [28], robotic planning [41, 40], etc.
Of course, the future is often very uncertain: Given the same historical trajectory, a person may take different paths, depending on their (latent) goals. Thus recent work has started focusing on multifuture trajectory prediction [51, 5, 25, 32, 52, 22].
Consider the example in Fig. 1. We see a person moving from the bottom left towards the top right of the image, and our task is to predict where he will go next. Since there are many possible future trajectories this person might follow, we are interested in learning a model that can generate multiple plausible futures. However, since the ground truth data only contains one trajectory, it is difficult to evaluate such probabilistic models.
To overcome the aforementioned challenges, our first contribution is the creation of a realistic synthetic dataset that allows us to compare models in a quantitative way in terms of their ability to predict multiple plausible futures, rather than just evaluating them against a single observed trajectory as in existing studies. We create this dataset using the 3D CARLA [10] simulator, where the scenes are manually designed to be similar to those found in the challenging realworld benchmark VIRAT/ActEV [34, 2]. Once we have recreated the static scene, we automatically reconstruct trajectories by projecting realworld data to the 3D simulation world. See Fig. 1 and 3. We then semiautomatically select a set of plausible future destinations (corresponding to semantically meaningful locations in the scene), and ask human annotators to create multiple possible continuations of the real trajectories towards each such goal. In this way, our dataset is “anchored” in reality, and yet contains plausible variations in highlevel human behavior, which is impossible to simulate automatically.
We call this dataset the “Forking Paths” dataset, a reference to the short story by Jorge Luis Borges.^{1}^{1}1 https://en.wikipedia.org/wiki/The_Garden_of_Forking_Paths As shown in Fig. 1, different human annotations have created forkings of future trajectories for the identical historical past. So far, we have collected 750 sequences, with each covering about 15 seconds, from 10 annotators, controlling 127 agents in 7 different scenes. Each agent contains 5.9 future trajectories on average. We render each sequence from 4 different views, and automatically generate dense labels, as illustrated in Fig. 1 and 3. In total, this amounts to 3.2 hours of trajectory sequences, which is comparable to the largest person trajectory benchmark VIRAT/ActEV [2, 34] (4.5 hours), or 5 times bigger than the common ETH/UCY [23, 30] benchmark. We therefore believe this will serve as a useful benchmark for evaluating models that can predict multiple futures. We will release the data together with the code used to generate it.
Our second contribution is to propose a new probabilistic model, Multiverse, which can generate multiple plausible trajectories given the past history of locations and the scene. The model contains two novel design decisions. First, we use a multiscale representation of locations. In the first scale, the coarse scale, we represent locations on a 2D grid, as shown in Fig. 1(1). This captures high level uncertainty about possible destinations and leads to a better representation of multimodal distributions. In the second fine scale, we predict a realvalued offset for each grid cell, to get more precise localization. This twostage approach is partially inspired by object detection methods [39]. The second novelty of our model is to design convolutional RNNs [56] over the spatial graph as a way of encoding inductive bias about the movement patterns of people.
In addition, we empirically validate our model on the challenging realworld benchmark VIRAT/ActEV [34, 2] for singlefuture trajectory prediction, in which our model achieves the bestpublished result. On the proposed simulation data for multifuture prediction, experimental results show our model compares favorably against the stateoftheart models across different settings. To summarize, the main contributions of this paper are as follows: (i) We introduce the first dataset and evaluation methodology that allows us to compare models in a quantitative way in terms of their ability to predict multiple plausible futures. (ii) We propose a new effective model for multifuture trajectory prediction. (iii) We establish a new state of the art result on the challenging VIRAT/ActEV benchmark, and compare various methods on our multifuture prediction dataset.
2 Related Work
Recently there is a large literature on forecasting future trajectories. We briefly review some of these works below.
Singlefuture trajectory prediction. Recent works have tried to predict a single best trajectory for pedestrians or vehicles. Early works [1, 33, 57, 60] focused on modeling person motions by considering them as points in the scene. SocialLSTM [1] is a popular method using social pooling to predict future trajectories. These research works [20, 58, 31, 28] have attempted to predict person paths by utilizing visual features. Kooij et al. in [20] looked at pedestrian’s faces to model their awareness for future prediction. Recently Liang et al. [28] proposed a joint future activity and trajectory prediction framework that utilized multiple visual features using focal attention [27]. Many works [22, 48, 3, 17, 62] in vehicle trajectory prediction have been proposed. CARNet [48] proposed attention networks on top of scene semantic CNN to predict vehicle trajectories. Chauffeurnet [3]
utilized imitation learning for trajectory prediction.
Multifuture trajectory prediction. Many works have tried to model the uncertainty of trajectory prediction. A number of works focused on learning the effects of the physical scene, e.g., people tend to walk on the sidewalk instead of grass. Various papers (e.g. [19, 40, 42]
use Inverse Reinforcement Learning (IRL) to forecast human trajectories. Other works
[47, 14, 25] like SocialGAN [14] have utilized generative adversarial networks [13] to generate diverse person trajectories. In vehicle trajectory prediction, DESIRE [22] utilized variational autoencoders (VAE) to predict future vehicle trajectories. Many recent works [52, 5, 51, 32] also proposed probabilistic frameworks for multifuture vehicle trajectory prediction. Different from these previous works, we present a flexible twostage framework that combines multimodal distribution modeling and precise location prediction.Trajectory Datasets. Many vehicle trajectory datasets [4, 6]
have been proposed as a result of selfdriving’s surging popularity. With the recent advancement in 3D computer vision research
[61, 26, 49, 10, 43, 45, 15], many research works [37, 11, 9, 8, 55, 64, 50] have looked into 3D simulated environment for its flexibility and ability to generate enormous amount of data. We are the first to propose a 3D simulation dataset that is reconstructed from realworld scenarios complemented with a variety of human trajectory continuations for multifuture person trajectory prediction.3 Methods
In this section, we describe our model for forecasting agent trajectories, which we call Multiverse. We focus on predicting the locations of a single agent for multiple steps into the future, , given a sequence of past video frames, , and agent locations, , where is the history length and is the prediction length. Since there is inherent uncertainty in this task, our goal is to design a model that can effectively predict multiple plausible future trajectories, by computing the multimodal distribution . See Fig. 2 for a high level summary of the model, and the sections below for more details.
3.1 History Encoder
The encoder computes a representation of the scene from the history of past locations, , and frames, . We encode each ground truth location by an index representing the nearest cell in a 2D grid of size , indexed from to . Inspired by [21, 29], we encode location with two different grid scales ( and ); we show the benefits of this multiscale encoding in Section 5.4. For simplicity of presentation, we focus on a single grid.
To make the model more invariant to lowlevel visual details, and thus more robust to domain shift (e.g., between different scenes, different views of the same scene, or between real and synthetic images), we preprocess each video frame using a pretrained semantic segmentation model, with possible class labels per pixel. We use the Deeplab model [7] trained on the ADE20k [63] dataset, and keep its weights frozen. Let
be this semantic segmentation map modeled as a tensor of size
.We then pass these inputs to a convolutional RNN [56, 54] to compute a spatialtemporal feature history:
(1) 
where is element wise product, and represents 2Dconvolution. The function projects a cell index into an onehot embedding of size according to its spatial location. We use the final state of this encoder , where is the hidden size, to initialize the state of the decoders. We also use the temporal average of the semantic maps, , during each decoding step. The context is represented as .
3.2 Coarse Location Decoder
After getting the context , our goal is to forecast future locations. We initially focus on predicting locations at the level of grid cells, . In Section 3.3, we discuss how to predict a continuous offset in , which specifies a “delta” from the center of each grid cell, to get a finegrained location prediction.
Let the coarse distribution over grid locations at time (known as the “belief state”) be denoted by , for and . For brevity, we use a single index
to represent a cell in the 2D grid. Rather than assuming a Markov model, we update this using a convolutional recurrent neural network, with hidden states
. We then compute the belief state by:(2) 
Here we use 2Dconvolution with one filter and flatten the spatial dimension before applying softmax. The hidden state is updated using:
(3) 
where embeds into a 3D tensor of size and is the embedding size. is a graph attention network [53], where the graph structure corresponds to the 2D grid in . More precisely, let be the feature vector corresponding to the th grid cell in , and let be the corresponding output in , where is the size of the decoder hidden state. We compute these outputs of GAT using:
(4) 
where are the neighbors of node in with each node represented as , where collects the cell ’s feature in . is some edge function (implemented as an MLP in our experiments) that computes the attention weights.
The graphstructured update function for the RNN ensures that the probability mass “diffuses out” to nearby grid cells in a controlled manner, reflecting the prior knowledge that people do not suddenly jump between distant locations. This inductive bias is also encoded in the convolutional structure, but adding the graph attention network gives improved results, because the weights are inputdependent and not fixed.
3.3 Fine Location Decoder
The 2D heatmap is useful for capturing multimodal distributions, but does not give very precise location predictions. To overcome this, we train a second convolutional RNN decoder to compute an offset vector for each possible grid cell using a regression output, . This RNN is updated using
(5) 
To compute the final prediction location, we first flatten the spatial dimension of into . Then we use
(6) 
where is the index of the selected grid cell, is the center of that cell, and is the predicted offset for that cell at time . For singlefuture prediction, we use greedy search, namely over the belief state. For multifuture prediction, we use beam search in Section 3.5.
This idea of combining classification and regression is partially inspired by object detection methods (e.g., [39]). It is worth noting that in concurrent work, [5] also designed a twostage model for trajectory forecasting. However, their classification targets are predefined anchor trajectories. Ours is not limited by the predefined anchors.
3.4 Training
Our model trains on the observed trajectory from time 1 to and predicts the future trajectories (in coordinates) from time to . We supervise this training by providing ground truth targets for both the heatmap (belief state), , and regression offset map, . In particular, for the coarse decoder, the crossentropy loss is used:
(7) 
For the fine decoder, we use the smoothed loss used in object detection [39]:
(8) 
where is the delta between the true location and the center of the grid cell at and is the ground truth for in Eq.(6). We impose this loss on every cell to improve the robustness.
The final loss is then calculated using
(9) 
where controls the regularization (weight decay), and is used to balance the regression and classification losses.
3.5 Inference
To generate multiple qualitatively distinct trajectories, we use the diverse beam search strategy from [24]. To define this precisely, let be the beam at time ; this set contains trajectories (history selections) , , where is an index in , along with their accumulated log probabilities, . Let be the coarse location output probability from Eq. (2) and (3) at time given inputs .
The new beam is computed using
(10) 
where is a diversity penalty term, and we take the top elements from the set produced by considering values with . If , this reduces to greedy search.
Once we have computed the top future predictions, we add the corresponding offset vectors to get trajectories by . This constitutes the final output of our model.
4 The Forking Paths Dataset
In this section, we describe our humanannotated simulation dataset, called Forking Paths, for multifuture trajectory evaluation.
Existing datasets. There are several realworld datasets for trajectory evaluation, such as SDD [44], ETH/UCY [35, 23], KITTI [12], nuScenes [4] and VIRAT/ActEV [2, 34]. However, they all share the fundamental problem that one can only observe one out of many possible future trajectories sampled from the underlying distribution. This is broadly acknowledged in prior works [32, 52, 5, 14, 42, 40] but has not yet been addressed.
The closest work to ours is the simulation used in [32, 52, 5]. However, these only contain artificial trajectories, not human generated ones. Also, they use a highly simplified 2D space, with pedestrians oversimplified as points and vehicles as blocks; no other scene semantics are provided.
Reconstructing reality in simulator. In this work, we use CARLA [10]
, a nearrealistic open source simulator built on top of the Unreal Engine 4. Following prior simulation datasets
[11, 45], we semiautomatically reconstruct static scenes and their dynamic elements from the realworld videos in ETH/UCY and VIRAT/ActEV. There are 4 scenes in ETH/UCY and 5 in VIRAT/ActEV. We exclude 2 cluttered scenes (UNIV & 0002) that we are not able to reconstruct in CARLA, leaving 7 static scenes in our dataset.For dynamic movement of vehicle and pedestrian, we first convert the ground truth trajectory annotations from the realworld videos to the ground plane using the provided homography matrices. We then match the realworld trajectories’ origin to correct locations in the recreated scenes.
Human generation of plausible futures. We manually select sequences with more than one pedestrian. We also require that at least one pedestrian could have multiple plausible alternative destinations. We then select one of the pedestrians to be the “controlled agent” (CA) for each sequence, and set meaningful destinations within reach, like a car or an entrance of a building. On average, each agent has about 3 destinations to move towards. In total, we have 127 CAs from 7 scenes. We call each CA and their corresponding scene a scenario.
For each scenario, there are on average 5.9 human annotators to control the agent to the defined destinations. Specifically, they are asked to watch the first 5 seconds of video, from a firstperson view (with the camera slightly behind the pedestrian) and/or an overhead view (to give more context). They are then asked to control the motion of the agent so that it moves towards the specified destination in a “natural” way, e.g., without colliding with other moving objects (whose motion is derived from the real videos, and is therefore unaware of the controlled agent). The annotation is considered successful if the agent reached the destination without colliding within the time limit of 10.4 seconds.
Note that our videos are up to 15.2 seconds long. This is slightly longer than previous works (e.g. [1, 14, 28, 47, 25, 60, 62]) that use 3.2 seconds of observation and 4.8 seconds for prediction. (We use 10.4 seconds for the future to allow us to evaluate longer term forecasts.)
Generating the data. Once we have collected humangenerated trajectories, 750 in total after data cleaning, we render each one in four camera views (three 45degree and one topdown view). Each camera view has 127 scenarios in total and each scenario has on average 5.9 future trajectories. With CARLA, we can also simulate different weather conditions, although we did not do so in this work. In addition to agent location, we collect ground truth for pixelprecise scene semantic segmentation from 13 classes including sidewalk, road, vehicle, pedestrian, etc. See Fig. 3.
5 Experimental results
This section evaluates various methods, including our Multiverse model, for multifuture trajectory prediction on the proposed Forking Paths dataset. To allow comparison with previous works, we also evaluate our model on the challenging VIRAT/ActEV [2, 34] benchmark for singlefuture path prediction.
5.1 Evaluation Metrics
SingleFuture Evaluation. In realworld videos, each trajectory only has one sample of the future, so models are evaluated on how well they predict that single trajectory. Following prior work [28, 1, 14, 47, 22, 17, 5, 42], we introduce two standard metrics for this setting.
Let be the ground truth trajectory of the th sample, and be the corresponding prediction. We then employ two distancebased error metrics: i) Average Displacement Error (ADE): the average Euclidean distance between the ground truth coordinates and the prediction coordinates over all time instants:
(11) 
ii) Final Displacement Error (FDE): the Euclidean distance between the predicted points and the ground truth point at the final prediction time:
(12) 
MultiFuture Evaluation. Let be the th true future trajectory for the th test sample, for , and let be the ’th sample from the predicted distribution over trajectories, for
. Since there is no agreedupon evaluation metric for this setting, we simply extend the above metrics, as follows: i)
Minimum Average Displacement Error Given K Predictions (minADE_{K}): similar to the metric described in [5, 40, 42, 14], for each true trajectory in test sample , we select the closest overall prediction (from the model predictions), and then measure its average error:(13) 
ii) Minimum Final Displacement Error Given K Predictions (minFDE_{K}): similar to minADE_{K}, but we only consider the predicted points and the ground truth point at the final prediction time instant:
(14) 
5.2 MultiFuture Prediction on Forking Paths
Method  Input Types  minADE_{20}  minFDE_{20}  
45degree  topdown  45degree  topdown  
Linear  Traj.  213.2  197.6  403.2  372.9 
LSTM  Traj.  201.0 2.2  183.7 2.1  381.5 3.2  355.0 3.6 
SocialLSTM [1]  Traj.  197.5 2.5  180.4 1.0  377.0 3.6  350.3 2.3 
SocialGAN (PV) [14]  Traj.  191.2 5.4  176.5 5.2  351.9 11.4  335.0 9.4 
SocialGAN (V) [14]  Traj.  187.1 4.7  172.7 3.9  342.1 10.2  326.7 7.7 
Next [28]  Traj.+Bbox+RGB+Seg.  186.6 2.7  166.9 2.2  360.0 7.2  326.6 5.0 
Ours  Traj.+Seg.  168.9 2.1  157.7 2.5  333.8 3.7  316.5 3.4 
Dataset & Setups. The proposed Forking Paths dataset in Section 4 is used for multifuture trajectory prediction evaluation. Following the setting in previous works [28, 1, 14, 1, 14, 47, 32], we downsample the videos to 2.5 fps and extract person trajectories using code released in [28], and let the models observe 3.2 seconds (8 frames) of the controlled agent before outputting trajectory coordinates in the pixel space. Since the length of the ground truth future trajectories are different, each model needs to predict the maximum length at test time but we evaluate the predictions using the actual length of each true trajectory.
Baseline methods. We compare our method with two simple baselines, and three recent methods with released source code, including a recent model for multifuture prediction and the stateoftheart model for singlefuture prediction: Linear is a single layer model that predicts the next coordinates using a linear regressor based on the previous input point. LSTM is a simple LSTM [16] encoderdecoder model with coordinates input only. Social LSTM [1]: We use the open source implementation from (https://github.com/agrimgupta92/sgan/). Next [28] is the stateoftheart method for singlefuture trajectory prediction on the VIRAT/ActEV dataset. We train the Next model without the activity labels for fair comparison using the code from (https://github.com/google/nextprediction/). Social GAN [14] is a recent multifuture trajectory prediction model trained using Minimum over N (MoN) loss. We train two model variants (called PV and V) detailed in the paper using the code from [14] .
All models are trained on real videos (from VIRAT/ActEV – see Section 5.3 for details) and tested on our synthetic videos (with CARLAgenerated pixels, and annotatorgenerated trajectories). Most models just use trajectory data as input, except for our model (which uses trajectory and semantic segmentation) and Next (which uses trajectory, bounding box, semantic segmentation, and RGB frames).
Implementation Details. We use ConvLSTM [56] cell for both the encoder and decoder. The embedding size is set to 32, and the hidden sizes for the encoder and decoder are both 256. The scene semantic segmentation features are extracted from the deeplab model [7], pretrained on the ADE20k [63] dataset. We use Adadelta optimizer [59] with an initial learning rate of 0.3 and weight decay of 0.001. Other hyperparameters for the baselines are the same to the ones in [14, 28]. We evaluate the top predictions for multifuture trajectories. For the models that only output a single trajectory, including Linear, LSTM, SocialLSTM, and Next, we duplicate the output for times before evaluating. For SocialGAN, we use different random noise inputs to get the predictions. For our model, we use diversity beam search [24, 36] as described in Section 3.5.
Quantitative Results. Table 1
lists the multifuture evaluation results, where we divide the evaluation according to the viewing angle of camera, 45degree vs. topdown view. We repeat all experiments (except “linear”) 5 times with random initialization to produce the mean and standard deviation values. As we see, our model outperforms baselines in all metrics and it performs significantly better on the minADE metric, which suggests better prediction quality over all time instants. Notably, our model outperforms Social GAN by a large margin of at least 8 points on all metrics.
Qualitative analysis. We visualize some outputs of the top 4 methods in Fig. 4. In each image, the yellow trajectories are the history trajectory of each controlled agent (derived from real video data) and the green trajectories are the ground truth future trajectories from human annotators. The predicted trajectories are shown in yelloworange heatmaps for multifuture prediction methods, and in red lines for singlefuture prediction methods. As we see, our model correctly generally puts probability mass where there is data, and does not “waste” probability mass where there is no data.
Error analysis. We show some typical errors our model makes in Fig. 5. The first image shows our model misses the correct direction, perhaps due to lack of diversity in our sampling procedure. The second image shows our model sometimes predicts the person will “go through” the car (diagonal red beam) instead of going around it. This may be addressed by adding more training examples of “going around” obstacles. The third image shows our model predicts the person will go to a moving car. This is due to the lack of modeling of the dynamics of other faraway agents in the scene. The fourth image shows a hard case where the person just exits the vehicle and there is no indication of where they will go next (so our model “backs off” to a sensible “stay nearby” prediction). We leave solutions to these problems to future work.
5.3 SingleFuture Prediction on VIRAT/ActEV
Dataset & Setups. NIST released VIRAT/ActEV [2] for activity detection research in streaming videos in 2018. This dataset is a new version of the VIRAT [34] dataset, with more videos and annotations. The length of videos with publicly available annotations is about 4.5 hours. Following [28], we use the official training set for training and the official validation set for testing. Other setups are the same as in Section 5.2, except we use the singlefuture evaluation metric.
Quantitative Results. Table 2 (first column) shows the evaluation results. As we see, our model achieves stateoftheart performance. The improvement is especially large on Final Displacement Error (FDE) metric, attributing to the coarse location decoder that helps regulate the model prediction for longterm prediction. The gain shows that our model does well at both single future prediction (on real data) and multiple future prediction on our quasisynthetic data.
Generalizing from simulation to realworld. As described in Section 4, we generate simulation data first by reconstructing from realworld videos. To verify the quality of the reconstructed data, and the efficacy of learning from simulation videos, we train all the models on the simulation videos derived from the real data. We then evaluate on the real test set of VIRAT/ActEV. As we see from the right column in Table 2, all models do worse in this scenario, due to the difference between synthetic and real data.
There are two sources of error. The synthetic trajectory data only contains about 60% of the real trajectory data, due to difficulties reconstructing all the real data in the simulator. In addition, the synthetic images are not photo realistic. Thus methods (such as Next [28]) that rely on RGB input obviously suffer the most, since they have never been trained on “real pixels”. Our method, which uses trajectories plus high level semantic segmentations (which transfers from synthetic to real more easily) suffers the least drop in performance, showing its robustness to “domain shift”. See Table 1 for input source comparison between methods.
Method  Trained on Real.  Trained on Sim. 
Linear  32.19 / 60.92  48.65 / 90.84 
LSTM  23.98 / 44.97  28.45 / 53.01 
SocialLSTM [1]  23.10 / 44.27  26.72 / 51.26 
SocialGAN (V) [14]  30.40 / 61.93  36.74 / 73.22 
SocialGAN (PV) [14]  30.42 / 60.70  36.48 / 72.72 
Next [28]  19.78 / 42.43  27.38 / 62.11 
Ours  18.51 / 35.84  22.94 / 43.35 
5.4 Ablation Experiments
We test various ablations of our model on both the singlefuture and multifuture trajectory prediction to substantiate our design decisions. Results are shown in Table 3, where the ADE/FDE metrics are shown in the “singlefuture” column and minADE_{20}/minFDE_{20} metrics (averaged across all views) in the “multifuture” column. We verify three of our key designs by leaving the module out from the full model.
(1) Spatial Graph: Our model is built on top of a spatial 2D graph that uses graph attention to model the scene features. We train model without the spatial graph. As we see, the performance drops on both tasks. (2) Fine location decoder: We test our model without the fine location decoder and only use the grid center as the coordinate output. As we see, the significant performance drops on both tasks verify the efficacy of this new module proposed in our study. (3) Multiscale grid: As mentioned in Section 3, we utilize two different grid scales (36 18) and (18 9) in training. We see that performance is slightly worse if we only use the fine scale (36 18) .
Method  SingleFuture  MultiFuture 
Our full model  18.51 / 35.84  166.1 / 329.5 
No spatial graph  28.68 / 49.87  184.5 / 363.2 
No fine location decoder  53.62 / 83.57  232.1 / 468.6 
No multiscale grid  21.09 / 38.45  171.0 / 344.4 
6 Conclusion
In this paper, we have introduced the Forking Paths dataset, and the Multiverse
model for multifuture forecasting. Our study is the first to provide a quantitative benchmark and evaluation methodology for multifuture trajectory prediction by using human annotators to create a variety of trajectory continuations under the identical past. Our model utilizes multiscale location decoders with graph attention model to predict multiple future locations. We have shown that our method achieves stateoftheart performance on two challenging benchmarks: the largescale real video dataset and our proposed multifuture trajectory dataset. We believe our dataset, together with our models, will facilitate future research and applications on multifuture prediction.
References
 [1] (2016) Social lstm: human trajectory prediction in crowded spaces. In CVPR, Cited by: §1, §2, §4, §5.1, §5.2, §5.2, Table 1, Table 2.
 [2] (2018) TRECVID 2018: benchmarking video activity detection, video captioning and matching, video storytelling linking and video search. In TRECVID, Cited by: §1, §1, §1, §4, §5.3, §5.
 [3] (2018) Chauffeurnet: learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079. Cited by: §1, §2.
 [4] (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §2, §4.
 [5] (2019) MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449. Cited by: §1, §1, §2, §3.3, §4, §4, §5.1, §5.1.
 [6] (2019) Argoverse: 3d tracking and forecasting with rich maps. In CVPR, Cited by: §2.
 [7] (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §3.1, §5.2.
 [8] (2018) Embodied question answering. In CVPRW, Cited by: §2.
 [9] (2017) Procedural generation of videos to train deep action recognition networks. In CVPR, pp. 2594–2604. Cited by: §2.
 [10] (2017) CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: §1, §2, §4.
 [11] (2016) Virtual worlds as proxy for multiobject tracking analysis. In CVPR, Cited by: §2, §4.
 [12] (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §4.
 [13] (2014) Generative adversarial nets. In NeurIPS, Cited by: §2.
 [14] (2018) Social gan: socially acceptable trajectories with generative adversarial networks. In CVPR, Cited by: §1, §2, §4, §4, §5.1, §5.1, §5.2, §5.2, §5.2, Table 1, Table 2.
 [15] (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286. Cited by: §2.
 [16] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §5.2.
 [17] (2019) Rules of the road: predicting driving behavior with a convolutional model of semantic interactions. In CVPR, Cited by: §2, §5.1.
 [18] (1960) A new approach to linear filtering and prediction problems. Trans. ASME, D 82, pp. 35–44. Cited by: §1.
 [19] (2012) Activity forecasting. In ECCV, Cited by: §1, §2.
 [20] (2014) Contextbased pedestrian path prediction. In ECCV, Cited by: §2.
 [21] (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In CVPR), Cited by: §3.1.
 [22] (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In CVPR, Cited by: §1, §2, §2, §5.1.
 [23] (2007) Crowds by example. In Computer Graphics Forum, pp. 655–664. Cited by: §1, §4.
 [24] (2016) A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562. Cited by: §3.5, §5.2.
 [25] (2019) Which way are you going? imitative decision learning for path forecasting in dynamic scenes. In CVPR, Cited by: §1, §1, §2, §4.
 [26] (2017) An event reconstruction tool for conflict monitoring using social media. In AAAI, Cited by: §2.
 [27] (2018) Focal visualtext attention for visual question answering. In CVPR, Cited by: §2.
 [28] (2019) Peeking into the future: predicting future person activities and locations in videos. In CVPR, Cited by: §1, §2, §4, §5.1, §5.2, §5.2, §5.2, §5.3, §5.3, Table 1, Table 2.
 [29] (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §3.1.
 [30] (2010) People tracking with human motion predictions from social forces. In ICRA, Cited by: §1.
 [31] (2017) Forecasting interactive dynamics of pedestrians with fictitious play. In CVPR, Cited by: §2.
 [32] (2019) Overcoming limitations of mixture density networks: a sampling and fitting framework for multimodal future prediction. In CVPR, Cited by: §1, §2, §4, §4, §5.2.
 [33] (2018) Scenelstm: a model for human trajectory prediction. arXiv preprint arXiv:1808.04018. Cited by: §2.
 [34] (2011) A largescale benchmark dataset for event recognition in surveillance video. In CVPR, Cited by: §1, §1, §1, §4, §5.3, §5.
 [35] (2012) Improving data association by joint modeling of pedestrian trajectories and groupings. In ECCV, Cited by: §4.
 [36] (2018) Neural nearest neighbors networks. In NeurIPS, Cited by: §5.2.
 [37] (2017) Unrealcv: virtual worlds for computer vision. In ACM Multimedia, Cited by: §2.
 [38] (2015) Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. Cited by: §3.4.
 [39] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In NeurIPS, Cited by: §1, §3.3, §3.4.
 [40] (2018) R2p2: a reparameterized pushforward policy for diverse, precise generative path forecasting. In ECCV, Cited by: §1, §2, §4, §5.1.
 [41] (2017) Firstperson activity forecasting with online inverse reinforcement learning. In ICCV, Cited by: §1.
 [42] (2019) PRECOG: prediction conditioned on goals in visual multiagent settings. arXiv preprint arXiv:1905.01296. Cited by: §2, §4, §5.1, §5.1.
 [43] (2016) Playing for data: ground truth from computer games. In ECCV, Cited by: §2.
 [44] (2016) Learning social etiquette: human trajectory understanding in crowded scenes. In ECCV, Cited by: §4.
 [45] (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, Cited by: §2, §4.
 [46] (2017) Tracking the untrackable: learning to track multiple cues with longterm dependencies. In ICCV, Cited by: §1.
 [47] (2018) SoPhie: an attentive gan for predicting paths compliant to social and physical constraints. arXiv preprint arXiv:1806.01482. Cited by: §2, §4, §5.1, §5.2.
 [48] (2018) Carnet: clairvoyant attentive recurrent network. In ECCV, Cited by: §2.
 [49] (2018) Airsim: highfidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pp. 621–635. Cited by: §2.
 [50] (2019) Stochastic prediction of multiagent interactions from partial observations. arXiv preprint arXiv:1902.09641. Cited by: §2.
 [51] (2019) Multiple futures prediction. arXiv preprint arXiv:1911.00997. Cited by: §1, §2.
 [52] (2019) Analyzing the variety loss in the context of probabilistic trajectory prediction. arXiv preprint arXiv:1907.10178. Cited by: §1, §2, §4, §4.
 [53] (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §3.2.
 [54] (2019) Eidetic 3d lstm: a model for video prediction and beyond. In ICLR, Cited by: §3.1.
 [55] (2019) Revisiting embodiedqa: a simple baseline and beyond. arXiv preprint arXiv:1904.04166. Cited by: §2.

[56]
(2015)
Convolutional lstm network: a machine learning approach for precipitation nowcasting
. In NeurIPS, Cited by: §1, §3.1, §5.2.  [57] (2018) SSlstm: a hierarchical lstm model for pedestrian trajectory prediction. In WACV, Cited by: §2.
 [58] (2018) Future person localization in firstperson videos. In CVPR, Cited by: §2.
 [59] (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §5.2.
 [60] (2019) SRlstm: state refinement for lstm towards pedestrian trajectory prediction. In CVPR, Cited by: §2, §4.
 [61] (2015) A fast 3d reconstruction system with a lowcost camera accessory. Scientific reports 5, pp. 10909. Cited by: §2.
 [62] (2019) Multiagent tensor fusion for contextual trajectory prediction. In CVPR, Cited by: §2, §4.
 [63] (2017) Scene parsing through ade20k dataset. In CVPR, Cited by: §3.1, §5.2.
 [64] (2017) Targetdriven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, pp. 3357–3364. Cited by: §2.
Comments
There are no comments yet.