# Multimodal Future Localization and Emergence Prediction for Objects in Egocentric View with a Reachability Prior

In this paper, we investigate the problem of anticipating future dynamics, particularly the future location of other vehicles and pedestrians, in the view of a moving vehicle. We approach two fundamental challenges: (1) the partial visibility due to the egocentric view with a single RGB camera and considerable field-of-view change due to the egomotion of the vehicle; (2) the multimodality of the distribution of future states. In contrast to many previous works, we do not assume structural knowledge from maps. We rather estimate a reachability prior for certain classes of objects from the semantic map of the present image and propagate it into the future using the planned egomotion. Experiments show that the reachability prior combined with multi-hypotheses learning improves multimodal prediction of the future location of tracked objects and, for the first time, the emergence of new objects. We also demonstrate promising zero-shot transfer to unseen datasets. Source code is available at $\href{https://github.com/lmb-freiburg/FLN-EPN-RPN}{\text{this https URL.}}$

## Authors

• 7 publications
• 8 publications
• 1 publication
• 83 publications
• ### Zero-Shot Multi-View Indoor Localization via Graph Location Networks

Indoor localization is a fundamental problem in location-based applicati...
08/06/2020 ∙ by Meng-Jiun Chiou, et al. ∙ 2

• ### FIERY: Future Instance Prediction in Bird's-Eye View from Surround Monocular Cameras

Driving requires interacting with road agents and predicting their futur...
04/21/2021 ∙ by Anthony Hu, et al. ∙ 10

• ### Stepwise Goal-Driven Networks for Trajectory Prediction

We propose to predict the future trajectories of observed agents (e.g., ...
03/25/2021 ∙ by Chuhua Wang, et al. ∙ 0

• ### Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Location and appearance are the key cues for video object segmentation. ...
08/11/2021 ∙ by Xiaoqi Zhao, et al. ∙ 6

• ### Translating Images into Maps

We approach instantaneous mapping, converting images to a top-down view ...
10/03/2021 ∙ by Avishkar Saha, et al. ∙ 0

• ### AtLoc: Attention Guided Camera Localization

Deep learning has achieved impressive results in camera localization, bu...
09/08/2019 ∙ by Bing Wang, et al. ∙ 0

• ### Reachability and Top-k Reachability Queries with Transfer Decay

The prevalence of location tracking systems has resulted in large volume...
05/18/2021 ∙ by Elena V. Strzheletska, et al. ∙ 0

## Code Repositories

### FLN-EPN-RPN

This repository contains the source code of the CVPR 2020 paper: "Multimodal Future Localization and Emergence Prediction for Objects in Egocentric View with a Reachability Prior"

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Figure 1 shows the view of a driver approaching pedestrians who are crossing the street. To safely control the car, the driver must anticipate where these pedestrians will be in the next few seconds. Will the last pedestrian (in blue) have completely crossed the street when I arrive or must I slow down more? Will the pedestrian on the sidewalk (in orange) continue on the sidewalk or will it also cross the street?

This important task comes with many challenges. First of all, the future is not fully predictable. There are typically multiple possible outcomes, some of them being more likely than others. The controller of a car must be aware of these multiple possibilities and their likelihoods. If a car crashes into a pedestrian who predictably crosses the street, this will be considered a severe failure, whereas extremely unlikely behaviour, such as the pedestrian in purple turning around and crossing the street in the opposite direction, must be ignored to enable efficient control. The approach we propose predicts two likely modes for this pedestrian: continuing left or right on the sidewalk.

Ideally this task can be accomplished directly in the sensor data without the requirement of privileged information such as a third person view, or a street map that marks all lanes, sidewalks, crossings, etc.. Independence of such information helps the approach generalize to situations not covered by maps or extra sensors, e.g., due to changes not yet captured in the map or GPS failures. However, making predictions in egocentric views suffers from partial visibility: we only see the context of the environment in the present view - other relevant parts of the environment are occluded and only become visible as the car moves. Figure 1 shows that the effect of the egomotion is substantial even in this example with relatively slow motion.

In this paper, we approach these two challenges in combination: multimodality of the future and egocentric vision. For the multimodality, we build upon the recent work by Makansi et al. [36], who proposed a technique to overcome mode collapse and stability issues of mixture density networks. However, the work of Makansi et al. assumes a static bird’s-eye view of the scene. In order to carry the technical concept over to the egocentric view, we introduce an intermediate prediction which improves the quality of the multimodal distribution: a reachability prior. The reachability prior is learned from a large set of egocentric views and tells where objects of a certain class are likely to be in the image based on the image’s semantic segmentation; see Figure 2 top. This prior focuses the attention of the prediction based on the environment. Even more important, we can propagate this prior much more easily into the future - using the egomotion of the vehicle - than a whole image or a semantic map. The reachability prior is a condensation of the environment, which contains the semantic context most relevant to the task.

The proposed framework of estimating and propagating a multimodal reachability prior is not only beneficial for future localization of a particular object (Figure 2 left), but it also enables the task of emergence prediction (Figure 2 right). For safe operation, it is not sufficient to reason about the future location of the observed

objects, but also potentially emerging objects in the scene must be anticipated, if their emergence exceeds a certain probability. For example, passing by a school requires extra care since the probability that a child can jump on the street is higher. Autonomous systems should behave differently near a school exit than on a highway. Predicting emergence of new objects did not yet draw much attention in literature.

The three tasks in Fig. 2 differ via their input conditions: the reachability prior is only conditioned by the semantic segmentation of the environment and the class of interest. It is independent of a particular object. Future localization includes the additional focus on an object of interest and its past trajectory. These conditions narrow down the space of solutions and make the output distribution much more peaked. Emergence prediction is a reduced case of the reachability prior, where new objects can only emerge from unobserved areas of the scene.

In this paper (1) we propose a future localization framework in egocentric view by transferring the work by Makansi et al. [36] from bird’s-eye view to egocentric observations, where multimodality is even more difficult to capture. Thus, (2) we propose to compute a reachability prior as intermediate result, which serves as attention to prevent forgetting rare modes, and which can be used to efficiently propagate scene priors into the future taking into account the egomotion. For the first time, (3) we formulate the problem of object emergence prediction for egocentric view with multimodality. (4) We evaluate our approach and the existing methods on the recently largest public nuScenes dataset [9] where the proposed approach shows clear improvements over the state of the art. In contrast to most previous works, the proposed approach is not restricted to a single object category. (5) We include heterogeneous classes like pedestrians, cars, buses, trucks and tricycles. (6) The prediction horizon was tripled from 1 second to 3 seconds into the future compared to existing methods. Moreover, (7) we show that the approach allows zero-shot transfer to unseen and noisy datasets (Waymo [50] and FIT).

## 2 Related Work

Bird’s-Eye View Future Localization.

Predicting the future locations or trajectories of objects is a well studied problem. It includes techniques like the Kalman filter

[26][39], and Gaussian processes [42, 57, 43, 56]. These techniques are limited to low-dimensional data, which excludes taking into account the semantic context provided by an image. Convolutional networks allow processing such inputs and using them for future localization. LSTMs have been very popular due to time series processing. Initial works exploited LSTMs for trajectories to model the interaction between objects [2, 59, 65], for scenes to exploit the semantics [4, 38], and LSTMs with attention to focus on the relevant semantics [46].

Another line of works tackle the multimodal nature of the future by sampling through cVAEs [30], GANs [3, 21, 45, 66, 27], and latent decision distributions [32]. Choi et al. [12] model future locations as nonparametric distribution, which can potentially result in multimodality but often collapses to a single mode. Given the instabilities of Mixture Density Networks (MDNs) in unrestricted environments, some works restrict the solution space to a set of predefined maneuvers or semantic areas [15, 24]. Makansi et al. [36] proposed a method to learn mixture densities in unrestricted environments. Their approach first predicts diverse samples and then fits a mixture model on these samples. All these methods have been applied on static scenes recorded from a bird’s-eye view, i.e., with full local observability and no egomotion. We build on the technique from Makansi et al. [36] to estimate multimodal distributions in egocentric views.

Egocentric Future Localization. The egocentric camera view is the typical way of observing the scene in autonomous driving. It introduces new challenges due to the egomotion and the narrow field of view. Multiple works have addressed these challenges by projecting the view into bird’s-eye view using 3D sensors [14, 17, 16, 47, 35, 44, 13]. This is a viable approach, but it suffers from nondense measurements or erroneous measurements in case of LIDAR and stereo sensors, respectively.

Alternative approaches try to work directly in the egocentric view. Yagi et al. [62] utilized the pose, locations, scales and past egomotion for predicting the future trajectory of a person. TraPHic [10] exploits the interaction between nearby heterogeneous objects. DTP [48] and STED [49] use encoder-decoder schemes using optical flow and past locations and scales of the objects. Yao et al. [63] added the planned egomotion to further improve the prediction. For autonomous driving, knowing the planned motion is a reasonable assumption [20], and we also make use of this assumption. All these models work with a deterministic model and fail to account for the multimodality and uncertainty of the future. The effect of this is demonstrated by our experiments.

The most related work to our approach, in the sense that it works on egocentric views and predicts multiple modes, is the Bayesian framework by Bhattacharyya et al. [6]. It uses Bayesian RNNs to sample multiple futures with uncertainties. Additionally, they learn the planned egomotion and fuse it to the main future prediction framework. NEMO [37] extends this approach by learning a multimodal distribution for the planned egomotion leading to better accuracy. Both methods need multiple runs to sample different futures and suffer from mode collapse, i.e., tend to predict only the most dominant mode, as demonstrated by our experiments.

Egocentric Emergence Prediction. To reinforce safety in autonomous driving, it is important to not only predict the future of the observed objects but also predict where new objects can emerge. Predicting the whereabouts of an emerging object inherits predicting the future environment itself. Predicting the future environment was addressed by predicting future frames [55, 52, 51, 1, 28, 31, 61] and future semantic segmentation [34, 25, 54, 33, 8, 7]. These methods can only hallucinate new objects in the scene in a photorealistic way, but none of them explicitly predicts the structure where new objects can actually emerge. Vondrick et al. [53] consider a higher-level task and predict the probability of a new object to appear in an egocentric view. However, they only predict ”what” object to appear but not ”where”. Fan et al. [19] suggested transferring current object detection features to the future. This way they anticipate both observed and new objects.

Reachability Prior Prediction. The environment poses constraints for objects during navigation. While some recent works use an LSTM to learn environment constraints from images [38, 60], others [4, 12] choose a more explicit approach by dividing the environment into meaningful grids to learn the grid-grid, object-object and object-grid interactions. Also soft attention mechanisms are commonly used to focus on relevant features of the environments [45, 46]. While these methods reason about static environment constraints within the model proposed, we propose to separate this task and learn a scene prior before the future localization in dynamic scenes. Lee et al [29] proposed a similar module, where a GAN per object class generates multiple locations to place an object photorealistically.

## 3 Multimodal Egocentric Future Prediction

Figure 3 shows the pipeline of our framework for the future localization task consisting of three main modules: (1) reachability prior network (RPN), which learns a prior of where members of an object class could be located in semantic map, (2) reachability transfer network (RTN), which transfers the reachability prior from the current to a future time step taking into account the planned egomotion, and (3) future localization network (FLN), which is conditioned on the past and current observations of an object and learns to predict a multimodal distribution of its future location based on the general solution from the RTN.

Emergence prediction shares the same first two modules and differs only in the third network where we drop the condition on the past object trajectory. We refer to it as emergence prediction network (EPN). The aim of EPN is to learn a multimodal distribution of where objects of a class emerge in the future.

### 3.1 Reachability Prior Network (RPN)

Given an observed scene from an egocentric view, the reachability prior network predicts where an object of a certain class can be at the same time step in the form of bounding box hypotheses. Let for be the set of bounding box hypotheses predicted by our RPN at time step , where () represents the center coordinates and () the width and height.

Since the reachability prior network should learn the relation between a class of objects (e.g, vehicle) and the scene semantics (e.g, road, sidewalk, and so on), we remove all dynamic objects from the training samples. This is achieved by inpainting [64]. Because inpainting on the semantic map causes fewer artifacts, in contrast to inpainting in the raw RGB image [5], the reachability prior is based on the semantic map. On one hand, the semantic map does not show some of the useful details visible in the raw image (e.g. the type of traffic sign or building textures). On the other hand, it is important that the inpainting does not introduce strong artifacts. These would be picked up during training and would bias the result (similar to keeping the original objects in the image).

For each image at time , we compute its semantic segmentation using deeplabV3plus [11] and derive its static semantic segmentation after inpainting all dynamic objects. This yields the training data for the reachability prior network: the static semantic segmentation is the input to the network, and the removed objects of class are ground-truth samples for the reachability. The network yields multiple hypotheses as output and is trained using the EWTA scheme [36] with the loss:

 LRPN = l(brpni,t,^bt). (1)

denotes a ground-truth bounding box of one instance from class (e.g, vehicle or pedestrian) in image and denotes the norm. EWTA applies this loss to the hypotheses in a hierarchical way. It penalizes all hypotheses (i.e, where ). After convergence, it halves the hypotheses () and penalizes only the best hypotheses. This halving is repeated until only the best hypothesis is penalized; see Makansi et al. [36] for details. A sample output of the reachability prior network for a car is shown in Figure 4 (top).

### 3.2 Reachability Transfer Network (RTN)

When running RPN on the semantic segmentation at time , we obtain a solution for the same time step . However, at test time, we require this prior in the unobserved future. Thus, we train a network to transfer the reachability at time to time , where is the fixed prediction horizon and is the relative pairwise transformation between the pose at time and

(referred to as planned egomotion) which is represented as a transformation vector (3d translation vector

and rotation quaternions ). This transfer network can be learned with a self-supervised loss from a time series

 LRTN = N∑i=1|brtni,t+Δt−brpni,t+Δt|. (2)

where is the output of the RTN network. is the image and is the static semantic segmentation at time . Figure 4 (middle) shows the reachability prior (top) transferred to the future. Given the ego motion as moving forward (red arrow) and the visual cues for upcoming traffic light and a right turn, the RTN anticipates that some more cars can be on the street emerging and transforms some of the RPN hypotheses to cover these new locations.

### 3.3 Future Localization Network (FLN)

Given an object which is observed for a set of frames from to , where denotes the observation period, FLN predicts the distribution of bounding boxes in the future frame . Figure 3c shows the input to this network: the past images , the past semantic maps , the past masks of the object of interest , the planned egomotion , and the reachability prior in the future frame . The object masks s are provided as images, where pixels inside the object bounding box are object class and elsewhere.

We use the sampling-fitting framework from Makansi et al. [36] to predict a Gaussian mixture for the future bounding box of the object of interest. The sampling network generates multiple hypotheses and is trained with EWTA, just like the RPN. The additional fitting network estimates the parameters (, , ) of a Gaussian mixture model with

from these hypotheses, similar to the expectation-maximization algorithm but via a network; see Makansi et al.

[36] for details. An example of the FLN prediction is shown in figure 4 (bottom). The fitting network is trained with the negative log-likelihood (NLL) loss

 Lnll = −log[K∑k=1πkN(μk,σ2k)]. (3)

### 3.4 Emergence Prediction Network (EPN)

Rather than predicting the future of a seen object, the emergence prediction network predicts where an unseen object can emerge in the scene. The EPN is very similar to the FLN shown in figure 3c. The only difference is that the object masks are missing in the input, since the task is not conditioned on a particular object but predicts the general distribution of objects emerging.

The network is trained on scenes where an object is visible in a later image (ground truth), but not in the current image . Like for the future localization network, we train the sampling network with EWTA and the fitting network with NLL.

## 4 Experiments

### 4.1 Datasets

Mapillary Vistas [41]. We used the Mapillary Vistas dataset for training the inpainting method from [64] on semantic segmentation and for training our reachability prior network. This dataset contains around images recorded in different cities across 6 continents, from different viewpoints, and in different weather conditions. For each image, pixelwise semantic and instance segmentation are provided. The images of this dataset are not temporally ordered, which prevents its usage for training the RTN, FLN, or EPN.

nuScenes [9]. nuScenes is very large autonomous driving dataset consisting of 1000 scenes with 20 seconds each. We used it for training and evaluating the proposed framework. We did not re-train the reachability prior network on this dataset, as to test generalization of the reachability prior network across different datasets. The nuScenes dataset provides accurate bounding box tracking for different types of traffic objects and the egomotion of the observer vehicle. We used the standard training/validation split (700/150 scenes) of the dataset for training/evaluating all experiments.

Waymo Open Dataset [50]. Waymo is the most recent autonomous driving dataset and contains 1000 scenes with 20 seconds each. To show zero-shot transfer of our framework (i.e, without re-training the model), we used the standard 202 testing scenes.

FIT Dataset. We collected 18 scenes from different locations in Europe and relied on MaskRCNN [22] and deepsort [58] to detect and track objects, and DSO [18] to estimate the egomotion. This dataset allows testing the robustness to noisy inputs (without human annotation). We will make these sequences and the annotations publicly available.

### 4.2 Evaluation Metrics

FDE. For evaluating both future localization and emergence prediction, we report the common Final Displacement Error (FDE), which estimates the distance of the centers of two bounding boxes in pixels.

IOU. We report the Intersection Over Union (IOU) metric to evaluate how well two bounding boxes overlap.

The above metrics are designed for single outputs, not distributions. In case of multiple hypotheses, we applied the above metrics between the ground truth and the closest mode to the ground truth (known as Oracle [36, 30]).

NLL. To evaluate the accuracy of the multimodal distribution, we compute the negative log-likelihood of the ground-truth samples according to the estimated distribution.

### 4.3 Training Details

We used ResNet-50 [23] as sampling network in all parts of this work. The fitting network consisted of two fully connected layers (each with 500 nodes) with a dropout layer (rate = 0.2) in between. In the FLN, we observed second and predicted seconds into the future. For the EPN, we observed only one frame and predicted second into the future. We used for all sampling networks, and and as the number of mixture components for the FLN and the EPN, respectively. The emergence prediction task requires more modes compared to the future localization task since the distribution has typically more modes in this task.

### 4.4 Baselines

As there is only one other work so far on egocentric multimodal future prediction [6], we compare also to unimodal baselines, which are already more established.

Kalman Filter [26]. This linear filter is commonly used for estimating the future state of a dynamic process through a set of (low-dimensional) observations. It is not expected to be competitive, since it considers only the past trajectory and ignores all other information.

DTP [48]. DTP is a dynamic trajectory predictor for pedestrians based on motion features obtained from optical flow. We used their best performing framework, which predicts the difference to the constant velocity solution.

STED [49]. STED is a spatial-temporal encoder-decoder that models visual features by optical flow and temporal features by the past bounding boxes through GRU encoders. It later fuses the encoders into another GRU decoder to obtain the future bounding boxes.

RNN-ED-XOE [63]. RNN-ED-XOE is an RNN-based encoder-decoder framework which models both temporal and visual features similar to STED. RNN-ED-XOE additionally encodes the future egomotion before fusing all information into a GRU decoder for future bounding boxes.

FLN-Bayesian using [6]. The work by Bhattacharyya et al. [6] is the only multimodal future prediction work for the egocentric scenario in the literature. It uses Bayesian optimization to estimate multiple future hypotheses and their uncertainty. Since they use a different network architecture and data modalities, rather than direct method comparison we port their Bayesian optimization into our framework for fair comparison. We re-trained our FLN with their objective to create samples by dropout during training and testing time as replacement for the EWTA hypotheses. We used the same number of samples, , as in our standard approach.

All these baselines predict the future trajectory of either pedestrians [48, 49, 6] or vehicles [63]. Thus, we re-trained them on nuScenes [9] to handle both pedestrian and vehicle classes. Moreover, some baselines utilize the future egomotion obtained from ORB-SLAM2 [40] or predicted by their framework, as in [6]. For a fair comparison, we used the egomotion from nuScenes dataset when re-training and testing their models, thus eliminating the effect of different egomotion estimation methods.

FLN w/o reachability. To measure the effect of the reachability prior, we ran this version of our framework without RPN and RTN.

FLN + reachability. Our full framework including all 3 networks: RPN, RTN, FLN.

Due to the lack of comparable work addressing the emergence prediction task, so far, we conduct an ablation study on the emergence prediction to analyze the effect of the proposed reachability prior on the accuracy of the prediction.

### 4.5 Egocentric Future Localization

Table 1 shows a quantitative evaluation of our proposed framework against all the baselines listed above. To distinguish test cases that can be solved with simple extrapolation from more difficult cases, we use the performance of the Kalman filter [26]; see also [63]. A test sample, where the Kalman filter [26] has a displacement error larger than average is counted as challenging. An error more than twice the average is marked very challenging. In Table 1, we show the error only for the whole test set (all) and the very challenging subset (hard). More detailed results are in the supplemental material.

As expected, deep learning methods outperform the extrapolation by a Kalman filter on all metrics. Both variants of our framework show a significant improvement over all baselines for the FDE and IOU metrics. When we use FDE or IOU, we use the oracle selection of the hypotheses (i.e, the closest bounding box to the ground truth). Hence, a multimodal method is favored over a unimodal one. Still, such significant improvement indicates the need for multimodality. To evaluate without the bias introduced by the oracle selection, we also report the negative log-likelihood (NLL).

Both variants of the proposed framework outperform the Bayesian framework on all metrics including the NLL. In fact, the Bayesian baseline is very close to the best unimodal baseline. This indicates its tendency for mode collapse, which we also see qualitatively. The use of the reachability prior is advantageous on all metrics and for all difficulties.

As the networks (ours and all baselines) were trained on nuScenes, the results on Waymo and FIT include a zero-shot transfer to unseen datasets. We obtain the same ranking for unseen datasets as for the test set of nuScenes. This indicates that overfitting to a dataset is not an issue for this task. We recommend having cross-dataset experiments (as we show) also in future works to ensure that this stays true and future improvements in numbers are really due to better models and not just overfitting.

Figure 5 shows some qualitative example in four challenging scenarios, where there are multiple options for the future location. (1) A pedestrian starts crossing the street and his future is not deterministic due to different speed estimates. (2) A pedestrian enters the scene from the left and will either continue walking to cross the street or will stop at the traffic light. (3) A tricycle driving from a parking area will continue driving to cross the road or will stop to give way to our vehicle. (4) A car entering the scene from the left will either slow down to yield or drive faster to overpass.

For all scenarios, we observe that the reachability prior (shown as set of colored bounding boxes) defines the general relation between the object of interest and the static elements of the scene. Similar to the observation from our quantitative evaluation, the Bayesian baseline predicts a single future with some uncertainty (unimodal distribution). Our framework without exploiting the reachability prior (FLN w/o RPN) tends to predict more diverse futures but still lacks predicting many of the modes. The reachability prior helps the approach to cover more of the possible future locations.

We highly recommend watching the supplementary video at https://youtu.be/jLqymg0VQu4, which gives a much more detailed qualitative impression of the results, as it allows the observer to get a much better feeling for the situation than the static pictures in the paper.

### 4.6 Egocentric Emergence Prediction

Table 2 shows the ablation study on the importance of using the reachability prior for the task of predicting object emergence in a scene. Similar to future localization, exploiting the reachability prior yields a higher accuracy and captures more of the modes. Two qualitative examples for this task are shown in Figure 6. Examples include scenarios (1) where a vehicle could emerge in the scene from the left street, could pass by or could be oncoming; (2) where a car could emerge from the left, from the right, it could pass by, or could be oncoming. EPN learns not only the location in the image, but also meaningful scales. For instance, the anticipation of passing-by cars has a larger scale compared to expected oncoming cars. The distributions for the two examples are different since more modes for emerging vehicles are expected in the second example (e.g, emerging from the right side). Notably, the reachability prior solution is different from the emergence solution, where close-by cars in front of the egocar are part of the reachability prior solution but are ruled out, since a car cannot suddenly appear there. More results are provided in the supplemental material.

## 5 Conclusions

In this work, we introduced a method for predicting future locations of traffic objects in egocentric views without predefined assumptions on the scene and by taking into account the multimodality of the future. We showed that a reachability prior and multi-hypotheses learning help overcome mode collapse. We also introduced a new task relevant for autonomous driving: predicting locations of suddenly emerging objects. Overall, we obtained quite good results even in difficult scenarios, but careful qualitative inspection of many results still shows a lot of potential for improvement on future prediction.

## 6 Acknowledgments

This work was funded in parts by IMRA Europe S.A.S. and the German Ministry for Research and Education (BMBF) via the project Deep-PTL.

## References

• [1] S. Aigner and M. Körner (2018) FutureGAN: anticipating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing gans. arXiv preprint arXiv:1810.01325. Cited by: §2.
• [2] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016) Social lstm: human trajectory prediction in crowded spaces. In CVPR, Cited by: §2.
• [3] J. Amirian, J. Hayet, and J. Pettre (2019) Social ways: learning multi-modal distributions of pedestrian trajectories with gans. In CVPR Workshops, Cited by: §2.
• [4] F. Bartoli, G. Lisanti, L. Ballan, and A. D. Bimbo (2017) Context-aware trajectory prediction. arXiv preprint arXiv:1705.02503. Cited by: §2, §2.
• [5] L. Berlincioni, F. Becattini, L. Galteri, L. Seidenari, and A. D. Bimbo (2018) Semantic road layout understanding by generative adversarial inpainting. arXiv preprint arXiv:1805.11746. Cited by: §3.1.
• [6] A. Bhattacharyya, M. Fritz, and B. Schiele (2018-06) Long-term on-board prediction of people in traffic scenes under uncertainty. In CVPR, External Links: Document, ISSN Cited by: §2, Figure 5, §4.4, §4.4, §4.4, Table 1, Table 3, Table 4, Table 5.
• [7] A. Bhattacharyya, M. Fritz, and B. Schiele (2018) Bayesian prediction of future street scenes through importance sampling based optimization. arXiv preprint arXiv:1806.06939. Cited by: §2.
• [8] A. Bhattacharyya, M. Fritz, and B. Schiele (2019) Bayesian prediction of future street scenes using synthetic likelihoods. In ICLR, External Links: Link Cited by: §2.
• [9] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: Figure 1, §1, §2, Figure 5, Figure 6, Figure 7, §4.1, §4.4, Table 1, Table 2, Table 3.
• [10] R. Chandra, U. Bhattacharya, A. Bera, and D. Manocha (2019-06) TraPHic: trajectory prediction in dense and heterogeneous traffic using weighted interactions. In CVPR, Cited by: §2.
• [11] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §3.1.
• [12] C. Choi and B. Dariush (2019) Looking to relations for future trajectory forecast. In ICCV, Cited by: §2, §2.
• [13] C. Choi, A. Patil, and S. Malla (2019) DROGON: a causal reasoning framework for future trajectory forecast. arXiv preprint arXiv:1908.00024. Cited by: §2.
• [14] H. Cui, T. Nguyen, F. Chou, T. Lin, J. Schneider, D. Bradley, and N. Djuric (2019) Deep kinematic models for physically realistic prediction of vehicle trajectories. arXiv preprint arXiv:1908.00219. Cited by: §2.
• [15] H. Cui, V. Radosavljevic, F. Chou, T. Lin, T. Nguyen, T. Huang, J. Schneider, and N. Djuric (2018) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. arXiv preprint arXiv:1809.10732. Cited by: §2.
• [16] N. Deo, A. Rangesh, and M. M. Trivedi (2018-06) How would surround vehicles move? a unified framework for maneuver classification and motion prediction. T-IV. External Links: Document, ISSN Cited by: §2.
• [17] N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F. Chou, T. Lin, and J. Schneider (2018) Short-term motion prediction of traffic actors for autonomous driving using deep convolutional networks. arXiv preprint arXiv:1808.05819. Cited by: §2.
• [18] J. Engel, V. Koltun, and D. Cremers (2016) Direct sparse odometry. In arXiv:1607.02565, Cited by: §4.1.
• [19] C. Fan, J. Lee, and M. S. Ryoo (2018-09) Forecasting hands and objects in future frames. In ECCV, Cited by: §2.
• [20] D. González, J. Pérez, V. Milanés, and F. Nashashibi (2016-04) A review of motion planning techniques for automated vehicles. T-ITS. External Links: Document, ISSN Cited by: §2.
• [21] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi (2018) Social gan: socially acceptable trajectories with generative adversarial networks. In CVPR, Cited by: §2.
• [22] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017-10) Mask r-cnn. In ICCV, pp. 2980–2988. External Links: Document, ISSN Cited by: §4.1.
• [23] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In CVPR, Cited by: §4.3.
• [24] Y. Hu, W. Zhan, and M. Tomizuka (2018) Probabilistic prediction of vehicle semantic intention and motion. arXiv preprint arXiv:1804.03629. Cited by: §2.
• [25] X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng, and S. Yan (2017) Predicting scene parsing and motion dynamics in the future. In NIPS, pp. 6915–6924. Cited by: §2.
• [26] R. E. Kalman (1960) A new approach to linear filtering and prediction problems. ASME Journal of Basic Engineering. Cited by: §2, §4.4, §4.5, Table 1, Table 3, Table 4, Table 5.
• [27] V. Kosaraju, A. Sadeghian, R. Martín-Martín, I. Reid, S. H. Rezatofighi, and S. Savarese (2019) Social-bigat: multimodal trajectory forecasting using bicycle-gan and graph attention networks. arXiv preprint arXiv:1907.03395. Cited by: §2.
• [28] Y. Kwon and M. Park (2019-06) Predicting future frames using retrospective cycle gan. In CVPR, Cited by: §2.
• [29] D. Lee, S. Liu, J. Gu, M. Liu, M. Yang, and J. Kautz (2018) Context-aware synthesis and placement of object instances. In NIPS, External Links: Link Cited by: §2.
• [30] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In CVPR, pp. 336–345. Cited by: §2, §4.2.
• [31] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2018) Flow-grounded spatial-temporal video prediction from still images. In ECCV, Cited by: §2.
• [32] Y. Li (2019-06) Which way are you going? imitative decision learning for path forecasting in dynamic scenes. In CVPR, Cited by: §2.
• [33] P. Luc, C. Couprie, Y. Lecun, and J. Verbeek (2018) Predicting future instance segmentations by forecasting convolutional features. arXiv preprint arXiv:1803.11496. Cited by: §2.
• [34] P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. LeCun (2017) Predicting deeper into the future of semantic segmentation. In ICCV, Cited by: §2.
• [35] Y. Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha (2019) TrafficPredict: trajectory prediction for heterogeneous traffic-agents. In AAAI, pp. 6120–6127. External Links: Cited by: §2.
• [36] O. Makansi, E. Ilg, O. Cicek, and T. Brox (2019-06) Overcoming limitations of mixture density networks: a sampling and fitting framework for multimodal future prediction. In CVPR, Cited by: §1, §1, §2, §3.1, §3.3, §4.2, §4.
• [37] S. Malla and C. Choi (2019) NEMO: future object localization using noisy ego priors. arXiv preprint arXiv:1909.08150. Cited by: §2.
• [38] H. Manh and G. Alaghband (2018) Scene-lstm: a model for human trajectory prediction. arXiv preprint arXiv:1808.04018. Cited by: §2, §2.
• [39] P. McCullagh and J.A. Nelder (1989) Generalized linear models, second edition. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Taylor & Francis. External Links: ISBN 9780412317606, LCCN 99013896, Link Cited by: §2.
• [40] R. Mur-Artal and J. D. Tardós (2017)

ORB-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras

.
IEEE Transactions on Robotics. External Links: Document Cited by: §4.4.
• [41] G. Neuhold, T. Ollmann, S. Rota Bulò, and P. Kontschieder (2017) The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, External Links: Link Cited by: §4.1.
• [42] A. O’Hagan (1978) Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society: Series B (Methodological) 40 (1), pp. 1–24. Cited by: §2.
• [43] C. E. Rasmussen and C. K. I. Williams (2005)

Gaussian processes for machine learning (adaptive computation and machine learning)

.
The MIT Press. External Links: ISBN 026218253X Cited by: §2.
• [44] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine (2019) PRECOG: prediction conditioned on goals in visual multi-agent settings. In ICCV, Cited by: §2.
• [45] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese (2019) SoPhie: an attentive gan for predicting paths compliant to social and physical constraints. In CVPR, Cited by: §2, §2.
• [46] A. Sadeghian, F. Legros, M. Voisin, R. Vesel, A. Alahi, and S. Savarese (2018) CAR-net: clairvoyant attentive recurrent network. In ECCV, Cited by: §2, §2.
• [47] S. Srikanth, J. A. Ansari, K. R. R, S. Sharma, K. M. J., and M. K. K (2019) INFER: intermediate representations for future prediction. arXiv preprint arXiv:1903.10641. Cited by: §2.
• [48] O. Styles, A. Ross, and V. Sanchez (2019-06) Forecasting pedestrian trajectory with machine-annotated training data. In IV, External Links: Document, ISSN Cited by: §2, §4.4, §4.4, Table 1, Table 3, Table 4, Table 5.
• [49] O. Styles, T. Guha, and V. Sanchez (2019) Multiple object forecasting: predicting future object locations in diverse environments. arXiv preprint arXiv:1909.11944. Cited by: §2, §4.4, §4.4, Table 1, Table 3, Table 4, Table 5.
• [50] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov (2019) Scalability in perception for autonomous driving: waymo open dataset. External Links: 1912.04838 Cited by: §1, §2, Figure 4, Figure 5, Figure 8, §4.1, Table 1, Table 4.
• [51] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee (2017) Decomposing motion and content for natural video sequence prediction. In ICLR, External Links: Link Cited by: §2.
• [52] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee (2017) Learning to generate long-term future via hierarchical prediction. In ICML, External Links: Link Cited by: §2.
• [53] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Anticipating visual representations from unlabeled video. In CVPR, pp. 98–106. Cited by: §2.
• [54] S. Vora, R. Mahjourian, S. Pirk, and A. Angelova (2018) Future segmentation using 3d structure. arXiv preprint arXiv:1811.11358. Cited by: §2.
• [55] J. Walker, A. Gupta, and M. Hebert (2014) Patch to the future: unsupervised visual prediction. In CVPR, External Links: Document Cited by: §2.
• [56] J. M. Wang, D. J. Fleet, and A. Hertzmann (2008) Gaussian process dynamical models for human motion. TPAMI 30 (2), pp. 283–298. External Links: Document Cited by: §2.
• [57] C. K. I. Williams (1997) Prediction with gaussian processes: from linear regression to linear prediction and beyond. In Learning and Inference in Graphical Models, pp. 599–621. Cited by: §2.
• [58] N. Wojke, A. Bewley, and D. Paulus (2017) Simple online and realtime tracking with a deep association metric. arXiv preprint arXiv:1703.07402. Cited by: §4.1.
• [59] Y. Xu, Z. Piao, and S. Gao (2018)

Encoding crowd interaction with deep neural network for pedestrian trajectory prediction

.
In CVPR, pp. 5275–5284. External Links: Document, ISSN Cited by: §2.
• [60] H. Xue, D. Q. Huynh, and M. Reynolds (2018) SS-lstm: a hierarchical lstm model for pedestrian trajectory prediction. In WACV, External Links: Document Cited by: §2.
• [61] T. Xue, J. Wu, K. Bouman, and B. Freeman (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In NIPS, pp. 91–99. Cited by: §2.
• [62] T. Yagi, K. Mangalam, R. Yonetani, and Y. Sato (2018-06) Future person localization in first-person videos. In CVPR, External Links: Document, ISSN Cited by: §2.
• [63] Y. Yao, M. Xu, C. Choi, D. J. Crandall, E. M. Atkins, and B. Dariush (2019-05) Egocentric vision-based future vehicle localization for intelligent driving assistance systems. In ICRA, External Links: Document, ISSN Cited by: §2, §4.4, §4.4, §4.5, Table 1, Table 3, Table 4, Table 5.
• [64] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018-06)

Generative image inpainting with contextual attention

.
In CVPR, Cited by: §3.1, §4.1.
• [65] P. Zhang, W. Ouyang, P. Zhang, J. Xue, and N. Zheng (2019) SR-lstm: state refinement for lstm towards pedestrian trajectory prediction. In CVPR, Cited by: §2.
• [66] T. Zhao, Y. Xu, M. Monfort, W. Choi, C. Baker, Y. Zhao, Y. Wang, and Y. N. Wu (2019-06)

Multi-agent tensor fusion for contextual trajectory prediction

.
In CVPR, Cited by: §2.

## 1 Video

We provide a supplemental video to present our results better. Since the task inherits a temporal dependency, we refer the reader to our video where the driving scenarios are presented as they happen. You can find it at https://youtu.be/jLqymg0VQu4

## 2 Egocentric Future Localization

For each dataset, we split the testing scenarios into challenging and very challenging categories based on their errors when Kalman Filter is used for future prediction (see more details in the main paper). Table 3 shows the quantitative comparison of our future localization framework against all baselines on the nuScenes [9] testing dataset for all scenarios, only the challenging ones, and only the very challenging ones. We clearly show that our framework outperforms all baselines in all difficulties. The benefit gained from our methods is even larger as the difficulty of the scenarios increases.

To show zero-shot transfer to unseen datasets, we report the same evaluation on the testing split of the Waymo Open dataset [50] in Table 4. The ranking of the methods is preserved as in the evaluation on nuScenes dataset. This shows that our framework using the reachability prior generalizes well to unseen scenarios. Note we also report the size of the testing dataset for each category where a significant drop in the number of scenarios is observed when the difficulty level increases.

To show robustness to datasets with noisy annotation, we report the same evaluation on our FIT dataset in Table 5. Similarly, our framework outperforms all baselines in all difficulties. Note that this simulates the real world applications where accurate annotations (e.g, object detection and tracking) are expensive to obtain.

## 3 Egocentric Emergence Prediction

We show two emergence prediction examples in Figure 7 for cars (1st row) and pedestrians (2nd row). In the first scenario, a car can emerge from the left street, from far distance, or from the occluded area by the truck. In the second scenario with a non-straight egomotion, a pedestrian can emerge from different occluded areas by the left moving car, the left parking cars, or the right truck. Note how the reachability prior helps the emergence prediction framework to cover more possible modes. Interestingly, the reachability prior prediction is different from the emergence prediction where close by objects (cars and pedestrians) are only part of the reachability prior.

## 4 Failure Cases

Our method is mainly based on the sampling network from Makansi et al. [36]

and thus inherits its failures. The sampling network is trained with the EWTA objective which leads sometimes to generating few bad hypotheses (outliers). Figure

8 shows few examples for this phenomena. One promising direction in future work is finding strategies for better sampling to overcome this limitation.