DROGON: A Causal Reasoning Framework for Future Trajectory Forecast

07/31/2019 ∙ by Chiho Choi, et al. ∙ 0

We propose DROGON (Deep RObust Goal-Oriented trajectory prediction Network) for accurate vehicle trajectory forecast by considering behavioral intention of vehicles in traffic scenes. Our main insight is that a causal relationship between intention and behavior of drivers can be reasoned from the observation of their relational interactions toward an environment. To succeed in causal reasoning, we build a conditional prediction model to forecast goal-oriented trajectories, which is trained with the following stages: (i) relational inference where we encode relational interactions of vehicles using the perceptual context; (ii) intention estimation to compute the probability distribution of intentional goals based on the inferred relations; and (iii) causal reasoning where we reason about the behavior of vehicles as future locations conditioned on the intention. To properly evaluate the performance of our approach, we present a new large-scale dataset collected at road intersections with diverse interactions of vehicles. The experiments demonstrate the efficacy of DROGON as it consistently outperforms state-of-the-art techniques.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Forecasting future trajectory of participants has gained a huge attention over the past years. Extensive research has targeted robotic systems in indoor and outdoor environments for execution of their safe navigation strategies. Considering interactions between humans, recent works in alahi2014socially ; yi2015understanding ; alahi2016social ; soo2016egocentric ; ma2017forecasting ; gupta2018social ; Hasan_2018_CVPR ; xu2018encoding ; Yagi_2018_CVPR ; sadeghian2018sophie ; vemula2018social provided an interpretation of pedestrian’s movements by learning social behavior of humans in a crowded environment. Recent breakthroughs in automated driving technologies have increased the demand for such research in the transportation domain. However, these works cannot be directly applied toward prediction of vehicle trajectories because of the following reasons: (i) interactions have been investigated from local surroundings assuming slow movement of people, which may not be applicable to vehicles with faster speed; and (ii) road layouts have been rarely considered in the literature, which can provide informative motion cues particularly in driving scenes. There are some research efforts in the transportation community. However, the focus has been on the highway scenarios deo2018multi ; park2018sequence , relative trajectory of vehicles respective to ego-motion yao2018egocentric , or prediction of only ego-vehicle trajectories huang2019uncertainty . Therefore, a robust solution still does not exist for predicting trajectories of vehicles driving in urban environments.

In this paper, we propose a vehicle trajectory forecast framework which aims to construct a causal relationship between intention and behavior of drivers from their relational interactions. In the real driving scenarios, humans are capable of estimating the intention of others based on the perceived interactions, which corresponds to the potential destination in the near/far future. Then, we subsequently reason about their behavior, predicting intermediate paths with respect to the intention. In this view, automated driving or advanced driving assistance systems should be able to address the following questions: (i) Can they learn to perform intention estimation and react to interactions with other vehicles using sensory data? (ii) If so, how can the systems predict accurate trajectories under conditions of uncertain knowledge in a physically plausible manner?

Figure 1: The proposed approach forecasts future trajectories of vehicles. We first infer relational interactions of vehicles with each other and with an environment. The following module estimates the probability distribution of intentional goals (zones). Then, we conditionally reason about the goal-oriented behavior as multiple trajectories being sampled from the estimated distribution.

Our framework, DROGON, is designed to address these questions. We infer relational behavior of interactive vehicles from the perceptual observations of the given environment. On top of it, we build a conditional probabilistic prediction model to forecast their goal-oriented trajectories. Specifically, we first estimate a probability distribution of the intention (i.e., potential destination) of vehicles. Then, we perform multi-modal trajectory prediction conditioned on the probability of the formerly estimated intention categories (5 zones). An overview of the proposed approach is given in Figure 1.

A demonstration of DROGON is not achievable using the existing datasets colyar2007us101 ; colyar2007usi80 ; Geiger2012CVPR ; ma2019trafficpredict mainly because they are insufficient to discover functional interactions between vehicles and/or causal relationships between intention and behavior, in terms of their size and diversity. Therefore, we created a large-scale vehicle trajectory forecast dataset that is comprised of highly interactive scenarios at four-way intersections in the San Francisco Bay Area. We extensively evaluate the proposed framework to the self-generated baselines as well as the state-of-the-art methods using this dataset.

The main contributions are summarized as follows:

  1. Propose a vehicle trajectory forecast framework to estimate the intention of vehicles by analyzing their relational behavior.

  2. Reason causality between the intentional destination of an agent and its intermediate configuration for more accurate prediction.

  3. Create a new dataset at four-way intersections with highly interactive scenarios in urban areas and residential areas.

2 Related Work

We review the most relevant works on deep learning based future trajectory forecast in the literature.

Social interaction modeling Following the pioneering work helbing1995social ; yamaguchi2011you

, there has been an explosion of research that has applied social interaction models to data-driven systems. These models are basically trained using recurrent neural networks to make use of sequential attributes of human interactions. In 

alahi2016social , a social pooling layer is introduced to model interactions of neighboring individuals, and gupta2018social efficiently improves its performance. More recently in vemula2018social , the relative importance of each person is captured using the attention mechanism, considering interactions between all humans. It is extended in ma2019trafficpredict with an assumption that the same types of road users show similar motion patterns. Although their predictions are acceptable in many cases, these approaches may fail in complex scenes without the perceptual consideration of the surrounding environment such as road structures or layouts.

Scene context as an additional modality Scene context of an interacting environment has been presented in lee2017desire in addition to their social model. However, their restriction of the interaction boundary to local surroundings often causes failures toward far future prediction. xue2018ss subsequently extends local scene context through additional global scale image features. Also, choi2019looking analyzes local scene context from a global perspective and encodes relational behavior of all agents. Motivated by efficient relational inference from the perceptual observation in choi2019looking , we design a novel framework on top of relation-level behavior understanding. DROGON takes advantage of relational inference for reasoning about the causal relationship between intention and behavior of drivers.

Datasets for vehicle trajectory forecast The NGSIM colyar2007us101 ; colyar2007usi80 dataset has been released for vehicle trajectory forecast with different traffic congestion levels of highways. However, the motion of the vehicles and their interactions are mostly simple. The KITTI Geiger2012CVPR dataset has played an important role in detection, recognition, tracking, etc. However, they only provide a small number of tracklets that are available for trajectory forecast. Subsequently, Cityscapes cordts2016cityscapes and HDD Ramanishka_behavior_CVPR_2018 has been introduced for general tasks in autonomous driving. Although diverse interactions of vehicles are collected from different places, they do not provide 3D trajectory and LiDAR point clouds. More recently, the TrafficPredict ma2019trafficpredict dataset has been collected from urban driving scenarios. Although its size is larger than the KITTI dataset, they only provide 3D trajectory information with no corresponding images or point clouds. As a result, it is insufficient to discover visual scene context using this dataset.

3 Preliminaries

3.1 Spatio-Temporal Interactions

Spatio-temporal interactions between road users have been considered as one of the most important features to understand their social behaviors. In vemula2018social ; ma2019trafficpredict ; haddad2019situation , spatio-temporal graph models are introduced with nodes to represent road users and edges to express their interactions with each other. To model spatio-temporal interactions, the spatial edges capture the relative motion of two nodes at each time step, and temporal edges capture the temporal motion of each node between adjacent frames as shown in Figure 2(a). Recently in choi2019looking , spatio-temporal features are visually computed using a convolutional kernel within a receptive field. In the spatio-temporal domain, these features not only contain interactions of road users with each other, but also incorporate their interactions with the environment. We use a similar approach and reformulate the problem with a graph model.

Figure 2: Illustration of different types of graph models to encode spatio-temporal interactions. (a) A node represents the state of each road user, whereas (b) it is a visual encoding of spatio-temporal interactions captured from each region of the discretized grid between adjacent frames.

3.2 Relational Graph

In the proposed approach, the traditional definition of a node is extended from an individual road user to a spatio-temporal feature representation obtained by exploiting spatial locality in input images. Thus, the edge captures relational behavior from spatio-temporal interactions of road users. We refer to this edge as ‘relational edge’ as shown in Figure 2(b). In this view, we define an undirected and fully connected graph , where is a finite set of nodes (=25 is used) and is a set of relational edges connecting each pair of nodes. Given number of input images, we visually extract a node , where is a

-dimensional vector representing spatio-temporal interactions within the

-th region of the discretized grid. The feature of the relational edge between two nodes first determines whether the given interaction pair has meaningful relations from a spatio-temporal perspective through the function , and then the function is used to identify how their relations can affect the future motion of the target based on its past motion context :


where is the concatenation of two nodes, denotes the weight parameters of , is those of , and is an -dimensional feature representation extracted from the past trajectory of the -th agent observed in the given perceptual information. We subsequently collect relational information from all pairs and perform element-wise sum to produce a unique relational representation for the -th agent.

4 Methodology

We transfer knowledge of spatio-temporal relational inference to predict the probability of intentional goals as well as goal-oriented trajectories. To accomplish this, we assemble building blocks from (i) relational inference to encode relational interactions of vehicles using a relational graph, (ii) intention estimation to compute the probability distribution of intentional goals based on the inferred relations from the perceptual context, and (iii) causal reasoning to reason about the goal-oriented behavior of drivers as future locations conditioned on the intentional destinations.

4.1 Problem Definition

Given X, the proposed framework aims to predict number of likelihood heatmaps for the - target vehicle observed in , where is number of past LiDAR images and is a top-down LiDAR map with a same coordinate with . After that, we find a coordinate of a point with a maximum likelihood from each heatmap, which corresponds to the future locations .

4.2 Causal Reasoning for Trajectory Forecast

4.2.1 Conditional Trajectory Prediction

We use a conditional VAE (CVAE) framework to forecast multiple possible trajectories of each vehicle. For given observation , a latent variable is sampled from the prior distribution , and the output heatmaps are generated from the distribution . As a result, multiple drawn from the conditional distribution allows the system to model multiple outputs using the same observation , where is the concatenation of past motion context encoded from and estimated intention . In general, the true posterior in maximum likelihood inference is intractable. Therefore, we consider an approximate posterior with variational parameters predicted by a neural network. The variational lower bound of the model is thus written as follows:


and the objective with Gaussian latent variables becomes



is modeled as Gaussian distribution.

We respectively build and

as a CVAE encoder and trajectory predictor, on top of convolutional neural networks. At training time, the observed condition

is first concatenated with heatmaps , and we train the CVAE encoder to learn to approximate the prior distribution

by minimizing the Kullback-Leibler divergence. Once the model parameters are learned, the latent variable

can be drawn from the same Gaussian distribution. At test time, the random sample is generated and masked with the relational features using the element-wise multiplication operator. The resulting variable is passed through the trajectory predictor and concatenated with the observation to generate number of heatmaps . Details of the network architecture are described in the supplementary material.

4.2.2 Intentional Goal Estimation

We also train the intention estimator for goal-oriented future prediction which employs prior knowledge about the intention of vehicles (at time ). Given the relational features extracted from vehicle interactions, we estimate the softmax probability for each intention category (as illustrated in Figure 1

) through a set of fully connected layers with a following ReLU activation function. We compute the cross-entropy from the softmax probability:


where is an estimated intention category and is the indicator function, which equals 1 if equals or 0 otherwise. We use the estimated intention to condition the process of model prediction. The computed softmax probability is later used at test time to sample with respect to its distribution.

4.3 Explicit Penalty Modeling

We introduce additional penalty terms specifically designed to constrain the model toward reliance on perceptual scene context and spatio-temporal priors.

Penetration penalty We encourage the model to forecast all future locations within a boundary of the drivable road in a given environment. To ensure that the predictions do not penetrate outside the road (i.e., sidewalks or buildings), we check the predicted heatmaps and penalize any points outside the drivable road using the following term:


where the function is the binary transformation with a threshold , is the binary mask annotated as zero inside the drivable road, and is the number of pixels in each likelihood heatmap.

Inconsistency penalty In order to restrict our model from taking unrealistic velocity changes between adjacent frames, we encourage temporal consistency between frames as a way to smooth the predicted trajectories. We hypothesize that the current velocity at should be near to the velocity of both the previous frame (-) and next frame (+). The inconsistency penalty is defined as


where v denotes velocity at time and


is the term to softly penalize the predictions outside of the velocity range.

Dispersion penalty We further constrain the model to output more natural future trajectories, penalizing the cases where large prediction error is observed. In order to discourage the dispersion of an actual distance error distribution of the model, we use the following penalty:


where is an Euclidean distance between the predicted location and ground truth at time and denotes a mean of . We observe that the penalty is particularly helpful to obtain accurate future locations with the concurrent use of the term.

4.4 Training

At training time, We minimize the total loss drawn as follows:


The first two terms are primarily used to optimize the CVAE modules which aims to approximate the prior and generate actual likelihood predictions. The third term mainly leads the model’s output to be in the drivable road, and the last two terms are involved in generation of more realistic future locations. We set the loss weights as , , and which properly optimized the entire network structures.

KITTI Geiger2012CVPR TrafficPredict ma2019trafficpredict Ours
No. of scenarios 50 103 213
No. of frames () 13.1 90 59.4
No. of object classes 8 5 8
No. of intersections 213
Sampling frequency (fps) 10 2 10
Type of labels 3D bounding boxes
Ego-car odometry
LiDAR point cloud
360 coverage
Intentional goal
Drivable area mask

Table 1: Comparison of our dataset with driving scene datasets Geiger2012CVPR ; ma2019trafficpredict for future trajectory forecast.

5 Intersection Dataset

A large dataset is collected in the San Francisco Bay Area (San Fransisco, Mountain View, San Mateo, and Santa Cruz), focusing on highly interactive scenarios at four-way intersections. We chose 213 scenarios in both urban and residential areas, which contain interactions between road users toward an environment. Our intersection dataset consists of LiDAR-based point clouds (full 360 coverage), track-IDs of traffic participants, their 3D bounding boxes, object classes (8 categories including cars and pedestrians), odometry of the ego-car, heading angle (in ), intentional goal (zone), and drivable area mask. A comparison of our dataset with other datasets is detailed in Table 1.

The point cloud data is acquired using a Velodyne HDL-64E S3 sensor, and distortion correction is performed using the high-frequency GPS data. Odometry of the ego-car is obtained via Normal Distributive Transform (NDT)-based point cloud registration. The labels are manually annotated at 2Hz and linearly interpolated to generate labels at 10Hz. For zones, we use the registered point cloud data and divide the intersection by 5 regions which are labeled from 0 through 4 in a clockwise direction (‘0’ being the middle zone as illustrated in Figure 

1). Our new dataset will be released to the public upon the acceptance of this paper.

6 Experiments

6.1 Preprocessing

Every (past and future) number of point clouds, we first transform this subset to the local coordinates at time using GPS/IMU position estimates in the world coordinate. Then, we project these transformed point clouds onto the top-down image space that is discretized with a resolution of 0.5. Each cell in projected top-down images has a three-channel () representation of the height, intensity, and density. The height and intensity is obtained by a laser scanner, and we choose the maximum value of the points in the cell. The density simply shows how many points belong to the cell and is computed by , where is the number of points in the cell. We further normalize each channel to be in the range of . From these projected top-down images of size where , we create the coordinates of past and future trajectories in the local coordinates at time .

In addition, we remove dynamically moving agents (vehicles and pedestrians) from raw point clouds to only leave the static elements such as road, sidewalks, buildings, and lanes, similar to lee2017desire . Resulting point clouds are registered in the world coordinate and accordingly cropped to build a map of size in the local coordinates at (same as ). We observed that the density is always high when the ego-vehicle stops through a red light, and the height of the hilly road is not consistent when registered. Therefore, only the intensity values are used.

Single-modal prediction   1.0     2.0     3.0     4.0
State-of-the-art methods
     S-LSTM alahi2016social   1.66 / 2.18     2.57 / 4.03     3.59 / 6.19     4.61 / 8.45
     S-GAN gupta2018social   1.61 / 3.01     2.06 / 3.83     2.32 / 4.35     4.28 / 7.92
     S-ATTN vemula2018social   1.17 / 1.45     1.69 / 2.61     2.41 / 4.45     3.29 / 6.67
     Const-Vel scholler2019simpler   0.52 / 0.85     1.27 / 2.63     2.34 / 5.38     3.70 / 8.88
     Gated-RN choi2019looking   0.74 / 0.98     1.14 / 1.79     1.60 / 2.89     2.13 / 4.20
     DROGON   0.52 / 0.71     0.86 / 1.46     1.31 / 2.60     1.86 / 4.02
Self-generated baselines
     w/o Intention   0.79 / 1.04     1.20 / 1.85     1.65 / 2.90     2.18 / 4.25
     w/o Map   0.65 / 0.86     1.01 / 1.62     1.46 / 2.77     2.02 / 4.23
     w/o Penalty   0.60 / 0.81     0.97 / 1.58     1.41 / 2.71     1.98 / 4.20
Table 2: Quantitative comparison (ADE / FDE in meters) for single-modal prediction.
Multi-modal prediction   1.0     2.0     3.0     4.0
State-of-the-art methods
     S-LSTM-20 alahi2016social   1.06 / 1.37     1.68 / 2.79     2.46 / 4.55     3.36 / 6.73
     S-GAN-20 gupta2018social   1.50 / 2.84     1.94 / 3.52     1.99 / 3.75     3.43 / 6.47
     S-ATTN-20 vemula2018social   1.35 / 1.69     1.73 / 2.10     2.09 / 3.11     2.66 / 5.10
     Gated-RN-20 choi2019looking   0.60 / 0.80     0.93 / 1.49     1.33 / 2.48     1.82 / 3.74
     DROGON-Best-20   0.39 / 0.53     0.65 / 1.14     1.03 / 2.11     1.48 / 3.29
     DROGON-Prob-20   0.38 / 0.49     0.55 / 0.84     0.77 / 1.40     1.05 / 2.25
Table 3: Quantitative comparison (ADE / FDE in meters) for multi-modal prediction.

6.2 Comparison to Baselines

We conduct ablative tests using our intersection dataset to demonstrate the efficacy of the proposed DROGON framework by measuring average distance error (ADE) during a given time interval and final distance error (FDE) at a specific time frame in meters.

Prior knowledge of intention: In order to investigate the efficacy of causal reasoning, we design a baseline (w/o Intention) by dropping the intention estimator and CVAE encoder from DROGON. As a result, this baseline model is not generative, outputting a single set of deterministic locations. In Table 2, the reported error rates indicate that causal reasoning is essential to predict accurate trajectories under conditions of prior knowledge of intention111Note that the mean average precision (mAP) of intention estimation is 71.1% (from DROGON) and 70.2% (from w/o map), respectively.. It is due to the fact that goal-oriented reasoning is practically helpful to condition the search space and guide the course of future motion.

Global scene context:

We define another baseline model (w/o Map) which does not use global scene context for trajectory forecast. For implementation, we did not add features extracted from the map

into the relational inference stage. In this way, the model is not guided to learn global road layouts, similar to relational inference in choi2019looking . As shown in Table 2, the prediction error of this baseline definitely increases against DROGON. The comparison indicates that discovering additional global context encourages the model to better understand about the spatial environment.

Explicit penalty: We now remove the penalty terms in the total loss from the proposed DROGON framework at training time. The performance of this baseline model (w/o Penalty) is compared in Table 2. Although its performance is higher than other baseline models, it achieves higher error rate in comparison to DROGON. This is apparent in the sense that the model is not explicitly guided by physical constraints of the real world. Thus, we conclude that these penalty terms are dominant in forecasting accurate future trajectories.

Figure 3: Qualitative comparison of DROGON with the state-of-the-art algorithms. We visualize the top-1 prediction. Gray mask is shown for non-drivable region.
Figure 4: All 20 trajectories of DROGON-Prob-20 are plotted for multiple vehicles interactions. We change the intensity of colors for those 20 samples and use different colors for different vehicles. Gray mask is shown for non-drivable region.

6.3 Comparison with the State of the Arts

We compare the performance of DROGON to the state-of-the-art approaches. Extensive evaluations are conducted on tasks for both single-modal and multi-modal prediction. As shown in Table 2 for single trajectory prediction, the performance of S-GAN gupta2018social is consistently improved against S-LSTM alahi2016social all over the time steps. S-ATTN vemula2018social shows further improvement of both ADE and FDE by employing relative importance of individual vehicles. Interestingly, however, their performance is worse than or comparable to the simple constant velocity (Const-Vel) model in scholler2019simpler . With additional environmental priors, the network model (Gated-RN in choi2019looking

) then performs better than the heuristic approach. DROGON also employs perceptual information of the physical environment. Additionally, we generate intentional goals and predict a trajectory by reasoning about goal-oriented behavior of humans. As a result, we achieve the best performance against the state-of-the-art counterparts.

For evaluation on multi-modal prediction in Table 3, we generate samples and report an error of the - prediction with minimum ADE (i.e. , ) as proposed in lee2017desire ; gupta2018social . We design two variants of DROGON with a different sampling strategy: (i) DROGON-Best-20 generates trajectories only conditioned on the best intention estimate; and (ii) DROGON-Prob-20 conditions the model proportional to the softmax probability of each intention category. Similar to single-modal prediction, our models show a lower error rate than that of other approaches. It validates the effectiveness of our causal reasoning framework for goal-oriented future forecast. In Figure 3, we display their qualitative comparison in general driving scenarios 3(a), by considering the influence of environments (parked cars and road layouts) while making turns 3(b), and with an ability to socially avoid potential collisions 3(c). DROGON properly forecasts trajectories considering interactions with other vehicles and the environment. Moreover, we achieve the best performance with DROGON-Prob-20. By taking adaptive condition on potential goals, we can eventually ease the impact of misclassification in intention estimation. In Figure 4, we visualize goal-oriented trajectories reasoned from DROGON-Prob-20. While approaching 4(a) and passing the intersection 4(b)-4(d), DROGON accordingly predicts goal-oriented trajectories based on the intentional destination (zone) of vehicles. Note that our framework is able to predict future dynamic motion of the static vehicles (blue and purple in 4(c)), which can eventually help to avoid potential collisions that might be caused by their unexpected motion.

7 Conclusion

We presented a Deep RObust Goal-Oriented trajectory prediction Network, DROGON, which aims to understand a causal relationship between intention and behavior of human drivers. Motivated by the real world scenarios, the proposed framework estimates the intention of drivers based on their relational behavior. Given prior knowledge of intention, our conditional probabilistic model reasons about the behavior of vehicles as intermediate paths. To this end, DROGON generates multiple possible trajectories of each vehicle considering physical constraints of the real world. For comprehensive evaluation, we collected a large-scale dataset with highly interactive scenarios at four-way intersections. The proposed DROGON framework achieved considerable improvement of prediction performance over the current state-of-the-art approaches.


  • [1] Alexandre Alahi, Vignesh Ramanathan, and Li Fei-Fei. Socially-aware large-scale crowd forecasting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2203–2210, 2014.
  • [2] Shuai Yi, Hongsheng Li, and Xiaogang Wang. Understanding pedestrian behaviors from stationary crowd groups. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3488–3496, 2015.
  • [3] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–971, 2016.
  • [4] Hyun Soo Park, Jyh-Jing Hwang, Yedong Niu, and Jianbo Shi. Egocentric future localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4697–4705, 2016.
  • [5] Wei-Chiu Ma, De-An Huang, Namhoon Lee, and Kris M Kitani. Forecasting interactive dynamics of pedestrians with fictitious play. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 4636–4644. IEEE, 2017.
  • [6] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social gan: Socially acceptable trajectories with generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), number CONF, 2018.
  • [7] Irtiza Hasan, Francesco Setti, Theodore Tsesmelis, Alessio Del Bue, Fabio Galasso, and Marco Cristani. Mx-lstm: Mixing tracklets and vislets to jointly forecast trajectories and head poses. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [8] Yanyu Xu, Zhixin Piao, and Shenghua Gao. Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5275–5284, 2018.
  • [9] Takuma Yagi, Karttikeya Mangalam, Ryo Yonetani, and Yoichi Sato. Future person localization in first-person videos. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [10] Amir Sadeghian, Vineet Kosaraju, Ali Sadeghian, Noriaki Hirose, and Silvio Savarese. Sophie: An attentive gan for predicting paths compliant to social and physical constraints. arXiv preprint arXiv:1806.01482, 2018.
  • [11] Anirudh Vemula, Katharina Muelling, and Jean Oh.

    Social attention: Modeling attention in human crowds.

    In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–7. IEEE, 2018.
  • [12] Nachiket Deo and Mohan M Trivedi. Multi-modal trajectory prediction of surrounding vehicles with maneuver based lstms. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1179–1184. IEEE, 2018.
  • [13] Seong Hyeon Park, ByeongDo Kim, Chang Mook Kang, Chung Choo Chung, and Jun Won Choi. Sequence-to-sequence prediction of vehicle trajectory via lstm encoder-decoder architecture. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1672–1678. IEEE, 2018.
  • [14] Yu Yao, Mingze Xu, Chiho Choi, David J Crandall, Ella M Atkins, and Behzad Dariush. Egocentric vision-based future vehicle localization for intelligent driving assistance systems. In IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2019.
  • [15] Xin Huang, Stephen McGill, Brian C Williams, Luke Fletcher, and Guy Rosman. Uncertainty-aware driver trajectory prediction at urban intersections. arXiv preprint arXiv:1901.05105, 2019.
  • [16] James Colyar and John Halkias. Us highway 101 dataset. Federal Highway Administration (FHWA), Tech. Rep. FHWA-HRT-07-030, 2007.
  • [17] James Colyar and John Halkias. Us highway i-80 dataset. Federal Highway Administration (FHWA), Tech. Rep. FHWA-HRT-07-030, 2007.
  • [18] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [19] Yuexin Ma, Xinge Zhu, Sibo Zhang, Ruigang Yang, Wenping Wang, and Dinesh Manocha. Trafficpredict: Trajectory prediction for heterogeneous traffic-agents. 2019.
  • [20] Dirk Helbing and Peter Molnar. Social force model for pedestrian dynamics. Physical review E, 51(5):4282, 1995.
  • [21] Kota Yamaguchi, Alexander C Berg, Luis E Ortiz, and Tamara L Berg. Who are you with and where are you going? In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1345–1352. IEEE, 2011.
  • [22] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 336–345, 2017.
  • [23] Hao Xue, Du Q Huynh, and Mark Reynolds. Ss-lstm: A hierarchical lstm model for pedestrian trajectory prediction. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1186–1194. IEEE, 2018.
  • [24] Chiho Choi and Behzad Dariush. Looking to relations for future trajectory forecast. arXiv preprint arXiv:1905.08855, 2019.
  • [25] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [26] Vasili Ramanishka, Yi-Ting Chen, Teruhisa Misu, and Kate Saenko. Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [27] Sirin Haddad, Meiqing Wu, He Wei, and Siew Kei Lam. Situation-aware pedestrian trajectory prediction with spatio-temporal attention model. In Computer Vision Winter Workshop, 2019.
  • [28] Christoph Schöller, Vincent Aravantinos, Florian Lay, and Alois Knoll. The simpler the better: Constant velocity for pedestrian motion prediction. arXiv preprint arXiv:1903.07933, 2019.