In recent years, there has been growing interest in building fully autonomous vehicles. Our requirement of such vehicles is to have accurate anticipation over other traffic participants so that their planned motions are neither too aggressive nor too conservative. To achieve this goal, autonomous vehicles are expected to reason about the behavior and intentions of surrounding vehicles and subsequently predicts future trajectories of these vehicles.
Given an urban driving environment where there are complex latent factors such as lane geometries, traffic regulations, road constructions and dynamical agents, the complexity of the prediction problem is high. Under such a scenario, there are two challenges to be addressed. First, given the complex environment, it is essential to consider the multimodal nature of the future trajectory . For example, at the intersection as depicted in Fig. 1, there are two distinct choices, moving forward and turning left, which result in totally different future trajectories. Second, the prediction method must be highly flexible and able to easily adapt to the complex contextual factors.
Many handcrafted prediction models, such as [2, 3, 4, 5], may lack flexibility and require refactoring when a new contextual factor is introduced. Meanwhile, other methods, especially the popular RNN-based models [6, 7], treat the trajectory prediction as a pure regression problem in spite of the multimodal nature of the future trajectory. We are therefore motivated to develop a flexible trajectory prediction framework which can easily adapt to various complex urban environments while incorporating high-level intentions to enhance the prediction accuracy.
In this paper, we propose an online two-level vehicle trajectory prediction framework. We develop a policy anticipation network using a long short-term memory (LSTM) network to anticipate high-level policies of vehicles (such as moving forward, yielding, turning and lane changing) based on sequential past observations. Given the high-level policy, we propose an optimization-based context reasoning process in which the complex contextual information is naturally encoded in a multi-layer cost map structure. A policy interpreter is set up to bridge the high-level and low-level reasoning by transforming the policy to a trajectory initial guess of the non-linear optimization. The policy anticipation network is used to capture the intention and guide the trajectory prediction process. Our optimization-based context reasoning process can easily adapt to different traffic configurations by transforming different factors into a unified notation of cost.
The motivation for modeling trajectory prediction as an optimization problem is that human drivers internally balance their maneuvers in terms of the “cost”. For example, driving through red lights or breaking speed limits would risk receiving penalties, and human drivers have an inborn ability to balance various kinds of costs during driving. The optimization-based reasoning process can be easily extended by adding another cost term to the unified cost map structure.
, especially in the field of imitating human driving behaviors using inverse reinforcement learning (IRL). However, from the prediction perspective, the multimodal nature of the future trajectory[1, 12] is not well modeled by the optimization process. For example, the non-linear optimization process may converge to either of the two possible intentions in Fig. 1. To this end, we propose the policy anticipation network, which guides the optimization process to the anticipated high-level intention. Note that our optimization-based context reasoning can also incorporate the IRL technique for weight tuning, which is left as important future work.
We summarize the contributions of this paper as follows:
An online two-level trajectory prediction framework which incorporates the multimodal nature of future trajectories.
A highly flexible optimization-based context reasoning process which incorporates a multi-layer cost map structure to encode various contextual factors.
Integration of the vehicle trajectory prediction framework and presentation of the results on accuracy, efficiency, and flexibility in various traffic configurations.
Ii Related Works
The problem of vehicle trajectory prediction has been actively studied in the literature. As concluded in , there are three levels of prediction models, namely, physics-based, maneuver-based and interaction-aware motion models. Physics-based motion models use dynamic and kinematic vehicle models to propagate future states [14, 15]. However, the prediction results only hold for the very short-term (less than one second). Maneuver-based motion models are more advanced in the sense that the model may forecast relatively complex maneuvers, such as lane change and turns at intersections, by revealing the maneuver pattern. Many of the works on this level present a probabilistic framework to account for the uncertainty and variation of the motion patterns, such as Gaussian processes (GPs) [16, 3], Monte Carlo sampling 
, Gaussian mixture models (GMMs)18]. However, they typically assume vehicles are independent entities and fail to model interactions within the context and with other agents.
, are based on dynamic Bayesian networks (DBNs). Though these methods are context-aware, they require refactoring the models when considering a new contextual factor. Our method belongs to the interaction-aware level. Compared to the DBN-based prediction methods, our method is more flexible and can be easily adapted to different traffic configurations.
It is notable that recurrent neural networks (RNNs) and their variants, such as LSTM networks, have recently been applied to predict or track moving targets, as in[6, 20] and . Our policy anticipation network shares a similar structure with . But the fundamental difference is that the network in  is only used to analyze the maneuver pattern at an intersection and cannot actively predict the future trajectories. Many learning-based end-to-end trajectory prediction models [6, 7, 12] lack the ability to encode the contextual information. In , Lee et al. suggest combining IRL with an environment feature map to learn the interaction with contextual factors. However, this requires a large amount of training data to generalize due to the high complexity of the model. Also, it is hard to learn the interaction in some rare driving situations, such as red light offences.
Iii System Overview
The overview of our vehicle trajectory prediction framework is shown in Fig. 2. During the high-level reasoning, the sequential state observations are fed to the policy anticipation network, which provides the future policy that a vehicle is likely to execute. Together with the map information, the policy can be properly interpreted in the driving context and a reference prediction is generated and fed to the optimization-based context reasoning process. The optimization process renders various environment observations and encodes them into the multi-layer cost map structure. A non-linear optimization process is then conducted to generate the predicted vehicle trajectory.
Iv Policy Anticipation and Interpretation
Iv-a Problem formulation
We assume that the vehicle is equipped with a detector that provides the pose estimationof a neighboring vehicle with ID at different time-instants, where and denote global coordinates at frame , denotes the vehicle orientation in the 2-D plane, and denotes the body velocity. We accumulate observations from different time-instants inside a sliding window with a total window size of . And the network predicts vehicle ’s future policy in a look-ahead window from to . The annotated labels include forward, yield, turn left, turn right, lane change left, and lane change right, and the labels can be easily extended when considering complex lane geometries.
Iv-B Network structure
for the detailed structure. Note that the output layer is modified to a softmax layer to provide the likelihood for all the policy labels. The probability distribution is used in the interpretation of the policy in Sec.IV-C. We adopt negative log-likelihood (NLL) loss for this classification problem.
Iv-C Policy interpretation
The policy interpretation module combines the policy anticipation results with a local map, so that the optimization-based context reasoning can start with a reasonable initial guess.
As shown in Fig. 3, with different initial guesses (turning left or forward in this case), the optimization will be devoted to finding a solution in a totally different local solution space. Specifically, we use the likelihood provided by the policy anticipation network as follows: 1) we prune the infeasible anticipations (turning right in this example); 2) we take the policy of the maximum likelihood, and 3) we generate an initial trajectory prediction by extracting reference points corresponding to the selected policy. The initial guess is fed to the optimization-based context reasoning for further processing. In the future, instead of using deterministic reasoning based on one selected policy, we plan to use a probabilistic interpretation process.
V Optimization-based Context Reasoning
V-a Cost map structure
In this section, we present the cost map structure, which encodes the whole driving context. We specify different kinds of costs by separating them into different layers with distinct physical meanings, for the sake of illustration. A toy example of the multi-layer cost map is given in Fig. 4. We adopt a four-layer cost map design in which we encode the cost induced by the lane geometry and static obstacles into the static layer, the cost induced by the moving objectives (MO) into the MO layer, the cost induced by traffic regulations into the context layer, and the cost induced by the vehicles’ nonholonomic constraints into the nonholonomic layer.
V-B Cost functions
We adopt a discrete notation of the vehicle trajectory  where a continuous trajectory is represented by a series of rear axle center points in a global coordinate system. Namely, the predicted trajectory is approximated by points , which are sampled at equidistant times of sampling step width . The dynamics of the trajectory can be expressed as a function of its time derivatives, which are the finite differences of the sampling points. The orientation and curvature of the trajectory can be expressed by its time derivatives . Following these notations, we introduce the cost functions .
[wide, labelwidth=!, labelindent=9pt]
Lane geometry: Ideally, the point that exceeds the solid-lane boundary should receive a repulsive force (cost) pointing into the travelable lanes. For broken-lane boundaries, we pose the cost of the same structure, but the magnitude is much smaller to allow for lane changing. We present a bi-directional signed distance field (bi-SDF) to describe the corresponding cost characteristics:
where measures the distance to the nearest solid-lane boundary, is the distance threshold, and is the cost magnitude. Note that means the in-boundary area, while represents that the point exceeds the boundary and needs to be pushed back to the travelable lanes. Different from the traditional SDF, which does not define the gradient when , we slightly extend the definition so that the point outside of the boundary will receive a force pointing inside the lane. The benefit of extending is that the optimization process is less likely to get stuck in the infeasible out-of-boundary area.
Static obstacles/ driveable area: The cost induced by static obstacles shares a similar form to , and these two costs are categorized into the static layer. The distance measure to static obstacles is also extended to allow a negative distance.
Moving obstacles: To take interaction with other agents into account, we introduce a cost for if the position of the predicted vehicle is within a distance threshold of the prediction of another agent , where denotes the set of all the interacting agents. The practical method of acquiring is introduced in Sec. VI-C. The MO cost at time is given by
where is specified by the quadratic error between the distance to the moving agent and if the distance threshold is reached.
Red lights: We argue that red lights should not be enforced as hard constraints since in a real-world driving scenario there exist red light offences. To capture the real intention of other drivers under traffic control, we introduce a red light repulsive force . The repulsive force is supposed to produce larger resistance for vehicles travelling at higher velocity. It is notable that if a vehicle refuses to brake and tries to go through a red light, as shown in Fig. 5, the cost will not dominate the optimization process and the abnormal behavior is captured. The overall cost can be expressed by the dot product of the velocity and the repulsive force as follows:
where is the cost magnitude, denotes the unit direction of the force , denotes the distance to the red light, and is the distance threshold below which the force will take effect.
Speed limits: Like red lights, speed limits should not be encoded in hard constraints when taking speed limit offences into account. We introduce the cost , which is induced by the speed limit and should also allow the vehicle to stop in the case of a traffic jam. As a result, we model as the quadratic error between the predicted velocity and a desired velocity . The magnitude of the desired velocity is determined by the minimum between two factors, namely, the speed limit and the velocity trend . Specifically, is obtained by conducting velocity fitting for the historical velocity observations in , which captures the acceleration and deceleration trend of the predicted vehicle and is close to zero in the case of a traffic jam. The direction of the desired velocity conforms to the lane geometry. Mathematically, we have and .
Nonholonomic constraints: The predicted trajectory should obey the limits of the vehicle motion model. Due to the steering geometry of the vehicle, the curvature should be bounded by the maximum curvature allowed. However, when taking abnormal operations, such as skidding, into account, the hard curvature constraint should also be modeled by the feasibility cost as follows:
where the cost takes effect when the curvature exceeds the limit and is the cost magnitude. Similarly, due to the friction limit of tires and throttle limit of vehicles, the maximum acceleration of vehicles cannot exceed a limit . We model the acceleration feasibility cost as follows:
where the maximum acceleration is denoted by and cost magnitude is denoted by .
The motivation for using quadratic functions with barriers for the cost functions is that 1) they tolerate a mild deviation from the best driving practices, and 2) they penalize abnormal behaviors while still allowing their existence.
V-C Non-linear optmization
Based on the cost functions, we now introduce the optimization formulation. At a top level, the predicted trajectory is generated by minimizing , which is the integral of the overall loss over time , i.e., , which can be approximated by finite summation in the discrete case as follows:
The weights of different costs represent the tradeoff among different contextual factors. We tune the weights so that predicted trajectories match a human prior for different traffic configurations. As mentioned in Sec. I, the optimization process can incorporate IRL for automatic weight tuning, which is important future work.
Vi Implementation Details
Vi-a Simulation environment
We adopt an open-source urban autonomous driving simulator named CARLA . In this section, we present our environment setup. For a scene containing vehicles, the first vehicles (agent vehicles) are controlled by the autopilot module provided by CARLA, the -th vehicle (player vehicle) is controlled by a human player and the -th vehicle is an observer vehicle which is supposed to closely follow the player vehicle, sense the environment, and predict the trajectory of the player vehicle. We focus on predicting the trajectories for the player vehicle since it reflects real human intentions. Another reason is that the agent vehicles do not have complex maneuver patterns due to the fixed handwritten logic of the autopilot module. Hence, when presenting the experimental results (Sec. VII) we will focus on illustrating the prediction results for the player vehicle, to give a clean and informative visualization.
Vi-B Data collection and network training
We collect the training data for the policy anticipation network from CARLA by driving the player vehicle ourselves using a Logitech G29 racing wheel. During the driving, we follow the traffic rules most of the time and conduct different maneuver patterns, but we also commit intentional traffic rule offences, as in Fig. 5, to examine how our prediction module will respond. Moreover, we add virtual road construction sites, as in Fig. 4, and respond to them during driving using the feedback from our visualization system. The collected data is frames in total. and are both set to 40 frames (4s). The policy label can be determined by examining the statistics on the steering angle and acceleration in the in an unsupervised way. One problem with the data collected from CARLA is that the current version111CARLA release 0.7.0 is used for all the experiments. only includes two-lane roads with traffic moving in opposite directions, which means that lane change behavior cannot be effectively incorporated. In the future, we will collect data from more complex environments to enrich the dataset.
Vi-C Non-linear optimization procedure
The non-linear optimization formulation (6) is implemented in Ceres  since the objectives can be rewritten into non-linear least squares. If more complex objectives are involved, non-linear solvers such as NLOPT  can be used. The maximum number of iterations is set to . Recall that the prediction for a certain vehicle can depend on the prediction of other vehicles due to the moving obstacle cost term . In practice, we use the prediction results from the last prediction round to calculate .
Vii-a Prediction accuracy
We adopt the root mean square error (RMSE) between the predicted coordinates and the true coordinates as the error metric. We are concerned with how the RMSE error statistics change with respect to the look-ahead time, especially when the look-ahead time is large. To this end, we plot the mean and variance of the RMSE loss with respect to the look-ahead time, as shown in Fig.6. We compare our method with the following two methods:
Naive fitting method. The future trajectory is generated using least mean square polynomial regression with an acceleration regulator. This method can capture the trend but cannot incorporate the driving context.
Since the source code of  is not officially available and  is mainly tested in a highway dataset, we adopt the RNN encoder-decoder part in  according to the available implementation details . We conduct the experiments in the form of case studies to show that our proposed framework can easily adapt to various traffic configurations, as elaborated in Sec. VII-B.
Vii-B Testing in different traffic configurations
To verify that our proposed method can automatically adapt to different traffic configurations and take various latent factors into account, we design five test cases: driving along a curved road, heading towards a pedestrian who is crossing the road, passing through an intersection with road construction, heading towards a red light with road construction, and committing a red light offence. To give a clean visualization, we focus on the prediction for the vehicle being driven by us, namely, the vehicle with ID .
[wide, labelwidth=!, labelindent=9pt]
Curved road: This case is used to verify the capability of reasoning about lane geometries. As illustrated in Fig. 5(a), both baseline methods can capture the motion trend. However, because they are unaware of the lane geometries, they take a long time to conform to the shape of the road. On the other hand, our proposed method produces a reasonable prediction immediately. Quantitatively, our method achieves accuracy improvement for the ending frame in . From the instantaneous error statistics, i.e., the average error for the whole predicted trajectory, we observe that the maximum instantaneous error is reduced from m to m. This testing case verifies the effectiveness of optimization-based context reasoning.
Intersection with road construction: This case is used to illustrate the importance of high-level reasoning, which the two baseline methods lack. As shown in Fig. 5(b), neither baseline methods can effectively capture the turning left intention and both converge slowly. The benefit of incorporating high-level behavior anticipation is validated by an accuracy gain of for the ending frame and a lower instantaneous error during the intersection entrance. The results verify that it is essential to incorporate the high-level intention.
Heading towards a pedestrian: This case is used to illustrate the ability to reason about other moving agents. As shown in Fig. 5(d), the predicted vehicle is moving at high speed, but a pedestrian is crossing the road ahead of the vehicle. Sudden braking of the vehicle should be the reaction. As shown in Fig. 5(d), the two baselines are still giving out forward trajectories, while our method expects hard braking by modeling the interactions between agents. The instantaneous error shows that our method predicts the braking intention beforehand.
Red light offence: This case is used to show how the proposed method responds to abnormal driving behavior, and is elaborated in Fig. 5.
Vii-C Run-time efficiency
In this section, we test the run-time efficiency. We collect rounds of predictions and record the time consumption of the three parts of the system, namely, network inference, cost map rendering, and non-linear optimization. The experiment is conducted on a desktop computer equipped with an Intel I7-8700K CPU and an NVIDIA GTX 1080-Ti graphics card for network training and inference.
As we can see from Tab. I, the network inference (on GPU) consumes ms on average since the network structure is not complex. It takes an average computing time of ms to render a 4-layer 2-D cost map (CPU implementation). The non-linear optimization is efficient, with an average time consumption of ms. In total, our prediction system typically consumes ms to complete one round of prediction, and a large part of that time is consumed in the cost map rendering.
Viii Conclusion and Future work
In this paper, we propose an online two-level vehicle trajectory prediction framework which utilizes a policy anticipation network for high-level policy reasoning and a non-linear optimization process for low-level context reasoning. We highlight the flexibility of the proposed framework, and provide various test cases, including normal operations and abnormal driving behavior, in urban environments. In the future, we will explore using IRL  to acquire the weights from data. Modeling interaction in prediction is another direction we are actively exploring .
-  N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “Desire: Distant future prediction in dynamic scenes with interacting agents,” in
-  G. Agamennoni, J. I. Nieto, and E. M. Nebot, “Estimation of multivehicle dynamics by considering contextual information,” IEEE Trans. Robot., vol. 28, no. 4, pp. 855–870, 2012.
-  C. Laugier, I. E. Paromtchik, M. Perrollaz, M. Yong, J.-D. Yoder, C. Tay, K. Mekhnacha, and A. Nègre, “Probabilistic analysis of dynamic scenes and collision risks assessment to improve driving safety,” IEEE Intell. Trans. Syst. Mag., vol. 3, no. 4, pp. 4–19, 2011.
-  S. Lefèvre, C. Laugier, and J. Ibañez-Guzmán, “Evaluating risk at road intersections by detecting conflicting intentions,” in Proc. of the IEEE/RSJ Intl. Conf. on Intell. Robots and Syst. IEEE, 2012, pp. 4841–4846.
-  F. Havlak and M. Campbell, “Discrete and continuous, probabilistic anticipation for autonomous robots in urban environments,” IEEE Trans. Robot., vol. 30, no. 2, pp. 461–474, 2014.
-  B. Kim, C. M. Kang, S. H. Lee, H. Chae, J. Kim, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” arXiv preprint arXiv:1704.07049, 2017.
-  A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 961–971.
-  M. T. Wolf and J. W. Burdick, “Artificial potential functions for highway driving with collision avoidance,” in Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on. IEEE, 2008, pp. 3731–3736.
P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement
Proceedings of the twenty-first international conference on Machine learning. ACM, 2004, p. 1.
-  M. Bahram, C. Hubmann, A. Lawitzky, M. Aeberhard, and D. Wollherr, “A combined model-and learning-based framework for interaction-aware maneuver prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 6, pp. 1538–1550, 2016.
-  D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning for autonomous cars that leverage effects on human actions.” in Robotics: Science and Systems, 2016.
-  N. Deo and M. M. Trivedi, “Convolutional social pooling for vehicle trajectory prediction,” arXiv preprint arXiv:1805.06771, 2018.
-  S. Lefèvre, D. Vasquez, and C. Laugier, “A survey on motion prediction and risk assessment for intelligent vehicles,” Robomech Journal, vol. 1, no. 1, p. 1, 2014.
-  S. Ammoun and F. Nashashibi, “Real time trajectory prediction for collision risk estimation between vehicles,” in Proc. of the IEEE Intl. Conf. on Intell. Comp. Comm. and Processing. IEEE, 2009, pp. 417–422.
-  M. Brannstrom, E. Coelingh, and J. Sjoberg, “Model-based threat assessment for avoiding arbitrary vehicle collisions,” IEEE Trans. on Intell. Trans. Syst., vol. 11, no. 3, pp. 658–669, 2010.
-  Q. Tran and J. Firl, “Online maneuver recognition and multimodal trajectory prediction for intersection assistance using non-parametric regression,” in Proc. of the IEEE Intl. Veh. Sym. IEEE, 2014, pp. 918–923.
-  A. Eidehall and L. Petersson, “Statistical threat assessment for general road scenes using Monte Carlo sampling,” IEEE Trans. on Intell. Trans. Syst., vol. 9, no. 1, pp. 137–147, 2008.
-  G. S. Aoude, V. R. Desaraju, L. H. Stephens, and J. P. How, “Driver behavior classification at intersections and validation on large naturalistic data set,” IEEE Trans. on Intell. Trans. Syst., vol. 13, no. 2, pp. 724–736, 2012.
-  T. Gindele, S. Brechtel, and R. Dillmann, “Learning driver behavior models from traffic observations for decision making and planning,” IEEE Intell. Trans. Syst. Mag., vol. 7, no. 1, pp. 69–79, 2015.
-  A. Khosroshahi, E. Ohn-Bar, and M. M. Trivedi, “Surround vehicles trajectory analysis with recurrent neural networks,” in Proc. of the IEEE Intl. Conf. on Intell. Trans. Syst. IEEE, 2016, pp. 2267–2272.
-  P. Ondruska and I. Posner, “Deep tracking: Seeing beyond seeing using recurrent neural networks,” arXiv preprint arXiv:1602.00991, 2016.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
-  J. Ziegler, P. Bender, T. Dang, and C. Stiller, “Trajectory planning for bertha—a local, continuous method,” in Proc. of the IEEE Intl. Veh. Sym. IEEE, 2014, pp. 450–457.
-  A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, 2017, pp. 1–16.
-  S. Agarwal, K. Mierle, and Others, “Ceres solver,” http://ceres-solver.org.
-  S. G. Johnson, The NLopt nonlinear-optimization package, 2011. [Online]. Available: http://ab-initio.mit.edu/nlopt
-  N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “Desire: Distant future prediction in dynamic scenes with interacting agents, supplementary material.” http://www.robots.ox.ac.uk/~namhoon/doc/DESIRE-supp.pdf.
-  W. Ding, J. Chen, and S. Shen, “Predicting vehicle behaviors over an extended horizon using behavior interaction network,” in Proc. of the IEEE Intl. Conf. on Robot. and Autom. IEEE, 2019.