1 Introduction
A major challenge for autonomous driving is achieving safe and comfortable driving experience under uncertainty. Several factors contribute to the uncertainty in the road world model (RWM) around the egovehicle (EV), which can be divided into uncertainty due to measurement noise and uncertainty associated with the environment. Examples for measurement noise include inherent sensor noise, errors in perception pipelines (e.g., object detection, road segmentation), limited sensor range due to weather or latency in detecting/tracking stationary/slowlymoving objects or objects that suddenly appear from occluded areas. Examples for uncertainty from the environment are hidden RWM states due to occlusions from large obstacles or sharp turns, or parameters that cannot be physically sensed such as intentions of humans participating in the environment, including other drivers, cyclists, and pedestrians.
Different approaches have been proposed to solve the behavior planning problem. Handwritten rules using finite state machines have been used in the DARPA Urban drive competition DARPA (2007), e.g., Urmson et al. (2008)
. Another approach is to formulate the problem as a partially observable Markov decision process (POMDP) where the uncertainty in the environment is captured by a belief about its true state. POMDP is a powerful framework for solving the behavior planning, and several papers have been published using approximate POMDP solvers, e.g.,
Bandyopadhyay et al. (2013); Brechtel et al. (2011, 2014); Cunningham et al. (2015); Ulbrich and Maurer (2013).This paper presents riskaverse QMDP (RAQMDP) — a scalable approach for decision making under uncertainty by incorporating QMDP Littman et al. (1995), unscented transform Julier and Uhlmann (2004), and MCTS Kochenderfer (2015)
. QMDP is a method for solving a POMDP by sampling from the belief and leveraging the solution to the underlying MDP; while unscented transform (also referred to as sigmapoint sampling) selects samples and their corresponding weights from a given distribution in a way that the original mean/covariance is preserved. Together they provide an efficient tool to sample from the belief, evaluate mean/variance of the outcome for different maneuvers, and make a riskaverse decision. To solve the underlying MDP, we propose a modified MCTS approach which performs a comprehensive evaluation of all permissible maneuvers (such as lane keeping or lane change negotiation/execution with various configurable parameters) over a sufficiently long planning horizon.
The approach presented in this paper can be easily extended to different driving scenarios, and in this paper we show results for two important use cases for highway driving. First use case is behavior planning under limited sensor range where we show that the proposed RAQMDP approach achieves a tradeoff between EV’s average velocity and worstcase absolute jerk as a function of sensor range. The second use case is on handling a merging car that is partially observed due to occlusion. In both cases, we show that the proposed algorithm is able to achieve a tradeoff between safety and assertiveness.
2 Related Work
A tree search based approach for behavior planning has been proposed in Paxton et al. (2017)
where a deep reinforcement learning (DRL) agent is trained to optimize the tree search efficiency. The paper however does not address uncertainty and assumes a fully observed environment (MDP).
Wei et al. (2011)proposes to use QMDP for single lane autonomous driving under uncertainty. The paper assumes a Gaussian model for sensor noise and uses normalized probability density values as weights. Since actual noise is not Gaussian and the transformation from observation to decision is nonlinear, this model assumption may be prone to mismatch errors.
Lenz et al. (2016) frames highway planning as a cooperative perfect information game, and proposes to use MCTS to minimize a global, shared cost function. Dong et al. (2017) addresses the ramp merging scenario using a probabilistic graphical model.Sunberg et al. (2017) proposes POMCPOW and PFTDPW as extensions of MCTS to POMDP settings with continuous observation and action spaces. In particular, the paper applies this problem to lane changing, and uses a particle filter to track predictions for parameters used to model driver intentions (assuming remaining observations are ideal). Luo et al. (2018)
considers importance sampling with a tree search algorithm, where samples are obtained using a probability distribution.
Slade et al.considers the joint estimation and control problem for a robot using an unscented Kalman filter
Wan and Merwe (2000) to estimate different parameter values, and a variation of QMDP tree search that assumes full observability of the environment after the first step for computation efficiency. Hubmann et al. (2018) proposes a POMDP planning framework to handle automated driving at an unsignalized intersection where a particle filter is used to represent belief about the routes other vehicles may take. They sample from the particles and evaluate various longitudinal acceleration along EV’s path and select the action achieving the highest expected reward at the end.As we will explain in more detail in the next sections, our main contributions to the prior work discussed here is introducing a sampleefficient framework that enables riskaverse decision making by tracking the uncertainty (in addition to the mean) of the total reward that an action can achieve. Moreover, rare critical events can be examined more thoroughly in this framework as they will have their corresponding sigmapoint and a dedicated tree search.
3 Approach
A modular stack is considered where a RWM is inferred based on sensor inputs and map information. Let represent the state of the EV as well as other objects sharing the road with EV at time , where th object state is defined as
(1) 
which represent lateral and longitudinal location, velocity, and acceleration for the object, respectively, and time indices are dropped for simplicity. The RWM module generates a mean and covariance matrix that are calculated by the internal tracking algorithm.
A hierarchical behavior planner is considered where the action space, , is defined as a composition of a high level action set and a set of parameters, , specifying how to execute the action (e.g., may include safe time headway, minimum distance from a lead vehicle, desired velocity, maximum acceleration/deceleration, level of politeness, and direction and maximum time/distance of lane change).
The behavior planner action is then passed to a motion planning and control module to generate the trajectory planning and throttle/steering control. Based on a time series of inputs from the RWM, the behavior planner selects an action based on a riskaverse metric:
(2) 
where is the belief about true state which is represented by a mean and covariance function , is the expected value of taking action in belief state , represents the variance of this value, and
is a hyperparameter. This is a POMDP problem and its exact solution is computationally intractable. Instead, we use a QMDPbased framework equipped with the unscented transform to generate sigmapoints to sample the belief distribution. In particular, we first select
sigmapoints in a deterministic way as follows:(3) 
where
is the dimension of the vector
, takes values in , is the weight corresponding to that can be positive or negative, and is the th row or column of the matrix square root of . Weights corresponding to other sigmapoints are the same and are given by . In our implementation, any sigmapoint that is too close to or results in physically infeasible situation (e.g., a car is placed out of the road, overlaps with another object, or has acceleration/deceleration/velocity beyond physical limits) is excluded. Therefore, the final value of used in equation (3) could be smaller than the dimension of vector . The mean and variance of the Qfunction are then calculated as follows:(4)  
(5) 
To calculate for a given state we use an online planner, in particular a Monte Carlo tree search (MCTS). In baseline MCTS, the tree is expanded using the UCT bound Kochenderfer (2015) as follows:
(6) 
where is the number of times action was selected when state was visited, is total number of visits to state , and is a hyperparameter that tunes the exploration/exploitation tradeoff. It should be noted that the main objective of the MCTS approach is to select the best action , and hence ranking of the Qvalues across different actions is sufficient, and accuracy of the value function is not very important, especially for actions with low values.
This is troublesome for the QMDP estimates given by equations (4) and (5), since an action that leads to a low value for a given sigmapoint, can lead to a high value for a different sigmapoint. We found that using the UCT bound can result in very few visits for actions with low values, resulting in unreliable estimate for the corresponding Qvalue. To mitigate this problem, we incorporated stronger exploration approach into MCTS, by combining the UCT bound with greedy exploration. It is also possible to add a variance term to the UCT bound for more efficient exploration. In this paper, we implemented the greedy approach since it resulted in acceptable performance. The modified action selection criteria is then given by
(7) 
For efficient tree search, we only limit using the greedy for selection of actions at the root of the tree, and use UCT for selecting actions at other nodes.
Finally, we should point out that our method also extends to cases where the belief state contains both continuous and discrete random variables. For example continuous noisy values for lateral/longitudinal location/velocity/acceleration, and discrete/categorical values for blinker status, type (car, truck, van, etc.), and absence/presence of an object at sensor range. For the discrete or categorical random variables we use all the nonzero probability realizations along with the potential sigmapoints selected for the continuous ones.
4 Simulation Results
In this section, we investigate two driving scenarios with uncertainty: 1) stationary object on the road beyond sensor range, and 2) highway ramp with limited field of view. Before we proceed, we provide a few details which are common to both scenarios. It is assumed that all vehicles use an intelligent driver model (IDM) Treiber et al. (2008) to set their longitudinal acceleration:
(8) 
where and are velocity of the vehicle and the distance to its lead, respectively, and the safe distance is set according to Lemma 2 in Shalevshwartz et al. (2017). Table 1 provides further details about the model parameters.
(9) 
Name  Description  Value 

Minimum distance in jammed traffic  m  
Response time  s  
Desired velocity  km/h (m/s)  
Maximum acceleration  m/s^{2}  
Safe deceleration  m/s^{2}  
Maximum deceleration  m/s^{2} 
The behavioral planning (BP) module operates at 2Hz. All algorithms studied in the next subsections perform a tree search of depth 15 (which implies 7.5s planning horizon). The number of tree queries is limited to 20K which is uniformly distributed over sample(s) from the (belief) state. The action space consists of the LaneKeep action with five configurations that put different lower and upper bounds on the longitudinal acceleration. In particular, we have the following intervals for the acceleration: [8, 2], [2, 1], [1, 0], [0, 1], [1, 2]. For rollout policy during tree search we use an IDM with acceleration restricted to the interval [8, 0]. This rollout policy allows the EV to continue at constant velocity (
) and keep a safe distance to its lead vehicle; hence it returns a reasonable estimate for the Qvalue update. State transitions during the tree search are modeled via a basic prediction algorithm that has perfect knowledge about EV’s dynamics while assumes other vehicles drive at constant velocity. The cost function is a weighted sum of three components which focus on safety, comfort, and performance. For safety, we consider penalty for closeness to other vehicles and crash (as a function of the velocity of the involved vehicles); for comfort, separate cost functions for hard brake and jerk are defined; and finally for performance, deviation from desired velocity is penalized. After the tree search finishes, the best action is selected according to equation (2) and is sent to the motion planning (MP) module that is running at 20Hz. The MP module respects the upper bound specified by the BP action but it may override the lower bound in case it finds higher deceleration necessary to ensure safety.4.1 Stationary Object on the Road
The experiment presented in this section addresses the sensor range limitation in autonomous driving which could be due to poor weather/light condition, occlusion by large obstacles, difficulty in detecting stationary objects on the road, etc. In this experiment, we assume there exists a stationary object on EV’s path, m away from its current location. Let us define sensor range as a distance at which the probability of detecting a stationary object is . In this scenario, our proposed RAQMDP() approach with riskaverse coefficient and greedy parameter takes two samples: 1) a stationary object is present at the sensor range, 2) the road ahead is clear of any object; where the corresponding weights for these two samples are and , respectively. For baseline we consider two MCTSbased schemes, denoted by MCTSP0 and MCTSP1, that assume there is a stationary object on the road at sensor range with probability and , respectively. Figure 2 compares the average velocity and the maximum absolute jerk experienced by the candidate algorithms for this experiment. MCTSP0 is aggressive and achieves the highest average velocity (equal to ). However, when the stationary object appears in the sensor range, this algorithm experiences the worst absolute jerk as it needs to decelerate immediately to avoid an accident.^{1}^{1}1With the parameters given in Table 1, MCTSP0 cannot avoid crash for a sensor range less than m. MCTSP1 on the other hand is conservative and settles at a much lower velocity. As a result, this algorithm experiences the lowest absolute jerk because it has sufficient time in order to decelerate and stop behind the stationary object. RAQMDP is able to achieve a tradeoff between safety and performance.
abs jerk  

To illustrate the advantage of selecting a riskaverse decision as well as comprehensive exploration at the tree root, we compare variants of RAQMDP, with riskaverse coefficient and greedy parameter . Table 2 shows the average velocity, its corresponding safe distance to a stationary object, and worst case absolute jerk achieved by each of these variants of RAQMDP for a scenario where sensor range is m. Choosing implies selecting the action with highest Qvalue which is the most common procedure that MCTS and QMDPbased solvers apply. As shown in Table 2, this selection results in a risky performance where the safe distance is larger than the sensor range. Selecting a very large value for is also not ideal and impacts robustness as variance term dominates the mean in (2). Selecting close to 0 results in insufficient explorations of available actions at the root of the tree and hence, an unstable performance in various sensor ranges. For the specific example shown in Table 2, and with choice of , EV constantly accelerates and decelerates and hence it could experience a large jerk if the stationary object appears in the sensor range during the acceleration phase. Selecting close to shows reliable performance in this scenario. In general, however, selecting close to 1 implies choosing actions almost at random at the tree root which could be computationally inefficient for high dimensional problems when action space is large. Using an adaptive exploration strategy, such as one that incorporates a computed variance in order to improve sample efficiency, is a subject for future work.
4.2 Highway Ramp with Limited Field of View
The second experiment addresses the problem of noisy sensor measurements. The amount of sensor noise could depend on the weather and light conditions, relative distance and orientation with respect to EV, length of time interval the object has been tracked, etc. Consider a highway ramp as depicted in Figure 3 where EV experiences a limited field of view due to existing obstacles (e.g., trees) or difference in the elevation. As a result, EV’s estimate of the state of a merging vehicle (MV) is subject to sensor noise whose value is high initially and decreases gradually as MV is tracked for a longer time. For simplicity of analysis, we assume MV would not switch its lane before the merge point^{2}^{2}2The experiment could be easily extended to the case where the behavior prediction module assigns probability to MV’s intention of keeping or switching its lane. In that case, there will be multiple sets of sigmapoints (similar to the ones derived in this section), one for each probable intention of the MV and the original sigmapoint weights are adjusted using the corresponding probability of each intention category. and there exists noise only in measurements of MV’s longitudinal velocity, and all other state elements are measured accurately. Let us denote the noise variance reported by RWM at time by . To perform unscented transform, we set . According to (3) we will have three sigmapoints at each decision making time where , , and are constructed by adding 0, , and , respectively, to the longitudinal velocity estimates of MV. The corresponding weight for and is 0.25.
For baseline we consider two MCTSbased schemes: 1) MCTSGenie that receives noisefree observations, and 2) MCTSNoisy which makes decisions based on noisy measurements. The initial velocity of both EV and MV is m/s. Figure 5 compares EV’s velocity for the candidate algorithms, and Table 5 provides further information about the vehicles’ status at the merge point for a realization of the noise under which initial measurements of MV velocity are lower than its actual value. In the beginning, MCTSNoisy assumes that MV is going slower than EV, and decides to accelerate to pass MV before reaching the merge point. However, after few seconds, it realizes that the initial measurements were off and in fact it is not able to take over MV. Although it applies aggressive deceleration (with jerk m/s^{3}
) at that moment, there is not enough time left to create a safe gap before the merge point (see distance to lead and time headway reported in Table
5). MCTSGenie observes the actual velocity of MV and hence is able to make a correct decision from the beginning. For RAQMDP, we select and . As explained above, RAQMDP takes three sigmapoints one of which considers that MV is going faster than the value that RWM reports. It then makes a riskaverse decision which forces the EV to slow down and increase the gap with MV. Note that RAQMDP has some delay in making that decision since its evaluation is also influenced by two other sigmapoints that suggest MV is going slow and there might be an opportunity to take over MV.MCTS  MCTS  RA  

Genie  Noisy  QMDP  
Distance to MV  
Time headway  
Max abs jerk  
EV velocity  
MV velocity 
5 Conclusion and Future Work
In this paper, we proposed a sampleefficient framework for behavior planning under uncertainty which incorporates QMDP, unscented transform, and modified MCTS. It was shown how input uncertainty could be propagated to estimate uncertainty in the outcome of candidate maneuvers, hence, enabling riskaverse decision making.
Future area of research includes 1) testing this algorithm on an actual selfdriving car, 2) using deep reinforcement learning to improve the efficiency of tree search by pruning the action space, and replacing the rollout procedure, 3) applying an enhanced prediction module with capability of assigning probability to various maneuvers for forward simulations in the tree.
References
 Bandyopadhyay et al. [2013] T. Bandyopadhyay, K. S. Won, E. Frazzoli, D. Hsu, W. S. Lee, and D. Rus. Intentionaware motion planning. Algorithmic Foundations of Robotics X, pages 475–491, 2013.
 Brechtel et al. [2011] S. Brechtel, T. Gindele, and R. Dillmann. Probabilistic MDPbehavior planning for cars. 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), pages 1537–1542, Oct 2011.
 Brechtel et al. [2014] S. Brechtel, T. Gindele, and R. Dillmann. Probabilistic decisionmaking under uncertainty for autonomous driving using continuous POMDPs. 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), pages 392–399, Oct 2014.
 Cunningham et al. [2015] A. G. Cunningham, E. Galceran, R. M. Eustice, and E. Olson. MPDM: Multipolicy decisionmaking in dynamic, uncertain environments for autonomous driving. 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1670–1677, May 2015.
 DARPA [2007] DARPA. DARPA urban challenge. 2007. URL http://archive.darpa.mil/grandchallenge/.
 Dong et al. [2017] C. Dong, J. M. Dolan, and B. Litkouhi. Interactive ramp merging planning in autonomous driving: Multimerging leading PGM (MMLPGM). 2017.
 Hubmann et al. [2018] C. Hubmann, J. Schulz, M. Becker, and D. Althoff. Automated driving in uncertain environments: Planning with interaction and uncertain maneuver prediction. 3(1):5–17, 2018.
 Julier and Uhlmann [2004] S. J. Julier and J. K. Uhlmann. Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 92(3), Mar 2004.
 Kochenderfer [2015] M. J. Kochenderfer. Decision making under uncertainty: Theory and application. MIT Press, 2015.
 Lenz et al. [2016] D. Lenz, T. Kessler, and A. Knoll. Tactical cooperative planning for autonomous highway driving using MonteCarlo tree search. IEEE Intelligent Vehicles Symposium, pages 1–7, 2016.
 Littman et al. [1995] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially observable environments: Scaling up. pages 1–59, 1995.
 Luo et al. [2018] Y. Luo, H. Bai, D. Hsul, and W. S. Lee. Importance sampling for online planning under uncertainty. The International Journal of Robotics Research, 2018.

Paxton et al. [2017]
C. Paxton, V. Raman, G. D. Hager, and M. Kobilarov.
Combining neural networks and tree search for task and motion planning in challenging environments.
2017.  Shalevshwartz et al. [2017] S. Shalevshwartz, S. Shammah, and A. Shashua. On a formal model of safe and scalable selfdriving cars. 2017.
 [15] P. Slade, Z. N. Sunberg, and M. J. Kochenderfer. Estimation and control using samplingbased Bayesian reinforcement learning.
 Sunberg et al. [2017] Z. N. Sunberg, C. J. Ho, and M. J. Kochenderfer. The value of inferring the internal state of traffic participants for autonomous freeway driving. 2017 American Control Conference (ACC), pages 3004–3010, May 2017.
 Treiber et al. [2008] M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations. 2008.
 Ulbrich and Maurer [2013] S. Ulbrich and M. Maurer. Probabilistic online POMDP decision making for lane changes in fully automated driving. 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), pages 2063–2070, Oct 2013.
 Urmson et al. [2008] C. Urmson et al. Autonomous driving in urban environments: Boss and the urban challenge. 25:425–466, 2008.
 Wan and Merwe [2000] E. A. Wan and R. V. D. Merwe. The unscented Kalman filter for nonlinear estimation. Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium, pages 153–158, Oct 2000.
 Wei et al. [2011] J. Wei, J. M. Dolan, J. M. Snider, and B. Litkouhi. A pointbased MDP for robust singlelane autonomous driving behavior under uncertainties. IEEE International Conference on Robotics and Automation, pages 2586–2592, 2011.
Comments
There are no comments yet.