1 Introduction
It is widely believed that the transportation systems of future will be dominated by autonomous vehicles (AVs). With the rapid advancements of this field in recent years, many have come to predict that this shift will occur within the next ten years. A major motivation for the interest and push towards development of AVs stems from the demand for safer transportation. It is generally assumed that replacing the intrinsic imperfections of human drivers with expert computational models may significantly reduce the number of accidents caused by driver error [1]. Yet, development of reliable and robust AV technologies remains an ongoing challenge, and is actively pursued from various directions of research and development [2].
Of particular importance is the research on reliable motion planning and collision avoidance mechanisms. Over the span of multiple decades, numerous approaches towards this problem have been proposed [3]
, ranging from control theoretic formalizations and optimal control methods to potential field and rulebased techniques. More recently, advances in machine learning have enabled new datadriven approaches to collision avoidance based on techniques such as imitation learning
[4] and deep Reinforcement Learning (RL)[5]. However, with the growing complexity in their deployment settings and mechanisms, the challenge of providing safety guarantees on these solutions is becoming increasingly difficult [2]. A notable instance is the Traffic Collision Avoidance System (TCAS), which had satisfied the rigorous safety requirements of the Federal Aviation Authority (FAA) before its wide deployment in NextGen commercial aircraft. Yet, it was recently shown [6] to be highly unreliable in the modern highdensity airspace, to the extent that it may give rise to Inevitable Collision States (ICS)—for which, regardless of the future trajectories, a collision eventually occurs. Furthermore, recent research demonstrate that automatic sense and avoid mechanisms can be adversarially exploited to manipulate the motion trajectory of AVs [7].In response, a growing number of mitigation techniques and novel approaches to safe motion planning are proposed, but each with certain casespecific assumptions and ad hoc verification procedures. Consequently, quantitative comparison across such methods is rendered extremely difficult. Current state of the art includes several attempts towards benchmarking of safe behavior in motion planning and collision avoidance (e.g., [8, 9]), but many of the current frameworks fail to meet the requirements of new adaptive techniques based on machine learning [2]. Also, current benchmarking frameworks do not provide comprehensive and robust mechanisms for exploration in the complex spaces of undesired states and trajectories. Prominent approaches in such frameworks are based on randomized or scenariobased generation of obstacles (e.g., [9]), which are highly prone to missing critical ICS or other undesired states specific to the mechanism under test. Another approach in these frameworks relies on computationally expensive techniques for reachability analysis of collision states, which also fail to provide concrete guarantees on critical boundaries of safe operation [10].
Aiming to fill the gap in safetyfocused benchmarking, this paper proposes a novel framework based on machine learning for benchmarking the reliability of novel techniques under the worstcase scenario of interacting with an optimal adversarial agent. This framework adopts the powerful exploration and optimization performance of deep RL to train adversarial autonomous agents whose goal is to learn optimal navigation policies that aim to drive the system into ICS and other unsafe states. Depending on the parameters and goals of analysis, such objectives may include direct collision of adversarial agent with the AV, or exploiting the collision avoidance mechanisms to manipulate the trajectory of AV to either alter and control its path, or to indirectly induce collisions between the AV and other objects in the environment.
Building on this foundation, the main contributions of this work include:

Proposal of a computational framework and process flow for worstcase benchmarking of collision avoidance algorithms, independent of their complexity, stochasticity, and even adaptive dynamics;

Proposal of a deep RL process flow to seamlessly adapt to the system under test, and to overcome the shortcomings of fully random or scenariobased exploration mechanisms;

Proposal of novel metrics for standardized comparison of collision avoidance algorithms;

Demonstration of the practical application and efficacy of the proposed framework via a realistic case study of comparing the reliability of two collision avoidance mechanisms in response to intentional collision attempts.
The remainder of this paper is organized as follows: Section 2 provides the architectural details of our proposed framework, and introduces novel metrics for quantification and comparison of motion planning and collision avoidance algorithms. Section 3 demonstrates the application of this framework through a practical case study of comparing the reliability of two collision avoidance mechanisms, viz., one based on a deep RL navigation policy, and another employing a control theoretics mechanism. Finally, Section 4 concludes the paper with remarks on potential directions of further research.
2 Proposed Architecture
The highlevel idea of this framework is to employ deep RL methods [5] for endtoend training of an optimal adversarial policy (i.e., mapping of states to actions) for autonomous motion planning. The difference between this adversarial policy and typical AV policies is that the latter aims to achieve an optimal motion planning while avoiding collisions, while the former aims to learn an optimal motion planning with the aim of causing intentional direct or indirect collisions. Through suitable choices of optimality criteria and the corresponding reward (i.e., objective) function, this approach enables the adoption of powerful statespace exploration and policy optimization techniques developed for deep RL applications. Furthermore, we argue that both the training and testtime procedures of adversarial policies provide quantitative measures of reliability, which can be used for benchmarking of behaviors in worstcase scenarios. Accordingly, our proposed framework is comprised of 4 main components: simulation environment, objectives, deep RL algorithm, and quantitative metrics, detailed as follows:
2.1 Simulation Environment
Computational testing of safety in AV necessitates a simulation environment that takes into account the various aspects of deployment settings, including (but not limited to): the physics of the AV, the terrain and road conditions, number and type of other mobile or static objects in the environment, type of traffic, etc. Depending on the goals and criteria of the experiment, the simulation environment may be chosen from numerous opensource and commercial simulation platforms, such as TORCS [11] and CARLA [12].
Aiming for a generalpurpose solution, the proposed framework is designed to be independent of the simulation platform and setup. To this end, our proposed framework imposes the bare minimum requirements on the simulation environment, which are twofold: First, the platform should provide an interface (i.e., API) to enable seamless integration with the deep RL module for external monitoring of state variables and control of agents; and second, the platform should allow for external adjustment of simulation flow and speed. The latter requirement is to eliminate the effect of training and computational lags on the flow of simulation.
If these requirements are met, the proposed framework can be seamlessly implemented on any configuration and setup of the environment, such as the choice of twoagent vs. multiagent settings, selection of terrains, models of vehicular physics, and any other parameter configured for the experiment.
2.2 Objectives
Deep RL algorithms are based on optimization of action policies according to defined objectives, represented by a reward function , which produces numerical values corresponding to the instantaneous value of the action at state . In the case of an AV aiming at traversing a race track in minimum time, a common formulation of the reward function is [13]:
(1) 
where is the velocity of AV along its longitudinal axis, is the angle between direction of AV and the track axis, and is the distance between AV and the center of the track axis. This function rewards trajectories with high velocities in the direction of the track, but it does not account for the negative effect of collisions. A simple remedy is to add a negative reward term to when a collision is detected, for instance:
(2) 
where
is a binary variable representing whether a collision has occurred or not, and
is the negative reward whose magnitude determines how undesired collisions are.As depicted in figure 1, experiments on an AVundertest (henceforth referred to as ) can employ one of the three choices as the highlevel objective of the adversarial RL agent (designated by ), described below:

Direct collision with : This objective aims to find an optimal navigation policy that leads to direct collision of with . Optimality can be defined arbitrarily (e.g., minimum time to cause collision). An instance of corresponding adversarial reward functions is , where is an arbitrary distance metric, and represents the cost of direct collisions for .

Induced collision: Unlike the former case, this objective aims to find an optimal navigation policy that manipulates the trajectory of such that it collides with any vehicle or object other than . A simple example of suitable adversarial reward function is , where determines whether has collided with another object, is the value of being in a collision, and captures how undesired direct collisions are for .

Trajectory manipulation: The aim of this objective is to exploit collision avoidance maneuvers [7] of to shift its natural trajectory towards an arbitrary alternative. An instance of corresponding reward function is , where is the position of at time if it were following the desired adversarial trajectory .
Our proposed framework allows for arbitrary modification and definition of reward functions corresponding to any one of the aforementioned objectives, as long as the reward functions satisfy the fundamental requirements of rationality [14]. It is noteworthy that not all objectives are necessarily reachable, and this framework provides a besteffort solution which is guaranteed to converge if the objective is reachable.
2.3 Deep RL Model
While there are numerous deep RL algorithms with proven success [5], not all are suitable for learning autonomous navigation policies. Driving falls within the domain of continuous control, meaning that the actuation parameters, such as acceleration, brake, and steering, can take any value from a bounded range of real numbers. Many of the algorithms proposed for deep RL, like Deep QNetworks (DQNs) [5], are designed for problems with discrete and lowdimensional action spaces. It is of course possible to discretize sets of continuous values, but this would lead to explosion of dimensions, thus rendering discrete approaches such as DQN infeasible. This limitation has motivated various efforts towards development of continuous deep RL algorithms, resulting in recent proposal of techniques such as Deep Deterministic Policy Gradient (DDPG), variants of Trust Region Policy Optimization, and variants of Asynchronous Advantage Actor Critic (A3C) [5].
Since the basis of deep RL is function approximation with deep neural networks, implementation of such algorithms requires suitable choices (i.e., tuning) of the model architecture and its hyperparameters. A common approach to this problem is to begin with an architecture and parameters from successful implementations of similar problems. For the purposes of this work, some options include those presented in
[13]. An important consideration in choosing or tuning architectures is the type of inputs. Conventionally, if the state observations of the RL agent are pixel values of the environment, convolutional architectures are considered. If states are numerical vectors, multilayer perceptrons may prove most feasible, and if states are sequential data (e.g., timedomain samples), a recurrent architecture such as Long ShortTerm Memory (LSTM) provides a good starting point for model tuning.
Another important aspect of deep RL models is the choice of exploration mechanism. Multiple studies (e.g., [15]) report that the classical greedy mechanism fails to provide efficient performance in continuous and multiaction scenarios. Alternative approaches include parameterspace noise methods, such as NoisyNet [15] and the OrnsteinUhlenbeck process [13].
To preserve the generality and flexibility of our solution, this framework allows for arbitrary choice of model type, architecture, and parameters. Of course, it is noteworthy that due to the stochastic nature of deep RL and exploration processes, different models may yield different results. Hence, to obtain valid comparisons, it is essential to use the same model architecture and configuration across the domain of comparison.
2.4 Metrics
We propose three classes of metrics for quantitative benchmarking of experiments: those obtained from the training process, those measured at testtime, and numerable parameters of initial environment configurations. One instance of trainingtime metrics is the number of iterations (alternatively, steps or episodes) to reach the threshold of convergence. Lower values of this measurement, averaged over repeated experiments, indicates that finding a strategy to jeopardize the safety of algorithm under test requires fewer explorations into multidimensional observations, and may signal a weaker resilience to the simulated type of hazard. Examples of testtime metrics include time or distance to incur damage, for which the lower values indicate weaker robustness of the algorithm. Another testtime metric is levels of damage incurred to each AV: higher damage to indicates higher safety risks, while higher damage to demonstrates the higher economic cost of intentional attacks, which in turn indicates lower safety risks for . The initial environment configuration may include the number, density, and initial arrangement of AVs that may impact the reliable behavior of the system. For instance, similar to the case of TCAS, the collision avoidance algorithm employed by an may fail to retain its safety criteria when the number of other vehicles in the environment is larger than a certain threshold. In this instance, the maximum number of agents that can safely coexist with can be noted as a measure of capacitive threshold.
Also, it is indeed noteworthy that due to the stochastic nature of deep RL and some simulation platforms, repeated measurements are necessary to obtain valid results.
3 Case Study
To demonstrate the application of our framework in practice, we have compared the resilience of two ground AVs to adversarial agents trained for direct collisions. One agent, , is an AV equipped with motion planning and collision avoidance policies obtained from the deep RL approach proposed in [13] ^{1}^{1}1The only notable difference between the model of [13] and this implementation is that the state vector of the latter incorporates a vector of distances with opponents, in addition to the state parameters considered in the former. This alteration is due to the need for accurate positioning of other agents in the collision avoidance policy. The other agent, , is the Olethros driver bot [11] that adopts a modelpredictive control approach towards navigation and collision avoidance. Table I describes the configuration of main components for this experiment. We have chosen TORCS as the simulation platform of this experiment, as it satisfies the necessary requirements noted in Section 2
. A further advantage of TORCS is the availability of an OpenAI Gym interface to its environment, thus making its integration with the TensorFlow deep learning platform easier. We configured the environment to include multiple AVs, which include
and , as well as a group of arbitrarily selected autonomous bots from the defaults of TORCS.Component  Configuration 

Environment  TORCS, multiagent 
Objective  Direct collision 
Reward function  
Deep RL model  DDPG  same architecture as [13] 
Exploration mechanism  OrnsteinUhlenbeck 
Trainingtime metrics  Number of episodes to convergence 
Testtime metrics  Number of (simulated) seconds to collision 
Table II presents the training progress of adversarial agent for both and . It can be seen that allows faster convergence (473 vs 889 episodes), and has a lower return value than . It is noteworthy than higher values of minimum time to collision indicate greater resilience, while higher values of optimal return represent weaker robustness. Accordingly, these observations indicate that is more predictable and less complex than , while is more resilient but less robust compared to . Similarly, testtime results presented in Table II indicate that allows collisions at a much sooner time than . Thus, we can conclude that in the selected environment, the motion planning and collision avoidance algorithm of is more resilient to direct collisions than that of .
Metric  
Number of episodes to convergence  470  890 
Optimal return  13900  15400 
Time to collision  22.44s  51.31s 
4 Summary and Future Directions
We have proposed a process flow and framework that utilize adversarial deep reinforcement learning to measure the reliability of motion planning and collision avoidance mechanisms in autonomous vehicles. We have established the advantages of this framework over current benchmarking schemes, which include flexibility and generality, adaptive probing by training adversarial policies against the particular systemundertest, sampleefficient and customizable exploration mechanisms, and provision of baseline (i.e., worstcase) measurements for benchmarking and comparison among heterogeneous systems.
The straightforward architecture of the proposed framework presents a number of potential venues for further research. An immediate next step is to apply this framework to prominent and recently published techniques for motion planning and collision avoidance, with the aim of creating reference benchmarks for use in relevant research projects. Another promising venue is to check the applicability of recent publications (e.g.,[16]) that claim training under adversarial perturbations can enhance the resilience and robustness of the policy. Hence, a potential mitigation and defense technique may emerge from investigating the training of adversarial policy and reinforcement learning models of collision avoidance in conjunction.
References
 [1] S. A. Bagloee, M. Tavana, M. Asadi, and T. Oliver, “Autonomous vehicles: challenges, opportunities, and future implications for transportation policies,” Journal of Modern Transportation, vol. 24, no. 4, pp. 284–303, 2016.
 [2] P. Koopman and M. Wagner, “Autonomous vehicle safety: An interdisciplinary challenge,” IEEE Intelligent Transportation Systems Magazine, vol. 9, no. 1, pp. 90–96, 2017.
 [3] M. Mohanan and A. Salgoankar, “A survey of robotic motion planning in dynamic environments,” Robotics and Autonomous Systems, vol. 100, pp. 171–185, 2018.
 [4] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation learning: A survey of learning methods,” ACM Computing Surveys (CSUR), vol. 50, no. 2, p. 21, 2017.
 [5] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “A brief survey of deep reinforcement learning,” arXiv preprint arXiv:1708.05866, 2017.
 [6] T. Jun, M. A. Piera, and S. Ruiz, “A causal model to explore the acas induced collisions,” Proceedings of the Institution of Mechanical Engineers, Part G: Journal of Aerospace Engineering, vol. 228, no. 10, pp. 1735–1748, 2014.
 [7] P. Pierpaoli, M. Egerstedt, and A. Rahmani, “Altering uav flight path by threatening collision,” in Digital Avionics Systems Conference (DASC), 2015 IEEE/AIAA 34th. IEEE, 2015, pp. 4A4–1.
 [8] M. Moll, I. A. Sucan, and L. E. Kavraki, “Benchmarking motion planning algorithms: An extensible infrastructure for analysis and visualization,” IEEE Robotics & Automation Magazine, vol. 22, no. 3, pp. 96–102, 2015.
 [9] D. Calisi, L. Iocchi, and D. Nardi, “A unified benchmark framework for autonomous mobile robots and vehicles motion algorithms (movema benchmarks),” in Workshop on experimental methodology and benchmarking in robotics research (RSS 2008), 2008.
 [10] Z. Zhou, J. Ding, H. Huang, R. Takei, and C. Tomlin, “Efficient path planning algorithms in reachavoid problems,” Automatica, vol. 89, pp. 28–36, 2018.
 [11] D. Loiacono, L. Cardamone, and P. L. Lanzi, “Simulated car racing championship: Competition software manual,” arXiv preprint arXiv:1304.1672, 2013.
 [12] A. Dosovitskiy, G. Ros, F. Codevilla, A. López, and V. Koltun, “Carla: An open urban driving simulator,” arXiv preprint arXiv:1711.03938, 2017.
 [13] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.

[14]
M. Bowling and M. Veloso, “Rational and convergent learning in stochastic
games,” in
International joint conference on artificial intelligence
, vol. 17, no. 1. Lawrence Erlbaum Associates Ltd, 2001, pp. 1021–1026.  [15] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin et al., “Noisy networks for exploration,” arXiv preprint arXiv:1706.10295, 2017.
 [16] A. Mandlekar, Y. Zhu, A. Garg, L. FeiFei, and S. Savarese, “Adversarially robust policy learning: Active construction of physicallyplausible perturbations,” in IEEE International Conference on Intelligent Robots and Systems (to appear), 2017.
Comments
There are no comments yet.