Training Adversarial Agents to Exploit Weaknesses in Deep Control Policies

02/27/2020 ∙ by Sampo Kuutti, et al. ∙ University of Surrey 0

Deep learning has become an increasingly common technique for various control problems, such as robotic arm manipulation, robot navigation, and autonomous vehicles. However, the downside of using deep neural networks to learn control policies is their opaque nature and the difficulties of validating their safety. As the networks used to obtain state-of-the-art results become increasingly deep and complex, the rules they have learned and how they operate become more challenging to understand. This presents an issue, since in safety-critical applications the safety of the control policy must be ensured to a high confidence level. In this paper, we propose an automated black box testing framework based on adversarial reinforcement learning. The technique uses an adversarial agent, whose goal is to degrade the performance of the target model under test. We test the approach on an autonomous vehicle problem, by training an adversarial reinforcement learning agent, which aims to cause a deep neural network-driven autonomous vehicle to collide. Two neural networks trained for autonomous driving are compared, and the results from the testing are used to compare the robustness of their learned control policies. We show that the proposed framework is able to find weaknesses in both control policies that were not evident during online testing and therefore, demonstrate a significant benefit over manual testing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The rise of deep learning has resulted in rapid progress in many fields, with state-of-the-art results obtained in fields such as image classification, sound recognition, and language processing [21, 14, 37]. The strong capability of Deep Neural Networks (DNNs) for modelling highly non-linear and complex functions has resulted in the adoption of DNNs in many control problems. Important results in control applications such as robotic arm manipulation, robot navigation, and autonomous vehicle control have been achieved through deep learning [26, 12, 25, 47, 22, 4, 8]. However, in safety-critical applications, the safety of the control policy must be fully guaranteed before it is commercially deployable. This presents a significant obstacle to the deployment of DNN-based control policies in safety-critical applications such as autonomous driving [43, 5]. As the operational environment of the system becomes increasingly complex, it becomes infeasible to test the control policy in all possible scenarios it may encounter [6, 17, 44, 10]. Therefore, methods for testing and understanding the safety of these opaque systems are necessary [18, 42, 7, 1]. Moreover, in tasks such as autonomous driving, testing the system in a naturalistic driving environment would mean that edge cases where collisions are more likely to occur, would be seen rarely [9]. Therefore, by using an adversarial agent whose aim is to deliberately create these edge case scenarios, better insights into possible failure cases can be obtained with reduced training times.

The concept of utilising an adversarial agent to disturb a machine learning agent has been suggested previously, for example, by Morimoto & Doya

[28]

, who used an actor-disturbor-critic method, where the disturbor aimed to find the worst disturbance to reduce the performance of a controller. This was used in the training loop of a reinforcement learning agent to improve the robustness of the control policy to disturbances, and was demonstrated in an inverted pendulum task. The framework was extended to use DNNs for estimating the control policy and disturbances in a deep reinforcement learning framework by Pinto et al.

[30], and was demonstrated successfully in a robotic manipulation task. For autonomous vehicles, the idea of learning to automatically find failure cases was suggested as early as 1992, by Schultz et al. [35]

, who used genetic algorithms to find test cases that exposed weaknesses in autonomous aerial vehicle controllers. The results suggested this could be an effective alternative to manual testing of complex software controllers. In more recent work, Behzadan & Munir

[2]

demonstrated that a reinforcement learning agent could be trained to create collisions with other road vehicles, by training an agent to collide against two agents, a DNN and a rule-based system. The number of episodes to convergence and minimum time-to-collision were then used to argue the DNN was the safer control policy. However, by having no constraints on the adversarial agent it is likely to learn a behaviour unlike any human driver, which could limit insights into plausible collision cases that might happen if the DNN control policies were deployed in the real world. For instance, in the examples shown by Behzadan & Munir

[2], the adversarial agent approached the target vehicle from the rear at high velocity, making collision avoidance extremely difficult. Moreover, this type of collision does not necessarily represent a vulnerability in the control policy under test, as the adversarial agent would be considered at fault in a real world collision [36]. Perhaps the closest work to our research is Adaptive Stress Testing (AST) by Koren et al. [19]. AST aims to find the most likely collision cases for an autonomous vehicle by manipulating the actions of pedestrians in the simulation environment and the noise in the observations of the control policy under testing. However, this approach has several weaknesses which limit the insight it can offer into the vulnerabilities in the autonomous system under testing. For example, in majority of the collisions found, the blame for the collision would fall on the pedestrians controlled by AST. Furthermore, the AST framework was only evaluated on a simple rule-based vehicle following system. Instead, in our approach there are constraints on the behaviour of the adversarial agent to maintain plausible driving trajectories and the focus is to find vulnerabilities which lead to collisions where the autonomous vehicle being tested is at fault. Moreover, the observations of the system under testing are not manipulated in any way, therefore all collision cases found by the proposed framework demonstrate a vulnerability in the learned deep control policy.

In this paper, we propose a technique for targeted black box testing, using a reinforcement learning algorithm to find the test scenarios which are most likely to cause the black box control policy to fail. The proposed system has no knowledge of the internal mechanisms of the control policy under testing, but instead learns a behaviour which finds failure cases for the control policy. In this way, the powerful function approximation capabilities of DNNs are used to find the weaknesses in other DNNs, and the testing procedure can therefore be fully automated. The proposed framework is tested in an autonomous driving problem, where the Adversarial Reinforcement Learning (ARL) agent is attempting to cause a vehicle following model to crash. Note that our approach is distinct to work on adversarial attacks [39, 11, 29], as we are not manipulating the inputs to the target DNN, instead we place another agent in the same environment which aims to deliberately cause the target control policy to fail. Similarly, our approach is distinct to research into adversarial robustness [46, 41, 33], as we do not aim to train the model to be robust to adversarial examples, instead we aim to leverage the adversarial agent to find failure cases in the target models more reliably than manual testing methods can, and understand the weaknesses present in the deep control policies.

The remainder of this paper is structured as follows. Section II presents the necessary background, methodology, and general framework behind ARL. The simulations results of the vehicle following use case are presented in Section III. Finally, concluding remarks are given in Section IV.

Ii Methodology

Ii-a Markov Decision Processes

Reinforcement learning allows an agent to learn through interaction with its environment. Reinforcement learning can be formally described by a Markov Decision Process (MDP), denoted by a tuple {

}, where represents the state space, represents the action space,

denotes the state transition probability model, and

is the reward function. At each time step , the agent observes state and takes an action , according to its policy , causing the environment to transition to the next state according to the transition dynamics as given by the transition probability model . The agent then receives a reward , according to the reward function , and observes the new state of the environment . The network parameters are then updated, such that the expected future rewards are maximised. As the agent interacts with the environment it learns through trial-and-error a state-action mapping for an optimal policy , which maximises the discounted sum of rewards over time given by the returns . Therefore, this exploration of the operational environment can be leveraged to explore potential weaknesses in black box systems.

(1)

where is the discount factor used to prioritise immediate rewards over future rewards.

Ii-B Reinforcement Learning

In our framework, the algorithm used to train the adversarial agent is Advantage Actor Critic (A2C) [27], which uses an actor-critic network architecture, as shown in Fig. 1. The actor network estimates the optimal policy function , which aims to maximise the expected rewards. Meanwhile, the critic network estimates the value of being in a given state, with the Value function . The weights of both networks are then updated based on the Advantage function :

(2)
(3)
(4)

Where denotes expectation, is the value function, and is the quality function estimating the value of each action for a given state [3, 38].

Fig. 1: An actor-critic network architecture. The dashed lines represent network updates[23].

The network architectures for both networks are as follows. The actor network has 3 fully-connected layers, followed by a Long Short-Term Memory (LSTM)

[15] layer, which is fully connected to the output layer. The actor network estimates the stochastic control policy with two outputs, mean value

and estimated variance

, which are used to generate a Gaussian distribution

from which the action is sampled, such that . Meanwhile, the critic network uses only 2 fully-connected layers followed by the output layer to estimate the value function

. All hidden neurons use a ReLU-6 activation

[20], whilst the uses a tanh activation, the uses a softplus activation, and the value estimate has a linear activation.

A2C training is formulated as in [23], by updating the actor and critic networks in separate update steps, using a policy loss and value loss functions, respectively, as given by:

(5)
(6)

where is the entropy coefficient and is the entropy added to encourage exploration in the policy, calculated as

(7)

Both networks are updated using RMSProp optimiser

[40]

during training, using their respective loss functions. The final hyperparameters of the network architecture are shown in Table

I.

Parameter Value
No. hidden layers (actor) 3
No. neurons per hidden layer (actor) 50
No. of LSTM units (actor) 16
No. hidden layers (critic) 2
No. neurons per hidden layer (critic) 50
Learning rate (actor), 1x10-4
Learning rate (critic), 1x10-2
Discount factor, 0.99
Entropy coefficient, 1x10-4
RMSProp 1x10-10
RMSProp decay 0.9
RMSProp momentum 0.0
TABLE I: Final network hyperparameters.

Ii-C Training Environment

The autonomous driving simulation was defined as a vehicle following scenario in highway driving. Two vehicles are driving at highway speeds on a straight road. The follower is a DNN trained to follow a leading vehicle at a safe distance, whilst the lead vehicle is the adversarial agent whose aim is to find weaknesses in the follower’s control policy. In order to do this, the adversarial agent must create collisions, thus proving the follower’s control policy is unsafe. For this scenario, the input to the ARL network are the follower vehicle velocity , follower vehicle acceleration , relative velocity to the follower , and time headway between the two vehicles , such that . The output of the network is the lead vehicle acceleration for the next time step. The simulation time steps are fixed at 40 ms.

We demonstrate this framework by attacking two previously published DNN models trained for vehicle following using the IPG CarMaker simulator [16], (1) a Reinforcement Learning (RL) model [23]

and (2) an Imitation Learning (IL) model

[24]. The RL model uses a feedforward network with an LSTM layer to control the longitudinal actions of the vehicle using the observations . The IL model uses a simple feedforward network to also control the longitudinal actions of the vehicle, using the observations . Both models aim to maintain a 2 s time headway from the lead vehicle. The time headway is a measure of intervehicular distance in time, given as follows:

(8)

where is the relative distance between the two vehicles in m, and is the velocity of the following vehicle in m/s.

The training was broken down into 5-minute episodes, where the episode ends after the 5 minutes have passed or a collision occurs. At the start of each episode, a road friction coefficient {0.4, 0.425, … , 1.0} was randomly chosen. It should be noted that a collision may be easier to cause in low friction conditions as the reaction time required for the follower vehicle reduces [31], however none of the agents can observe the road friction coefficients and should therefore learn a policy which generalises to different road conditions. The reward function for training the ARL agent was given based on the time headway:

(9)

Thus, the reward function rewards low time headways, encouraging collisions to occur. The reward is capped at 100, as otherwise the reward function would tend towards infinity as the time headway reaches zero.

The velocity and the acceleration of the lead vehicle were limited to ensure that the vehicle behaviour remains plausible and the velocity is in the highway driving range, as well as to obtain insights into the effect of the driving speeds on the robustness of the vehicle following models. The acceleration was always limited to [-6, 2] m/s2, whilst four velocity ranges were tested as [17, 30], [12, 35], [12, 30], [17, 35] m/s. For each velocity constraint and vehicle follower model combination, 5 training runs of 2,500 episodes were completed.

Iii Simulation Results

Iii-a Results

The average number of collisions and episodes until first collision for each velocity range and vehicle follower model can be seen in Tables II and III, respectively. In initial testing, the lead vehicle was limited to [17, 30] m/s. Since the vehicle following models were trained in this velocity range, it tests their robustness in their training domain. The ARL agent was then trained for 2,500 episodes against both agents, for which the results can be seen in Fig. 2(a). The results demonstrate the IL model is susceptible to an adversarial agent, and thus the ARL agent can cause collisions to occur. On the other hand, the RL model has zero collisions with the ARL agent, and as can be seen from Fig. 2(a) the minimum time headway in the episodes remains near the target headway of 2 s. This shows a significant benefit of the RL model over the IL one, in terms of robustness to an adversarial agent. The second set of experiments, shown in Fig. 2(b), relaxed the velocity constraints on the lead vehicle, to [12, 35] m/s increasing the maximum velocity and decreasing the minimum velocity. These velocity ranges are outside the distribution the vehicle following models experienced during training, and therefore also test model generalisation capability. From the results, it can be seen that both models are more susceptible to an attack in this domain, but nevertheless the RL model still demonstrates significant safety benefits over the IL model. The two last velocity ranges tested were [12, 35] and [17, 30] m/s, relaxing the minimum and maximum lead vehicle velocity constraints, respectively. The results can be seen in Fig. 2(c) and (d). Comparing the two sets of experiments, it can be seen that relaxing the minimum velocity and allowing the lead vehicle to drive at lower speeds enables it to find collision cases more easily. In both cases, collision cases against the IL model are found. However, the results from Tables II and III show that the ARL is able to exploit the IL model significantly more often and earlier in its training. On the other hand, the RL model only collides in the higher velocity experiments, although this occurs relatively rarely and only at the very end of the ARL agent’s training phase.

vlead Imitation Learning Reinforcement Learning
[17, 30] m/s 486.6 0.0
[12, 35] m/s 644.0 2.0
[12, 30] m/s 799.6 0.0
[17, 35] m/s 315.2 1.2
TABLE II: Average number of collisions for different lead vehicle velocity constraints. Averaged over 5 training runs of 2,500 episodes each.
vlead Imitation Learning Reinforcement Learning
[17, 30] m/s 563.2 0.0
[12, 35] m/s 579.2 922.3
[12, 30] m/s 245.3 0.0
[17, 35] m/s 1030.8 2451.0
TABLE III: Average number of episodes until first collision found for different lead vehicle velocity constraints. Averaged over 5 training runs of 2,500 episodes each.
(a) [17, 30] m/s.
(b) [12, 35] m/s.
(c) [12, 30] m/s.
(d) [17, 35] m/s.
Fig. 2: Comparison of the two vehicle following agents’ minimum

per episode over training runs. Averaged over 5 runs, with standard deviation shown in shaded colour.

Further investigation into the type of behaviour the ARL was adopting during training revealed that, for a single training run, the ARL tends to converge to a singular type of behaviour that leads to collisions and these behaviours can vary significantly between different training runs. While some differences in the converged behaviour of the agent can be expected due to the variance in reinforcement learning [13, 34, 32, 45], these results show significant differences between different trained agents. For instance, example collision scenarios are shown in Fig 3, where 2 collisions from 1 training run are shown in the top subfigures, whilst 2 collisions from another training run are shown in the bottom subfigures. For consistency, both training runs are attacking the IL model, with the same velocity constraints. As can be seen in the first two plots, the ARL agent has adopted a strategy in which it continuously accelerates and decelerates between high and low velocities, until the follower vehicle comes close to it with a high acceleration rate, at which point the lead vehicle then decelerates at maximum deceleration. Meanwhile, in the plots (c) and (d), the ARL agent has adopted a strategy in which it first decelerates to a low velocity, and once both vehicles are at low velocities it starts to accelerate back to the maximum velocity, followed by waiting until the following vehicle is approaching it at high acceleration, when it finally decelerates and creates a collision. These results reveal a flaw in the IL model, where it continues to accelerate when the 2 s, trying to reach the target of 2 s, even if the lead vehicle is decelerating and there is a large relative velocity difference between the vehicles. Finding different collision modes is beneficial, as it offers further insight into the different vulnerabilities present in the control policy. Therefore, by exploiting information from multiple training runs where the ARL is using different collision modes, valuable insight into the weaknesses of the DNN under testing can be obtained.

(a)
(b)
(c)
(d)
Fig. 3: Comparison of collision scenarios between training runs, (a) and (b) are from training run 1, whilst (c) and (d) are from training run 2. Both training runs use velocity constraints of [17, 30] m/s and the IL model as the vehicle follower.

Iii-B Discussion

The overall testing completed accounts for a total of 100,000 episodes, or over 8000 simulated hours of testing. This resulted in a total of 11243 collision cases found, which includes 11227 and 16 for the IL and RL models, respectively. This clearly demonstrates the significantly higher robustness of the RL model to the presence of an adversarial agent. Moreover, these results demonstrate that the proposed ARL framework is able to find failure cases for both control policies under testing. Compared to the type of manual test case definition often used for vehicle safety testing, this can be highly beneficial for testing complex black box control systems. For instance, both control policies tested here, were tested for 10 hours of simulated vehicle following in their original works, where the lead vehicle also drove at highway speeds. In this manual testing, the types of trajectories executed by the lead vehicle were manually defined (including both naturalistic driving and emergency manoeuvrers), where the parameters (e.g. maximum velocity, acceleration, time to execute manoeuvrer etc.) were randomised during testing. The constraints on the lead vehicle used in the manual test case definition were m/s and m/s2, and road friction coefficient was uniformly sampled from , representing similar driving conditions to those in the adversarial testing framework presented here. The results for these driving tests are shown in Table IV and show that during normal testing not a single collision was found. This demonstrates how effective our ARL is at finding weaknesses in DNN-based control policies. Indeed, the results from the manual testing would suggest the IL model to be the safer control policy. However, our testing framework exposes significant vulnerabilities in the IL model, demonstrating that the RL control policy is significantly more robust to the presence of an adversarial agent.

Parameter Imitation Learning Reinforcement Learning
min. xrel 23.844 m 7.780 m
mean xrel 57.37 m 58.01 m
max. vrel 8.878 m/s 7.891 m/s
mean vrel 0.0197 m/s 0.0289 m/s
min. th 1.738 s 1.114 s
mean th 1.990 s 2.007 s
collisions 0 0
TABLE IV: 10-hour driving test with manually defined lead vehicle trajectories.

Iv Concluding Remarks

In this paper, an automated testing framework for deep neural networks was presented. The proposed framework is based on adversarial reinforcement learning, where an adversarial agent is placed in the same environment with the system under testing. By training the adversarial agent through reinforcement learning, the agent learns behaviours which degrade the performance of the target system. This general concept could be used to analyse vulnerabilities in control policies used in multi-agent environments, such as robotic manipulation or unmanned aerial vehicles. In our work, the ARL approach was tested in an autonomous vehicle use case, where the aim of the ARL agent was to cause the vehicle behind it to collide into it. Two neural network models trained for vehicle following were tested, one which uses imitation learning and the other using reinforcement learning. Both models had no collisions when manually tested in their original works. The ARL agent was shown to be able to learn a driving behaviour which can cause both target models to collide into the lead vehicle. This in itself demonstrates the significant benefit of this type of targeted adversarial black box testing. Also, the results showed that the reinforcement learning model is significantly more robust to this kind of adversarial behaviour, demonstrating the safety benefit of the reinforcement learning model over the imitation learning model. This type of adversarial testing framework provides an important technique for testing black box control policies, and can be used to benchmark and compare deep control policies as well as to gain additional insights into the types of edge cases the policies are likely to fail in.

Acknowledgment

This work was funded by the EPSRC under grant agreements (EP/R512217/1) and Innovate UK Autonomous Valet Parking Project (Grant No 104273). We would also like to thank NVIDIA Corporation for their GPU grant.

References

  • [1] A. Adadi and M. Berrada (2018)

    Peeking inside the black-box: a survey on explainable artificial intelligence (xai)

    .
    IEEE Access 6, pp. 52138–52160. Cited by: §I.
  • [2] V. Behzadan and A. Munir (2018) Adversarial reinforcement learning framework for benchmarking collision avoidance mechanisms in autonomous vehicles. arXiv preprint arXiv:1806.01368. Cited by: §I.
  • [3] S. Bhatnagar, M. Ghavamzadeh, M. Lee, and R. S. Sutton (2008) Incremental natural actor-critic algorithms. In Advances in Neural Information Processing Systems (NIPS), pp. 105–112. Cited by: §II-B.
  • [4] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §I.
  • [5] M. Borg, C. Englund, K. Wnuk, B. Duran, C. Levandowski, S. Gao, Y. Tan, H. Kaijser, H. Lönn, and J. Törnqvist (2018) Safely entering the deep: a review of verification and validation for machine learning and a challenge elicitation in the automotive industry. arXiv preprint arXiv:1812.05389. Cited by: §I.
  • [6] S. Burton, L. Gauerhof, and C. Heinzemann (2017) Making the case for safety of machine learning in highly automated driving. In International Conference on Computer Safety, Reliability, and Security, pp. 5–16. Cited by: §I.
  • [7] D. Castelvecchi (2016) Can we open the black box of ai?. Nature News 538 (7623), pp. 20. Cited by: §I.
  • [8] F. Codevilla, M. Miiller, A. López, V. Koltun, and A. Dosovitskiy (2018) End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §I.
  • [9] F. Codevilla, E. Santana, A. M. López, and A. Gaidon (2019) Exploring the limitations of behavior cloning for autonomous driving. In

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    ,
    pp. 9329–9338. Cited by: §I.
  • [10] E. Coelingh, J. Nilsson, and J. Buffum (2018) Driving tests for self-driving cars. IEEE Spectrum 55 (3), pp. 40–45. Cited by: §I.
  • [11] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §I.
  • [12] S. Gu, E. Holly, T. Lillicrap, and S. Levine (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3389–3396. Cited by: §I.
  • [13] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2018) Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §III-A.
  • [14] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine 29 (6), pp. 82–97. Cited by: §I.
  • [15] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II-B.
  • [16] IPG Automotive GmbH (2017) CarMaker: virtual testing of automobiles and light-duty vehicles. External Links: Link Cited by: §II-C.
  • [17] N. Kalra and S. M. Paddock (2016) Driving to safety: how many miles of driving would it take to demonstrate autonomous vehicle reliability?. Transportation Research Part A: Policy and Practice 94, pp. 182–193. Cited by: §I.
  • [18] P. Koopman and M. Wagner (2016) Challenges in autonomous vehicle testing and validation. SAE International Journal of Transportation Safety 4 (1), pp. 15–24. Cited by: §I.
  • [19] M. Koren, S. Alsaif, R. Lee, and M. J. Kochenderfer (2018) Adaptive stress testing for autonomous vehicles. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1–7. Cited by: §I.
  • [20] A. Krizhevsky and G. Hinton (2010)

    Convolutional deep belief networks on cifar-10

    .
    Unpublished manuscript 40 (7). Cited by: §II-B.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105. Cited by: §I.
  • [22] S. Kuutti, R. Bowden, Y. Jin, P. Barber, and S. Fallah (2020) A survey of deep learning applications to autonomous vehicle control. IEEE Transactions on Intelligent Transportation Systems. Cited by: §I.
  • [23] S. Kuutti, R. Bowden, H. Joshi, R. de Temple, and S. Fallah (2019) End-to-end reinforcement learning for autonomous longitudinal control using advantage actor critic with temporal context. In 2019 IEEE 22nd Intelligent Transportation Systems Conference (ITSC), pp. 2456–2462. Cited by: Fig. 1, §II-B, §II-C.
  • [24] S. Kuutti, R. Bowden, H. Joshi, R. de Temple, and S. Fallah (2019) Safe deep neural network-driven autonomous vehicles using software safety cages. In 2019 20th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL), pp. 150–160. Cited by: §II-C.
  • [25] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg (2019)

    Making sense of vision and touch: self-supervised learning of multimodal representations for contact-rich tasks

    .
    In 2019 International Conference on Robotics and Automation (ICRA), pp. 8943–8950. Cited by: §I.
  • [26] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §I.
  • [27] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), pp. 1928–1937. Cited by: §II-B.
  • [28] J. Morimoto and K. Doya (2005) Robust reinforcement learning. Neural computation 17 (2), pp. 335–359. Cited by: §I.
  • [29] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016) The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: §I.
  • [30] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 2817–2826. Cited by: §I.
  • [31] K. Reif (2014) Brakes, brake control and driver assistance systems. Weisbaden, Germany, Springer Vieweg. Cited by: §II-C.
  • [32] J. Romoff, P. Henderson, A. Piché, V. Francois-Lavet, and J. Pineau (2018) Reward estimation for variance reduction in deep reinforcement learning. arXiv preprint arXiv:1805.03359. Cited by: §III-A.
  • [33] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry (2018) Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems (NIPS), pp. 5014–5026. Cited by: §I.
  • [34] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §III-A.
  • [35] A. C. Schultz, J. J. Grefenstette, and K. A. De Jong (1992) Adaptive testing of controllers for autonomous vehicles. In Proceedings of the 1992 Symposium on autonomous underwater vehicle technology, pp. 158–164. Cited by: §I.
  • [36] S. Shalev-Shwartz, S. Shammah, and A. Shashua (2017) On a formal model of safe and scalable self-driving cars. arXiv preprint arXiv:1708.06374. Cited by: §I.
  • [37] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112. Cited by: §I.
  • [38] R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. Vol. 135, MIT press Cambridge. Cited by: §II-B.
  • [39] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §I.
  • [40] T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §II-B.
  • [41] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2018)

    Robustness may be at odds with accuracy

    .
    arXiv preprint arXiv:1805.12152. Cited by: §I.
  • [42] P. Van Wesel and A. E. Goodloe (2017) Challenges in the verification of reinforcement learning algorithms. Technical report Technical report, NASA. Cited by: §I.
  • [43] K. R. Varshney and H. Alemzadeh (2017) On the safety of machine learning: cyber-physical systems, decision sciences, and data products. Big data 5 (3), pp. 246–255. Cited by: §I.
  • [44] W. Wachenfeld and H. Winner (2017) The new role of road testing for the safety validation of automated vehicles. In Automated Driving, pp. 419–435. Cited by: §I.
  • [45] L. Weaver and N. Tao (2001) The optimal reward baseline for gradient-based reinforcement learning. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 538–545. Cited by: §III-A.
  • [46] K. Y. Xiao, V. Tjeng, N. M. Shafiullah, and A. Madry (2018) Training for faster adversarial robustness verification via inducing relu stability. arXiv preprint arXiv:1809.03008. Cited by: §I.
  • [47] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. Cited by: §I.