1 Introduction
How can we ensure machine learning systems do not make catastrophic mistakes? While machine learning systems have shown impressive results across a variety of domains
(Krizhevsky et al., 2012; Mnih et al., 2015; Silver et al., 2017), they may also fail badly on particular inputs, often in unexpected ways (Szegedy et al., 2013). As we start deploying these systems, it is important that we can reliably evaluate the risk of failure. This is particularly important for safety critical domains like autonomous driving where the negative consequences of a single mistake can overwhelm the positive benefits accrued during typical operation of the system.Limitations of random testing. The key problem we highlight is that for standard statistical evaluation, attaining confidence that the failure rate of a policy is below requires at least episodes. We now informally summarize this point, which we discuss further in Appendix A. For concreteness, consider a selfdriving car company that decides that the cost of a single accident where the car is at fault outweighs the benefits of 100 million miles of faultless operation. The standard approach in machine learning is to estimate the expected return via i.i.d. samples from the data distribution (frequently a test set). For tightly bounded returns, the sample estimate is guaranteed to quickly converge to the true expectation. However, with catastrophic failures, this may be prohibitively inefficient. In our current example, any policy with a failure probability greater than per mile has negative expected return. In other words, it would be better to not deploy the car. However, to achieve reasonable confidence that the car crashes with probability below , the manufacturer would need to testdrive the car for at least miles, which may be prohibitively expensive.
Our Contributions. To overcome the abovementioned problems, we develop a novel adversarial evaluation approach. The central motivation behind our algorithmic choices is the fact that realworld evaluation is typically dominated by the cost of running the agent in the realworld and/or human supervision. In the selfdriving car example, both issues are present: testing requires both operating a physical car, and a human test driver behind the wheel. The overarching idea is thus to screen out situations that are unlikely to be problematic, and focus evaluation on the most difficult situations. The difficulty arises in identifying these situations – since failures are rare, there is little signal to drive optimization. To address this problem, we introduce a continuation approach to learning a failure probability predictor (AVF), which estimates the probability the agent fails given some initial conditions. The idea is to leverage data from less robust agents, which fail more frequently, to provide a stronger learning signal. In our implementation, this also allows the algorithm to reuse data gathered for training the agent, saving time and resources during evaluation. We note that adversarial testing is a wellestablished idea (see Section 5), but typically requires either a dense optimization signal or expert domain knowledge. We avoid these stumbling blocks by relying on the learned AVF, which guides the adversarially acting evaluator.
We look at two settings where the AVF can be used. In the simplest setting, failure search, the problem is to efficiently find inputs (initial conditions) that cause failures (Section 2.1
). This task has several uses. First, an adversary that solves this task efficiently allows one to identify and debug potentially unsafe policies. Second, as has been done previously in the supervised learning literature, efficient adversaries can be used for adversarial training, by folding the states causing failures back into the training algorithm
(Madry et al., 2017). The second setting, risk estimation, is the problem of efficiently estimating the failure probability of an agent (Section 2.2), which also has a simple application to efficiently selecting the most reliable agent from a finite set (Section 4.3).Empirically, we demonstrate dramatic improvements in efficiency through adversarial testing on two domains (simulated driving and humanoid locomotion). In summary, we present key contributions:

We empirically demonstrate the limitations of random testing. We observe that with random testing, the cost of reliably obtaining even a single adversarial input can exceed the entire cost of training. Further, reliably estimating risk can exceed training costs.

We describe a continuation approach for learning failure probability predictors even when failures are rare. We develop algorithms applying failure probability predictors to failure search, risk estimation, and model selection.

We extensively evaluate our method on simulated driving and humanoid locomotion domains. Using adversarial evaluation, we find failures with 198 and 3100 times fewer samples respectively. On Humanoid, we bring the cost of reliable risk estimation down from greater than the cost of training to a practical budget.
2 Problem Formulation
We first introduce our notation. Recall that we are interested in reliability assessment of a trained agent. We assume that the experimenter who performs the reliability assessment can run an experiment (equivalently, a rollout or episode) with the trained agent given some initial condition , the outcome of which is a random failure indicator where
for some probability distribution
over some set and where . In particular, is binary and indicates a “catastrophic failure”. The interpretation of is that it collects all sources of randomness due to the agent and the environment. Unlike the case of , is neither observed, nor controllable by the experimenter. We are interested in the agent’s performance on the environment distribution over initial conditions , which we assume can be sampled quickly.^{1}^{1}1 To minimize jargon, we omit standard conditions on the spaces and functions that permit the use of the language of probabilities.In the realworld, evaluating
requires running the agent in the realworld and/or human supervision, which generally dominate the cost of evaluating neural networks. It is thus assumed that evaluating
on the pair is costly. In Section 4.1, we show that even on relatively cheap simulated environments, the cost of simulating environments dominates costs of evaluating neural networks.2.1 Failure Search
The objective of failure search is to find a catastrophic failure of the agent. Algorithms that search for failures, which we will call adversaries, can specify initial conditions , observe the outcome of running the agent on and return as soon as . Thus, for each round , the adversary can choose , observe , and return if . Here, , independently of the history . The naive adversary evaluates the agent on samples from until observing a failure. We assess adversaries by either the expected number of episodes, or the expected time elapsed, until returning a failure case.
To design a more efficient adversary, the experimenter is allowed to collect historical data of the form while they are training their agent. Here, the initial condition and are chosen by the training procedure, where encodes information about the agent used in round to generate the observation .
2.2 Risk estimation
As before, let denote the distribution over initial states defined by the environment. The failure probability of the trained agent is
where and independently. A typical goal is to estimate up to a fixed relative accuracy with high probability. Given some , , an algorithm is said to be correct if the estimate produced belongs to the interval with probability at least . When belongs to the said interval, we say that it is a approximation of .
As before, the algorithm has access to historical data and can quickly sample from . The naive or vanilla Monte Carlo (VMC) estimator samples from and returns the average of the observed outcomes. We assess estimators by the total number of experiments required.
3 Approach
In this section we describe our approach to the two problems outlined above. A common feature of our proposed solutions is that they estimate the failure probability predictor (AVF) that returns the probability of failure given a initial condition (equivalently, the value function of an adversary which receives reward for failures, without discounting):
where . Our continuation approach to learning an approximation , is described in Section 3.3. will be chosen so that the cost of evaluating is negligible compared to running an experiment. The idea is to use to save on the cost of experimentation. Our solutions build on the certainty equivalence approach (Turnovsky, 1976): First, we describe how (if available) could be leveraged. In cases of certainty equivalence, we would substitute for
. We then introduce additional heuristics to reduce sensitivity to the mismatch between
and .3.1 Failure Search
When is known, the optimal adversary has a particularly simple form (proof left to the reader):
Proposition 3.1.
The adversary that minimizes the expected number of rounds until failure evaluates the agent repeatedly on an instance that maximizes .
Having access to , a natural approach is to evaluate the agent on . There are two problems with this: (i) a maximizer may be hard to find and (ii) there is no guarantee that the maximizer of will give a point where is large. Rather than always select the global maximizer of , one way to increase the robustness of this procedure is to sample diverse initial conditions. We adopt a simple procedure: sample initial conditions from , pick the initial condition from this set where is the largest, and run an experiment from the found initial condition. We repeat this process with new sampled initial conditions until we find a catastrophic failure. The pseudocode is included in Appendix C as Algorithm 2 (AVF Adversary).
3.2 Risk estimation using AVFs
The failure probability estimation method uses importance sampling (IS) (e.g., Section 4.2, Bucklew 2004) where the distribution is replaced by a proposal distribution . For , the proposal distribution is used to generate random initial condition , then an experiment is performed to generate . The failure probability is estimated using the sample mean
where is the importance weight of the th sample. Here, denotes the density of and denotes the density of .^{2}^{2}2Here, we implicitly assume that these measures have a density with respect to a common measure. As it turns out, this is not a limiting assumption, but this goes beyond the scope of the present article. Let . As is wellknown, under the condition that whenever , we have that , and hence . Given that
is unbiased for any proposal distribution, one natural objective is to choose a proposal distribution which minimizes the variance of the estimator
.Proposition 3.2.
For such that let be defined as the distribution over whose density is
Then, the variance minimizing proposal distribution is .
The (standard) proof of this and the next proposition can be found in Appendix B. Note that from the definition of it follows that when .
The above result motivates us to suggest using the distribution as the proposal distribution of an IS estimator. Note that the choice of (as long as it is bounded away from zero) only influences the efficiency of the method, but not its correctness. It remains to specify how to sample from and how to calculate the importance weights. For sampling from , we propose to use the rejection sampling method (Section 1.2.2, Bucklew 2004) with the proposal distribution chosen to be : First, is chosen, which is accepted by probability . This is repeated until a sample is accepted:
Proposition 3.3.
Rejection sampling as described produces a sample from .
To increase the robustness of the sampling procedure against errors introduced by
, we introduce a “hyperparameter”
so that is redefined to be proportional to . Note that is what our previous result suggests. However, if is overestimated, a larger value of may work better, while if is underestimated, a smaller value of may work better. The pseudocode of the full procedure is given as Algorithm 1.3.3 A Continuation Approach to Learning AVFs
Recall that we wish to learn an approximation to , where is the true failure probability predictor for an agent . The classical approach is to first obtain a dataset by evaluating the agent of interest on initial states
. We could then fit a softmax classifier
to this data with a standard crossentropy loss.The classical approach does not work well for our setup, because failures are rare and the failure signal is binary. For example, in the humanoid domain, the agent of interest fails once every 110k episodes, after it was trained for 300k episodes. To learn a classifier we would need to see many failure cases, so we would need significantly more than 300k episodes to learn a reasonable .
To solve this problem, we propose a continuation approach to learning AVFs, where we learn from a family of related agents that fail more often. The increased failure rates of these agents can provide an essential boost to the ‘signal‘ in the learning process.
The particular problem we consider, agent evaluation, has additional structure that we can leverage. Typically, agents earlier on in training fail more often but in related ways, to the final agent. So we propose learning from agents that were seen earlier on in training. This has an added computational benefit – we do not need to gather additional data at evaluation time. Concretely, we collect data during training of the form where is the initial condition in iteration of training, and characterizes essential features of the agent’s policy used to collect the failure indicator . We then train a network to predict at input . In the simplest implementation, merely encodes the training iteration . In many common offpolicy RL methods, additional stochasticity is used in the policy at training time to encourage exploration. In these cases, we also use to encode the degree of stochasticity in the policy.
The continuation approach does make a key assumption – that agents we train on fail in related ways to the final agent. We provide a simple toy example to discuss in what ways we rely on this, and in what ways we do not.
Example. Suppose the true AVF factorizes as . Suppose we have two agents described by , such that the latter agent is significantly more robust, i.e., . Then, if learning requires a fixed number of positive examples to reach a particular accuracy, the continuation approach learns , the statedependent component, times faster than the naive approach, since it receives times as many positive examples.
On the other hand, accurately learning is difficult, since positive examples on are rare. However, even when is learned inaccurately, this AVF is still useful for both failure search and risk estimation. Concretely, if two AVFs differ only in , then both the adversaries and estimators induced by and are equivalent.
Discussion. Of course, this strict factorization does not hold exactly in practice. However, we use this example to illustrate two points. First, for both failure search and risk estimation, it is the shape of the learned AVF which matters more than its magnitude. Second, it captures the intuition that the AVF may learn the underlying structure of difficult states, i.e. , on policies where positive examples are less scarce. Further, using flexible parameterizations for , such as neural networks, may even allow to represent interactions between and , provided there is sufficient data to learn these regularities. Of course, if failure modes of the test policy of interest look nothing like the failure modes observed in the related agents, the continuation approach is insufficient. However, we show that in the domains we study, this approach works quite well.
4 Experiments
We run experiments on two standard reinforcement learning domains, which we describe briefly here. Full details on the environments and agent training procedures are provided in Appendix E.
In the Driving domain, the agent controls a car in the TORCS simulator (Wymann et al., 2000) and is rewarded for forward progress without crashing. Initial conditions are defined by the shape of the track, which is sampled from a procedurally defined distribution . We define a failure as any collision with a wall. We use an onpolicy actorcritic agent as in Espeholt et al. (2018), using environment settings from Mnih et al. (2016) which reproduces the original results on TORCS.
In the Humanoid domain, the agent controls a DoF humanoid body in the MuJoCo simulator (Todorov et al., 2012; Tassa et al., 2018), and is rewarded for standing without falling. The initial condition of an experiment is defined by a standing joint configuration, which is sampled from a fixed distribution. We define a catastrophic failure as any state where the humanoid has fallen, i.e. the head height is below a fixed threshold. We use an offpolicy distributed distributional deterministic policy gradient (D4PG) agent, following hyperparameters from BarthMaron et al. (2018) which reproduces the original results on the humanoid tasks.
4.1 Failure search
We first compare three adversaries by the number of episodes required to find an initial condition that results in a trajectory that ends with a failure. Our purpose is to illustrate two key contributions. First, a naive evaluation, even when using the same number of episodes as training the agent, can lead to a false sense of safety by failing to detect any catastrophic failures. Second, adversarial testing addresses this issue, by dramatically reducing the cost of finding failures. The adversaries evaluated are the naive (VMC) and the AVF adversaries introduced in LABEL:sec:\avf{}adv, as well as an adversary that we call the prioritized replay (PR) adversary. This latter adversary runs experiments starting from all initial conditions which led to failures during training, most recent first. This provides a simple and natural alternative to the naive adversary. Additional details on all adversaries are provided in Appendix E. The results are summarized in Table 1.
Domain 





Driving  
Humanoid 
Discussion of AVF adversary. The AVF adversary is multiple orders of magnitude more efficient than the random adversary. In particular, we note that on Humanoid, for the VMC adversary to have over a % chance of detecting a single failure, we would require over episodes, exceeding the cost of training, which used less than episodes^{3}^{3}3The number of failures is distributed and in the median case, for , , we have .. In practice, evaluation is often run for much less time than training, and in these cases, the naive approach would very often lead to the mistaken impression that such failure modes do not exist.
We observe similar improvements in wallclock time. On the Driving domain, in the median case, finding a failure requires hours in expectation for the random adversary, compared to minutes for the AVF adversary. Even including the model training time of minutes, the AVF adversary is times faster. Similarly, on the Humanoid domain, the random adversary requires days ( hours) in expectation, compared to minutes for the AVF adversary, amounting to a fold speedup after including AVF training time of minutes. These numbers underscore the point that even in relatively cheap, simulated environments, the cost of environment interactions dominate learning and inference costs of the AVF.
Discussion of Prioritized Replay adversary. The PR adversary often provides much better efficiency than random testing, with minimal implementational overhead. The main limitation is that in some cases, the prioritized replay adversary may never detect failures, even if they exist. In particular, an agent may learn to handle the particular problem instances on which it was trained, without eliminating all possible failures. In these cases (this occurred once on Humanoid), we fall back to the VMC adversary after trying all the problem instances which caused failures at training. On the Driving domain, because we use a form of adversarial training for the agent, very few () of the training problem instances have support under . Since none of these resulted in failures, the PR adversary would immediately fall back to the VMC adversary.
Comparison to classical approaches. One question is why we do not compare to more classical techniques from the rare events literature, such as crossentropy method or subset simulation. Because we optimize a binary failure signal, rather than a continuous score as in the classical setting, these approaches would not work well. For example, in Humanoid, the final agent fails once every 110k episodes, and was trained for 300k episodes. If we used crossentropy method on the final agent, we would need significantly more than 300k episodes of data to fit a good proposal distribution. The continuation approach is essential  obtaining enough failures to fit a useful proposal distribution requires learning from a family of weaker agents.
4.2 Risk estimation
We now compare approaches for risk estimation. Again, the purpose is to illustrate the claim that a naive approach may often be too inefficient to give rise to nonvacuous quantitative answers with reasonable resources, while better alternatives can deliver such answers. In particular, we compare the AVF estimator of Section 3.2 to the vanilla Monte Carlo estimator (VMC). For comparison purposes, we run each estimator multiple times and report the fraction of cases when the obtained estimates fail to fall in the interval, where is the failure probability to be estimated. This provides a fairly accurate estimate of the probability of failing to be a approximation of .^{4}^{4}4Here, was measured separately by running the VMC estimator for episodes on Driving and episodes on Humanoid, so that standard errors lies within % relative error on Driving, and % on Humanoid. Results are summarized in Fig. 1. In Section E.3 we include plots for other choices of , to show that the results are not very sensitive to .
Discussion. At the confidence level , on the Driving domain, the AVF estimator needs only experiments to achieve a approximation, while the VMC estimator needs experiments for the same. This means that here the AVF estimator requires times less environment interaction. Similarly, on the Humanoid domain, at the confidence level , the AVF estimator requires experiments to achieve a approximation, while the VMC estimator requires experiments, a fold improvement.
We note that these numbers demonstrate that the AVF also has good coverage, i.e. it samples from all possible failures, since if any failure conditions were vastly undersampled, the variance of the AVF estimator would explode due to the importance weight term. Coverage and efficiency are complementary. While failure search merely requires the adversary identify a single failure, efficient risk estimation requires the adversary to sample from all possible failures.
4.3 Application to identifying more
reliable agents
The improved efficiency of the AVF estimator has many benefits. One application is to identify the most reliable agent from a fixed finite set.
For this illustration, we compare Humanoid agents, spaced evenly over the course of training (excluding the beginning of training, when the agent has a high failure rate). We compare the choices made when the failure probability estimation is performed with the VMC estimator versus the AVF estimator. The selected policy at each point in time is the policy with the lowest estimated risk. In the event of a tie (e.g., among several policies with no failures), we report the expected failure probability when selecting from the tied agents randomly.
Fig. 2 summarizes the results. When using VMC, 36 out of 50 policies have never failed by the end of the evaluation process. Thus, VMC was not able to rank out of the policies and the experimenter would be forced to choose one of these with no further information about their reliability. On the other hand, the AVF estimator quickly produces failures for the weak policies, and is able to select a policy which is over 3 times more robust (this estimate is conservative – see Appendix E.4). This can be viewed as an extremely naive form of adversarial training, where the model is selected from a discrete set of options rather than a highdimensional parameter space. We use this example to highlight the recurring theme that random testing can be highly inefficient – failing to identify the best policies even after episodes (longer than training time) – while adversarial testing greatly improves efficiency.
4.4 Applicability and Robustness
We conclude this section with practical considerations regarding the use of this approach in settings beyond the domains considered here, including the real world.
Maintaining high coverage. A concern is that if the AVF underestimates the failure probability of certain initial conditions, the risk estimator will have high variance, and with limited samples, often underestimate the true failure probability. Our main point is that in many settings, VMC has little chance of revealing any failures with a practical budget, and so using any form of optimized adversary is a large improvement, but we also acknowledge other factors which mitigate this issue.
First, because the AVF is trained on weaker agents, it typically overestimates failure probabilities. Second, in Humanoid, where failures are particularly rare, we use a simplified Differentiable Neural Dictionary (DND) (Pritzel et al., 2017) described in Appendix E.1
. A DND is a kNN classifier in feature space, but uses a learned pseudocount, which causes it to output higher failure probabilities when the query point is far in feature space from training points.
Efficiency lower bounds. Further, we can ensure our method never does more than times worse than VMC. To do so, we run both the VMC and AVF estimators in parallel. If VMC finds at least several failures, then we take the VMC estimate, and otherwise take the AVF estimate. This incurs a x slowdown in the worst case, while remaining much more efficient in many safety critical domains. Neufeld et al. (2014) give better guarantees when combining stochastic estimators.
5 Background and Related Work
Adversarial examples. Our work on reliability is in part motivated by research on adversarial examples, which highlights the fact that machine learning systems that perform very well on average may nonetheless perform extremely poorly on particular adversarial inputs, often in surprising ways (Szegedy et al., 2013; Goodfellow et al., 2014). Unlike previous work on adversarial examples in reinforcement learning (Huang et al., 2017; Lin et al., 2017), we do not allow adversaries to generate inputs outside the distribution on which the agent is trained.
Most work on adversarial examples has focused on norm balls in the image domain. Many recent papers have questioned the practical value of norm ball robustness (Gilmer et al., 2018; Engstrom et al., 2017; Schott et al., 2018). However, moving beyond the norm ball specification has been difficult, because for unrestricted adversarial examples, human evaluation is necessary to determine the ground truth labels (see Brown et al. (2018); Song et al. (2018) for work in this direction).
In this context, we believe simulated reinforcement learning environments provide a valuable testbed for researchers interested in adversarial examples  the ground truth is provided by the simulator. We hope the domains and problem formulations presented here drive research on adversarial examples beyond norm balls, and towards training and testing models consistent with global specifications.
Adversarial approaches for evaluation. Our work bears similarities to Werpachowski et al. (2018) who use adversarial examples with an importance weighting correction to detect test set overfitting in the context of classification. Our aims share similarities to recent proposals for testing components of autonomous vehicle systems (ShalevShwartz et al., 2017; Dreossi et al., 2017; Tian et al., 2018; Pei et al., 2017). We believe these approaches are complementary and should be developed in parallel: these works focus on componentslevel specifications (e.g. controllers should maintain safe distances) while here, we focus on testing the entire agent endtoend for systemlevel specifications.
Rare event estimation. We draw upon a rich literature on rare event probability estimation (e.g., Bucklew 2004; Rubino & Tuffin 2009; Rubinstein & Kroese 2017). Importance sampling (IS) is one of the workhorses of this field. The closest methods to our approach are the adaptive importance sampling methods where data collected from an initial proposal distribution is iteratively used to adjust it to maximize sampling efficiency (e.g., Rubinstein 1997; Rubinstein & Kroese 2017; Li et al. 2013). These methods do not work well in our context where the failure signal is binary. Instead, we adapt the proposal distribution using data from related, but less robust, agents. Our approach is also different from previous works in that, to better reflect the practicalities of RL tasks, we explicitly separate controllable randomness (i.e., initial conditions, simulator parameters) from randomness that is neither controllable, nor observable (environment and agent randomness). As we show, this has implications on the form of the minimumvariance proposal distribution.
In robotics, the assessment of the safety of motion control algorithms has been recognized as a critical aspect of their realworld deployment (Van Den Berg et al., 2011). In a recent paper, extending Janson et al. (2018), Schmerling & Pavone (2017) proposed to use an adaptive mixture importance sampling algorithm to quantify the collision probability for an LQR controller with EKF state estimation applied to nonlinear systems with a full rigidbody collision model. Unlike our method, this work needs an analytic model of the environment.
Learning highly reliable agents. While in this work we primarily focus on agent evaluation, rather than training, we note recent work on safe reinforcement learning (see García & Fernández, 2015, for a review). These approaches typically rely on strong knowledge of the transition dynamics (Moldovan & Abbeel, 2012), or assumptions about the accuracy of a learned model (Berkenkamp et al., 2017; Akametalu et al., 2014), and we believe are complementary to adversarial testing, which can be viewed as a mechanism for checking whether such assumptions in fact hold. We provide a longer discussion of this literature in Appendix F.
6 Conclusion and Future Work
In this work, we argued that standard approaches to evaluating RL agents are highly inefficient in detecting rare, catastrophic failures, which can create a false sense of safety. We believe the approach and results here strongly demonstrate that adversarial testing can play an important role in assessing and improving agents, but are only scratching the surface. We hope this work lays the groundwork for future research into evaluating and developing robust, deployable agents.
Acknowledgments
We are grateful to Alhussein Fawzi, Tengyu Ma, and Percy Liang for their helpful input on this paper, and to Dhruva Tirumala, Josh Merel, Abbas Abdolmaleki, Wojtek Czarnecki, and Heinrich Kuttler for insightful discussions.
References
 Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. arXiv preprint arXiv:1705.10528, 2017.
 Akametalu et al. (2014) Anayo K Akametalu, Shahab Kaynama, Jaime F Fisac, Melanie Nicole Zeilinger, Jeremy H Gillula, and Claire J Tomlin. Reachabilitybased safe learning with gaussian processes. In CDC, pp. 1424–1431. Citeseer, 2014.

Altman (1999)
Eitan Altman.
Constrained Markov decision processes
, volume 7. CRC Press, 1999.  Audibert et al. (2011) JeanYves Audibert, Olivier Catoni, et al. Robust linear least squares regression. The Annals of Statistics, 39(5):2766–2794, 2011.
 BarthMaron et al. (2018) Gabriel BarthMaron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
 Beattie et al. (2016) Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
 Berkenkamp et al. (2017) Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe modelbased reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pp. 908–918, 2017.
 Brown et al. (2018) Tom B Brown, Nicholas Carlini, Chiyuan Zhang, Catherine Olsson, Paul Christiano, and Ian Goodfellow. Unrestricted adversarial examples. arXiv preprint arXiv:1809.08352, 2018.
 Brownlees et al. (2015) Christian Brownlees, Emilien Joly, Gábor Lugosi, et al. Empirical risk minimization for heavytailed losses. The Annals of Statistics, 43(6):2507–2536, 2015.
 Bucklew (2004) James Antonio Bucklew. Introduction to Rare Event Simulation. Springer New York, 2004.
 Chow et al. (2018) Yinlam Chow, Ofir Nachum, Edgar DuenezGuzman, and Mohammad Ghavamzadeh. A lyapunovbased approach to safe reinforcement learning. arXiv preprint arXiv:1805.07708, 2018.
 Dreossi et al. (2017) Tommaso Dreossi, Alexandre Donzé, and Sanjit A Seshia. Compositional falsification of cyberphysical systems with machine learning components. In NASA Formal Methods Symposium, pp. 357–372. Springer, 2017.
 Engstrom et al. (2017) Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rotation and a translation suffice: Fooling cnns with simple transformations. arXiv preprint arXiv:1712.02779, 2017.
 Espeholt et al. (2018) Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561, 2018.
 Frank et al. (2008) Jordan Frank, Shie Mannor, and Doina Precup. Reinforcement learning in the presence of rare events. In ICML, pp. 336–343, 2008.
 García & Fernández (2015) Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16:1437–1480, 2015. URL http://jmlr.org/papers/v16/garcia15a.html.
 Gilmer et al. (2018) Justin Gilmer, Ryan P Adams, Ian Goodfellow, David Andersen, and George E Dahl. Motivating the rules of the game for adversarial example research. arXiv preprint arXiv:1807.06732, 2018.
 Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 Horgan et al. (2018) Dan Horgan, John Quan, David Budden, Gabriel BarthMaron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
 Huang et al. (2017) Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
 Jaderberg et al. (2017) Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
 Janson et al. (2018) Lucas Janson, Edward Schmerling, and Marco Pavone. Monte carlo motion planning for robot trajectory optimization under uncertainty. In Robotics Research, pp. 343–361. Springer, 2018.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Li et al. (2013) Wentao Li, Zhiqiang Tan, and Rong Chen. Twostage importance sampling with mixture proposals. Journal of the American Statistical Association, 108(504):1350–1365, 2013.
 Lin et al. (2017) YenChen Lin, ZhangWei Hong, YuanHong Liao, MengLi Shih, MingYu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748, 2017.
 Madry et al. (2017) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
 Mandlekar et al. (2017) Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li FeiFei, and Silvio Savarese. Adversarially robust policy learning: Active construction of physicallyplausible perturbations. In IROS, pp. 3932–3939. IEEE, 2017.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937, 2016.
 Moldovan & Abbeel (2012) Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. arXiv preprint arXiv:1205.4810, 2012.
 Neufeld et al. (2014) James. Neufeld, András György, Dale Schuurmans, and Csaba Szepesvári. Adaptive Monte Carlo via bandit allocation. In ICML, pp. 1944–1952, 2014.

Pei et al. (2017)
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana.
Deepxplore: Automated whitebox testing of deep learning systems.
In Proceedings of the 26th Symposium on Operating Systems Principles, pp. 1–18. ACM, 2017.  Peng et al. (2017) Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Simtoreal transfer of robotic control with dynamics randomization. CoRR, abs/1710.06537, 2017. URL http://arxiv.org/abs/1710.06537.
 Pinto et al. (2017) Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for imagebased robot learning. CoRR, abs/1710.06542, 2017. URL http://arxiv.org/abs/1710.06542.
 Pritzel et al. (2017) Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. arXiv preprint arXiv:1703.01988, 2017.
 Rubino & Tuffin (2009) Gerardo Rubino and Bruno Tuffin (eds.). Rare Event Simulation using Monte Carlo Methods. John Wiley & Sons, 2009.
 Rubinstein (1997) Reuven Y Rubinstein. Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1):89–112, 1997.
 Rubinstein & Kroese (2017) Reuven Y. Rubinstein and Dirk P. Kroese. Simulation and the Monte Carlo method. Wiley, 3 edition, 2017.
 Schmerling & Pavone (2017) Edward Schmerling and Marco Pavone. Evaluating trajectory collision probability through adaptive importance sampling for safe motion planning. In Robotics: Science and Systems 2017, 2017.
 Schott et al. (2018) Lukas Schott, Jonas Rauber, Wieland Brendel, and Matthias Bethge. Robust perception through analysis by synthesis. arXiv preprint arXiv:1805.09190, 2018.
 ShalevShwartz et al. (2017) Shai ShalevShwartz, Shaked Shammah, and Amnon Shashua. On a formal model of safe and scalable selfdriving cars. arXiv preprint arXiv:1708.06374, 2017.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Song et al. (2018) Yang Song, Rui Shu, Nate Kushman, and Stefano Ermon. Generative adversarial examples. arXiv preprint arXiv:1805.07894, 2018.
 Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Tassa et al. (2018) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
 Tian et al. (2018) Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. Deeptest: Automated testing of deepneuralnetworkdriven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, pp. 303–314. ACM, 2018.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Turnovsky (1976) Stephen J. Turnovsky. Optimal stabilization policies for stochastic linear systems: The case of correlated multiplicative and additive disturbances. The Review of Economic Studies, 43(1):191–194, 1976. doi: 10.2307/2296614. URL http://dx.doi.org/10.2307/2296614.
 Valiant (1984) Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
 Van Den Berg et al. (2011) Jur Van Den Berg, Pieter Abbeel, and Ken Goldberg. Lqgmp: Optimized path planning for robots with motion uncertainty and imperfect state information. The International Journal of Robotics Research, 30(7):895–913, 2011.
 Vecerík et al. (2017) Matej Vecerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin A Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. CoRR, abs/1707.08817, 2017.
 Werpachowski et al. (2018) Roman Werpachowski, András György, and Csaba Szepesvári. Detecting overfitting using adversarial examples. arXiv, 2018.
 Wymann et al. (2000) Bernhard Wymann, Eric Espié, Christophe Guionneau, Christos Dimitrakakis, Rémi Coulom, and Andrew Sumner. Torcs, the open racing car simulator. Software available at http://torcs. sourceforge. net, 4:6, 2000.
 Zhu et al. (2018) Yuke Zhu, Ziyu Wang, Josh Merel, Andrei A. Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvunakool, János Kramár, Raia Hadsell, Nando de Freitas, and Nicolas Heess. Reinforcement and imitation learning for diverse visuomotor skills. CoRR, abs/1802.09564, 2018.
Appendix A Risk Estimation with Heavytailed Losses
In this section, we expand on why empirical estimators of the expected risk using uninformed random sampling will have very high variance for problems with heavytailed losses. In statistical learning theory, the objective is to find a model
which minimizes the expected risk, , where denotes the loss of the model on a data point . For example, in classification, is typically the loss, indicating whether makes the correct prediction on . In reinforcement learning, returns the negative discounted return on a rollout from state .To assess a model , since we typically can only sample from , the expected risk can be estimated on a test set of i.i.d. points using the empirical risk : .
The standard justification for this approach is that with high probability, the empirical risk is close to the true expected risk (Valiant, 1984). If , then Hoeffding’s inequality yields
Equivalently, the empirical test risk is within of the expected risk with probability , provided the number of data points in the test set is at least
In many commonly studied settings, such as classification with the loss, or reinforcement learning with bounded rewards and fixed episode lengths (Tassa et al., 2018; Beattie et al., 2016), these bounds guarantee fairly good efficiency. However, for very heavytailed losses, these bounds become very weak. Ensuring a constant additive error requires , and error up to a constant multiple of still requires examples. Using the selfdriving car example from Section 1, consider a car going on many milelong trips. Due to the extremely negative rewards associated with crashes, achieving a fixed error bound of requires more trips than would otherwise be necessary if negative rewards for crashes were bounded to the same range as normal operation.
Intuitively, because very costly events may occur with a small probability, random sampling is unlikely to detect these failure modes unless we use a very large number of random samples. Note that for an agent with probability of failure, . Thus, for small , even with experiments, there is a chance the empirical estimate will detect no failures at all, even though for a sufficiently catastrophic failure, this could dominate the overall expected risk. While the exact bound on the error in the empirical estimate may depend on the particular estimator and concentration inequalities being used (Brownlees et al., 2015; Audibert et al., 2011), any approach relying solely on uninformed random sampling will face similar issues.
Appendix B Proofs for Section 3.2
We start with the proof of creftype 3.2. Let where . Under the condition that whenever , , hence .
Since are independent and identically distributed, for any . Hence, the variance of is minimized when the variance of is minimized. Since , this variance is minimized when is minimized, where we used the definition of and that is binaryvalued. By integral CauchySchwartz, this is minimized by the proposal distribution whose density with respect to is proportional to , leading to the desired claim.
Notably this differs from the deterministic case, where the varianceminimizing proposal distribution has density proportional to with respect to . Characterizing the varianceminimizing proposal distribution for the general stochastic case, when is not binaryvalued, is an interesting open question.
Next, we prove creftype 3.3, which stated that the rejection sampling procedure that accepts a random instance with probability produces a sample from . Let be uniformly distributed on the interval and independent of . Clearly, the probability that a sample produced by rejection sampling falls into a set is . Now,
Since is a probability measure on , it follows that the unconditional acceptance probability satisfies , thus showing that the distribution of accepted points is as required.
Appendix C Pseudocode for AVF Adversary
Appendix D Agent Training Details
d.1 Environments
For the Driving domain, we use the TORCS 3D car racing game (Wymann et al., 2000) with settings corresponding to the “Fast car, no bots” setting in Mnih et al. (2016). At each step, the agent receives an
dimensional observation vector summarizing its position, velocity, and the local geometry of the track. The agent receives a reward proportional to its velocity along the center of the track at its current position, while collisions with a wall terminate the episode and provide a large negative reward.
Each problem instance is a track shape, parameterized by a dimensional vector encoding the locations and curvatures of waypoints specifying the track. Problem instances are sampled by randomly sampling a set of waypoints from a fixed distribution, and rejecting any tracks with selfintersections.
For the Humanoid domain, we use the Humanoid Stand task from Tassa et al. (2018). At each step, the agent receives a dimensional observation vector summarizing its joint angles and velocities, and the locations of various joints in Cartesian coordinates. The agent receives a reward proportional to its head height. If the head height is below m, the episode is terminated and the agent receives a large negative reward.
Each problem instance is defined by an initial standing pose. To define this distribution, we sample random trajectories from a separately trained D4PG agent. We sample a random state from these trajectories, ignoring the first seconds of each trajectory, as well as any trajectories terminating in failure, to ensure all problem instances are feasible. Together, we obtain problem instances, so it is unlikely that any problem instance at train or test time has been previously seen.
d.2 Agents
We now describe the agents we used for evaluation on each task. The focus of this work is not on training agents, and hence we leave the agents fixed for all experiments. However, we did make some modifications to decrease agent failure probabilities, as we are most interested in whether we can design effective adversaries when failures are rare, and thus there is not much learning signal to guide the adversary.
On Driving, we use an asynchronous batched implementation of Advantage ActorCritic, using a Vtrace correction, following Espeholt et al. (2018). We use PopulationBased Training, and evolve the learning rate and entropy cost weights Jaderberg et al. (2017), with a population size of . Each learner is trained for actor steps, which takes hours distributed over CPU workers and a single GPU learner. Since episodes are at most steps, this equates to roughly episodes per learner, and episodes for the entire population. At test time, we take the most likely predicted action, rather than sampling, as we found it decreased the failure probability by roughly half. Additionally, we used a handcrafted form of adversarial training by training on a more difficult distribution of track shapes, with sharper turns than the original distribution, since this decreased failure probability roughly fold.
On Humanoid, we use a D4PG agent, using the same hyperparameters as BarthMaron et al. (2018). The agent is trained for learner steps, which corresponds to between and actor steps, which takes hours distributed over CPU workers and a single GPU learner. Since episodes are at most steps, this equates to roughly episodes. We use different exploration rates on actors, as in Horgan et al. (2018)
, with noise drawn from a normal distribution with standard deviation
evenly spaced from to . We additionally use demonstrations from the agent described in the previous section, which was used for defining the initial pose distribution, following Vecerík et al. (2017). In particular, we use demonstration trajectories, and use demonstration data for half of each batch to update both the critic and policy networks. This results in a roughly fold improvement in the failure rate.Appendix E Experimental Details
e.1 AVF Details
When constructing the training datasets, we ignore the beginning of training during which the agent fails very frequently. This amounts to using the last episodes of data on Driving and last on Humanoid. To include information about the training iteration of the agent, we simply include a single real value, the current training iteration divided by the maximum number of training iterations. Similarly, for noise applied to the policy, we include the amount of noise divided by the maximum amount of noise. These are all concatenated before applying the MLP. We train both AVF models to minimize the crossentropy loss, with the Adam optimizer (Kingma & Ba, 2014), for and iterations on Driving and Humanoid respectively, which requires and minutes respectively on a single GPU.
On Driving, the AVF architecture uses a layer MLP with hidden units per layer and a single output, with a sigmoid activation to convert the output to a probability. On Humanoid, since failures of the most robust agents are very rare, we use a simplified Differentiable Neural Dictionary architecture (Pritzel et al., 2017) to more effectively leverage the limited number of positive training examples. In particular, to classify an input , the model first retrieves the nearest neighbors and their corresponding labels from the training set. The final prediction is then a weighted average , where is a learned pseudocount, which allows the model to make uncertain predictions when all weights are close to . To compute weights , each point is embedded by a layer MLP into dimensions, and the weight of each neighbor is computed , where is a Gaussian kernel, . We tried different hyperparameters for the number of MLP layers on Driving, and the number of neighbors on Humanoid, and selected based on a heldout test set using data collected during training. In particular, to match realworld situations, we did not select hyperparameters based on either failure search or risk estimation experiments.
e.2 Failure Search
We ran the VMC adversary for experiments for each agent. The expected cost is simply , where is the fraction of experiments resulting in failures. For the AVF adversary, we ran experiments for each agent, since failures are more common, so less experiments are necessary to estimate the failure probabilities precisely. For the PR adversary, the adversary first selected problem instances which caused failures on the actor with the least noise, most recent first, followed by the actor with the second least noise, and so on. We also tried a version which did not account for the amount of noise and simply ran the most recent failures first, which was slightly worse.
For the failure search experiments on the Humanoid domain, we always evaluate the best version of each agent according to the AVF model selection procedure from Section 4.3. We chose this because we are most interested in whether we can design effective adversaries when failures are rare, and simply choosing the final agent is ineffective, as discussed in Appendix E.4.
We now discuss the sample size parameter in Algorithm 2. As discussed, using larger values of applies more selection pressure in determining which problem instances to run experiments on, while using smaller values of draws samples from a larger fraction of , and thus provides some robustness to inaccuracies in the learned . Of course, other more sophisticated approaches may be much more effective in producing samples from diverse regions of , but we only try this very simple approach. For our experiments, we use for Driving and for Humanoid. We did not perform any hyperparameter selection on , in order to match the realworld setting where the experimenter has only a fixed budget of experiments, and wishes to find a single failure. However, we include additional experiments to study the robustness of the AVF adversary to the choice of . These are summarized in Table 2.
Sample size 






Driving  
Humanoid 
We observe that on both domains, applying greater optimization pressure improves results, over the default choices for we used. However, relative to the improvements over the VMC adversary, these differences are small, and the AVF adversary is dramatically faster than the VMC adversary for all choices of we tried.
e.3 Risk Estimation Details
In our paper, we showed that the AVF estimator can estimate the failure probability of an agent much faster than the VMC estimator. In particular, we showed in both the Driving and Humanoid domains that our estimator requires an order of magnitude less episodes to achieve a 3approximation at any given confidence level. In general, we may be interested in a approximation, that is we might want the estimates to fall in the range where is the probability of failure that we wish to measure. One might ask whether our results are sensitive to the choice of . In other words, did we select so that our results look good? To address this concern, we include results for lower and higher choices of . We see that the AVF estimator performs significantly better than the VMC estimator across choices of .
e.4 Model Selection
Measuring ground truth failure probabilities. In Figure 4, we show the “ground truth” failure probabilities used for computing model robustness in Figure 2. Each bar represents a version of the agent at a different point in training, starting from step , and taking a new version every learner steps (so that there are agents in total). Failure probabilities are computed by taking the fraction of experiments resulting in failures, using experiments per agent. As in Figure 2, we show robustness () rather than failure probabilities, so that it is easier to see differences between the most robust models. Robustness should be interpreted as the expected number of experiments until seeing a failure. Thus, when we report robustness in Figure 2 for averages over multiple agents, we first average the failure probabilities of these agents, before applying the reciprocal, which represents the expected number of experiments until failure, if the agent is randomly selected from that set.
We note that the failure probabilities we report after model selection in Figure 2 are conservative, for two reasons. First, for the two agents which did not fail in our ground truth measurements, we use a failure probability of rather than . Second, because our ground truth measurements are only noisy measurements of the true failure probability for each agent, it is possible that even an optimal model selection algorithm would not pick the optimal agent according to our ground truth measurements. We could run the ground truth measurements for longer, but already, this required experiments, or CPUdays. Nonetheless, even with these conservative estimates, the AVF estimator showed significant improvements over using the VMC estimator for model selection.
Nonmonotonicity of robustness over training. Finally, we note that the robustness is not monotonically increasing, and thus simply taking the last agent (or randomly choosing from the last few) would be a worse model selection strategy than uniform exploration with the AVF estimator. This differs from the standard setting in deep reinforcement learning, with tightly bounded rewards, where expected return usually (roughly) monotonically increases over the course of training. As discussed in Sections 1 and 5, when the agent is optimized using VMC, failures are observed very rarely, and thus the optimization process does not effectively decrease the failure probability, which is consistent with the nonmonotonicity we observe here.
Appendix F Extended Related Work
Learning highly reliable agents. Our results also indicate that learning highly reliable agents will likely require a significant departure from some of current practices. In particular, since for highly reliable systems failure events are extremely rare, any technique that uses vanilla Monte Carlo (that is, perhaps 99% of the current RL literature) is ought to fail to provide the necessary guarantees. This is a problem that has been studied in the simulation optimization literature, but so far it has received only little attention in the RL community (e.g., Frank et al. 2008).
The training time equivalent of our setting can be viewed as a special case of Constrained MDPs (Altman, 1999), where a cost of is incurred at the final state of a catastrophic trajectory. While recent work has adapted deep reinforcement learning techniques to handle constraints, these address situations where costs are observed frequently, and the objective is to keep expected costs below some threshold (Achiam et al., 2017; Chow et al., 2018) as opposed to avoiding costs entirely, and are not designed to specifically satisfy constraints when violations would be caused by rare events. This suggests a natural extension to the rare failures case using the techniques we develop here. Related work in safe reinforcement learning (see García & Fernández, 2015, for a review) tries to avoid catastrophic failures either during learning or deployment. Approaches for safe exploration typically rely on strong knowledge of the transition dynamics (Moldovan & Abbeel, 2012), or assumptions about the accuracy of a learned model (Berkenkamp et al., 2017; Akametalu et al., 2014), and we believe are complementary to adversarial testing approaches, which make weak assumptions, but also do not provide guarantees. Finally, there has been significant recent interest in training controllers that are robust to system misspecification by exposing them to prespecified or adversarially constructed noise applied e.g. to observation or system dynamics (Peng et al., 2017; Pinto et al., 2017; Zhu et al., 2018; Mandlekar et al., 2017).
Comments
There are no comments yet.