Guaranteeing safety is a vital issue for many modern robotics systems, such as unmanned aerial vehicles (UAVs), autonomous cars, or domestic robots [1, 2, 3]. One approach is to attempt to specify all potential scenarios a robot may encounter a priori. However, this is usually impractical due to the fact that such solutions are either computationally expensive to compute or that robots today need to deal with uncertain and diverse environments. Hence, we need to design algorithms for robots that can safely and autonomously learn about the uncertain environment they live in, which can potentially address both problems [4, 5].
Reinforcement learning algorithms autonomously perform exploration, and they have shown promising results in many fields of artificial intelligence. Therefore, it is natural to explore the implications of reinforcement learning in robotics. Unlike most applications in artificial intelligence, where unsafe outcomes from learning can occur in simulation, in robotics, we would need to avoid unsafe scenarios at all costs. Therefore, safe learning, which is the process of applying a learning algorithm such as reinforcement learning while still satisfying a set of safety specifications, has attracted great interest in recent years. Safety usually has two main interpretations: one is related to stochasticity of the environment, where the goal is to guarantee staying in a given performance bound as is commonly studied in robust control [6, 7, 8, 9, 10, 11, 12]. The second interpretation is due to the system falling in an undesirable physical state, which is a common interpretation used in robotics [13, 14, 15, 16, 17]. In this paper, we focus on safe learning and exploration as the avoidance of ‘unsafe’ - physically undesirable - states. We refer to  for a survey on safe reinforcement learning.
To perform exploration in safety-critical systems, prior knowledge of the task is often incorporated into the exploration process. In  and , the authors proposed a method to safely explore a deterministic Markov Decision Process (MDP) using Gaussian processes. In their work, they assumed the transition model is known and that there exists a predefined safety function. Both of these assumptions can be quite restrictive when the system is going to operate in unknown environments. In our work, we plan to address both of these challenges by considering unknown transition models, and no access to a predefined safety function. Similarly, other work has considered reachability analysis as well as Gaussian processes to perform safe reinforcement learning [17, 21], and used a safety metric to improve their algorithm. However, it is not trivial to derive an appropriate safety metric in many robotics tasks. Other techniques utilize teacher demonstrations to avoid unsafe states [13, 14, 22]
. However, teaching demonstrations are usually difficult to capture as operating robots with high degrees of freedom can be challenging, especially if the system dynamics are unknown due to the existing uncertainty in the environment.
In our work, we propose an algorithm to safely and autonomously explore a deterministic MDP whose transition function is unknown. We take a natural definition of safety, similar to : If we can recover from a state , i.e. we can move from it to a state that is known to be safe, then the state is also safe.
Instead of relying on Gaussian processes or other estimation procedures, our exploration algorithm directly leverages the underlying continuity assumptions, and so guarantees safety deterministically. We demonstrate our algorithm in simulation on two different navigation tasks.
Ii Problem Definition
Our goal in this project is to design an algorithm for a robot to safely and efficiently explore uncertain parts of the environment. We want an algorithm that deterministically ensure safety and expand the size of the known safe state set. The theoretical work in this section will build towards formalizing this goal in equation (1).
Ii-a Introductory Assumptions
First, we begin by formalizing our interactions with the environment.
We model the dynamical system living in an unknown environment as a deterministic MDP . Such an MDP is a tuple with a set of states , a set of actions , and an unknown deterministic transition model . Let be the initial state of the MDP.
For example, for a quadrotor, the states in
might be the quadrotor’s pitch, yaw and roll, the quadrotor’s angular and linear velocities, and its height from the ground, all concatenated together to form a vector in. The actions in could be the fixed rotation speeds of the rotors (a vector in ), and the transition function would apply those speeds to the rotors over a fixed interval of duration .
The algorithm we will develop is applicable to finite state and action spaces. When or are fundamentally continuous, our algorithm can be applied to finite, fine-grained discretizations of those space. We show in Section IV how to handle this discretization.
We now make a few definitions that will help us reason about our knowledge of the environment and our ability to take a sequence of actions to efficiently explore the environment.
We denote knowledge the actor has about the transition function as a set such that the following implication relation holds:
Let , with appropriate superscripts when necessary, denote a sequence of actions; the action in the list and the cardinality. We overload the transition function such that if and only if taking the actions sequentially from state yields the final state . Lastly, denotes the set of all possible ordered sequences of actions .
Without further assumptions about , it is not possible to perform safe exploration from , because we cannot take any action from the initial state with limited knowledge of the action’s safety. Therefore, we assume we are given an initial safe set such that the initial state of the system is in this set, i.e. . Furthermore, we assume as in  that for any , we are given a list of actions such that .
Formally, we assume we are given an initial safe set such that , an initial knowledge set , and restate the above assumption as:
We further make a set of assumptions on Lipschitz continuity of the transition function to enable safe exploration.
is -Lipschitz continuous over the states with some distance metric :
Similarly, is -Lipschitz continuous over the actions with and the additional distance metric :
Practically, the Euclidean distance is often used for both and .
Note that these requirements are mild and naturally satisfied in most domains: If we take the same action from two similar states, we will end up in similar states; and if we take similar actions from the same state, we will again end up in similar states. For this algorithm, we assume that and have been estimated via some prior methodology, and we note that larger than optimal values of these constants, while leading to less efficient algorithms, still satisfy Assumption 3.
With this setup, we can now define our notion of safety.
We define to be safe with respect to and knowledge set if there exists a recovery algorithm that can use the information in and the Lipschitz assumptions to confidently produce actions which will transition the MDP from into after a finite number of steps. We define that action is safe at state if all possible outcomes of , with respect to and Lipschitz assumptions, are safe. When calling a set safe, the choice of should be clear from context.
For instance, in our quadrotor example, might include various hovering states, and thus we consider any state to be safe if we can return to a hovering position from that state. We note that our notion of safety is similar to the safety definition of .
This definition of safety, while theoretically satisfying, is not computable in its current form, as we have not described what it means to “use information to confidently produce actions”. With a bit more work, we will do so in Definition 7 below.
We now state some important observations about our definition of safety.
Suppose is a safe set with respect to some knowledge set . Then:
However, this upper bound on the safe set cannot normally be calculated, as is unknown.
Suppose is safe with respect to some , and that . Then is safe with respect to .
Ii-B Computing the Safe Set
We now utilize the knowledge set to determine which states and actions are safe.
In order to handle unknown states, we define an uncertain transition function , parameterized by knowledge , that maps each state-action pair to all of its possible outcomes:
where denotes the hypersphere centered at with radius over the distance metric . A visualization of -function is shown in Fig. 1.
If is a safe set with respect to knowledge , then we define the expansion function as:
We also let with . Moreover, we denote to be the fixed point of the expansion function.
Using these definitions, we make the following crucial observation that is similar to Eq. (4):
If is safe with respect to , then so is .
Let . Then, at least one of the following is true:
If , then since is safe with respect to , so is . Otherwise, since there exists an such that , even if is still unknown, we do know , which is safe. From there we can use the recovery algorithm provided by Definition 3 for to return to .
If we take action from the state at some time step , then we recursively define our knowledge after taking step to be .
For timestep , recursively define the safe set after step to be
Using Theorem 1, we see that this definition is justified, i.e. is safe with respect to for all . We now have a computable set which we can pair with our more theoretical definition of safety, Definition 3.
At step , a state is safe if and only if such that the action is safe at .
We are now ready to state our goal:
Our goal is to maximize the rate of safe exploration over the number of actions taken, given an initial safe set and state . Formally,
where is the total number of actions taken, and is the set of states deterministically known to be safe by the algorithm after actions, and is as defined in Eq. (4).
In order to present our algorithm that efficiently expands the safe set, we first introduce some notations and functions.
We define the path-knowledge function parametrized with that carries the transition triplets:
That is, if and only if there exist triplets in that let us move from to , and otherwise.
While performing safe exploration, it is both desirable and useful to learn the transition function. While implicitly performs this, it is useful to denote it as a function.
We therefore denote the transitions that have been learnt with certainity (without ambiguity) as function :
where is a placeholder that represents “an unknown state”.
Since the MDP is deterministic, is a properly defined function, i.e.
We also overload the function as similar to .
We define the path-planning function :
That is, gives the smallest list of actions that moves from to if such a transition is known in . In the case that there exist several such sequences, we assume gives any one of them.
Due to the assumption that we know how to move from one state to another inside the initial safe set, we have the following two relations for :
Iii-B Efficient Exploration
In order to efficiently expand the safe set, we must optimize for the list of actions that will lead to the largest safe set expansion with minimum number of actions. We will do this in two steps:
We first define a measure that corresponds to the possible safe set expansion from taking a new action.
We then greedily optimize to take actions which are considered efficient under that measure.
The most straightforward approach for defining a measure that corresponds to safe set expansion is to measure the amount of safe set expansion for each possible outcome of an action, and compute an expected value over all such possibilities. We call this the safe set expansion measure.
However, as we will practically demonstrate in our results, this approach can be highly suboptimal especially for continuous dynamics. Instead, we develop a second measure that quantifies the uncertainty reduction on the outcomes of all state-action pairs by taking an action. This prioritizes exploration towards the safe set boundary in addition to exploring actions which will expand the boundary.
We define the uncertainty reduction measure:
where the inner summation is over all state-action pairs, , and is the modeled probability that action
is the modeled probability that actionwill move from state to .
While the underlying MDP is deterministic, for some actions we only know that they will end up in a specific set of points, thus it makes sense to model our uncertainty as a probability distribution over that set. While different probability models can be employed for
, we are going to use uniform distribution overfor simplicity.
To show that our measure is well defined, we have the following theorem.
The uncertainty does not increase when some more actions are taken. In fact, the updated uncertainty is a subset of the previous one. Formally, for all , we have and thus .
for all and .
By Theorem 2, we can write:
Our goal is now to take steps to maximize the reduction in uncertainty. We realize that it is computationally difficult to maximize this reduction over sequences of multiple uncertain actions, so instead we will greedily optimize for the best immediate reduction in uncertainty. However, we still cannot directly maximize over all and , because at any step we are at a fixed state and getting to state may require many intermediate actions. Thus we must optimize over paths of actions starting at . A second realization is that though it may be more efficient in some circumstances to try to get near a state via paths containing uncertain actions, it is difficult to optimize over such paths, thus we only consider paths to which consist of certain actions.
The following theorem will be useful for our algorithm:
Taking actions that are already in the collection of knowledge, , does not lead to any safe set expansion.
We first note that the update rule depends only on the uncertainty function , which depends on , and . Suppose . When we take the action from state , we do not add a new element to . Since does not change, there will be no change in for any and , so we will not be able to expand the safe set.
Due to Theorem 3, if we take a path of certain actions from to and then take an uncertain action at , we know that the entire safe set expansion will come from the last uncertain step. Thus, we wish to perform the following optimization:
Again due to Theorem 3, if two different action lists lead to the same state from state , then the one with smaller cardinality is better off in the optimization. This means, we can just optimize over the shortest paths between state and the optimization variable state . The optimization can then be reformulated as follows:
This optimization is over finite variables for finite discrete MDPs.
Iii-C Overall Algorithm
We first present the pseudocodes for the algorithm blocks that we so far formalized. Algorithm 1 is the pseudocode for computing , which corresponds to the procedure that we use for expanding the safe set at iteration .
In order to perform the optimization for efficient exploration, we compute expected total uncertainty reduction as in Algorithm 2. Then, Algorithm 3 benefits from that procedure to optimize for uncertainty reduction. Note that in Algorithm 2, we use the scaling factor of in line 7, since we are using a uniform distribution over the possible outcomes of an action.
Lastly, we note that it is possible to have state-action pairs that are not in the current knowledge set, but have only one possible outcome. Adding these pairs to the knowledge in each iteration can possibly increase efficiency. We present this procedure in Algorithm 4. Note we check if for discrete MDPs. For continuous MDPs, where we can sample state and action spaces as we will describe in Section IV, we check if is a singleton.
We present the complete algorithm as a pseudocode in Algorithm 5.
Iv Simulations and Results
We call the algorithm developed above Safe Exploration Optimized For Uncertainty Reduction. We developed the following alternative methods as baselines to compare our algorithm against:
Random Exploration. In this method, we perform the safe set expansion as in our algorithm. However, each action taken is chosen randomly from the set of all possible actions, and is not necessarily safe.
Safe Exploration with No Optimization.
In this method, we again perform the safe set expansion as in our algorithm. However, each action taken is chosen randomly from the set of actions that are classified as safe at the current state.
Safe Exploration Optimized for Safe Set Expansion. This is similar to Safe Exploration Optimized for Uncertainty Reduction. However, instead of optimizing for the uncertainty reduction measure, we optimize for the safe set expansion measure described earlier. If the maximum expected safe set expansion amount is zero, we take the safe action that will, expectedly, push the system to its closest safety boundary at that time, so that it can possibly expand the safe set later.
We simulated two different environments with continuous state and action spaces, described below, to analyze the performance of our algorithm. In each environment, we used Euclidean distance for both and . We used breadth-first search (BFS) for the function.
For both environments, we began by uniformly sampling the state and action spaces. We used only those original samples when calculating the -function, and not any new states we might have encountered since starting the simulation, so that the optimization would not be biased towards already visited states.
To quantitatively assess the performance of our algorithm in comparison with the baselines, we defined and used the following metrics:
Safe Set Size. We plot the number of actions vs. to evaluate the safe set expansion efficiency.
Total Uncertainty. We plot the number of actions vs. to analyze how fast the total uncertainty decreases. For consistency among the iterations, we sum only over states in the original sampling, i.e. we do not consider other states encountered since starting the simulation.
Iv-a Muddy Jumper
We simulated a simple system with the transition model:
where , , and is the dampening factor. We simulated it as:
which is plotted in Fig. 2. This environment was inspired by the idea of a robot jumping on muddy ground. When the dampening factor is , the robot is not able to move anymore, so those states are unsafe.
With these, we can take and . For our simulations, we used , and . We sample as and as . Hence, the largest safe set is the interval between and . We set and . We present the results of our algorithm in Fig. 3.
It can be seen that random exploration leads to fast uncertainty reduction, because the allowed actions include very large jumps. However more importantly, it terminates upon reaching unsafe states after only a few actions. Other exploration techniques tackle this problem by taking only safe actions. However, our algorithm outperforms the baselines in terms of efficiency. It is better than the safe exploration with no optimization as it optimizes the uncertainty reduction per action. It can be seen that optimizing for safe set expansion yields good results initially; however becomes highly suboptimal afterwards. In this case, the superiority of our algorithm can be due to that reducing overall uncertainty leads to larger safe set expansion in future iterations, whereas a greedy optimization for the expansion cannot achieve this. As a side note, it is also interesting to see that the safe exploration with no optimization outperformed the safe exploration optimized for safe set expansion in later iterations. This might be because optimizing for expansion may often get stuck at states that are not amongst the original state samples, so the optimization becomes only over the immediate next action.
Iv-B Hilly Jumper
We simulate another environment with the following transition model:
where , , and is an environment-dependent function. We simulated it as:
Then, we have
This environment was inspired by a robot jumping on hilly ground where is the elevation function. Both the elevation function and its derivative are plotted in Fig. 4. When and , the robot is on a very steep terrain, so it cannot return to the safe set, which is around -state.
With these, we can take . While this problem does not have a single Lipschitz-continuity constant over all states, as the slope increases without bounds in each direction, it locally satisfies Lipschitz-continuity around the central region. For our simulations, we used , , and . We sample as and as . Hence, the largest safe set is the interval between and . We set and . We present the results of our algorithm in Fig. 5.
Unlike the Muddy Jumper environment, random exploration does not lead to very fast uncertainty reduction, because the action set contains only very short jumps. Due to the same reason, it does not crash —it is unlikely for the robot to leave the safe set with random small jumps. It is even harder due to the fact that the regions have a slope towards the central region. Safe exploration with no optimization performs even worse than random exploration, because it enforces safety constraint.
Similar to Muddy Jumper experiments, the baseline which optimizes over safe set expansion initially gives good performance; and later becomes suboptimal. In this case, the reason is the following: After the algorithm expands the safe set up to on the negative side, it cannot expand further due to the sampling of state space. It also could not expand the positive side further from , because the algorithm gets stuck near the negative limit. This is because when the system is at a new state near , all actions have some level of uncertainty, so the optimization is only over the immediate actions. And when the system is near the limit, it does not leave that region; because if it finds some possibility of safe set expansion, it explores; if it cannot find a possible expansion, it still goes toward the boundary. In fact, we ran this algorithm up to actions, and observed that the system was always in states after action. This baseline would require the following additional mechanisms to perform well: 1) to spot when to be confident that there is no possibility of expansion, and 2) to make sure the system moves to unexplored state regions when that confidence is obtained.
On the other hand, our algorithm, safe exploration optimized for uncertainty reduction, outperforms all the baseline methods in terms of efficiency. It can be noted that none of the algorithms reaches the maximum expandable safe set within actions. While the explored safe set might be expanded further with more iterations, it is also limited due to the sampling of state space —denser sampling could increase this limit; however it causes a significant computational burden.
For safe exploration tasks, computational cost is a problem in general. Our algorithm has polynomial complexity in the number of states and actions. While the use of Gaussian processes enables faster computation, directly using Lipschitz-continuity makes the algorithm computationally heavier, though we do note that our algorithm is parallelizable. For reference, we initially sample 101 states, 121 possible actions in Muddy Jumper; and 139 states, 7 possible actions in Hilly Jumper. However, the number of states increases during algorithm execution as the system visits states that are not amongst the initial samples. Additionally, it can be a concern for low-memory systems that our framework requires the storage of that grows linearly with the number of uncertain actions taken.
Both our example environments had 1-dimensional state spaces. We note that in higher dimensional problems, the number of states necessarily grows exponentially in the dimension of state space. In particular, the number of states on the boundary of the safe set at any step is likewise exponential in the dimension of the state space. Since our algorithm does not extrapolate from data in order to produce its safety guarantees, it by design must explore this exponentially sized safe set boundary.
For some specific applications, our algorithm’s requirement that how to move from one state to another is known inside the initial safe set can be too restrictive. In such cases, our algorithm can be readily applied provided that there exist some uncertain but safe actions for each state in the initial safe set. While this may hurt efficiency, it will enable the use of our algorithm in broader configurations.
In this formulation, our algorithm is limited to deterministic environments. Further research could generalize it to stochastic MDPs and to systems with disturbances. Similarly, our framework requires prior knowledge of the Lipschitz continuity parameters and . In settings where it is impractical to provide estimates of these parameters prior to running this algorithm, this algorithm could be modified to learn them online. However, either of these generalization would come at the expense of losing the algorithm’s deterministic guarantees.
As long as Lipschitz-continuity assumptions can be made, our algorithm can be applied to both linear and nonlinear systems, as well as to systems where safe state set boundaries are very complex. We have demonstrated our algorithm on two simulated environments, and we are planning to design real robotics experiments to showcase our algorithm.
Lastly, in each iteration of our algorithm, we currently take only actions we are certain about before taking a final, uncertain, action that we learn from (see the first constraint in (9)). This algorithm could potentially be improved by having it optimize over and learn from paths that include several uncertain actions in sequence rather than just one.
In this paper, we presented an algorithm to safely explore safety-critical deterministic MDPs that is efficient in terms of the number of actions it takes. Unlike some previous works, our algorithm does not require the transition function to be known a priori, except for some little prior knowledge.
Future work will demonstrate our algorithm’s use in practice. In addition, future work can be done to further improve the efficiency of our algorithm by allowing it to plan along sequences of multiple uncertain actions. We are also planning to relax the determinism requirement on the MDP and apply our algorithm to stochastic environments. And lastly, exploration is needed into combining this Lipschitz grounded approach with model-based approaches to handle higher dimensional state and action spaces.
Finally, we will study different methods for transferring from a source (e.g., simulation) domain to a target (e.g., real-world) domain. In order for a robotic system to adapt to a new domain, the system must often explore the parameters of the new environment, but must also do so safely. In the future work, we will leverage our work on safe exploration in MDPs and the Delaunay-based optimization  to address this problem.
The authors thank Fred Y. Hadaegh, Adrian Stoica and Duligur Ibeling for the discussions and support. The authors also gratefully acknowledge funding from Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration in support of this work. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
-  D. Sadigh and A. Kapoor, “Safe control under uncertainty with probabilistic signal temporal logic,” in Proceedings of Robotics: Science and Systems (RSS), June 2016.
-  D. Dey, D. Sadigh, and A. Kapoor, “Fast safe mission plans for autonomous vehicles,” Proceedings of Robotics: Science and Systems Workshop, Tech. Rep., June 2016.
G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient smt solver for verifying deep neural networks,” inInternational Conference on Computer Aided Verification. Springer, 2017, pp. 97–117.
-  B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and autonomous systems, vol. 57, no. 5, pp. 469–483, 2009.
-  J. Kober and J. Peters, “Reinforcement learning in robotics: A survey,” in Reinforcement Learning. Springer, 2012, pp. 579–610.
-  S. P. Coraluppi and S. I. Marcus, “Risk-sensitive and minimax control of discrete-time, finite-state markov decision processes,” Automatica, vol. 35, no. 2, pp. 301–309, 1999.
-  M. Heger, “Consideration of risk in reinforcement learning,” in Machine Learning Proceedings 1994. Elsevier, 1994, pp. 105–111.
M. Sato, H. Kimura, and S. Kobayashi, “Td algorithm for the variance of return and mean-variance reinforcement learning,”Transactions of the Japanese Society for Artificial Intelligence, vol. 16, no. 3, pp. 353–362, 2001.
-  V. S. Borkar, “Q-learning for risk-sensitive control,” Mathematics of operations research, vol. 27, no. 2, pp. 294–311, 2002.
-  C. Gaskett, “Reinforcement learning under circumstances beyond its control,” Proceedings of the International Conference on Computational Intelligence for Modelling Control and Automation (CIMCA2003), February 2003.
-  P. Geibel and F. Wysotzki, “Risk-sensitive reinforcement learning applied to control under constraints,” Journal of Artificial Intelligence Research, vol. 24, pp. 81–108, 2005.
-  A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin, “Provably safe and robust learning-based model predictive control,” Automatica, vol. 49, no. 5, pp. 1216–1226, 2013.
-  P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning in reinforcement learning,” in Proceedings of the 22nd international conference on Machine learning. ACM, 2005, pp. 1–8.
-  P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous helicopter aerobatics through apprenticeship learning,” The International Journal of Robotics Research, vol. 29, no. 13, pp. 1608–1639, 2010.
-  F. Berkenkamp, A. Krause, and A. P. Schoellig, “Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics,” arXiv preprint arXiv:1602.04450, 2016.
-  A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2017.
-  A. K. Akametalu, S. Kaynama, J. F. Fisac, M. N. Zeilinger, J. H. Gillula, and C. J. Tomlin, “Reachability-based safe learning with gaussian processes.” in 53rd IEEE Conference on Decision and Control (CDC). Citeseer, 2014, pp. 1424–1431.
-  J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015.
-  M. Turchetta, F. Berkenkamp, and A. Krause, “Safe exploration in finite markov decision processes with gaussian processes,” in Advances in Neural Information Processing Systems, 2016, pp. 4312–4320.
-  A. Wachi, Y. Sui, Y. Yue, and M. Ono, “Safe exploration and optimization of constrained mdps using gaussian processes,” in AAAI Conference on Artificial Intelligence (AAAI), 2018.
-  J. H. Gillula and C. J. Tomlin, “Guaranteed safe online learning of a bounded system,” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, 2011, pp. 2979–2984.
-  J. Garcia and F. Fernández, “Safe exploration of state and action spaces in reinforcement learning,” Journal of Artificial Intelligence Research, vol. 45, pp. 515–564, 2012.
-  T. M. Moldovan and P. Abbeel, “Safe exploration in markov decision processes,” arXiv preprint arXiv:1205.4810, 2012.
-  R. S. Sutton, A. G. Barto, et al., Reinforcement learning: An introduction. MIT press, 1998.
-  S. R. Alimo, P. Beyhaghi, and T. R. Bewley, “Optimization combining derivative-free global exploration with derivative-based local refinement,” in 2017 IEEE 56th Annual Conference on Decision and Control (CDC). IEEE, 2017, pp. 2531–2538.