Safe Exploration in Finite Markov Decision Processes with Gaussian Processes

by   Matteo Turchetta, et al.
ETH Zurich

In classical reinforcement learning, when exploring an environment, agents accept arbitrary short term loss for long term gain. This is infeasible for safety critical applications, such as robotics, where even a single unsafe action may cause system failure. In this paper, we address the problem of safely exploring finite Markov decision processes (MDP). We define safety in terms of an, a priori unknown, safety constraint that depends on states and actions. We aim to explore the MDP under this constraint, assuming that the unknown function satisfies regularity conditions expressed via a Gaussian process prior. We develop a novel algorithm for this task and prove that it is able to completely explore the safely reachable part of the MDP without violating the safety constraint. To achieve this, it cautiously explores safe states and actions in order to gain statistical confidence about the safety of unvisited state-action pairs from noisy observations collected while navigating the environment. Moreover, the algorithm explicitly considers reachability when exploring the MDP, ensuring that it does not get stuck in any state with no safe way out. We demonstrate our method on digital terrain models for the task of exploring an unknown map with a rover.



There are no comments yet.


page 1

page 2

page 3

page 4


Safe Reinforcement Learning in Constrained Markov Decision Processes

Safe reinforcement learning has been a promising approach for optimizing...

Safe Exploration for Interactive Machine Learning

In Interactive Machine Learning (IML), we iteratively make decisions and...

Safe Exploration in Markov Decision Processes with Time-Variant Safety using Spatio-Temporal Gaussian Process

In many real-world applications (e.g., planetary exploration, robot navi...

Safe Reinforcement Learning with Linear Function Approximation

Safety in reinforcement learning has become increasingly important in re...

Formal Language Constraints for Markov Decision Processes

In order to satisfy safety conditions, a reinforcement learned (RL) agen...

Provably Safe PAC-MDP Exploration Using Analogies

A key challenge in applying reinforcement learning to safety-critical do...

Learning to Act Safely with Limited Exposure and Almost Sure Certainty

This paper aims to put forward the concept that learning to take safe ac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Today’s robots are required to operate in variable and often unknown environments. The traditional solution is to specify all potential scenarios that a robot may encounter during operation a priori. This is time consuming or even infeasible. As a consequence, robots need to be able to learn and adapt to unknown environments autonomously Kober et al. (2013); Argall et al. (2009). While exploration algorithms are known, safety is still an open problem in the development of such systems Schaal and Atkeson (2010). In fact, most learning algorithms allow robots to make unsafe decisions during exploration. This can damage the platform or its environment.

In this paper, we provide a solution to this problem and develop an algorithm that enables agents to safely and autonomously explore unknown environments. Specifically, we consider the problem of exploring a Markov decision process (MDP), where it is a priori unknown which state-action pairs are safe. Our algorithm cautiously explores this environment without taking actions that are unsafe or may render the exploring agent stuck.

Related Work. Safe exploration is an open problem in the reinforcement learning community and several definitions of safety have been proposed Pecka and Svoboda (2014). In risk-sensitive reinforcement learning, the goal is to maximize the expected return for the worst case scenario Coraluppi and Marcus (1999). However, these approaches only minimize risk and do not treat safety as a hard constraint. For example, Geibel and Wysotzki (2005)

define risk as the probability of driving the system to a previously known set of undesirable states. The main difference to our approach is that we do not assume the undesirable states to be known

a priori. Garcia and Fernández (2012) propose to ensure safety by means of a backup policy; that is, a policy that is known to be safe in advance. Our approach is different, since it does not require a backup policy but only a set of initially safe states from which the agent starts to explore. Another approach that makes use of a backup policy is shown by Hans et al. (2008), where safety is defined in terms of a minimum reward, which is learned from data.

Moldovan and Abbeel (2012)

provide probabilistic safety guarantees at every time step by optimizing over ergodic policies; that is, policies that let the agent recover from any visited state. This approach needs to solve a large linear program at every time step, which is computationally demanding even for small state spaces. Nevertheless, the idea of ergodicity also plays an important role in our method. In the control community, safety is mostly considered in terms of stability or constraint satisfaction of controlled systems.

Akametalu et al. (2014) use reachability analysis to ensure stability under the assumption of bounded disturbances. The work in Berkenkamp and Schoellig (2015) uses robust control techniques in order to ensure robust stability for model uncertainties, while the uncertain model is improved.

Another field that has recently considered safety is Bayesian optimization Mockus (1989). There, in order to find the global optimum of an a priori unknown function Srinivas et al. (2010), regularity assumptions in form of a Gaussian process (GP) Rasmussen and Williams (2006) prior are made. The corresponding GP posterior distribution over the unknown function is used to guide evaluations to informative locations. In this setting, safety centered approaches include the work of Sui et al. (2015) and Schreiter et al. (2015), where the goal is to find the safely reachable optimum without violating an a priori unknown safety constraint at any evaluation. To achieve this, the function is cautiously explored, starting from a set of points that is known to be safe initially. The method in Sui et al. (2015) was applied to the field of robotics to safely optimize the controller parameters of a quadrotor vehicle Berkenkamp et al. (2016). However, they considered a bandit setting, where at each iteration any arm can be played. In contrast, we consider exploring an MDP, which introduces restrictions in terms of reachability that have not been considered in Bayesian optimization before.

Contribution. We introduce SafeMDP, a novel algorithm for safe exploration in MDPs. We model safety via an a priori unknown constraint that depends on state-action pairs. Starting from an initial set of states and actions that are known to satisfy the safety constraint, the algorithm exploits the regularity assumptions on the constraint function in order to determine if nearby, unvisited states are safe. This leads to safe exploration, where only state-actions pairs that are known to fulfil the safety constraint are evaluated. The main contribution consists of extending the work on safe Bayesian optimization in Sui et al. (2015) from the bandit setting to deterministic, finite MDPs. In order to achieve this, we explicitly consider not only the safety constraint, but also the reachability properties induced by the MDP dynamics. We provide a full theoretical analysis of the algorithm. It provably enjoys similar safety guarantees in terms of ergodicity as discussed in Moldovan and Abbeel (2012), but at a reduced computational cost. The reason for this is that our method separates safety from the reachability properties of the MDP. Beyond this, we prove that SafeMDP is able to fully explore the safely reachable region of the MDP, without getting stuck or violating the safety constraint with high probability. To the best of our knokwledge, this is the first full exploration result in MDPs subject to a safety constraint. We validate our method on an exploration task, where a rover has to explore an a priori unknown map.

2 Problem Statement

In this section, we define our problem and assumptions. The unknown environment is modeled as a finite, deterministic MDP Sutton and Barto (1998). Such a MDP is a tuple  with a finite set of states , a set of state-dependent actions , a known, deterministic transition model , and reward function . In the typical reinforcement learning framework, the goal is to maximize the cumulative reward. In this paper, we consider the problem of safely exploring the MDP. Thus, instead of aiming to maximize the cumulative rewards, we define  as an a priori unknown safety feature. Although  is unknown, we make regularity assumptions about it to make the problem tractable. When traversing the MDP, at each discrete time step, , the agent has to decide which action and thereby state to visit next. We assume that the underlying system is safety-critical and that for any visited state-action pair, , the unknown, associated safety feature, , must be above a safety threshold, . While the assumption of deterministic dynamics does not hold for general MDPs, in our framework, uncertainty about the environment is captured by the safety feature. If requested, the agent can obtain noisy measurements of the safety feature, , by taking action  in state . The index  is used to index measurements, while  denotes movement steps. Typically .

It is hopeless to achieve the goal of safe exploration unless the agent starts in a safe location. Hence, we assume that the agent stays in an initial set of state action pairs, , that is known to be safe a priori. The goal is to identify the maximum safely reachable region starting from , without visiting any unsafe states. For clarity of exposition, we assume that safety depends on states only; that is, . We provide an extension to safety features that also depend on actions in Sec. 3.

Assumptions on the reward function Ensuring that all visited states are safe without any prior knowledge about the safety feature is an impossible task (e.g., if the safety feature is discontinuous). However, many practical safety features exhibit some regularity, where similar states will lead to similar values of .

In the following, we assume that  is endowed with a positive definite kernel function  and that the function has bounded norm in the associated Reproducing Kernel Hilbert Space (RKHS) Schölkopf and Smola (2002). The norm induced by the inner product of the RKHS indicates the smoothness of functions with respect to the kernel. This assumption allows us to model  as a GP Srinivas et al. (2010)

. A GP is a probability distribution over functions that is fully specified by its mean function 

and its covariance function . The randomness expressed by this distribution captures our uncertainty about the environment. We assume  for all , without loss of generality. The posterior distribution over can be computed analytically, based on  measurements at states  with measurements, , that are corrupted by zero-mean Gaussian noise, . The posterior is a  distribution with mean 

, variance 

, and covariance , where  and is the positive definite kernel matrix, 

. The identity matrix is denoted by


We also assume -Lipschitz continuity of the safety function with respect to some metric  on . This is guaranteed by many commonly used kernels with high probability Srinivas et al. (2010); Ghosal and Roy (2006).

Figure 1: Illustration of the set operators with . The set  can be reached from  in one step and from  in two steps, while only the state  can be reached from . Visiting  is safe; that is, it is above the safety threshold, is reachable, and there exists a safe return path through .

Goal In this section, we define the goal of safe exploration. In particular, we ask what the best that any algorithm may hope to achieve is. Since we only observe noisy measurements, it is impossible to know the underlying safety function  exactly after a finite number of measurements. Instead, we consider algorithms that only have knowledge of  up to some statistical confidence . Based on this confidence within some safe set , states with small distance to 

can be classified to satisfy the safety constraint using the Lipschitz continuity of 

. The resulting set of safe states is


which contains states that can be classified as safe given the information about the states in . While creftype 1 considers the safety constraint, it does not consider any restrictions put in place by the structure of the MDP. In particular, we may not be able to visit every state in  without visiting an unsafe state first. As a result, the agent is further restricted to


the set of all states that can be reached starting from the safe set in one step. These states are called the one-step safely reachable states. However, even restricted to this set, the agent may still get stuck in a state without any safe actions. We define


as the set of states that are able to return to a set  through some other set of states, , in one step. In particular, we care about the ability to return to a certain set through a set of safe states . Therefore, these are called the one-step safely returnable states. In general, the return routes may require taking more than one action, see Fig. 1. The -step returnability operator with considers these longer return routes by repeatedly applying the return operator, in creftype 3, times. The limit contains all the states that can reach the set  through an arbitrarily long path in .

For safe exploration of MDPs, all of the above are requirements; that is, any state that we may want to visit needs to be safe (satisfy the safety constraint), reachable, and we must be able to return to safe states from this new state. Thus, any algorithm that aims to safely explore an MDP is only allowed to visit states in


which is the intersection of the three safety-relevant sets. Given a safe set  that fulfills the safety requirements,  is the set of states from which we can return to  by only visiting states that can be classified as above the safety threshold. By including it in the definition of , we avoid the agent getting stuck in a state without an action that leads to another safe state to take.

Given knowledge about the safety feature in  up to  accuracy thus allows us to expand the set of safe ergodic states to . Any algorithm that has the goal of exploring the state space should consequently explore these newly available safe states and gain new knowledge about the safety feature to potentially further enlargen the safe set. The safe set after  such expansions can be found by repeatedly applying the operator in creftype 4: with . Ultimately, the size of the safe set is bounded by surrounding unsafe states or the number of states in . As a result, the biggest set that any algorithm may classify as safe without visiting unsafe states is given by taking the limit,

Thus, given a tolerance level  and an initial safe seed set  is the set of states that any algorithm may hope to classify as safe. Let  denote the set of states that an algorithm determines to be safe at iteration . In the following, we will refer to complete, safe exploration whenever an algorithm fulfills ; that is, the algorithm classifies every safely reachable state up to  accuracy as safe, without misclassification or visiting unsafe states.

3 SafeMDP Algorithm

We start by giving a high level overview of the method. The SafeMDP algorithm relies on a GP model of to make predictions about the safety feature and uses the predictive uncertainty to guide the safe exploration. In order to guarantee safety, it maintains two sets. The first set, , contains all states that can be classified as satisfying the safety constraint using the GP posterior, while the second one, , additionally considers the ability to reach points in  and the ability to safely return to the previous safe set, . The algorithm ensures safety and ergodicity by only visiting states in . In order to expand the safe region, the algorithm visits states in , a set of candidate states that, if visited, could expand the safe set. Specifically, the algorithm selects the most uncertain state in , which is the safe state that we can gain the most information about. We move to this state via the shortest safe path, which is guaranteed to exist (Lemma 2). The algorithm is summarized in Algorithm 1.

  Inputs: states ,  actions   transition function kernel , Safety threshold , Lipschitz constant , Safe seed .
   for all
  for  do
     Safe Dijkstra in  from  to 
     Update GP with  and 
     if then  Break
Algorithm 1 Safe exploration in MDPs (SafeMDP)

Initialization. The algorithm relies on an initial safe set  as a starting point to explore the MDP. These states must be safe; that is, , for all . They must also fulfill the reachability and returnability requirements from Sec. 2. Consequently, for any two states, , there must exist a path in  that connects them: . While this may seem restrictive, the requirement is, for example, fulfilled by a single state with an action that leads back to the same state.

Classification. In order to safely explore the MDP, the algorithm must determine which states are safe without visiting them. The regularity assumptions introduced in Sec. 2 allow us to model the safety feature as a 

, so that we can use the uncertainty estimate of the GP model in order to determine a confidence interval within which the true safety function lies with high probability. For every state 

, this confidence interval has the form where  is a positive scalar that determines the amplitude of the interval. We discuss how to select  in Sec. 4.

Rather than defining high probability bounds on the values of  directly in terms of , we consider the intersection of the sets  up to iteration , with for safe states  and  otherwise. This choice ensures that set of states that we classify as safe does not shrink over iterations and is justified by the selection of  in Sec. 4. Based on these confidence intervals, we define a lower bound, , and upper bound, , on the values that the safety features  are likely to take based on the data obtained up to iteration . Based on these lower bounds, we define


as the set of states that fulfill the safety constraint on  with high probability by using the Lipschitz constant to generalize beyond the current safe set. Based on this classification, the set of ergodic safe states is the set of states that achieve the safety threshold and, additionally, fulfill the reachability and returnability properties discussed in Sec. 2:


Expanders. With the set of safe states defined, the task of the algorithm is to identify and explore states that might expand the set of states that can be classified as safe. We use the uncertainty estimate in the GP in order to define an optimistic set of expanders,


where The function  is positive whenever an optimistic measurement at , equal to the upper confidence bound, , would allow us to determine that a previously unsafe state indeed has value  above the safety threshold. Intuitively, sampling  might lead to the expansion of  and thereby . The set explicitly considers the expansion of the safe set as exploration goal, see Fig. (a)a for a graphical illustration of the set.

Sampling and shortest safe path. The remaining part of the algorithm is concerned with selecting safe states to evaluate and finding a safe path in the MDP that leads towards them. The goal is to visit states that allow the safe set to expand as quickly as possible, so that we do not waste resources when exploring the MDP. We use the GP posterior uncertainty about the states in  in order to make this choice. At each iteration , we select as next target sample the state with the highest variance in , where . This choice is justified, because while all points in  are safe and can potentially enlarge the safe set, based on one noisy sample we can gain the most information from the state that we are the most uncertain about. This design choice maximizes the knowledge acquired with every sample but can lead to long paths between measurements within the safe region. Given , we use Dijkstra’s algorithm within the set  in order to find the shortest safe path to the target from the current state, . Since we require reachability and returnability for all safe states, such a path is guaranteed to exist. We terminate the algorithm when we reach the desired accuracy; that is, .

(a) States are classified as safe (above the safety constraint, dashed line) according to the confidence intervals of the GP model (red bar). States in the green bar can expand the safe set if sampled, .
(b) Modified MDP model that is used to encode safety features that depend on actions. In this model, actions lead to abstract action-states , which only have one available action that leads to .

Action-dependent safety. So far, we have considered safety features that only depend on the states, . In general, safety can also depend on the actions, . In this section, we introduce a modified MDP that captures these dependencies without modifying the algorithm. The modified MDP is equivalent to the original one in terms of dynamics, . However, we introduce additional action-states  for each action in the original MDP. When we start in a state  and take action , we first transition to the corresponding action-state and from there transition to  deterministically. This model is illustrated in Fig. (b)b. Safety features that depend on action-states, , are equivalent to action-dependent safety features. The SafeMDP algorithm can be used on this modified MDP without modification. See the experiments in Sec. 5 for an example.

4 Theoretical Results

The safety and exploration aspects of the algorithm that we presented in the previous section rely on the correctness of the confidence intervals . In particular, they require that the true value of the safety feature, , lies within  with high probability for all and all iterations . Furthermore, these confidence intervals have to shrink sufficiently fast over time. The probability of  taking values within the confidence intervals depends on the scaling factor . This scaling factor trades off conservativeness in the exploration for the probability of unsafe states being visited. Appropriate selection of  has been studied by Srinivas et al. (2010) in the multi-armed bandit setting. Even though our framework is different, their setting can be applied to our case. We choose,


where  is the bound on the RKHS norm of the function , is the probability of visiting unsafe states, and is the maximum mutual information that can be gained about  from  noisy observations; that is, The information capacity  has a sublinear dependence on  for many commonly used kernels Srinivas et al. (2010). The choice of  in creftype 8 is justified by the following Lemma, which follows from (Srinivas et al., 2010, Theorem 6):

Lemma 1.

Assume that , and that the noise  is zero-mean conditioned on the history, as well as uniformly bounded by  for all . If  is chosen as in creftype 8, then, for all  and all , it holds with probability at least  that .

This Lemma states that, for  as in creftype 8, the safety function  takes values within the confidence intervals  with high probability. Now we show that the the safe shortest path problem has always a solution:

Lemma 2.

Assume that  and that for all states, , . Then, when using Algorithm 1 under the assumptions in Theorem 1, for all  and for all states, , .

This lemma states that, given an initial safe set that fulfills the initialization requirements, we can always find a policy that drives us from any state in  to any other state in  without leaving the set of safe states, . Lemmas 2 and 1 have a key role in ensuring safety during exploration and, thus, in our main theoretical result:

Theorem 1.

Assume that  is -Lipschitz continuous and that the assumptions of Lemma 1 hold. Also, assume that , for all , and that for any two states, , . Choose  as in creftype 8. Then, with probability at least , we have for any along any state trajectory induced by Algorithm 1 on an MDP with transition function . Moreover, let  be the smallest integer such that with . Then there exists a such that, with probability at least ,

Theorem 1 states that Algorithm 1 performs safe and complete exploration of the state space; that is, it explores the maximum reachable safe set without visiting unsafe states. Moreover, for any desired accuracy  and probability of failure , the safely reachable region can be found within a finite number of observations. This bound depends on the information capacity , which in turn depends on the kernel. If the safety feature is allowed to change rapidly across states, the information capacity will be larger than if the safety feature was smooth. Intuitively, the less prior knowledge the kernel encodes, the more careful we have to be when exploring the MDP, which requires more measurements.

5 Experiments

In this section, we demonstrate Algorithm 1 on an exploration task. We consider the setting in Moldovan and Abbeel (2012), the exploration of the surface of Mars with a rover. The code for the experiments is available at

For space exploration, communication delays between the rover and the operator on Earth can be prohibitive. Thus, it is important that the robot can act autonomously and explore the environment without risking unsafe behavior. For the experiment, we consider the Mars Science Laboratory (MSL) Lockwood (2006), a rover deployed on Mars. Due to communication delays, the MSL can travel 20 meters before it can obtain new instructions from an operator. It can climb a maximum slope of  (MSL, 2007, Sec. 2.1.3). In our experiments we use digital terrain models of the surface of Mars from the High Resolution Imaging Science Experiment (HiRISE), which have a resolution of one meter McEwen et al. (2007).

As opposed to the experiments considered in Moldovan and Abbeel (2012), we do not have to subsample or smoothen the data in order to achieve good exploration results. This is due to the flexibility of the GP framework that considers noisy measurements. Therefore, every state in the MDP represents a square area with , as opposed to  in Moldovan and Abbeel (2012).

At every state, the agent can take one of four actions: up, down, left, and right. If the rover attempts to climb a slope that is steeper than , it fails and may be damaged. Otherwise it moves deterministically to the desired neighboring state. In this setting, we define safety over state transitions by using the extension introduced in Sec. 3. The safety feature over the transition from  to  is defined in terms of height difference between the two states, . Given the maximum slope of that the rover can climb, the safety threshold is set at a conservative . This encodes that it is unsafe for the robot to climb hills that are too steep. In particular, while the MDP dynamics assume that Mars is flat and every state can be reached, the safety constraint depends on the a priori unknown heights. Therefore, under the prior belief, it is unknown which transitions are safe.

We model the height distribution, , as a GP with a Matérn kernel with

. Due to the limitation on the grid resolution, tuning of the hyperparameters is necessary to achieve both safety and satisfactory exploration results. With a finer resolution, more cautious hyperparameters would also be able to generalize to neighbouring states. The lengthscales are set to

and the prior standard deviation of heights is 

. We assume a noise standard deviation of 0.075 m. Since the safety feature of each state transition is a linear combination of heights, the GP model of the heights induces a GP model over the differences of heights, which we use to classify whether state transitions fulfill the safety constraint. In particular, the safety depends on the direction of travel, that is, going downhill is possible, while going uphill might be unsafe.

Following the recommendations in Sui et al. (2015), in our experiments we use the GP confidence intervals directly to determine the safe set . As a result, the Lipschitz constant is only used to determine expanders in . Guaranteeing safe exploration with high probability over multiple steps leads to conservative behavior, as every step beyond the set that is known to be safe decreases the ‘probability budget’ for failure. In order to demonstrate that safety can be achieved empirically using less conservative parameters than those suggested by Theorem 1, we fix to a constant value, . This choice aims to guarantee safety per iteration rather than jointly over all the iterations. The same assumption is used in Moldovan and Abbeel (2012).

We compare our algorithm to several baselines. The first one considers both the safety threshold and the ergodicity requirements but neglects the expanders. In this setting, the agent samples the most uncertain safe state transaction, which corresponds to the safe Bayesian optimization framework in Schreiter et al. (2015). We expect the exploration to be safe, but less efficient than our approach. The second baseline considers the safety threshold, but does not consider ergodicity requirements. In this setting, we expect the rover’s behavior to fulfill the safety constraint and to never attempt to climb steep slopes, but it may get stuck in states without safe actions. The third method uses the unconstrained Bayesian optimization framework in order to explore new states, without safety requirements. In this setting, the agent tries to obtain measurements from the most uncertain state transition over the entire space, rather than restricting itself to the safe set. In this case, the rover can easily get stuck and may also incur failures by attempting to climb steep slopes. Last, we consider a random exploration strategy, which is similar to the -greedy exploration strategies that are widely used in reinforcement learning.

(c) Non-ergodic
(d) Unsafe
(e) Random
(f) No Expanders
(g) SafeMDP
[%] at failure
SafeMDP -
No Expanders -
Non-ergodic 2
Unsafe 1
Random 219
(h) Performance metrics.
Figure 2: Comparison of different exploration schemes. The background color shows the real altitude of the terrain. All algorithms are run for 525 iterations, or until the first unsafe action is attempted. The saturated color indicates the region that each strategy is able to explore. The baselines get stuck in the crater in the bottom-right corner or fail to explore, while Algorithm 1 manages to safely explore the unknown environment. See the statistics in Fig. (h)h.

We compare these baselines over an 120 by 70 meters area at latitude and longitude. We set the accuracy . The resulting exploration behaviors can be seen in Fig. 2. The rover starts in the center-top part of the plot, a relatively planar area. In the top-right corner there is a hill that the rover cannot climb, while in the bottom-right corner there is a crater that, once entered, the rover cannot leave. The safe behavior that we expect is to explore the planar area, without moving into the crater or attempting to climb the hill. We run all algorithms for  iterations or until the first unsafe action is attempted. It can be seen in Fig. (g)g that our method explores the safe area that surrounds the crater, without attempting to move inside. While some state-action pairs closer to the crater are also safe, the GP model would require more data to classify them as safe with the necessary confidence. In contrast, the baselines perform significantly worse. The baseline that does not ensure the ability to return to the safe set (non-ergodic) can be seen in Fig. (c)c. It does not explore the area, because it quickly reaches a state without a safe path to the next target sample. Our approach avoids these situations explicitly. The unsafe exploration baseline in Fig. (d)d considers ergodicity, but concludes that every state is reachable according to the MDP model. Consequently, it follows a path that crosses the boundary of the crater and eventually evaluates an unsafe action. Overall, it is not enough to consider only ergodicity or only safety, in order to solve the safe exploration problem. The random exploration in Fig. (e)e attempts an unsafe action after some exploration. In contrast, Algorithm 1 manages to safely explore a large part of the unknown environment. Running the algorithm without considering expanders leads to the behavior in Fig. (f)f, which is safe, but only manages to explore a small subset of the safely reachable area within the same number of iterations in which Algorithm 1 explores over of it. The results are summarized in Table (h)h.

6 Conclusion

We presented SafeMDP, an algorithm to safely explore a priori unknown environments. We used a Gaussian process to model the safety constraints, which allows the algorithm to reason about the safety of state-action pairs before visiting them. An important aspect of the algorithm is that it considers the transition dynamics of the MDP in order to ensure that there is a safe return route before visiting states. We proved that the algorithm is capable of exploring the full safely reachable region with few measurements, and demonstrated its practicality and performance in experiments.

Acknowledgement. This research was partially supported by the Max Planck ETH Center for Learning Systems and SNSF grant 200020_159557.


  • Akametalu et al. [2014] Anayo K. Akametalu, Shahab Kaynama, Jaime F. Fisac, Melanie N. Zeilinger, Jeremy H. Gillula, and Claire J. Tomlin. Reachability-based safe learning with Gaussian processes. In Proc. of the IEEE Conference on Decision and Control (CDC), pages 1424–1431, 2014.
  • Argall et al. [2009] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
  • Berkenkamp and Schoellig [2015] Felix Berkenkamp and Angela P. Schoellig. Safe and robust learning control with Gaussian processes. In Proc. of the European Control Conference (ECC), pages 2501–2506, 2015.
  • Berkenkamp et al. [2016] Felix Berkenkamp, Angela P. Schoellig, and Andreas Krause. Safe controller optimization for quadrotors with Gaussian processes. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), 2016.
  • Coraluppi and Marcus [1999] Stefano P. Coraluppi and Steven I. Marcus. Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes. Automatica, 35(2):301–309, 1999.
  • Garcia and Fernández [2012] Javier Garcia and Fernando Fernández. Safe exploration of state and action spaces in reinforcement learning.

    Journal of Artificial Intelligence Research

    , pages 515–564, 2012.
  • Geibel and Wysotzki [2005] Peter Geibel and Fritz Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research (JAIR), 24:81–108, 2005.
  • Ghosal and Roy [2006] Subhashis Ghosal and Anindya Roy. Posterior consistency of Gaussian process prior for nonparametric binary regression. The Annals of Statistics, 34(5):2413–2429, 2006.
  • Hans et al. [2008] Alexander Hans, Daniel Schneegaß, Anton Maximilian Schäfer, and Steffen Udluft. Safe exploration for reinforcement learning. In

    Proc. of the European Symposium on Artificial Neural Networks (ESANN)

    , pages 143–148, 2008.
  • Kober et al. [2013] Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: a survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
  • Lockwood [2006] Mary Kae Lockwood. Introduction: Mars Science Laboratory: The Next Generation of Mars Landers. Journal of Spacecraft and Rockets, 43(2):257–257, 2006.
  • McEwen et al. [2007] Alfred S. McEwen, Eric M. Eliason, James W. Bergstrom, Nathan T. Bridges, Candice J. Hansen, W. Alan Delamere, John A. Grant, Virginia C. Gulick, Kenneth E. Herkenhoff, Laszlo Keszthelyi, Randolph L. Kirk, Michael T. Mellon, Steven W. Squyres, Nicolas Thomas, and Catherine M. Weitz. Mars Reconnaissance Orbiter’s High Resolution Imaging Science Experiment (HiRISE). Journal of Geophysical Research: Planets, 112(E5):E05S02, 2007.
  • Mockus [1989] Jonas Mockus. Bayesian Approach to Global Optimization, volume 37 of Mathematics and Its Applications. Springer Netherlands, 1989.
  • Moldovan and Abbeel [2012] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision processes. In

    Proc. of the International Conference on Machine Learning (ICML)

    , pages 1711–1718, 2012.
  • MSL [2007] MSL. MSL Landing Site Selection User’s Guide to Engineering Constraints, 2007. URL
  • Pecka and Svoboda [2014] Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning – an overview. In Modelling and Simulation for Autonomous Systems, pages 357–375. Springer, 2014.
  • Rasmussen and Williams [2006] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006.
  • Schaal and Atkeson [2010] Stefan Schaal and Christopher Atkeson. Learning Control in Robotics. IEEE Robotics & Automation Magazine, 17(2):20–29, 2010.
  • Schölkopf and Smola [2002] Bernhard Schölkopf and Alexander J. Smola.

    Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

    MIT Press, 2002.
  • Schreiter et al. [2015] Jens Schreiter, Duy Nguyen-Tuong, Mona Eberts, Bastian Bischoff, Heiner Markert, and Marc Toussaint.

    Safe exploration for active learning with Gaussian processes.

    In Proc. of the European Conference on Machine Learning (ECML), volume 9284, pages 133–149, 2015.
  • Srinivas et al. [2010] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proc. of the International Conference on Machine Learning (ICML), 2010.
  • Sui et al. [2015] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with Gaussian processes. In Proc. of the International Conference on Machine Learning (ICML), pages 997–1005, 2015.
  • Sutton and Barto [1998] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. Adaptive computation and machine learning. MIT Press, 1998.

Appendix A Preliminary lemmas

Lemma 3.



This lemma follows directly from the definitions of , , and . ∎

Lemma 4.

, .


Proof by induction. Consider , then by definition. For the induction step, assume . Now consider . We know that

Therefore, since and , it follows that and the induction step is complete. ∎

Lemma 5.

, and , a sequence of actions, that induces starting at , such that and .


. means that either or . Therefore, we can reach a state in taking at most one action. Repeating this procedure times, the system reaches a state in with at most actions. In particular, if we choose , we prove the agent reaches with at most actions. Therefore there is a sequence of actions of length , with , inducing a state trajectory such that: , for every and .
. Consider . This means that . In case we have that and that . Therefore . For we know and . Similarly and . For any we can apply this reasoning times and prove that . ∎

Lemma 6.



This is a direct consequence of Lemma 5. In fact, Lemma 5 states that belongs to if and only if there is a path of length at most starting from contained in that drives the system to a state in . Since we are dealing with a finite MDP, there are different states. Therefore, if such a path exists it cannot be longer than . ∎

Lemma 7.

Given and , it holds that .


Let . It follows from Lemmas 6 and 5 that there exists a sequence of actions, , with , that induces a state trajectory, , starting at with and . Using the () direction of Lemma 5 and Lemma 6, we conclude that . ∎

Lemma 8.



Consider . Then either or , by definition. This implies that . ∎

Lemma 9.

For any , and


Proof by induction. Consider , by initialization. We known that

where the last inequality follows from Lemma 3. This implies that or, equivalently, that . Furthermore, we know by initialization that . Moreover, we can say that , since . We can conclude that . For the induction step assume that and . Let . Then,

Furthermore, it follows from Lemma 3 that . This implies that . Thus . Now consider . We known that

by Lemma 8

We also know that . Since we just proved that and we assumed for the induction step, Lemma 7 allows us to say that . All together this allows us to complete the induction step by saying . ∎

Lemma 10.



Consider , we can say that:


This means that

Lemma 11.

Given two sets such that , it holds that:  .


We have to prove that:


Let’s start by checking the reachability condition first:

by Lemma 8

Now let’s focus on the recovery condition. We use Lemmas 7 and 10 to say that implies that and this completes the proof. ∎

Lemma 12.

Given two sets such that , the following holds: .


The result follows by repeatedly applying Lemma 11. ∎

Lemma 13.

Assume that , and that the noise  is zero-mean conditioned on the history, as well as uniformly bounded by  for all . If  is chosen as in creftype 8, then, for all  and all , it holds with probability at least  that .


See Theorem 6 in Srinivas et al. [2010]. ∎

See 1


See Corollary 1 in Sui et al. [2015]. ∎

Appendix B Safety

Lemma 14.

For all and for all ,  such that .


We use a recursive argument to prove this lemma. Since , we know that . Because of Lemmas 6 and 5 we know , with , inducing such that , and . Similarly, we can build another sequence of actions that drives the system to some state in passing through starting from . By applying repeatedly this procedure we can build a finite sequence of actions that drives the system to a state passing through starting from . Because of Lemmas 6 and 5 this is equivalent to . ∎

Lemma 15.

For all and for all ,  such that .


The proof is analogous to the the one we gave for Lemma 14. The only difference is that here we need to use the reachability property of instead of the recovery property of . ∎

See 2


This lemma is a direct consequence of the properties of listed above (that are ensured by the initialization of the algorithm) and of Lemmas 15 and 14

Lemma 16.

For any , the following holds with probability at least : .


Let’s prove this result by induction. By initialization we know that for all . For the induction step assume that for all