Safely Bridging Offline and Online Reinforcement Learning

by   Wanqiao Xu, et al.

A key challenge to deploying reinforcement learning in practice is exploring safely. We propose a natural safety property – uniformly outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. We then design an algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to ensure safety with high probability. We experimentally validate our results on a sepsis treatment task, demonstrating that our algorithm can learn while ensuring good performance compared to the baseline policy for every patient.



There are no comments yet.


page 1

page 2

page 3

page 4


MAMPS: Safe Multi-Agent Reinforcement Learning via Model Predictive Shielding

Reinforcement learning is a promising approach to learning control polic...

Safe Reinforcement Learning via Online Shielding

Reinforcement learning is a promising approach to learning control polic...

Robust Model Predictive Shielding for Safe Reinforcement Learning with Stochastic Dynamics

This paper proposes a framework for safe reinforcement learning that can...

Provably Safe PAC-MDP Exploration Using Analogies

A key challenge in applying reinforcement learning to safety-critical do...

Conservative Safety Critics for Exploration

Safe exploration presents a major challenge in reinforcement learning (R...

Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning

Online reinforcement learning (RL) algorithms are often difficult to dep...

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Off-policy policy evaluation (OPE) is the problem of estimating the onli...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning is a promising approach to learn policies for sequential decision-making to enable data-driven decision-making. For instance, it can be used to help manage health conditions such as sepsis [10] and chronic illnesses [19], which require the clinician to make sequences of decisions regarding treatment. Other applications include adaptively sequencing educational material for students [14] or learning inventory control policies with uncertain demand [7, 9].

The core challenge in reinforcement learning is how to balance the exploration-exploitation tradeoff—i.e., how to balance taking exploratory actions (to estimate the transitions and rewards of the underlying system) and exploiting the knowledge acquired thus far (to make good decisions). However, in high-stakes settings, exploration can be costly or even unethical—for instance, taking exploratory actions on patients or students can lead to adverse outcomes that could have been avoided.

As a consequence, there has been great interest in safe reinforcement learning [6], where the algorithm explores in a way that satisfies some kind of safety property. For instance, one notion of safety is to restrict the system to stay in a safe region of the state space [13]. However, many desirable safety properties cannot be formulated this way—e.g., we may want to ensure that a patient does not suffer avoidable adverse outcomes, but some adverse outcomes may be unavoidable even under the optimal policy. Instead, we consider a safety property based on the following intuition:

With high probability, the algorithm should never take actions significantly worse than the ones known to be good based on the knowledge accumulated so far.

For instance, if the algorithm has discovered that a treatment achieves good outcomes for the current patient, then it is obligated to use either that treatment or an alternative that is only slightly worse.

To formalize this property, we need to devise notions of “actions known to be good” and “knowledge accumulated so far”. We take “knowledge accumulated so far” to simply be all observations that have been gathered so far. Formalizing “actions known to be good” is more challenging. For instance, we could consider estimating the value of all actions based on the current dataset, and take a “good action” to be one with high value. However, this definition does not account for estimation error. Offline (or batch) reinforcement learning algorithms [4, 12] are designed to provide conservative estimates of the values of different actions based on historical data [18, 11]. Building on these ideas, we formalize a “good action” to be one for which the conservative lower bound on its value is high. We refer to the policy that always takes the action with the best conservative value estimate as the baseline policy. Then, our safety property is that, with high probability, the algorithm never takes an action that is significantly worse than using the current baseline policy.

This safety property is challenging to satisfy for two reasons. First, it requires the learning algorithm to uniformly outperform the current baseline (up to a given per-episode tolerance ) with high probability. In particular, the algorithm cannot outperform the baseline for some time and then subsequently “spend” this improvement on exploratory actions; instead, we want to “spread out” our potentially harmful exploration across episodes. Second, the baseline is always improving as more data is collected, so the exploration becomes successively more constrained.

Our goal is to establish assumptions under which we can provably learn (i.e., achieve sublinear regret) while satisfying safety. We consider the classic setting of a Markov decision process (MDP) with finite state and action spaces; in this setting, we can straightforwardly convert existing algorithms for computing optimistic value estimates 

[8, 2] into ones for computing conservative value estimates. When unconstrained exploration is allowed, algorithms such as upper confidence bound (UCB) [8, 2] and posterior sampling [16] can learn the MDP parameters (i.e., the transitions and rewards) while achieving strong regret guarantees. We aim to adapt these algorithms to satisfy safety without significantly hurting learning.

In general, we need to make assumptions about the underlying system to guarantee safety; otherwise, taking any exploratory action could lead to a safety violation. First, we assume that the MDP is ergodic—i.e., every state is reached with some probability during an episode. This assumption is standard [5]; it is necessary since we need to be able to safely transition to each state using the baseline policy to ensure that we can learn the MDP parameters at every state. This assumption is reasonable in practice as long as there is sufficient randomness in the MDP transitions. Second, we assume that any single exploratory action cannot violate the safety constraint. This assumption is also necessary; otherwise, we would be unable to take any exploratory action since doing so would risk violating safety. Instead, safety can only be violated by multiple suboptimal actions. Intuitively, this assumption is reasonable in settings where the time steps are sufficiently small, since delaying an action by a small amount should not significantly affect outcomes. For instance, short medication delays for ICU patients often do not affect readmissions and mortality [15].

Then, to satisfy the safety constraint, our algorithm uses a shielding strategy [13, 1], where it uses a classical UCB policy if it does not risk exhausting its exploration budget (i.e., the allowable deficit in performance compared to the baseline policy); otherwise, it switches to using the baseline policy, which is guaranteed to satisfy the safety constraint. However, naïvely, this strategy can lead to suboptimal exploration (e.g., it only explores state-action pairs at the very beginning of an episode, and then exhausts its exploration budget), potentially failing to learn and causing linear regret. In contrast, our algorithm does not always explore starting at the beginning of an episode. Instead, it adaptively decides at which point to start exploring via UCB (using the baseline policy before this point), to enable construction of meta-episodes that “stitch” together consecutive exploration steps to obtain a sequence of state-action pairs that are effectively sampled from the same distribution as a full episode using UCB; note that we cannot use UCB for a full episode since it likely violates the safety constraint. Thus, this strategy explores in a way that is identical to UCB, thereby ensuring sufficient learning for strong regret guarantees.

We prove that our algorithm not only ensures safety, but also has regret guarantees similar to those of existing reinforcement learning algorithms for finite-state MDPs—i.e., for MDPs satisfying our assumptions, the cost of our safety constraint is only a constant factor.

Related literature. There has been a great deal of recent interest in safe reinforcement learning [6], although it has largely focused on guaranteeing safety rather than proving regret bounds (that guarantee convergence to an optimal policy). Furthermore, most of these approaches focus on safety constraints in the form of safe regions, where the goal is to stay inside the safe region. Such constraints are common in robotics, but less so for other applications of reinforcement learning such as healthcare [10], education [14], and operations research [7, 9].

The most closely related work to ours is on conservative learning [5, 17], where the goal is to maintain cumulative regret less than that of a baseline policy up to some exploration budget. There are two key differences. First, their baseline policy is provided exogenously by the user, and is static; in contrast, our baseline is updated as new information becomes available. Thus, our safety constraint grows stronger over time. Second, their guarantee is only valid on average up to the current time step; thus, it suffices to follow the baseline policy for enough time (accumulating similar average performance) and then “spend” this improvement by exploring arbitrarily (i.e., using UCB for an entire episode). For instance, they may perform safely on many patients and excessive explore on a few subsequent patients, leading to particularly adverse outcomes for these patients. In contrast, we require safety uniformly on every time step of every episode, thereby spreading out exploration across the time horizon and ensuring that exploration does not significantly harm any single individual. Importantly, spreading exploration across different episodes can lead to biased exploration, which is not an issue for [5]; we address this issue by constructing meta-episodes that mimic UCB episodes.

2 Problem Formulation

Preliminaries. Consider a Markov decision process (MDP) , with finite states , finite actions , transitions , rewards , and time horizon . We consider policies with internal state , along with internal state transitions . Our safety property (described below) is a constraint on the reward accrued by our policy across multiple steps in the MDP; thus, our policy maintains an internal state to track this information and ensure that we satisfy safety.

Given , a rollout is a random sequence such that , , , and ; we assume is deterministic and is given. We denote the distribution over rollouts by . We define the and value functions by , , and

Regret. We let denote the (deterministic) optimal policy, and and its and value functions, respectively. We consider the episodic setting, where and are initially unknown, and over a sequence of episodes, we choose a policy along with an initial internal state based on the observations so far, and observe a rollout . Our goal is to choose and to minimize the cumulative regret

where the expectation is taken over the random sequence of rollouts .

Safety property. Intuitively, our safety property says we do not take any actions across an episode that achieve significantly worse rewards compared to the current baseline (for simplicity, we assume that this policy is only updated at the end of an episode). Thus, it ensures that we do not take unsafe sequences of actions that could have been avoided by .

The baseline policy is constructed using all data collected before the current episode. It should be constructed based on lower bounds on the value function that account for estimation error. The strength of the safety property depends on ; thus, these bounds should be as tight as possible to achieve the strongest safety property. We build on a UCB strategy called UCBVI [2]

, a state-of-the-art algorithm that achieves minimax regret guarantees. This algorithm constructs policies based on values that are optimistic compared to the true values; its minimax guarantees stem from the fact that its confidence intervals around its value estimates are very tight. We modify UCBVI to instead construct policies based on conservative values. We describe our approach in detail in Section 


Now, given , our safety property says that with probability at least (over the randomness of ), for every and , we have


We call the reward deficit, since it is the deficit in reward compared to , and the exploration budget, since it bounds how much exploration we can do. To understand (1), consider the alternative


where the equality follows by a telescoping sum argument (see, e.g., Lemma 2.1 in [3]). Intuitively, (2) says that our cumulative expected reward must be within of that of across the entire episode. In contrast, (1) is significantly stronger, since the maximum ensures that we cannot compensate for performing worse than in one part of a episode by performing better later.

Note that our algorithm can always use , which satisfies (1); the challenge is how to take exploratory actions in a way that minimizes regret while maintaining safety.

Assumptions. Ensuring safety is impossible without additional assumptions, since otherwise any exploration the agent undertakes could lead to a safety violation. We make two key assumptions. The first one is standard, saying that the MDP is ergodic:

Assumption 2.1.

Letting be all deterministic policies, is finite.

Here, is the worst-case diameter of the MDP—i.e., the worst-case time it takes for any policy to reach any state from any state . This assumption says that every state is visited by any policy ; for instance, if there is a state not visited by one of our baseline policies , then we would not be able to explore that state, potentially leading to linear regret. Our second assumption says that any single step of exploration in the MDP does not violate our safety property:

Assumption 2.2.

For any , and , we have .

That is, using an arbitrary action in state and then switching to (i.e., ), is not much worse than using (i.e., ). Note that at the very least, assuming is necessary; otherwise, any exploratory action could potentially violate safety. Our algorithm only requires this assumption for the baseline policies that occur during learning. The stricter tolerance enables us to continue to take exploratory steps if we have only accrued error so far. In particular, if the tolerance were , then if we take a single step such that , then at each subsequent step , we cannot take an exploratory action, since we run the risk that , which would violate safety.

3 Algorithm

The key challenge is how to take exploratory actions to minimize regret while ensuring that our safety property holds. We build on the upper confidence bound value iteration (UCBVI) algorithm by [8], which obtains near-optimal regret guarantees for finite-horizon MDPs. Like other UCB algorithms, it relies on optimism to minimize regret—i.e., it takes actions that optimize the cumulative reward under optimistic assumptions about its estimates of the MDP parameters. A natural strategy is to use the internal state to keep track of the reward deficit accrued so far; then, we can use the UCBVI policy from the beginning of each episode until we exhaust our exploration budget, after which we switch to the baseline policy.

The challenge is that the UCBVI regret guarantees depend crucially on using the UCBVI policy for the entire horizon, or at least for extended periods of time. The reason is that selectively using UCBVI at the beginning of each episode biases the portions of the state space where UCBVI is used; for instance, if there are some states that are only reached late in the episode, then we may never use UCBVI in these states, causing us to underexplore and accrue high regret.

To avoid this issue, our algorithm uses the UCBVI policy in portions of each episode in a sequence of episodes, such that we can “stitch” these portions together to form a single meta-episode that is mathematically equivalent to using the UCBVI policy for an entire episode. The cost is that we may require multiple episodes to obtain a single UCBVI episode, which would slow down exploration and increase regret. However, we can show that the number of episodes in a meta-episode is not too large with high probability, so the strategy actually achieves similar regret as UCBVI.

Overall algorithm. Our algorithm is summarized in Algorithm 1; indexes a single meta-episode, and indexes an episode of . To be precise, we use meta-episode to an iteration of the outer loop of Algorithm 1, and episode to refer to an iteration of the inner loop; we alternatively index episodes by when referring to the sequence of all episodes. Then, we use rollout to refer to the sequence of observations during an episode, and a meta-rollout to refer to the rollout consisting of a subset of the observations in , where is the total number of episodes in meta-episode . In particular, consists of observations where the UCB policy was used; our algorithm uses in a way that ensures that is equivalent to a single rollout sampled from the MDP while exclusively using .

At a high level, at the beginning of each episode , our algorithm constructs the baseline policy using the current rollouts . Furthermore, at the beginning of each meta-episode , our algorithm constructs the UCBVI policy using the current meta-rollouts . Then, it obtains a sequence of rollouts using , which combines the current and in a way that ensures safety. It does so in a way that it can “stitch” together portions of the rollouts using into a single rollout whose distribution equals the distribution over rollouts induced by using . In other words, is equivalent to using for a single episode. Thus, each meta-rollout of our algorithm corresponds to a single UCBVI episode. As long as the number of episodes per meta-episode is not too large, we obtain similar regret as UCBVI. We describe our algorithm in more detail below.

Safety. First, we describe how our algorithm ensures safety. Our strategy is to use internal state to keep track of reward deficit. In particular, suppose we have satisfying and satisfying with high probability; then, we use internal state and

where the second equality follows since we always have . In particular, we have with high probability. Then, our algorithm switches to using as soon as (i.e., )—i.e., it uses the shield policy

where is the current UCBVI policy. Thus, we have

where the second inequality follows by Assumption 2.2. Since using does not increase the reward deficit, we have , so (1) holds—i.e., using ensures safety with high probability.

Meta-episodes. As defined, implements the naïve strategy of using at the beginning of each episode, and switching to if it can no longer ensure safety. However, as discussed above, this strategy may explore in a biased way, causing it to accrue high regret. Instead, we modify to construct a single UCBVI episode (called a meta-episode) across multiple actual episodes, which ensures exploration equivalent to UCBVI. We denote such a meta-episode by and an episode in meta-epsiode by (i.e., there are episodes in , so we have total episodes); we index our episodes by instead of .

At a high level, in the first episode of a meta-episode (i.e., ), we use from the beginning. If uses for the entire episode, then this single episode is equivalent to a UCBVI episode, so we are done. Otherwise, we switch to using at some step (i.e., at state ). Then, in the subsequent episode, we initially use until some step such that ; at this point, we switch to until we have exhausted our exploration budget. If we do not encounter , then we try again in the next episode; since the MDP is ergodic, we are guaranteed to find after a few tries with high probability. We continue this process until we have used for steps (i.e., a full UCB episode).

Formally, we augment the internal state of our policy with the target state form which we want to continue using (or for the initial episode), so . In particular, we let


where is the target state for episode —i.e., the state at which we switched to for some , such that we did not encounter in episodes . Next, we have


That is, the internal state remains until encountering the target state ; at this point, it becomes and starts accruing reward deficit as before. Finally, we have shield policy


i.e., we use the UCBVI policy if we have reached the target state and do not risk exceeding our exploration budget; otherwise, we use the backup policy .

Finally, a meta-episode terminates once we have used at least times across the rollouts ; in this case, we have episodes in meta-episode . Then, our algorithm constructs the corresponding meta-rollout by concatenating the portions of that use . Note that in the very last episode , we may continue using even after we have obtained the necessary steps using ; we ignore the extra steps so is exactly steps long.

procedure SafeUCBVI()
     Initialize rollout history
     Initialize meta-rollout history
     for  do
         Compute using
         Initialize target state
         for  do
              Compute , , and using
              Obtain a rollout using , as in (4), and as in (5), and add it to
              Update to be the next target state, or break if done (and terminate if )
         end for
         Construct from and add it to
     end for
end procedure
Algorithm 1 Safe offline-to-online UCBVI.

Policy construction. Finally, we describe how our algorithm constructs the quantities , , , and . The constructions are based on the UCBVI algorithm; in particular, note that on step , is equivalent to a set of UCBVI rollouts, so we can use it to construct a UCBVI policy for the th episode.111By only using meta-episodes to construct , the meta-episodes exactly mimic the execution of UCBVI; in practice, we can use the entire dataset to construct . In particular, we construct by estimating the transitions and rewards based on the data collected so far (i.e., the tuples collected on steps using the UCBVI policy, so , , and ), to obtain

where is the number of observations of state-action pair in the data collected so far. Then, we use value iteration to solve the Bellman equations

where , where is a bonus term, and where . Finally, we take .

We construct and similarly. For , we use the above strategy except we subtract the bonus—i.e., letting , we have

Then, we take . Finally, for , we add the bonus, but use value iteration for policy evaluation instead of policy optimization—i.e.,

4 Theoretical Guarantees

All our results in this section are conditioned on a high-probability event which says that (i) our confidence sets around the estimated transitions and rewards hold, and (ii) a condition saying that we find the target state in a reasonable number of episodes (see Lemma 4.7). This event holds with probability at least ; we give details in Appendix A.1.

Safety guarantee. First, we prove that our algorithm satisfies our safety constraint.

Theorem 4.1.

On event , Algorithm 1 satisfies (1) for all


First, we show that for all . To this end, consider following cases at step : (i) if uses , then , (ii) if switches to on step , then , and (iii) otherwise, remains the same, so the claim follows by induction. As a consequence, it suffices to show that on event . To this end, the following lemma says that the high probability upper and lower bounds and used to construct are correct.

Lemma 4.2.

On event , for all , , , and , we have (i) , and (ii) .

This result is based on standard arguments; we give a proof in Appendix A.2. Now, by Lemma 4.2,

on event , so the claim holds. ∎

Regret bound. Next, we prove that our algorithm has sublinear regret.

Theorem 4.3.

On event , the cumulative expected regret of Algorithm 1 is

where , and where the expectation is taken over the randomness during all of the rollouts taken. Furthermore, letting be the total number of time-steps by the end of meta-episode , the regret satisfies .


The main idea is to bound the regret by the regret of the meta-rollouts (which correspond to UCBVI rollouts), plus the regret of the shield policy on the remaining steps—i.e., , where

where we have used to denote that the th step of episode is not included in meta-rollout . By equivalence to UCBVI, is bounded by the UCBVI regret:

Lemma 4.4.

On event , we have

The proof is based directly on the UCBVI regret analysis; for completeness, we give a proof in Appendix A.3. Thus, we focus on bounding . First, we have the straightforward bound,


which follows since the maximum regret during a single episode is (since the rewards are bounded by ), and since we can also omit the steps for which , which which there are exactly .

As a consequence, the key challenge in bounding is proving that the number of episodes in a meta-episode becomes small—in particular, once , then the entire (single) rollout is part of the meta-rollout , so the second term in the regret is zero.

To prove that becomes small, we note that for any episode, one of the following conditions must hold: (i) the exploration budget is exhausted—i.e., , (ii) the algorithm explores using for at least time steps, or (iii) the episode does not reach the target state in the first time steps; in particular, if (iii) does not hold, then either the episode uses for the final steps of that episode (so (ii) holds) or the exploration budget is exhausted (so (i) holds). We let denote the number of episodes that satisfy the three respective cases in meta-episode ; note that either (i.e., always use the UCBVI policy) or .

We bound the three possibilities separately. Our first lemma shows that the number of episodes in case (i) is bounded by the UCBVI regret (i.e., the regret of the meta-episode), which is sublinear.

Lemma 4.5.

On event , we have

where , and is the total number of observations of the state-action pair prior to meta-episode .

Intuitively, this lemma follows since if our algorithm exhausts exploration budget, then it explores sufficiently; thus, the number of times that the exploration budget is exhausted cannot be too large. We give a proof in Appendix A.4. The left-hand side of the bound is essentially (but not exactly) the UCBVI regret, and we can bound it using the same strategy. In particular, we have the following:

Lemma 4.6.

On event , we have

The proof is based on the same strategy as UCBVI, so we defer it to Appendix A.5. Note that we have summed over meta-episodes ; later, we use Lemma 4.6 to directly bound .

Next, is straightforward to bound—in particular, note that we can use for time steps in at most four episodes, since at the end of the fourth episode we would have a complete UCBVI episode (which has length ). Thus, we have . Next, we use the following result to bound .

Lemma 4.7.

On event , for any state , a rollout using will reach state within time steps after at most episodes.

This result follows applying Markov’s inequality in conjunction with Assumption 2.1, which says the MDP is ergodic; thus, it visits with high probability early in the rollout. We give a proof in Appendix A.6. Finally, we have the following overall bound:

Lemma 4.8.

We have under event .

This result follows by combining the previous lemmas; we give a proof in Appendix A.7. In summary, we have


where the first inequality follows by combining (6) with Lemma 4.8 (which implies ), and the second inequality follows from Lemmas 4.54.6. Finally, Theorem 4.3 follows immediately by combining (7) and Lemma 4.4. ∎








Figure 1: For inventory control, (a) regret of our approach (for various vs. UCBVI, and (b) safety violations of our approach vs. UCBVI as a function of . For sepsis management, (c) regret and (d) safety violations.

5 Experiments

We compare the performance and safety violations of our algorithm and UCBVI on two tasks. We give details on the experimental setup in Appendix B.

Inventory control. First, we consider a single-product stochastic inventory control problem based on [5], but with a finite horizon. At the beginning of each month , the manager notes the current inventory of a single product, and then decide the number of items to order from a supplier before observing the random demand. They have to account for the tradeoff between the costs of keeping inventory and lost sales or penalties resulting from being unable to satisfy customer demand. The objective is to maximize profit during the entire decision-making process.

Figure 1 (a) shows that the regret of our algorithm starts out linearly increasing, since in the beginning the meta algorithm is forced to switch to the baseline policy a lot in one meta-episode to guarantee safety. As historical data accrues, our algorithm uses the UCBVI policy more frequently since its reward deficit decreases. At some point, our algorithm starts to converge at a similar rate as UCBVI. Note that UCBVI converges faster since it ignores the safety constraint and can explore arbitrarily, even if some actions result in poor values. Figure 1 (b) shows the number of times the safety constraint is violated. Our algorithm almost always satisfies the constraint for all shown values of , whereas UCBVI fails to do so in a significant number of episodes, especially when is small.

Sepsis management. To validate our results in a more realistic human-centric setting where safety is critical, we run simulations on the MIMIC-III data set to test the performance of our algorithm in sepsis treatment. Sepsis is the body’s acute response to infection that can lead to organ dysfunction, tissue damage, and death. It is the number one cost of hospitalization in the U.S. and the third leading cause of death worldwide. The management of intravenous fluids and vasopressors are crucial in the treatment of sepsis, but the current clinical practice is shown to be suboptimal. To develop a more efficient treatment strategy, we can model the problem as an MDP and apply reinforcement learning algorithms. The states are aggregated patient data, and rewards reflect the patient’s outcome after medication doses [10].

Figure 1 (c) shows similar behavior as in the inventory experiment. The regret of our algorithm converges at similar rates as UCBVI, and our regret moves closer to that of UCBVI as the safety budget increases. Figure 1 (d) shows that our algorithm satisfies safety most of the times for all values of , while UCBVI violates safety more even for relatively larger values of . Note that in this setting each episode represents a patient’s treatment cycle, so each safety violation indicates a patient has received a failed treatment or experienced adverse outcome. Thus it is highly undesirable and even unethical to violate safety even for tens of times.

6 Conclusion

We propose a novel reinforcement learning algorithm that ensures close performance compared to our current knowledge uniformly across every step of every episode with high probability. We derive assumptions on the MDP under which both safety and sublinear regret can be achieved. Our results show that the price of uniform safety on learning is negligible—i.e., a constant,

-independent factor. One limitation of our approach is that we need to make several assumptions about the MDP; a key direction for future work is relaxing these assumptions. Furthermore, we make specific assumptions about the safety property; extending our techniques to additional properties is another direction for future work. Our work has ethical considerations insofar as we are proposing a way to improve safety of reinforcement learning in practice; before deploying our approach (or any machine learning algorithm) in any domain, it is critical to ensure the algorithm does not harm individuals.


  • [1] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu (2018) Safe reinforcement learning via shielding. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 32. Cited by: §1.
  • [2] M. G. Azar, I. Osband, and R. Munos (2017) Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. Cited by: §A.1, §A.3, §1, §2.
  • [3] O. Bastani, Y. Pu, and A. Solar-Lezama (2018) Verifiable reinforcement learning via policy extraction. arXiv preprint arXiv:1805.08328. Cited by: §2.
  • [4] D. Ernst, P. Geurts, and L. Wehenkel (2005) Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, pp. 503–556. Cited by: §1.
  • [5] E. Garcelon, M. Ghavamzadeh, A. Lazaric, and M. Pirotta (2020) Conservative exploration in reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 1431–1441. Cited by: §1, §1, §5.
  • [6] J. Garcıa and F. Fernández (2015) A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §1, §1.
  • [7] I. Giannoccaro and P. Pontrandolfo (2002) Inventory management in supply chains: a reinforcement learning approach. International Journal of Production Economics 78 (2), pp. 153–161. Cited by: §1, §1.
  • [8] T. Jaksch, R. Ortner, and P. Auer (2010) Near-optimal regret bounds for reinforcement learning.. Journal of Machine Learning Research 11 (4). Cited by: §1, §3.
  • [9] P. W. Keller, S. Mannor, and D. Precup (2006) Automatic basis function construction for approximate dynamic programming and reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 449–456. Cited by: §1, §1.
  • [10] M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal (2018) The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine 24 (11), pp. 1716–1720. Cited by: Appendix B, §1, §1, §5.
  • [11] A. Kumar, A. Zhou, G. Tucker, and S. Levine (2020) Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779. Cited by: §1.
  • [12] S. Levine, A. Kumar, G. Tucker, and J. Fu (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §1.
  • [13] S. Li and O. Bastani (2020) Robust model predictive shielding for safe reinforcement learning with stochastic dynamics. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 7166–7172. Cited by: §1, §1.
  • [14] T. Mandel, Y. Liu, S. Levine, E. Brunskill, and Z. Popovic (2014) Offline policy evaluation across representations with applications to educational games.. In AAMAS, pp. 1077–1084. Cited by: §1, §1.
  • [15] L. Meng, K. Laudanski, A. Huffenberger, and C. Terwiesch (2020) The impact of medication delays on patient health in the icu: estimating marginal effects under endogenous delays. Available at SSRN 3590744. Cited by: §1.
  • [16] I. Osband and B. Van Roy (2017) Why is posterior sampling better than optimism for reinforcement learning?. In International Conference on Machine Learning, pp. 2701–2710. Cited by: §1.
  • [17] Y. Wu, R. Shariff, T. Lattimore, and C. Szepesvári (2016) Conservative bandits. In International Conference on Machine Learning, pp. 1254–1262. Cited by: §1.
  • [18] T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma (2020) Mopo: model-based offline policy optimization. arXiv preprint arXiv:2005.13239. Cited by: §1.
  • [19] M. Zhou, Y. Mintz, Y. Fukuoka, K. Goldberg, E. Flowers, P. Kaminsky, A. Castillejo, and A. Aswani (2018) Personalizing mobile fitness apps using reinforcement learning. In CEUR workshop proceedings, Vol. 2068. Cited by: §1.

Appendix A Proofs for Section 4

a.1 High probability event

We first introduce the high probability event under which the concentration inequalities described in the policy construction and in UCBVI-CH hold. Let be the high probability event under which the UCBVI-CH regret analysis holds. This event is defined in the equation on the bottom of page 16 in the appendices of [2]. The proof that holds with probability at least is proved in the subsequent Lemma 1. We then define


denote the set of all probability distributions on the states

, then construct the confidence sets for every and

Next, we define the random event

where is the number of observations of state-action pair up to episode . Finally, letting , we conclude that holds with probability at least . Indeed,

and the claim follows from a union bound.

a.2 Proof of Lemma 4.2


First, we prove claim (i). We show by induction that is indeed a lower bound on , the real Q functions. Define the sets

We want to show that the set of events hold under the event .

We proceed by induction. For , by definition, , so holds. Now, assuming holds, we want to show that also holds. To this end, note that

where we use the induction hypothesis in the last inequality. The event , by Hölder’s inequality, implies that