1 Introduction
There is increasing excitement around applications of machine learning (ML), but also growing awareness and concern. Recent research on FAT (fairness, accountability and transparency) ML aims to address these concerns but most work focuses on supervised learning settings and only few works exist on reinforcement learning or sequential decision making in general
(Jabbari et al., 2016; Joseph et al., 2016; Kannan et al., 2017; Raghavan et al., 2018).One challenge when applying reinforcement learning (RL) in practice is that, unlike in supervised learning, the performance of an RL algorithm is typically not monotonically increasing with more data due to the trialanderror nature of RL that necessitates exploration. Even sharp drops in policy performance during learning are possible, for example, when the agent starts to explore a new part of the state space. Such unpredictable performance fluctuation has limited the use of RL in highstakes applications like healthcare, and calls for more accountable algorithms that can quantify and reveal their performance during learning.
In this work, we propose that an RL algorithm outputs policy certificates
, a form of confidence interval, in episodic reinforcement learning. Policy certificates are upper bounds on how far from optimal the return (expected sum of rewards) of an algorithm in the next episode can be. They allow one to monitor the policy’s performance and intervene if necessary, thus improving accountability of the algorithm. Formally, we propose a theoretical framework called IPOC that not only guarantees that certificates are valid performance bounds but also that both, the algorithm’s policy and certificates, improve with more data.
There are two relevant lines of research on RL with guaranteed performance for episodic reinforcement learning. The first area is on frameworks for guaranteeing the performance of a RL algorithm across many episodes, as it learns. Such frameworks, like regret (Jaksch et al., 2010), PAC
(probably approximately correct, Kakade,
2003; Strehl et al., 2009) and UniformPAC (Dann et al., 2017)all provide apriori bounds about the cumulative performance of the algorithm, such as bounding the total number of times an algorithm may execute a policy that is not near optimal. However, these frameworks do not provide bounds for any individual episode. In contrast, the second main related area for providing guarantees focuses on estimating and guaranteeing the performance of a particular RL policy, given some prior data
(e.g., Thomas et al., 2015b; Jiang and Li, 2016; Thomas and Brunskill, 2016). Such work typically provides limited or no guarantees for algorithms that are learning and updating their policies across episodes. In this paper, we unite both lines of work by providing performance guarantees online for a reinforcement learning algorithm in individual episodes and across all episodes. In fact, we show that bounds in our new IPOC framework imply strong guarantees in existing regret and PAC frameworks.We consider policy certificates in two settings, finite episodic Markov decision processes (MDPs) and, more generally, finite MDPs with episodic side information (context)
(AbbasiYadkori and Neu, 2014; Hallak et al., 2015; Modi et al., 2018). The latter is of particular interest in practice. For example, in a drug treatment optimization task where each patient is one episode, context is the background information of the patient which influences the treatment outcome. While one expects the algorithm to learn a good policy quickly for frequent contexts, the performance for unusual patients may be significantly more variable due to the limited prior experience of the algorithm. Policy certificates allow humans to detect when the current policy is good for the current patient and intervene if a certified performance is deemed inadequate. For example, for this health monitoring application, a human expert could intervene to either directly specify the policy for that episode, or in the context of automated customer service, the service could be provided at reduced cost to the customer.Existing algorithms based on the optimisminthefaceofuncertainty (OFU) principle (e.g., Auer et al., 2009) are natural to extend to learning with policy certificates. We demonstrate this by extending the UBEV algorithm (Dann et al., 2017) for episodic MDPs with finite state and action spaces, and show that with high probability it outputs certificates greater than at most times for all . For problems with side information, we propose an algorithm that learns with policy certificates in episodic MDPs with adversarial linear side information (AbbasiYadkori and Neu, 2014; Modi et al., 2018) of dimension , and bound the rate at which the cumulative sum of certificates can grow up to log terms by .
2 Setting and Notation
In this work, we consider episodic RL problems where the agent interacts with the environment in episodes of a certain length. While the framework for policy certificates applies to a wide range of problems, we focus on finite Markov decision processes (MDP) with linear side information (Modi et al., 2018; Hallak et al., 2015; AbbasiYadkori and Neu, 2014) for concreteness. This setting includes tabular MDPs as a special case but is more general and can model variations in the environment across episodes, e.g., because different episodes correspond to treating different patients in a healthcare application. Unlike the tabular special case, function approximation is necessary for efficient learning.
Finite MDPs with linear side information.
The agent interacts in episode by observing a state , taking action and observing the next state as well as a scalar reward . This interaction loop continues for time steps , before a new episode starts. We assume that state and actionspace are of finite sizes and , respectively, as in the widely considered tabular MDPs (Osband and Van Roy, 2014; Dann and Brunskill, 2015; Azar et al., 2017; Jin et al., 2018). But here, the agent essentially interacts with a family of infinitely many tabular MDPs that is parameterized by linear contexts. At the beginning of episode , two contexts, and , are observed and the agent interacts in this episode with a tabular MDP, whose dynamics and reward function depend on the contexts in a linear fashion. Specifically, it is assumed that the rewards are sampled from with means and transition probabilities are where and
are unknown parameter vectors for each
. As a regularity condition, we assume bounded parameters, i.e., and as well as bounded contexts and . We allow and to be different, and use to denote in the following. To further simplify notation, we assume w.l.o.g. that there is a fixed start state. Note that there is no assumption of the distribution of contexts; our framework and algorithms can handle adversarially chosen contexts.Return and optimality gap.
The quality of a policy in any episode is evaluated by the total expected reward or return: , where this notation means that all actions in the episode are taken as prescribed by . We focus here on deterministic timedependent policies and note that optimal policy and return depend on the context of the episode. The difference of achieved and optimal return is called optimality gap for each episode where is the algorithm’s policy in that episode.
Additional notation.
We denote by the largest optimality gap possible and and are the Q and value function of in episode . Optimal versions are marked by superscript and subscripts are omitted when unambiguous. We often treat as linear operator, that is, for any .
3 Existing Learning Frameworks
During execution, the optimality gaps are hidden, the algorithm only observes the sum of rewards which is a sample of . This causes risk as one does not know when the algorithm is playing a good policy and when a potentially bad policy. One might hope that performance guarantees for algorithms mitigate this risk but no existing theoretical framework gives guarantees for individual episodes during learning:

Mistakestyle PAC bounds (Strehl et al., 2006, 2009; Szita and Szepesvári, 2010; Lattimore and Hutter, 2012; Dann and Brunskill, 2015) bound the number of mistakes, that is, the size of the superlevel set with high probability, but do not tell us when mistakes can happen. The same is true for the recently proposed stronger UniformPAC bounds (Dann et al., 2017) which hold for all jointly.

Supervisedlearning style PAC bounds (Kearns and Singh, 2002; Jiang et al., 2017; Dann et al., 2018) guarantee that the algorithm outputs an optimal policy for a given , that is, they ensure that for greater than the bound. They do however require to know ahead of time and do not give any guarantee about during learning (when is smaller than the bound).
Not knowing during learning makes it difficult to stop an algorithm at some point and extract a good policy. For example, the common way to extract a good policy for algorithms with regret bound is to pick one of the policies executed so far at random (Jin et al., 2018). This only yields a good policy that has with probability in general. As a result, one requires episodes for a good policy with probability at least which is much larger than the of algorithms with supervisedlearning style PAC bounds. Note that the KWIK framework (Li et al., 2008) does guarantee the quality of individual predictions but is for supervised learning; its use in RL leads to mistakestyle PAC bounds (see Section 7).
4 The IPOC Framework
We introduce a new learning framework that mitigates the limitations of prior guarantees highlighted above. This framework forces the algorithm to output its current policy as well as a certificate before each episode . This certificate informs the user how suboptimal the policy can be for the current context, i.e., and allows one to intervene if needed. For example, in automated customer services, one might reduce the service price in episode if certificate is above a certain threshold, since the quality of the provided service cannot be guaranteed. When there is no context, a certificate upper bounds the suboptimality of the current policy in any episode which makes algorithms anytime interruptable (Zilberstein and Russell, 1996): one is guaranteed to always know a policy with improving performance. Our learning framework is formalized as follows:
[Individual Policy Certificates (IPOC) Bounds] An algorithm satisfies an individual policy certificate (IPOC) bound if for a given it outputs a certificate and the current policy before each episode (after observing the contexts) so that with probability at least

all certificates are upper bounds on the suboptimality of policy played in episode , i.e., ; and either

for all number of episodes the cumulative sum of certificates is bounded (Cumulative Version), or

for any threshold , the number of times certificates can exceed the threshold is bounded as (Mistake Version).
Here, can be (known or unknown) properties of the environment. If conditions 1 and 2a hold, we say the algorithm has a cumulative IPOC bound and if conditions 1 and 2b hold, we say the algorithm has a mistake IPOC bound.
Condition 1 alone would be trivial to satisfy with , but condition 2 prohibits this by controlling the size of . Condition 2a bounds the cumulative sum of certificates (similar to regret bounds), and condition 2b bounds the size of the superlevel sets of (similar to PAC bounds). We allow both alternatives as condition 2b is stronger but one sometimes can only prove condition 2a (see Sec. 5.2.1). An IPOC bound controls simultaneously the quality of certificates (how big is) as well as the optimality gaps themselves and hence an IPOC bound not only guarantees that the algorithm improves its policy but also becomes better at telling us how well the policy performs. As such it is stronger than existing frameworks. Besides this benefit, IPOC ensures that the algorithm is anytime interruptable, i.e., it can be used to find better and better policies that have small with high probability . That means IPOC bounds imply supervised learning style PAC bounds for all jointly. These claims are formalized in the following statements:
Assume an algorithm has a cumulative IPOC bound .

Then it has a regret bound of same order, i.e., with probability at least , for all the regret is bounded by .

If has the form for appropriate functions , then with probability at least for any , it outputs a certificate within
(1) episodes. For settings without context, this implies that the algorithm outputs an optimal policy within that number of episodes (supervised learningstyle PAC bound).
If an algorithm has a mistake IPOC bound , then

it has a uniform PAC bound , i.e., with probability at least , the number of episodes with is at most for all ;

with probability at least for all , it outputs a certificate within episodes. For settings without context, that means the algorithm outputs an optimal policy within that many episodes (supervised learningstyle PAC).

if has the form with it also has a cumulative IPOC bound of order
Note that the functional form in part 2 of Prop. 4 includes all common polynomial bounds like or with appropriate factors and similarly for part 3 of Prop. 4 which covers for example .
5 Algorithms with Policy Certificates
As shown above, IPOC is stricter than other learning frameworks. Existing algorithms based on the OFU principle (Auer et al., 2009) need extensions to satisfy IPOC bounds. OFU algorithms can be interpreted as maintaining a set of models defined by confidence sets of the individual components and picking the policy optimistically from that set of models. As a byproduct, this yields an upper confidence bound on the optimal value function and therefore optimal return . We augment this by computing a lower confidence bound on value function of the optimistic policy recursively using the same confidence set of models. This yields a lower confidence bound on which is sufficient to compute . We demonstrate this approach by extending two similar OFU algorithms, one for tabular MDPs with no side information, and the other for the more general case with side information. While the algorithms have similar structure we consider them separately because we can prove stronger IPOC guarantees for the first (see Section 5.2.1).
5.1 Policy Certificates in Tabular MDPs
We present an extension of the UBEV algorithm by Dann et al. (2017) called ORLC (optimistic reinforcement learning with certificates) and shown in Algorithm 1. Algorithm 1 essentially combines the policy selection approach of UBEV with highconfidence modelbased policy evaluation of the current policy. Before each episode, Algorithm 1 computes , an optimistic estimate of , as well as , a pessimistic estimate of , by dynamic programming on the empirical model and confidence intervals and for and , respectively. Note that the width of the lower confidence bounds is by a factor larger than , as the estimation target of is a random quantity due to the dependency on as opposed to (see discussion below).
We show the following IPOC bound for this algorithm: [Mistake IPOC Bound of Algorithm 1] For any given , Algorithm 1 satisfies in any tabular MDP with states, actions and horizon , the following Mistake IPOC bound: For all , the number of episodes where the algorithm outputs a certificate is
(2) 
By Proposition 4, this implies a PAC bound of same order as well as a regret bound. The PAC lower bound (Dann and Brunskill, 2015) for this setting is , which implies an IPOC mistake lower bound of the same order by Proposition 4. We conjecture that Algorithm 1 satisfies a UniformPAC bound of order , which is by a factor lower than the UniformPAC bound of UBEV due to our assumption of timeindependent dynamics. Using techniques by Azar et al. (2017), this bound can be reduced to match the lower PAC bound. However, as we sketch below, existing techniques cannot be directly applied to the lower confidence bounds in Algorithm 1. It is therefore an open question whether our IPOC bound in Theorem 1
is improvable, or whether the IPOC lower bound is strictly larger than the PAC lower bound. Interestingly, in the related active learning setting, such a discrepancy between achieved and certifiable performance is known to exist
(Balcan et al., 2010).5.1.1 Proof Sketch of the IPOC Bound
To show Theorem 1, we need to verify condition 1 and 2b of Definition 4. Condition 2b can be shown in similar way to existing UniformPAC (Dann et al., 2017) bounds but with optimality gaps being replaced by certificates . For condition it is sufficient to show and holds in all episodes . We use additional subscripts to indicate the value of variables before sampling in episode . Proving optimism, , is standard in analyses of OFU algorithms. Hence, we focus here on showing . When there is no value clipping one can use the following common decomposition (Azar et al., 2017; Jin et al., 2018; Dann et al., 2017). All terms are functions of which we omit in the following for readability.
(3)  
(4)  
(5) 
Here, we expanded and used . We want to show that this value difference is nonnegative. Using standard martingale concentration, one can show . It remains to show that . We decompose the lefthand side as
Using an inductive assumption that , the first term cannot be positive. Note that when showing optimism () the second term is which is a martingale that can be bounded directly by . Unfortunately, the second term here is not a martingale as and both depend on the samples. For that reason, we have to resort to Hölder’s inequality to decompose
(6) 
and apply concentration bounds on the distance of empirical distributions to get the upper bound . This is why the lower confidence bound width are by a factor larger than the upper confidence bound widths . Eventually, this yields a IPOC bound compared to the conjectured UniformPAC bound with dependency in the term. Similarly, the difference in
dependency of our IPOC and conjectured UniformPAC bounds origins from leveraging Bernstein’s inequality for the upper confidence bound widths. That requires bounding how much larger the empirical variance estimate of
value of next state can be compared to using target values. While this is possible by exploiting that is monotonically decreasing with (Azar et al., 2017, Equation 5), this technique cannot be applied to the lower confidence widths as is not monotone in .5.2 Policy Certificates in MDPs With Linear Side Information
After considering the tabular MDP setting, we now present an algorithm for the more general setting with side information, which for example allows us to take background information about a customer into account and generalize across different customers.
Algorithm 2 gives an extension, called ORLCSI, of the OFU algorithm by AbbasiYadkori and Neu (2014). Similar to tabular Algorithm 1, Algorithm 2 computes as an upper bound on the optimal Qfunction and as a lower bound on the Qfunction of the current policy using dynamic programming with the empirical model as well as confidence bound widths and . Unlike in the tabular case, the empirical model is now computed as leastsquares estimates of the model parameters evaluated at the current contexts. Specifically, the empirical transition probability is where is the least squares estimate of model parameter . Since transition probabilities are normalized, this estimate is then clipped to . Note that this empirical model is estimated separately for each triple, but does generalize across different contexts. The confidence widths and are derived using ellipsoid confidence intervals on model parameters (AbbasiYadkori and Neu, 2014). We show the following IPOC bound: [Cumulative IPOC Bound for Alg. 2 ] For any and regularization parameter , Algorithm 2 satisfies the following cumulative IPOC bound in any MDP with states, actions, contexts with dimensions and as well as bounded parameters and . With probability at least all certificates are upper bounds on the optimality gaps and their total sum after episodes is bounded for all by
(7) 
By Proposition 4, this IPOC bound implies a regret bound of the same order which improves on the regret bound of AbbasiYadkori and Neu (2014) with by a factor of . While they make a different modelling assumption (generalized linear instead of linear), we believe at least our better dependency is due to using improved leastsquares estimators for the transition dynamics ^{1}^{1}1They estimate only from samples where the transition was observed instead of all occurrences of (no matter whether was the next state). and can likely be transferred to their setting. The mistaketype PAC bound by Modi et al. (2018) is not directly comparable because our cumulative IPOC does not imply a mistaketype PAC bound.^{2}^{2}2Similar to regret and PAC bounds (Dann et al., 2017), an algorithm with a sublinear cumulative IPOC bound can still output a certificate larger than a given threshold infinitely often as long as it does so sufficiently less frequently (see Section 5.2.1). Nonetheless, loosely translating our result to a PAClike bound yields which is much smaller than their bound for sufficiently small .
The confidence bounds in Algorithm 2 are more general but looser compared to the confidence bounds specialized to the tabular case in Algorithm 1, in particular the upper confidence bounds. Instantiating the cumulative IPOC bound for Algorithm 2 from Theorem 2 for tabular MDPs (where for all ) yields which is worse than the cumulative IPOC bound for Algorithm 1 implied by Theorem 1.
5.2.1 Mistake IPOC Bound for Algorithm 2?
By Proposition 4, a mistake IPOC bound is stronger than the cumulative version we proved for Algorithm 2. One might wonder whether Algorithm 2 also satisfies this stronger bound, but this is not the case: For any , there is an MDP with linear side information such that Algorithm 2 outputs certificates infinitely often with probability .
Proof Sketch.
Consider a twoarmed bandit where the twodimensional context is identical to the deterministic reward for both actions. The context alternates between and
. That means in oddnumbered episodes, the agent receives reward
for action and reward for action (bandit A) and conversely in evennumbered episodes (bandit B). Let and be the current number of times action was played in each bandit and the covariance matrix. One can show that the optimistic Qvalue of action in bandit A is lower bounded as(8)  
(9) 
Assume now the agent stops playing action 2 in bandit A and playing action 1 in bandit B at some point. Then the denominator in Eq (9) stays constant but the numerator grows unboundedly as . That implies that but the optimistic Qvalue for the other action approaches the true reward. Eventually and the agent will play the suboptimal action in bandit A again. Hence, Algorithm 2 has to output infinitely many . ∎
The construction in the proof illustrates that the nondecreasing nature of the ellipsoid confidence intervals cause this negative result (due to the term in in Line 2 of Alg 2). This does not rule out alternative algorithms with mistake IPOC bound for this setting, but they would likely require entirely different parameter estimators and confidence bounds.
6 Simulation Experiments
Certificates need to upper bound the optimality gap in each episode, even for the worst case up to a small failure probability, and Algorithms 1 and 2 are not optimized for empirical performance. As such, their certificates may be conservative, and potentially significantly overestimate the unobserved optimality gaps without further empirical tuning. Yet, one may wonder whether the certificates output by Algorithms 1 and 2 are simply a monotonically decreasing sequence, or whether they can indicate the actual performance variation during learning. In this section, we present the results of a small simulation study, which demonstrates that the certificates do inform us about when the algorithms execute a bad policy. For brevity, we focus on the more general Algorithm 2 in tasks with side information. Details are available in Appendix D.
We first apply Algorithm 2 to randomly generated contextual bandit problems () with dimensional context and actions.^{3}^{3}3We actually use a slightly more complicated version of Algorithm 2 with better empirical performance but the same theoretical guarantees. All details are in Appendix D. For clarity, we presented the simpler Algorithm 2 in the main paper. Certificates and optimality gaps have a correlation of which confirms that certificates are informative about the policy’s return. If one for example needs to intervene when the policy is more than from optimal (e.g., by reducing the price for that customer), then in more than
Comments
There are no comments yet.