# Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

We introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any unknown weakly-communicating Markov Decision Process (MDP) for which an upper bound c on the span of the optimal bias function is known. For an MDP with S states, A actions and Gamma <= S possible next states, we prove a regret bound of O(c√(Gamma SAT)), which significantly improves over existing algorithms (e.g., UCRL and PSRL), whose regret scales linearly with the MDP diameter D. In fact, the optimal bias span is finite and often much smaller than D (e.g., D=infinity in non-communicating MDPs). A similar result was originally derived by Bartlett and Tewari (2009) for REGAL.C, for which no tractable algorithm is available. In this paper, we relax the optimization problem at the core of REGAL.C, we carefully analyze its properties, and we provide the first computationally efficient algorithm to solve it. Finally, we report numerical simulations supporting our theoretical findings and showing how SCAL significantly outperforms UCRL in MDPs with large diameter and small span.

## Authors

• 5 publications
• 19 publications
• 37 publications
• 6 publications
• ### Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

While designing the state space of an MDP, it is common to include state...
07/06/2018 ∙ by Ronan Fruit, et al. ∙ 0

• ### Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes

We introduce and analyse two algorithms for exploration-exploitation in ...
12/11/2018 ∙ by Jian Qian, et al. ∙ 0

• ### Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

The problem of reinforcement learning in an unknown and discrete Markov ...

• ### Near-optimal Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms for the Non-episodic Setting

We study reinforcement learning in factored Markov decision processes (F...
02/06/2020 ∙ by Ziping Xu, et al. ∙ 9

• ### Active Exploration in Markov Decision Processes

We introduce the active exploration problem in Markov decision processes...
02/28/2019 ∙ by Jean Tarbouriech, et al. ∙ 0

• ### Exploration in Structured Reinforcement Learning

We address reinforcement learning problems with finite state and action ...
06/03/2018 ∙ by Jungseul Ok, et al. ∙ 0

• ### Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management

We consider a stochastic inventory control problem under censored demand...
05/10/2019 ∙ by Shipra Agrawal, et al. ∙ 0

## Code Repositories

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

While learning in an unknown environment, a reinforcement learning (RL) agent must trade off the exploration needed to collect information about the dynamics and reward, and the exploitation of the experience gathered so far to gain as much reward as possible. In this paper, we focus on the regret framework (Jaksch et al., 2010), which evaluates the exploration-exploitation performance by comparing the rewards accumulated by the agent and an optimal policy. A common approach to the exploration-exploitation dilemma is the optimism in face of uncertainty

(OFU) principle: the agent maintains optimistic estimates of the value function and, at each step, it executes the policy with highest optimistic value

(e.g., Brafman and Tennenholtz, 2003; Jaksch et al., 2010; Bartlett and Tewari, 2009). An alternative approach is posterior sampling (Thompson, 1933), which maintains a Bayesian distribution over MDPs (i.e., dynamics and expected reward) and, at each step, samples an MDP and executes the corresponding optimal policy (e.g., Osband et al., 2013; Abbasi-Yadkori and Szepesvári, 2015; Osband and Roy, 2017; Ouyang et al., 2017; Agrawal and Jia, 2017).

Given a finite MDP with states, actions, and diameter  (i.e., the time needed to connect any two states), Jaksch et al. (2010) proved that no algorithm can achieve regret smaller than . While recent work successfully closed the gap between upper and lower bounds w.r.t. the dependency on the number of states (e.g., Agrawal and Jia, 2017; Azar et al., 2017), relatively little attention has been devoted to the dependency on . While the diameter quantifies the number of steps needed to “recover” from a bad state in the worst case, the actual regret incurred while “recovering” is related to the difference in potential reward between “bad” and “good” states, which is accurately measured by the span (i.e., the range) of the optimal bias function . While the diameter is an upper bound on the bias span, it could be arbitrarily larger (e.g., weakly-communicating MDPs may have finite span and infinite diameter) thus suggesting that algorithms whose regret scales with the span may perform significantly better.111The proof of the lower bound relies on the construction of an MDP whose diameter actually coincides with the bias span (up to a multiplicative numerical constant), thus leaving the open question whether the “actual” lower bound depends on or the bias span. See (Osband and Van Roy, 2016) for a more thorough discussion. Building on the idea that the OFU principle should be mitigated by the bias span of the optimistic solution, Bartlett and Tewari (2009) proposed three different algorithms (referred to as Regal) achieving regret scaling with instead of . The first algorithm defines a span regularized problem, where the regularization constant needs to be carefully tuned depending on the state-action pairs visited in the future, which makes it unfeasible in practice. Alternatively, they propose a constrained variant, called Regal.C, where the regularized problem is replaced by a constraint on the span. Assuming that an upper-bound on the bias span of the optimal policy is known (i.e., ), Regal.C achieves regret upper-bounded by . Unfortunately, they do not propose any computationally tractable algorithm solving the constrained optimization problem, which may even be ill-posed in some cases. Finally, Regal.D avoids the need of knowing the future visits by using a doubling trick, but still requires solving a regularized problem, for which no computationally tractable algorithm is known.

In this paper, we build on Regal.C and propose a constrained optimization problem for which we derive a computationally efficient algorithm, called ScOpt. We identify conditions under which ScOpt converges to the optimal solution and propose a suitable stopping criterion to achieve an -optimal policy. Finally, we show that using a slightly modified optimistic argument, the convergence conditions are always satisfied and the learning algorithm obtained by integrating ScOpt into a UCRL-like scheme (resulting into SCAL) achieves regret scaling as when an upper-bound on the optimal bias span is available, thus providing the first computationally tractable algorithm that can solve weakly-communicating MDPs.

## 2 Preliminaries

We consider a finite weakly-communicating Markov decision process (Puterman, 1994, Sec. 8.3) with a set of states and a set of actions . Each state-action pair is characterized by a reward distribution with mean and support in

as well as a transition probability distribution

over next states. We denote by and the number of states and actions, and by the maximum support of all transition probabilities. A Markov randomized decision rule maps states to distributions over actions. The corresponding set is denoted by , while the subset of Markov deterministic decision rules is . A stationary policy repeatedly applies the same decision rule over time. The set of stationary policies defined by Markov randomized (resp. deterministic) decision rules is denoted by (resp. ). The long-term average reward (or gain) of a policy starting from is

where . Any stationary policy has an associated bias function defined as

 hπM(s):=C-limT→+∞ EQ[T∑t=1(r(st,at)−gπM(st))],

that measures the expected total difference between the reward and the stationary reward in Cesaro-limit222For policies with an aperiodic chain, the standard limit exists. (denoted ). Accordingly, the difference of bias values quantifies the (dis-)advantage of starting in state rather than . In the following, we drop the dependency on whenever clear from the context and denote by the span of the bias function. In weakly communicating MDPs, any optimal policy has constant gain, i.e., for all . Let and

be the transition matrix and reward vector associated with decision rule

. We denote by and the Bellman operator associated with and optimal Bellman operator

 ∀v∈RS,  Ldv:=rd+Pdv;Lv:=maxd∈DMR{rd+Pdv}.

For any policy , the gain and bias satisfy the following system of evaluation equations

 gπ=Pdgπ;hπ=Ldhπ−gπ. (1)

Moreover, there exists a policy for which satisfy the optimality equation

 h∗=Lh∗−g∗e, where e=(1,…,1)⊺. (2)

Finally, we denote by the diameter of , where is the minimal expected number of steps needed to reach from in .

Learning problem. Let be the true unknown MDP. We consider the learning problem where , and are known, while rewards and transition probabilities are unknown and need to be estimated on-line. We evaluate the performance of a learning algorithm after time steps by its cumulative regret .

## 3 Optimistic Exploration-Exploitation

Since our proposed algorithm SCAL (Sec. 6) is a tractable variant of Regal.C and thus a modification of UCRL, we first recall their common structure summarized in Fig. 1.

### 3.1 Upper-Confidence Reinforcement Learning

UCRL proceeds through episodes At the beginning of each episode , UCRL computes a set of plausible MDPs defined as , where and

are high-probability confidence intervals on the rewards and transition probabilities of the true MDP

, which guarantees that w.h.p. We use confidence intervals constructed using empirical Bernstein’s inequality (Audibert et al., 2007; Maurer and Pontil, 2009)

 βsar,k := ⎷14ˆσ2r,k(s,a)bk,δmax{1,Nk(s,a)}+493rmaxbk,δmax{1,Nk(s,a)−1}, βsas′p,k := ⎷14ˆσ2p,k(s′|s,a)bk,δmax{1,Nk(s,a)}+493bk,δmax{1,Nk(s,a)−1},

where is the number of visits in before episode , and

are the empirical variances of

and and . Given the empirical averages and of rewards and transitions, we define by and .

Once has been computed, UCRL finds an approximate solution to the optimization problem

 (˜M∗k,˜π∗k)∈argmaxM∈Mk,π∈ΠSD(M)gπM. (3)

Since w.h.p., it holds that . As noticed by Jaksch et al. (2010), problem (3) is equivalent to finding where is the extended MDP (sometimes called bounded-parameter MDP) implicitly defined by . More precisely, in the (finite) action space is “extended” to a compact action space by considering every possible value of the confidence intervals and as fictitious actions. The equivalence between the two problems comes from the fact that for each there exists a pair () such that the policies and induce the same Markov reward process on respectively and , and conversely. Consequently, (3) can be solved by running so-called extended value iteration (EVI): starting from an initial vector , EVI recursively computes

 un+1(s)=maxa,˜r,˜p[˜r(s,a)+˜p(⋅|s,a)Tun]=˜Lun(s), (4)

where is the optimistic optimal Bellman operator associated to . If EVI is stopped when , then the greedy policy w.r.t. is guaranteed to be -optimal, i.e., . Therefore, the policy associated to is an optimistic -optimal policy, and UCRL executes until the end of episode .

### 3.2 A first relaxation of Regal.C

Regal.C follows the same steps as UCRL but instead of solving problem (3), it tries to find the best optimistic model having constrained optimal bias span i.e.,

 (˜M∗\textscRC,˜π∗\textscRC)=argmaxM∈M\textscRC,π∈Π% SD(M)gπM, (5)

where is the set of plausible MDPs with bias span of the optimal policy bounded by . Under the assumption that , Regal.C discards any MDP whose optimal policy has a span larger than (i.e., ) and otherwise looks for the MDP with highest optimal gain . Unfortunately, there is no guarantee that all MDPs in are weakly communicating and thus have constant gain. As a result, we suspect this problem to be ill-posed (i.e., the maximum is most likely not well-defined). Moreover, even if it is well-posed, searching the space seems to be computationally intractable. Finally, for any , there may be several optimal policies with different bias spans and some of them may not satisfy the optimality equation (2) and are thus difficult to compute.

In this paper, we slightly modify problem (5) as follows:

 (˜M∗c,˜π∗c)∈argmaxM∈Mk,π∈Πc(M)gπM, (6)

where the search space of policies is defined as

 Πc(M):={π∈ΠSR:sp{hπM}≤c∧sp{gπM}=0},

and if . Similarly to (3), problem (6) is equivalent to solving . Unlike (5), for every MDP in (not just those in ), (6) considers all (stationary) policies with constant gain satisfying the span constraint (not just the deterministic optimal policies).

Since and are in general non-continuous functions of (, ), the argmax in (5) and (6) may not exist. Nevertheless, by reasoning in terms of supremum value, we can show that (6) is always a relaxation of (5) (where we enforce the additional constraint of constant gain).

###### Proposition 1.

Define the following restricted set of MDPs . Then

 supM∈Ek,π∈ΠSDgπM≤supM∈Mk,π∈Πc(M)gπM.
###### Proof.

The result follows from the fact that and , . ∎

As a result, the optimism principle is preserved when moving from (5) to (6) and since the set of admissible MDPs is the same, any algorithm solving (6) would enjoy the same regret guarantees as Regal.C. In the following we further characterise problem (6), introduce a truncated value iteration algorithm to solve it, and finally integrate it into a UCRL-like scheme to recover Regal.C regret guarantees.

## 4 The Optimization Problem

In this section we analyze some properties of the following optimization problem, of which (6) is an instance,

 supπ∈Πc(M){gπM}, (7)

where is any MDP (with discrete or compact action space) s.t. . Problem (7) aims at finding a policy that maximizes the gain within the set of randomized policies with constant gain (i.e., ) and bias span smaller than (i.e., ). Since the supremum always exists and we denote it by . The set of maximizers is denoted by , with elements (if is non-empty).

In order to give some intuition about the solutions of problem (7), we introduce the following illustrative MDP.

###### Example 1.

Consider the two-states MDP depicted in Fig. 2. For a generic stationary policy with decision rule we have that

 d=[x1−xy1−y];Pd=[1−xxy1−y],rd=[1−x21−y].

We can compute the gain and the bias by solving the linear system (1). For any or , we obtain

 g1=g2=12+x1−3y2(x+y);h2−h1=12+1−3y2(x+y),

while for , , we have and , with . Note that for any . In the following, we will use this example choosing particular values for , , and to illustrate some important properties of optimization problem  (7).

Randomized policies. The following lemma shows that, unlike in unconstrained gain maximization where there always exists an optimal deterministic policy, the solution of (7) may indeed be a randomized policy.

###### Lemma 2.

There exists an MDP and a scalar , such that and .

###### Proof.

Consider Ex. 1 with constraint . The only deterministic policy with constant gain and bias span smaller than is defined by the decision rule with and , which leads to and . On the other hand, a randomized policy can satisfy the constraint and maximize the gain by taking and , which gives and , thus proving the statement. ∎

Constant gain. The following lemma shows that if we consider non-constant gain policies, the supremum in (7) may not be well defined, as no dominating policy exists. A policy is dominating if for any policy , in all states .

###### Lemma 3.

There exists an MDP and a scalar , such that there exists no dominating policy in with constrained bias span (i.e., ).

###### Proof.

Consider Ex. 1 with constraint . As shown in the proof of Lem. 2, the optimal stationary policy with constant gain has . On the other hand, the only policy with non-constant gain is , , which has and and , thus proving the statement. ∎

On the other hand, when the search space is restricted to policies with constant gain, the optimization problem is well posed. Whether problem (7) always admits a maximizer is left as an open question. The main difficulty comes from the fact that, in general, is not a continuous map and is not a closed set. For instance in Ex. 1, although the maximum is attained, the point , does not belong to (i.e., is not closed) and is not continuous at this point. Notice that when the MDP is unichain (Puterman, 1994, Sec. 8.3), is compact, is continuous, and we can prove the following lemma (see App. A):

###### Lemma 4.

If is unichain then .

We will later show that for the specific instances of (7) that are encountered by our algorithm SCAL, Lem. 4 holds.

## 5 Planning with ScOpt

In this section, we introduce ScOpt and derive sufficient conditions for its convergence to the solution of (7). In the next section, we will show that these assumptions always hold when ScOpt is carefully integrated into UCRL (while in App. B we show that they may not hold in general).

### 5.1 Span-constrained value and policy operators

ScOpt is a version of (relative) value iteration (Puterman, 1994; Bertsekas, 1995), where the optimal Bellman operator is modified to return value functions with span bounded by , and the stopping condition is tailored to return a constrained greedy policy with near-optimal gain. We first introduce a constrained version of the optimal Bellman operator .

###### Definition 1.

Given and , we define the value operator as

 Tcv={Lv(s)∀s∈¯¯¯¯S(c,v),c+mins{Lv(s)}∀s∈S∖¯¯¯¯S(c,v), (8)

where .

In other words, operator applies a span truncation to the one-step application of , that is, for any state , , which guarantees that . Unlike , operator is not always associated with a decision rule s.t. (see App. B). We say that is feasible at and if there exists a distribution such that

 Tcv(s)=∑a∈Asδ+v(s,a)[r(s,a)+p(⋅|s,a)Tv]. (9)

When a distribution exists in all states, we say that is globally feasible at , and is its associated decision rule, i.e., . In the following lemma, we identify sufficient and necessary conditions for (global) feasibility.

###### Lemma 5.

Operator is feasible at and if and only if

 mina∈As{r(s,a)+p(⋅|s,a)Tv}≤mins′{Lv(s′)}+c. (10)

Furthermore, let

 D(c,v):={d∈DMR|sp{Ldv}≤c} (11)

be the set of randomized decision rules whose associated operator returns a span-constrained value function when applied to . Then, is globally feasible if and only if , in which case we have

 Tcv=maxδ∈D(c,v)Lδv, and δ+v∈argmaxδ∈D(c,v)Lδv. (12)

The last part of this lemma shows that when is globally feasible at (i.e., ), is the componentwise maximal value function of the form with decision rule satisfying . Surprisingly, even in the presence of a constraint on the one-step value span, such a componentwise maximum still exists (which is not as straightforward as in the case of the greedy operator ). Therefore, whenever , optimization problem (12) can be seen as an LP-problem (see App. A.2).

###### Definition 2.

Given and , let be the set of states where is feasible (condition (10)) with be the associated decision rule (Eq. 9). We define the operator as333When there are several policies achieving in state , chooses an arbitrary decision rule.

 Gcv=⎧⎨⎩δ+v(s)s∈˜S(C,v),argmina∈As{r(s,a)+p(⋅|s,a)Tv}s∈S∖˜S(C,v).

As a result, if is globally feasible at , by definition . Note that computing is not significantly more difficult than computing a greedy policy (see App. C for an efficient implementation).

We are now ready to introduce ScOpt (Fig. 3). Given a vector and a reference state , ScOpt implements relative value iteration where is replaced by , i.e.,

 vn+1=Tcvn−Tcvn(¯¯¯s)e. (13)

Notice that the term subtracted at any iteration  prevents from increasing linearly with and thus avoids numerical instability. However, the subtraction can be dropped without affecting the convergence properties of ScOpt. If the stopping condition is met at iteration , ScOpt returns policy where .

### 5.2 Convergence and Optimality Guarantees

In order to derive convergence and optimality guarantees for ScOpt we need to analyze the properties of operator . We start by proving that preserves the one-step span contraction properties of .

###### Assumption 6.

The optimal Bellman operator is a 1-step -span-contraction, i.e., there exists a such that for any vectors , .444In the undiscounted setting, if the MDP is unichain, is a -stage contraction with .

###### Lemma 7.

Under Asm. 6, is a -span contraction.

The proof of Lemma 7 relies on the fact that the truncation of in the definition of is non-expansive in span semi-norm. Details are given in App. D, where it is also shown that preserves other properties of such as monotonicity and linearity. It then follows that admits a fixed point solution to an optimality equation (similar to ) and thus ScOpt converges to the corresponding bias and gain, the latter being an upper-bound on the optimal solution of (7). We formally state these results in Lem. 8.

###### Lemma 8.

Under Asm. 6, the following properties hold:

1. Optimality equation and uniqueness: There exists a solution to the optimality equation

 Tch+=h++g+e. (14)

If is another solution of (14), then and there exists s.t. .

2. Convergence: For any initial vector , the sequence generated by ScOpt converges to a solution vector of the optimality equation (14), and

 limn→+∞Tn+1cv0−Tncv0=g+e.
3. Dominance: The gain is an upper-bound on the supremum of (7), i.e., .

A direct consequence of point 2 of Lem. 8 (convergence) is that ScOpt always stops after a finite number of iterations. Nonetheless, may not always be globally feasible at (see App. B) and thus there may be no policy associated to optimality equation (14). Furthermore, even when there is one, Lem. 8 provides no guarantee on the performance of the policy returned by ScOpt after a finite number of iterations. To overcome these limitations, we introduce an additional assumption, which leads to stronger performance guarantees for ScOpt.

###### Assumption 9.

Operator is globally feasible at any vector such that .

###### Theorem 10.

Assume Asm. 6 and 9 hold and let denote the contractive factor of (Asm. 6). For any such that , any and any , the policy output by ScOpt is such that . Furthermore, if in addition the policy is unichain, is the solution to optimization problem (7) i.e., and .

The first part of the theorem shows that the stopping condition used in Fig. 3 ensures that ScOpt returns an -optimal policy . Notice that while by definition of , in general when the policy associated to is not unichain, we might have . On the other hand, Corollary 8.2.7. of Puterman (1994) ensures that if is unichain then , hence the second part of the theorem. Notice also that even if is unichain, we cannot guarantee that satisfies the span constraint, i.e., may be arbitrary larger than . Nonetheless, in the next section, we show that the definition of and Thm. 10 are sufficient to derive regret bounds when ScOpt is integrated into UCRL.

## 6 Learning with Scal

In this section we introduce SCAL, an optimistic online RL algorithm that employs ScOpt to compute policies that efficiently balance exploration and exploitation. We prove that the assumptions stated in Sec. 5.2 hold when ScOpt is integrated into the optimistic framework. Finally, we show that SCAL enjoys the same regret guarantees as Regal.C, while being the first implementable and efficient algorithm to solve bias-span constrained exploration-exploitation.

Based on Def. 1, we define as the span truncation of the optimal Bellman operator of the bounded-parameter MDP (see Sec. 3). Given the structure of problem (6), one might consider applying ScOpt (using ) to the extended MDP . Unfortunately, in general does not satisfy Asm. 6 and 9 and thus may not enjoy the properties of Lem. 8 and Thm. 10. To overcome this problem, we slightly modify as described in Def. 3.

###### Definition 3.

Let be a bounded-parameter (extended) MDP. Let and an arbitrary state. We define the “modified” MDP associated to by555For any closed interval , and

 B‡r(s,a) =[0,max{Br(s,a)}], B‡p(s,a,s′) ={Bp(s,a,s′)if  s′≠¯¯¯s,Bp(s,a,¯¯¯s)∩[η,1]otherwise,

where we assume that is small enough so that: , , and . We denote by the optimal Bellman operator of (cf. Eq. 4) and by the span truncation of (cf. Def. 1).

By slightly perturbing the confidence intervals of the transition probabilities, we enforce that the “attractive” state is reached with non-zero probability from any state-action pair implying that the ergodic coefficient of

 γ=1−mins,u∈S, a,b∈A˜p, ˜q∈B‡p⎧⎪ ⎪⎨⎪ ⎪⎩∑j∈Smin{˜p(j|s,a),˜q(j|u,b)}≥η if j=¯s⎫⎪ ⎪⎬⎪ ⎪⎭

is smaller than , so that is -contractive (Puterman, 1994, Thm. 6.6.6), i.e., Asm. 6 holds. Moreover, for any policy , state necessarily belongs to all recurrent classes of implying that is unichain and so is unichain. As is shown in Thm. 11, the -perturbation of introduces a small bias in the final gain.

By augmenting (without perturbing) the confidence intervals of the rewards, we ensure two nice properties. First of all, for any vector , and thus by definition . Secondly, there exists a decision rule such that , meaning that (Puterman, 1994, Proposition 6.6.1). Thus if then and so which by Lem. 5 implies that is globally feasible at . Therefore, Asm. 9 holds in .

When combining both the perturbation of and the augmentation of we obtain Thm. 11 (proof in App. E).

###### Theorem 11.

Let be a bounded-parameter (extended) MDP and its “modified” counterpart (see Def. 3). Then