# Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards

We consider an agent who is involved in a Markov decision process and receives a vector of outcomes every round. Her objective is to maximize a global concave reward function on the average vectorial outcome. The problem models applications such as multi-objective optimization, maximum entropy exploration, and constrained optimization in Markovian environments. In our general setting where a stationary policy could have multiple recurrent classes, the agent faces a subtle yet consequential trade-off in alternating among different actions for balancing the vectorial outcomes. In particular, stationary policies are in general sub-optimal. We propose a no-regret algorithm based on online convex optimization (OCO) tools (Agrawal and Devanur 2014) and UCRL2 (Jaksch et al. 2010). Importantly, we introduce a novel gradient threshold procedure, which carefully controls the switches among actions to handle the subtle trade-off. By delaying the gradient updates, our procedure produces a non-stationary policy that diversifies the outcomes for optimizing the objective. The procedure is compatible with a variety of OCO tools.

Comments

There are no comments yet.

## Authors

• 10 publications
• ### Reinforcement Learning of Markov Decision Processes with Peak Constraints

In this paper, we consider reinforcement learning of Markov Decision Pro...
01/23/2019 ∙ by Ather Gattami, et al. ∙ 0

read it

• ### Efficient Learning in Non-Stationary Linear Markov Decision Processes

We study episodic reinforcement learning in non-stationary linear (a.k.a...
10/24/2020 ∙ by Ahmed Touati, et al. ∙ 2

read it

• ### A Sliding-Window Algorithm for Markov Decision Processes with Arbitrarily Changing Rewards and Transitions

We consider reinforcement learning in changing Markov Decision Processes...
05/25/2018 ∙ by Pratik Gajane, et al. ∙ 0

read it

• ### Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

In this paper we consider multi-objective reinforcement learning where t...
11/25/2020 ∙ by Jingfeng Wu, et al. ∙ 3

read it

• ### Robust Asymmetric Learning in POMDPs

Policies for partially observed Markov decision processes can be efficie...
12/31/2020 ∙ by Andrew Warrington, et al. ∙ 0

read it

• ### Optimizing Expectation with Guarantees in POMDPs (Technical Report)

A standard objective in partially-observable Markov decision processes (...
11/26/2016 ∙ by Krishnendu Chatterjee, et al. ∙ 0

read it

• ### Adaptive prior probabilities via optimization of risk and entropy

An agent choosing between various actions tends to take the one with the...
03/18/2018 ∙ by Armen E. Allahverdyan, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Markov Decision Processes (MDPs) model sequential optimization problems in Markovian environments. At each time, an agent performs an action, contingent upon the state of the environment. Her action influences the environment through the resulting state transition. In many situations, an action at a state leads a vector of different and correlated outcomes, and the agent desires to optimize a complex and global objective that involves all these outcomes across time. Motivated by these situations, we consider online MDPs with Global concave Reward functions (MDPwGR). In the MDPwGR problem, an agent seeks to optimize a concave reward function, which is generally non-linear, over the average vectorial outcome generated by a latent MDP.

For online optimization with global concave rewards and MDPwGR in particular, an agent is required to alternate among different actions in order to balance the vectorial outcomes. The setting of MDPwGR presents the following subtle challenges. To alternate between two actions, the agent has to travel from one state to another, which could require visiting sub-optimal states and compromises her objective. Thus, the alternations have to be carefully controlled, in order to balance the outcomes while maintaining near-optimality, on top of her simultaneous exploration and exploitation on the latent model.

We shed light on the mentioned trade-off by proposing Toc-UCRL2, a near-optimal online algorithm for MDPwGR. The algorithm is built upon a dual based approach using gradient updates, which facilitate the balancing of outcomes, as well as UCRL2, which solves MDPs with certain scalar rewards. In order to handle the mentioned trade-off in action alternations, we introduce a novel gradient threshold procedure that delays the gradient updates. The delay is finely tuned so that the balancing mechanism is still intact while the objective is not severely compromised, leading to a no-regret and non-stationary policy.

Related Literature. MDPwGR is a common generalization of the Bandits with Global concave Rewards (BwGR) and online MDPs with Scalar Rewards (MDPwSR). BwGR is first studied by [Agrawal and Devanur, 2014], who establish important connections between online convex optimization and upper-confidence-bound (UCB) based algorithms for BwGR and its generalization. The work of [Agrawal and Devanur, 2014] focus on stochastic -armed bandits. Subsequently, BwGR is studied contextual -armed bandits [Agrawal et al., 2016]. [Busa-Fekete et al., 2017] consider -armed bandits with generalized fairness objectives, which require special cares different from BwGR. [Berthet and Perchet, 2017] consider the combination of Frank-Wolfe algorithm and UCB algorithms (which is also considered in [Agrawal and Devanur, 2014]), and [Berthet and Perchet, 2017] demonstrate fast rate convergence in cases when the concave reward functions are not known but satisfy certain smoothness property.

BwGR is closely related to Bandits with Knapsacks (BwK), which precedes BwGR. BwK is first studied under -armed bandits by [Badanidiyuru et al., 2013]. Subsequently, BwK is studied under -armed bandits with concave rewards and convex constraints [Agrawal and Devanur, 2014], contextual -armed bandits [Badanidiyuru et al., 2014, Agrawal et al., 2016] and linear bandits [Agrawal and Devanur, 2016]. The works on BwGR and BwK focus on stationary stochastic environments, and provide online algorithms with global rewards converging to the offline optimum when the number of time steps grows.

Recently, [Immorlica et al., 2018] study the adversarial BwK, and show that it is impossible to achieve a competitive ratio of less than compared to the offline optimum, where is the budget. They propose algorithms with competitve ratios of compared to a certain offline benchmark. Our positive results, which are on MDPwGR and MDPwK (see Appendix 6.3), walk a fine line between the negative results for adversarial BwK and positive results for stochastic BwGR and BwK. Finally, online optimization with global rewards are also studied in adversarial settings with full feedback [Even-Dar et al., 2009, Azar et al., 2014].

MDPwSR on communicating MDPs is studied by [Auer and Ortner, 2006, Jaksch et al., 2010]. Subsequently, [Agrawal and Jia, 2017] provide improved regret bounds by optimistic posterior sampling. [Ortner, 2018] derive regret bounds in terms of a mixing time parameter. [Bartlett and Tewari, 2009] consider a more general case of weakly communicating MDPs, and regret bounds are derived in [Fruit et al., 2018b]. [Fruit et al., 2018a] study an even more general case of non-communicating MDPs. These works focus on optimizing under scalar rewards, but do not consider optimizing under vectorial outcomes. For a review on MDPs, please consult [Puterman, 1994].

Reinforcement learning on multi-objective MDPs and MDPs with resource constraints are studied in discounted reward settings [Gábor et al., 1998, Barrett and Narayanan, 2008, Van Moffaert and Nowé, 2014]. [Natarajan and Tadepalli, 2005, Lizotte et al., 2012] design algorithms for average reward settings. [Mannor and Shimkin, 2004, Mannor et al., 2009] consider optimizing the average rewards in asymptotic settings, and demonstrate convergence of their algorithms. We study MDPwGR in an non-asymptotic average reward setting. Recently, [Hazan et al., 2018] study exploration problems on MDPs in offline settings, which can be modeled as MDPs with global rewards. Another recent work by [Tarbouriech and Lazaric, 2019] consider active exploration in Markov decision processes, which involves maximizing a certain concave function on the frequency of visiting each state-action pair. [Tarbouriech and Lazaric, 2019] assumes that the underlying transition kernel is known to the agent, and they also make certain mixing time assumptions that hold for every stationary policy. Different from [Tarbouriech and Lazaric, 2019], our model allows the underlying transition kernel to be not known to the agent. Moreover, we only assume the underlying MDP to be communicating (see Assumption 2.1 in Section 2), which is less restrictive than the mixing time assumption [Tarbouriech and Lazaric, 2019]. [Jaksch et al., 2010] also provide a discussion on the relationship between mixing time assumptions and the communicating assumption. Constrained MDPs are reviewed in [Altman, 1999], and multi-objective reinforcement learning is surveyed in [Roijers et al., 2013, Liu et al., 2015].

Organization of the Paper. In Section 2, we provide the problem definition of MDPwGR, and define the offline benchmark for the regret analysis oif MDPwGR. In Section 3, we discuss the challenges in MDPwGR, and explain why existing works on BwGR and MDPwSR fail to solve MDPwGR to near-optimality. Then we introduce our algorithm Toc-UCRL2 which solves MDPwGR to near-optimality. In Section 4, we analyze Toc-UCRL2 in the case of Frank-Wolfe oracle, assuming that the reward function is -smooth. In Section 6, we discuss in details the applications of the problem model of MDPwGR, and demonstrate the near-optimality of Toc-UCRL2 in all these applications. Finally, we conclude in Section 7. Supplementary details to the discussions and more proofs are provided in the Appendix.

Notation. For , is the norm on , defined as . The vectors are the all-one and all-zero vectors, and is the th standard basis vector in . All vectors are column vectors by default. The inner product between is . Denote . For a norm on , denote its dual norm as , where . For a finite set , denote

as the set of probability distributions over

. For an event , if holds, and otherwise. Finally, “w.r.t.” stands for “with respect to”.

## 2 Problem Definition of MDPwGR

An instance of MDPwGR is specified by the tuple . The set is a finite state space, and is the starting state. The collection contains a finite set of actions for each state . We say that is a state-action pair if . The quantity is the transition kernel. For each , we denote as the probability distribution on the subsequent state when the agent takes action at state .

For each , the

-dimensional random variable

represents the stochastic outcomes. The mean is denoted as . The reward function is concave, and is to be maximized. The function is -Lipschitz continuous on w.r.t. a norm , i.e. for all .111We also assume to be closed, i.e. is closed. This ensures , where is the Fenchel dual of . The function needs not be monotonic in any of the dimensions.

Dynamics. An agent, who faces an MDPwGR instance , starts at state . At time , three events happen. First, the agent observes his current state . Second, she takes an action . Third, she stochastically transits to another state , and observes a vector of stochastic outcomes . In the second event, the choice of is based on a non-anticipatory policy. That is, the choice only depends on the current state and the previous observations . When only depends on , but not on , we say that the corresponding non-anticipatory policy is stationary.

At each time step , the subsequent state and the outcomes are generated in a Markovian manner. Conditional on , we suppose four properties on . First, are independent of . Second, the subsequent state is distributed according to , or in short . Third, the outcome is identically distributed as . Fourth, can be arbitrarily correlated.

Objective. The MDPwGR instance is latent. While the agent knows , she does not know . To state the objective, define . For any horizon not known a priori, the agent aims to maximize , by selecting actions with a non-anticipatory policy. The agent faces a dilemma between exploration and exploitation. She needs to learn while optimizing in a Markovian environment.

MDPwGR models a variety of online learning problems in Markovian environments, such as multi-objective optimization (MOO), maximum entropy exploration (MaxEnt), and MDPwSR with knapsack constraints in the large volume regime (MDPwK). We elaborate on these applications in Section 6. Finally, if is a linear function, we recover MDPwSR [Jaksch et al., 2010]; if we specialize , we recover BwGR [Agrawal and Devanur, 2014].

Reachability of . To ensure learnability, we suppose in Assumption 2.1 that the instance is communicating. For any and any stationary policy , the travel time from to under is equal to the random variable .

###### Assumption 2.1.

The latent MDPwGR instance is communicating, that is, the quantity is finite. We call the diameter of .

The same reachability assumption is made in [Jaksch et al., 2010]. Since the instance is latent, the corresponding diameter is also not known to the agent. Assumption 2.1 is weaker than the unichain assumption, where every stationary policy induces a single recurrent class on .

Offline Benchmark and Regret. To measure the effectiveness of a policy, we rephrase the agent’s objective as the minimization of regret: . The offline benchmark is the optimum of the convex optimization problem , which serves as a fluid relaxation [Puterman, 1994, Altman, 1999] to the MDPwGR problem.

 (PM): \; maxx g⎛⎝∑s∈S,a∈Asv(s,a)x(s,a)⎞⎠ s.t. ∑a∈Asx(s,a)=∑s′∈S,a′∈As′p(s|s′,a′)x(s′,a′) ∀s∈S (2.1a) ∑s∈S,a∈Asx(s,a)=1 (2.1b) x(s,a)≥0 ∀s∈S,a∈As (2.1c)

In , the variables form a probability distribution over the state-action pairs. The set of constraints (2.1a) requires the rates of transiting into and out of each state to be equal.

To achieve near-optimality, we aim to design a non-anticipatory policy with an anytime regret bound for some . That is, for all , there exist constants (which only depend on ), so that the policy satisfies for all with probability at least . Our offline benchmark is justified as follows:

###### Theorem 2.2.

Consider an MDPwGR instance that satisfies Assumption 2.1 with diameter . For any non-anticipatory policy, it holds that

 E[g(¯V1:T)]≤opt(PM)+2L∥1K∥D/T.

Theorem 2.2 is proved in Appendix A.2. Interestingly, the proof requires inspecting a dual formulation of , and it appears hard to analyze directly. We could have when is small (see Appendix A.1), thus an additive term in the upper bound is necessary.

## 3 Challenges of MDPwGR, and Algorithm Toc-UCRL2

Challenges. While MDPwGR is a common generalization of BwGR and MDPwGR, we identify unique challenges in MDPwGR for alternating among different actions, which is crucial for balancing the outcomes and achieving near-optimality.

We showcase these challenges in Fig. 1. An arc from state to state represents an action , with . Instance , which can be seen as a BwGR instance, consists of a single state and actions . Instances , are respective instances for MDPwGR, MDPwSR. These instances share the same . The center node is a communicating MDP. Each peripheral node is a distinct state, disjoint from . Each has a self-loop ; there is an arc from to , as well as an arc back. Thus, are communicating.

Let’s focus on , both with . For , we set for each . For , we set for each , and set for all other . Now, . In the case of , the agent achieves a anytime regret by choosing at time , where .

In , an optimal policy has recurrent classes , and each action should be chosen with frequency for optimality. To alternate from to , the agent has to travel from state to , which forces her to visit and compromises her objective. This presents a more difficult case than , where she can freely alternate among .

The agent has to explore and seek shortest paths among s to alternate among . Importantly, her frequency of alternations has to be finely controlled. To elaborate, define , which is the number of alternations among in the first time steps (An alternation from requires visiting once). The agent’s anytime regret depends on delicately:

###### Claim 3.1.

Let be arbitrary, and . There is an such that: If , then . If , then either , or for some , where is the diameter of .

Claim 3.1, proved in Appendix A.4, holds even when the agent knows . In the claim, the first “if” case is when the agent alternates too often, and compromises the objective by visiting the sub-optimal too many times. The second “if” case is essentially when the agent alternates too infrequently by staying at a loop for too long, leading to an imbalance in the outcomes. The frequency of alternation becomes an even subtler issue when the agent has to maintain simultaneous exploration and exploitation on .

The trade-off in Claim 3.1 is absent in MDPwSR, where the agent follows a single stationary policy and alternates within a recurrent class of optimal states. Since the rewards are scalar, the agent does not need to balance the outcomes, unlike in MDPwGR. For example, in instance , state-action pairs , have scalar reward 1, while other state-action pairs have scalar reward 0. The agent achieves a anytime regret by traveling from a starting state to , and then alternating solely between by actions indefinitely.

Altogether, the trade-off in Claim 3.1 occurs when a policy can have multiple recurrent classes, which is possible in communicating MDPs, but not unichain MDPs. In fact, stationary policies are in general sub-optimal for MDP-wGR:

###### Claim 3.2.

There exists under which any stationary policy incurs an anytime regret.

The Claim is proved in Appendix A.4. Claim 3.2 is in stark contrast to the optimality of stationary policies in the unichain case [Altman, 1999], or the scalar reward case [Jaksch et al., 2010], or the discounted case [Altman, 1999]. How should the agent manage her exploration-exploitation trade-off, in face of the trade-off in alternating among actions (cf. Claim 3.1), while avoiding converging to a stationary policy?

Algorithm. We propose Algorithm Toc-UCRL2, displayed in Algorithm 1, for MDPwGR. The algorithm runs in episodes, and it overcomes the discussed challenges by a novel gradient threshold procedure. During episode , which starts at time , it runs a certain stationary policy , until the end of the episode at time . The start times and policies are decided adaptively, as discussed later. We maintain confidence regions , on the latent , across episodes, by first defining

 Nm(s,a)=τ(m)−1∑t=11(st=s,at=a),N+m(s,a)=max{1,Nm(s,a)}. (3.3)

Define

. The estimates and confidence regions for

are:

 ^vm(s,a) :=1N+m(s,a)τ(m)−1∑t=1Vτ(st,at)1(st=s,at=a), radvm,k(s,a) := ⎷2^vm,k(s,a)⋅(log-v)mN+m(s,a)+3⋅(log-v)mN+m(s,a), Hvm(s,a) :={¯v∈[0,1]K:|¯vk−^vm,k(s,a)|≤radvm,k(s,a)∀k∈[K]}. (3.4)

Define . The estimates and confidence regions for are:

 ^pm(s′|s,a) :=1N+m(s,a)τ(m)−1∑t=11(st=s,at=a,st+1=s′), radpm(s′|s,a) :=√2^pm(s′|s,a)⋅(log-p)mN+m(s,a)+3⋅(log-p)mN+m(s,a), Hpm(s,a) :={¯p∈ΔS:∣∣¯p(s′)−^pm(s′|s,a)∣∣≤radpm(s′|s,a)∀s′∈S}. (3.5)

OCO Oracle. We balance the contributions from each of the outcomes by an Online Convex Optimization (OCO) oracle OCO. The applications of OCO tools with UCB algorithms are first studied in bandit settings by [Agrawal and Devanur, 2014], and are subsequently studied in different settings by [Agrawal et al., 2016, Busa-Fekete et al., 2017, Berthet and Perchet, 2017]. An OCO oracle is typically based on a gradient descent algorithm. At the end of time , the oracle OCO computes a sub-gradient that depends on . For each , the scalar reward reflects how well balances the outcomes. To illustrate, we provide the definition of the Frank-Wolfe oracle based on [Frank and Wolfe, 1956], which is defined for -smooth reward functions (see later in Defintion 3.3). The initial gradient is . To prepare for time , at the end of time the oracle outputs gradient

 θt+1=−∇g(¯V1:t).

For an even more concrete example, consider instance with . The oracle outputs . The resulting scalar reward for is , confirming the intuition that the agent should choose those s with , but not those s with .

EVI Oracle. Despite the uncertainty on , we aim for an optimal policy for , the MDPwSR with scalar rewards and transition kernel . Problem is easier than the original MDPwGR, since is optimized by a stationary policy. Denote the optimal average reward as , where is the feasible region of that is defined by .

To learn and while optimizing , we follow the optimistic approach in UCRL2 [Jaksch et al., 2010], and employ an Extended Value Iteration (EVI) oracle EVI in (3.2). An EVI oracle computes a near-optimal and stationary policy for , where , are optimistic estimates of , , i.e. . The oracle also outputs , an optimistic estimate of , as well as , a certain bias associated with each state. These outputs are useful for the analysis. Finally, is a certain prescribed error parameter for . We extract an EVI oracle from [Jaksch et al., 2010], displayed in Appendix B.1.

Gradient Threshold. While the OCO and EVI oracles are vital for solving MDPwGR, they are yet to be sufficient for solving MDPwGR. Let’s revisit instance and Claim 3.1. An OCO oracle could potentially recommend alternating among for times in time steps, leading to the first “if” case of a large . UCRL2 recommends alternating among for only times in time steps, leading to the second “if” case of a small .

We introduce a novel gradient threshold procedure (starting from Line 11) to overcome the discussed challenges. The procedure maintains a distance measure on the sub-gradients generated during each episode, and starts the next episode if the measure exceeds a threshold . A small makes the agent alternate among different stationary policies frequently and balance the outcomes, while a large facilitates learning and avoid visiting sub-optimal states. It is interesting to note that Toc-UCRL2 does not converge to a stationary policy, except when we force . A properly tuned paths the way to obtain near optimality for the MDPwGR problem. In the context of , the threshold can be tuned to optimize for the regret bound, and to ensure that the agent alternates among sufficiently often.

While the procedure overcomes the challenges, it dilutes the balancing effect of the underlying OCO oracle by delaying gradient updates, and interferes with the learning of . This makes the analysis of Toc-UCRL2 challenging. Despite these apparent obstacles, we still show that Toc-UCRL2 achieves an anytime regret that diminishes with .

Main Results. We first focus on -smooth , then consider general in Section 5.

###### Definition 3.3 (β-smooth).

For , a concave function is -smooth w.r.t. norm , if is differentiable on , and it holds for all that

 ∥∇f(u)−∇f(w)∥∗≤β∥u−w∥. (3.6)

We provide regret bounds for Toc-UCRL2 under . Denote , , so is the number of state-action pairs. Denote , which is the maximum number of states from which a state-action pair can transit to. We employ the notation, which hides additive terms which scales with as well as multiplicative factors. The full bounds for the Theorems and the analyses are provided in the Appendix.

###### Theorem 3.4.

Consider Toc-UCRL2 with OCO oracle and gradient threshold , applied on a communicating MDPwGR instance with diameter . Suppose is -Lipschitz continuous and -smooth w.r.t the norm . With probability , we have anytime regret bound

 Reg(T)=~O(√β[√Q+LD/√Q]∥1K∥3/2/√T)+~O(L∥1K∥D√ΓSA/√T). (3.7)

In particular, setting gives .

In the first regret term, the summand with represents the regret due to the delay in gradient updates, and the summand with represents the regret due to the interference of the gradient threshold procedure with the learning of , as well as the regret in switching stationary policies, which could require visiting sub-optimal states. The second regret term is the regret due to the simultaneous exploration-exploitation using an EVI oracle. The factor scales with the magnitude of contribution from the outcomes at each time to the global reward. The same factor appears in related bandit settings [Agrawal and Devanur, 2014, Agrawal et al., 2016].

Applying Theorem 3.4 on an MDPwSR instance, we recover the regret bound by [Jaksch et al., 2010]. Indeed, we recover UCRL2 (up to the difference in ) when we specialize Toc-UCRL2 with OCO oracle to linear . Nevertheless, when we specialize Toc-UCRL2 with to BwGR problems with smooth , we do not recover the Frank-Wolfe based algorithm (Algorithm 4 in [Agrawal and Devanur, 2014]) for BwGR, due to the gradient threshold procedure. The resulting regret bound is also different from [Agrawal and Devanur, 2014], see their Theorem 5.2. Nevertheless, the procedure is crucial for MDPwGR. A direct combination of Frank-Wolfe Algorithm and UCRL2, which is equivalent to using OCO oracle and setting , is insufficient for solving MDPwGR, see Appendix B.2.

## 4 Analysis of Toc-UCRL2, with Focus on Oracle FW

In this Section, we provide an analytic framework for analyzing Toc-UCRL2 under general OCO oracles. In particular, we prove Theorem 3.4 to demonstrate our framework. To start, we consider events , which quantify the accuracy in estimating :

 Ev:={v(s,a)∈Hvm(s,a) for all m∈N, % s∈S, a∈As}, (4.1)
 (4.2)
###### Lemma 4.1.

Consider an execution of Toc-UCRL2 with a general OCO oracle. It holds that

Lemma 4.1 is proved in Appendix B.4. We analyze by tracing the sequence of stochastic outcomes and quantifying their contributions to the global reward. The tracing bears similarity to the analysis of the Frank-Wolfe algorithm (see Bubeck [2015]). Let’s define the shorthand , where is an optimal solution of .

 g(¯V1:t) ≥g(¯V1:t−1)+∇g(¯V1:t−1)⊤[¯V1:t−¯V1:t−1]−β2∥¯V1:t−¯V1:t−1∥2 (4.3) =g(¯V1:t−1)+1t∇g(¯V1:t−1)⊤[Vt(st,at)−¯V1:t−1]−β2t2∥Vt(st,at)−¯V1:t−1∥2 ≥g(¯V1:t−1)+1t∇g(¯V1:t−1)⊤[v∗−¯V1:t−1]+1t∇g(¯V1:t−1)⊤[Vt(st,at)−v∗]−β∥1K∥22t2 ≥g(¯V1:t−1)+1t[opt(PM)−g(¯V1:t−1)]+1t(−θt)⊤[Vt(st,at)−v∗]−β∥1K∥22t2. (4.4)

Step (4.3) uses the -smoothness of . Rearranging (4.4) gives

 t⋅Reg(t) ≤(t−1)⋅Reg(t−1)+β∥1K∥22t+(−θt)⊤[v∗−Vt(st,at)]. (4.5)

Apply the inequality (4.5) recursively for , we obtain the following regret bound :

 Reg(T) ≤β∥1K∥2logTT+1TT∑t=1(−θt)⊤[v∗−Vt(st,at)]. (4.6)

To proceed, we provide the following novel analysis that allows us to compare the online output and the offline benchmark, and help us analyze the effect of the gradient threshold procedure. For a time step , we denote random variable as the index of the episode that contains . When the underlying OCO oracle is specified, we decorate with the corresponding superscript, for example is the above-mentioned episode index under . We provide the following Proposition that helps us analyze the second term in (4.6):

###### Proposition 4.2.

Consider an execution of Toc-UCRL2 with a general OCO oracle, over a communicating MDPwGR instance M with diameter . For each , suppose that there is a deterministic constant s.t. . Conditioned on events , with probability at least we have

 T∑t=1(−θt)⊤[v∗−