# Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves Õ(L|X|^2√(|A|T)) regret with high probability, where L is the horizon, |X| is the number of states, |A| is the number of actions, and T is the number of episodes. To the best of our knowledge, our algorithm is the first one to ensure sub-linear regret in this challenging setting. Our key technical contribution is to introduce an optimistic loss estimator that is inversely weighted by an upper occupancy bound.

## Authors

• 3 publications
• 32 publications
• ### Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation

We study the reinforcement learning for finite-horizon episodic Markov d...
02/17/2021 ∙ by Jiafan He, et al. ∙ 4

• ### Online Markov Decision Processes with Aggregate Bandit Feedback

We study a novel variant of online finite-horizon Markov Decision Proces...
01/31/2021 ∙ by Alon Cohen, et al. ∙ 10

• ### Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

This work studies the problem of learning episodic Markov Decision Proce...
06/10/2020 ∙ by Tiancheng Jin, et al. ∙ 0

• ### Optimistic Policy Optimization with Bandit Feedback

Policy optimization methods are one of the most widely used classes of R...
02/19/2020 ∙ by Yonathan Efroni, et al. ∙ 0

• ### Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs

We develop a new approach to obtaining high probability regret bounds fo...
06/14/2020 ∙ by Chung-Wei Lee, et al. ∙ 0

• ### Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

We study the problem of learning Markov decision processes with finite s...
03/12/2013 ∙ by Yasin Abbasi-Yadkori, et al. ∙ 0

• ### Online Learning for Unknown Partially Observable MDPs

Solving Partially Observable Markov Decision Processes (POMDPs) is hard....
02/25/2021 ∙ by Mehdi Jafarnia-Jahromi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Reinforcement learning studies the problem where a learner interacts with the environment sequentially and aims to improve her strategy over time. The environment dynamic is usually modeled as a Markov Decision Process (MDP) with a fixed and unknown transition function. We consider a general setting where the interaction proceeds in episodes with a fixed horizon, and within each episode, the learner sequentially observes her current state, selects an action, suffers and observes the loss corresponding to this state-action pair, and transits to the next state according to the underlying transition function.222Similarly to previous work such as (Rosenberg and Mansour, 2019a), throughout the paper we use the term “losses” instead of “rewards” to be consistent with the adversarial online learning literature. One can translate between losses and rewards by simply taking negation. The goal of the learner is to minimize her regret, which is the difference between her total loss and the total loss of the optimal fixed policy.

The majority of the literature in learning MDPs also assumes stationary losses, that is, the losses observed for a specific state-action pair follow a fixed and unknown distribution. To better capture applications with non-stationary or even adversarial losses, the works of (Even-Dar et al., 2009; Yu et al., 2009) are among the first to study the problem of learning adversarial MDPs where the losses can change arbitrarily between episodes. There are several follow-ups in this direction, such as (Yu et al., 2009; Neu et al., 2010, 2012; Zimin and Neu, 2013; Dekel and Hazan, 2013; Rosenberg and Mansour, 2019a). See Section 1.1 for more related works.

For a MDP with states, actions, episodes, and steps in each episode, the best existing result is by the recent work of (Rosenberg and Mansour, 2019a), which achieves regret, assuming a fixed and unknown transition function, adversarial losses, and importantly full-information feedback, that is, the loss for every state-action pair is revealed at the end of each episode. On the other hand, with the more natural and standard bandit feedback (that is, only the loss for each visited state-action pair is revealed), a later work by the same authors (Rosenberg and Mansour, 2019b) achieves regret , which has a much worse dependence on the number of episodes compared to the full-information setting.

Our main contribution is to significantly improve the latest result of (Rosenberg and Mansour, 2019b). Specifically, we propose an efficient algorithm which achieves regret in the same setting with bandit feedback, unknown transition function and adversarial losses. Although our regret bound still exhibits a gap compared to the best existing lower bound  (Jin et al., 2018), to the best of our knowledge, our result is the first one to achieve regret for this challenging setting.

Our algorithm builds on top of the key ideas of the UC-O-REPS algorithm Rosenberg and Mansour (2019a, b). Specifically, we also construct the same confidence sets to handle the unknown transition function, and apply Online Mirror Descent over the space of occupancy measure (see Section 2.1

) to handle adversarial losses. The key difference and challenge is that in the bandit feedback setting, to apply Online Mirror Descent one needs to construct good loss estimators since the loss function is not completely revealed. However, the most natural approach of building unbiased loss estimators via inverse probability requires the knowledge of the transition function, and is thus not feasible in our setting.

We resolve this key difficulty by proposing a novel biased and optimistic loss estimator. The idea is straightforward — instead of inversely weighting the observation by the probability of visiting the corresponding state-action pair (which is unknown), we use the maximum probability among all plausible transition functions

specified by the confidence set, which we show can be computed efficiently via backward dynamic programming and solving some linear programs greedily. We call these maximum probabilities

upper occupancy bounds. This idea resembles the optimistic principle of using upper confidence bounds for many other problems of learning with bandit feedback, such as stochastic multi-armed bandits (Auer et al., 2002a), stochastic linear bandits (Chu et al., 2011; Abbasi-Yadkori et al., 2011), and reinforcement learning with stochastic losses (Jaksch et al., 2010; Azar et al., 2017; Jin et al., 2018). However, applying optimism in constructing loss estimators for an adversarial setting is new as far as we know.

### 1.1 Related Works

#### Stochastic losses.

Learning MDPs with stochastic losses and bandit feedback is relatively well-studied for the tabular case (that is, finite number of states and actions). For example, in the episodic setting, using our notation,333We warn the reader that in some of these cited papers, the notation or might be defined differently (often times smaller for and times larger for ). We have translated the bounds based on Table 1 of (Jin et al., 2018) using our notation defined in Section 2. the UCRL2 algorithm of (Jaksch et al., 2010) achieves regret, and the UCBVI algorithm of Azar et al. (2017) achieves the optimal bound , both of which are model-based algorithms and construct confidence sets for both the transition function and the loss function. The recent work by Jin et al. (2018) achieves a suboptimal bound via an optimistic Q-learning algorithm that is model-free. Besides the episodic setting, other setups such as discounted losses or infinite-horizon average-loss setting have also been heavily studied; see for example (Ouyang et al., 2017; Fruit et al., 2018; Zhang and Ji, 2019; Wei et al., 2019; Dong et al., 2019) for some recent works.

Based on whether the transition function is known and whether the feedback is full-information or bandit, we discuss four categories separately.

Known transition and full-information feedback. Early works on adversarial MDPs assume known transition function and full-information feedback. For example, the work of (Even-Dar et al., 2009) proposes the algorithm MDP-E and proves a regret bound of where is the mixing time of the MDP, and another work by (Yu et al., 2009) achieves regret, both of which consider a continuous setting (as opposed to the episodic setting considered in this work). The later work of (Zimin and Neu, 2013) considers the episodic setting and proposes the O-REPS algorithm which applies Online Mirror Descent over the occupancy measure space, a key component adopted by (Rosenberg and Mansour, 2019a) and our work. O-REPS achieves the optimal regret in this setting.

Known transition and bandit feedback. Several other works consider the harder bandit feedback model while still assuming a known transition function. The work of (Neu et al., 2010) achieves regret , assuming that all states are reachable with some probability under all policies. Later, Neu et al. (2014) eliminates the dependence on but only achieves regret. The O-REPS algorithm of (Zimin and Neu, 2013) again achieves the optimal regret . Another line of works (Arora et al., 2012; Dekel and Hazan, 2013) assumes deterministic transition for a continuous setting without some unichain structure, which is known to be harder and admits regret (Dekel et al., 2014).

Unknown transition and full-information feedback. To deal with unknown transition, Neu et al. (2012) proposes the Follow the Perturbed Optimistic Policy algorithm and achieves regret. Combining the idea of confidence sets and Online Mirror Descent, the UC-O-REPS algorithm of (Rosenberg and Mansour, 2019a) improves the regret to . We note that this work also studies general convex performance criteria, which we do not consider here.

Unknown transition and bandit feedback. This is the setting considered in our work. The only previous work is (Rosenberg and Mansour, 2019b) and achieves a regret bound of as mentioned earlier, or under the rather strong assumption that under any policy all states are reachable with some probability , which could be arbitrarily large in general. Our algorithm achieves regret without this assumption by using a different and optimistic loss estimator. We also note that the lower bound of  (Jin et al., 2018) still applies.

There are also a few works that consider both time-varying transition functions and time-varying losses (Yu and Mannor, 2009; Cheung et al., 2019; Lykouris et al., 2019). The most recent one by Lykouris et al. (2019) considers a stochastic problem with episodes arbitrarily corrupted and obtains regret (ignoring dependence on other parameters). Note that this bound is of order only when is a constant, and is vacuous whenever . On the other hand, our bound is always no matter how much corruption there is for the losses, but our algorithm cannot deal with changing transition functions.

## 2 Problem Formulation

An adversarial Markov decision process is defined by a tuple , where is the finite state space, is the finite action space, is the transition function, with being the probability of transferring to state when executing action in state , and is the loss function for episode .

In this work, we consider an episodic setting with finite horizon and assume that the MDP has a layered structure, satisfying the following conditions:

• The state space consists of layers such that = and for .

• and are singletons, that is, and .

• Transitions are possible only between consecutive layers. In other words, if , then and for some .

These assumptions were made in previous work (Neu et al., 2012; Zimin and Neu, 2013; Rosenberg and Mansour, 2019a) as well. They are not necessary but greatly simplify notation and analysis. Such a setup is sometimes referred to as the loop-free stochastic shortest path problem in the literature. It is clear that this is a strict generalization of the episodic setting studied in (Azar et al., 2017; Jin et al., 2018) for example, where the number of states is the same for each layer (except for the first and the last one).444In addition, some of these works (such as (Azar et al., 2017)) also assume that the states have the same name for different layers, and the transition between the layers remains the same. Our setup does not make this assumption and is closer to that of (Jin et al., 2018). We also refer the reader to footnote 2 of (Jin et al., 2018) for how to translate regret bounds between settings with and without this extra assumption.

The interaction between the learner and the environment is presented in Protocol 1. Ahead of time, the environment decides a MDP and only the state space with its layer structure and the action space are known to the learner. In particular, the loss functions can be chosen adversarially with the knowledge of the learner’s algorithm. The interaction proceeds in episodes. In episode , the learner starts from state and decides a stochastic policy , where is the probability of taking action at a given state , so that for every state . Then the learner executes this policy in the MDP, generating state-action pairs .555Formally, the notation should have a dependence. Throughout the paper we omit this dependence for conciseness as it is clear from the context. Specifically, for each , action is drawn from and the next state is drawn from .

Importantly, instead of observing the loss function at the end of episode (Rosenberg and Mansour, 2019a), in our setting the learner only observes the loss for each visited state-action pair: . That is, we consider the more challenging setting with bandit feedback.

For any given policy , we denote its expected loss in episode by

 E[L−1∑k=0ℓt(xk,ak)∣∣ ∣∣P,π],

where the notation emphasizes that the state-action pairs

are random variables generated according to the transition function

and a stochastic policy . The total loss over episodes for any fixed policy is thus

 LT(π)=T∑t=1E[L−1∑k=0ℓt(xk,ak)∣∣ ∣∣P,π],

while the total loss of the learner is

 LT=T∑t=1E[L−1∑k=0ℓt(xk,ak)∣∣ ∣∣P,πt].

The goal of the learner is to minimize the regret, defined as

 RT=LT−minπLT(π)

where ranges over all stochastic policies.

#### Notation.

We use to denote the index of the layer to which state belongs, and to denote the indicator function whose value is if the input holds true and otherwise. Let be the observation of the learner in episode , and be the -algebra generated by . Also let be a shorthand of .

### 2.1 Occupancy Measures

Solving the problem with techniques from online learning requires introducing the concept of occupancy measures. Specifically, the occupancy measure associated with a stochastic policy and a transition function is defined as follows:

 qP,π(x,a,x′)=Pr[xk=x,ak=a,xk+1=x′|P,π]

where is the index of the layer to which belongs. In other words, is the marginal probability of encountering the triple when executing policy in a MDP with transition function .

Clearly, an occupancy measure satisfies the following two properties. First, due to the loop-free structure, each layer is visited exactly once and thus for every ,

 ∑x∈Xk∑a∈A∑x′∈Xk+1q(x,a,x′)=1. (1)

Second, the probability of entering a state when coming from the previous layer is exactly the probability of leaving from that state to the next layer (except for and ). Therefore, for every and every state , we have

 ∑x′∈Xk−1∑a∈Aq(x′,a,x)=∑x′∈Xk+1∑a∈Aq(x,a,x′). (2)

It turns out that these two properties are also sufficient for any function to be an occupancy measure associated with some transition function and some policy.

###### Lemma 1 (Rosenberg and Mansour (2019a)).

If a function satisfies conditions (1) and (2), then it is a valid occupancy measure associated with the following induced transition function and induced policy :

 Pq(x′|x,a)=q(x,a,x′)∑y∈Xk(x)+1q(x,a,y),πq(a|x)=∑x′∈Xk(x)+1q(x,a,x′)∑b∈A∑x′∈Xk(x)+1q(x,b,x′).

We denote by the set of valid occupancy measures, that is, the subset of satisfying conditions (1) and (2). For a fixed transition function , we denote by the set of occupancy measures whose induced transition function is exactly . Similarly, we denote by the set of occupancy measures whose induced transition function belongs to a set of transition functions .

With the concept of occupancy measure, we can reduce the problem of learning a policy to the problem of learning an occupancy measure and apply online linear optimization techniques. Specifically, with slight abuse of notation, for an occupancy measure we define

 q(x,a)=∑x′∈Xk(x)+1q(x,a,x′)

for all and , which is the probability of visiting state-action pair . Then the expected loss of following a policy for episode can be rewritten as

 E[L−1∑k=0ℓt(xk,ak)∣∣ ∣∣P,π]=L−1∑k=0∑x∈Xk∑a∈AqP,π(x,a)ℓt(x,a)=∑x∈X∖{xL},a∈AqP,π(x,a)ℓt(x,a)≜⟨qP,π,ℓt⟩,

and accordingly the regret of the learner can also be rewritten as

 RT=LT−minπLT(π)=T∑t=1⟨qP,πt−q∗,ℓt⟩ (3)

where is the optimal occupancy measure in .

On the other hand, assume for a moment that the set

were known and the loss function was revealed at the end of episode . Consider an online linear optimization problem (see (Hazan and others, 2016) for example) with decision set and linear loss parameterized by at time . In other words, at each time , the learner proposes and suffers loss . The regret of this problem is

 T∑t=1⟨qt−q∗,ℓt⟩. (4)

Therefore, if in the original problem, we set , then the two regret measures Eq. (3) and Eq. (4) are exactly the same by Lemma 1 and we have thus reduced the problem to an instance of online linear optimization.

It remains to address the issues that is unknown and we have only partial information on . The first issue can be addressed by constructing a confidence set based on observations and replacing with , an idea introduced in (Rosenberg and Mansour, 2019a, b) already. Our main contribution is to propose a new loss estimator to address the second issue. Note that importantly, the above reduction does not reduce the problem to an instance of the well-studied bandit linear optimization (Abernethy et al., 2008) where the quantity (or a sample with this mean) is observed. Indeed, roughly speaking, what we observed in our setting are samples with mean . These two are different when we do not know and have to operate over the set .

## 3 Algorithm

The complete pseudocode of our algorithm, UOB-REPS, is presented in Algorithm 2. The three key components of our algorithm are: 1) maintaining a confidence set of the transition function, 2) constructing loss estimators, and 3) using Online Mirror Descent to update the occupancy measure. The first and the third components are the same as in the UC-O-REPS algorithm (Rosenberg and Mansour, 2019a, b), which we briefly describe below. The second component on constructing loss estimators is novel and explained in detail in Section 3.1.

#### Confidence sets.

The idea of maintaining a confidence set of the transition function dates back to (Jaksch et al., 2010). Specifically, the algorithm maintains counters to record the number of visits of each state-action pair and each state-action-state triple . A doubling epoch schedule is deployed, so that a new epoch starts whenever there exists a state-action whose counter is doubled compared to its initial value at the beginning of the epoch. For epoch , let and be the initial values of the counters, that is, the total number of visits of pair and triple before epoch . Then the empirical transition function for this epoch is defined as

 ¯Pi(x′|x,a)=Mi(x′|x,a)max{1,Ni(x,a)},

and the confidence set is defined as

 Pi={ˆP:∥∥ˆP(⋅|x,a)−¯Pi(⋅|x,a)∥∥1≤ϵi(x,a),∀(x,a)} (5)

where the confidence width is defined as

 ϵi(x,a)=  ⎷2∣∣Xk(x)+1∣∣ln(T|X||A|δ)max{1,Ni(x,a)},

for some confidence parameter . For the first epoch (), is simply the set of all transition functions so that .666To represent in the form of Eq. (5), one can simply let be any distribution and . By standard concentration arguments, one can show the following:

###### Lemma 2 (Rosenberg and Mansour (2019a)).

With probability at least , we have for all .

#### Online Mirror Descent (OMD).

As discussed in Section 2.1, our problem is closely related to an online linear optimization problem over some occupancy measure space. In particular, our algorithm maintains an occupancy measure for episode and execute the induced policy . We apply Online Mirror Descent, a standard algorithmic framework to tackle online learning problems, to update the occupancy measure as

 ˆqt+1=argminq∈Δ(Pi)η⟨q,ˆℓt⟩+D(q∥ˆqt)

where is the index of the epoch to which episode belongs, is some learning rate, is some loss estimator for , and is a Bregman divergence. Following (Rosenberg and Mansour, 2019a, b), we use the unnormalized KL-divergence as the Bregman divergence:

 D(q∥q′)=∑x,a,x′q(x,a,x′)lnq(x,a,x′)q′(x,a,x′)−∑x,a,x′(q(x,a,x′)−q′(x,a,x′)). (6)

Note that as pointed out earlier, ideally one would use as the constraint set in the OMD update, but since is unknown, using in place of it is a natural idea. Also note that the update can be implemented efficiently as shown by Rosenberg and Mansour (2019a).

### 3.1 Loss Estimators

A common technique to deal with partial information in adversarial online learning problems (such as adversarial multi-armed bandits (Auer et al., 2002b)) is to construct loss estimators based on observations. In particular, inverse importance-weighted estimators are widely applicable. For our problem, with a trajectory for episode , a common importance-weighted estimator for would be

 ℓt(x,a)qP,πt(x,a)I{xk(x)=x,ak(x)=a}.

Clearly this is an unbiased estimator for

. Indeed, the conditional expectation is exactly since the latter is exactly the probability of visiting when executing policy in a MDP with transition function .

The issue of this standard estimator is that we cannot compute since is unknown. To address this issue, Rosenberg and Mansour (2019b) directly use in place of , leading to an estimator that could be either an overestimate or an underestimate, and they can only show regret with this approach.

Instead, since we have a confidence set that contains with high probability (where is the index of the epoch to which belongs), we propose to replace with an upper occupancy bound defined as

 ut(x,a)=maxˆP∈PiqˆP,πt(x,a),

that is, the largest possible probability of visiting among all the plausible environments. In addition, we also adopt the idea of implicit exploration from (Neu, 2015) to further increase the denominator by some fixed amount . Our final estimator for is

The implicit exploration is important for several technical reasons such as obtaining a high probability regret bound, the key motivation of the work (Neu, 2015) for multi-armed bandits.

Clearly, is a biased estimator and in particular is underestimating with high probability (since by definition if ). The idea of using underestimates for adversarial learning with bandit feedback can be seen as an optimism principle which encourages exploration, and appears in previous work such as (Allenberg et al., 2006; Neu, 2015) in different forms and for different purposes. A key part of our analysis is to show that the bias introduced by these estimators is reasonably small, which eventually leads to a better regret bound compared to (Rosenberg and Mansour, 2019b).

#### Computing upper occupancy bound efficiently.

It remains to discuss how to compute efficiently. First note that

 ut(x,a)=πt(a|x)maxˆP∈PiqˆP,πt(x) (7)

where once again we slightly abuse the notation and define for any occupancy measure , which is the marginal probability of visiting state under the associated policy and transition function. Further define

 f(~x)=maxˆP∈PiPr[xk(x)=x∣∣xk(~x)=~x,ˆP,πt],

for any with , which is the maximum probability of visiting starting from state , under policy and among all plausible transition functions in . Clearly one has , and also for all in the same layer as . Moreover, since the confidence set imposes an independent constraint on for each different pair , we have the following recursive relation:

 f(~x)=∑a∈Aπt(a|~x)⎛⎜⎝maxˆP(⋅|~x,a)∑x′∈Xk(~x)+1ˆP(x′|~x,a)f(x′)⎞⎟⎠ (8)

where the maximization is over the constraint

 ∥∥ˆP(⋅|~x,a)−¯Pi(⋅|~x,a)∥∥1≤ϵi(~x,a)

and can be solved efficiently via a greedy approach after sorting the values of for all (see Appendix A for details). This suggests computing via backward dynamic programming from layer down to layer , detailed in Algorithm 3.

## 4 Analysis

In this section, we analyze the regret of our algorithm and prove the following theorem.

###### Theorem 3.

With probability at least , UOB-REPS with ensures :

The proof starts with decomposing the regret into four different terms. Specifically, by Eq. (3) the regret can be written as where we define and . We then add and subtract three terms and decompose the regret as

Here, the first term Error measures the error of using to approximate ; the third term Reg is the regret of the corresponding online linear optimization problem and is controlled by OMD; the second and the fourth terms and correspond to the bias of the loss estimators.

In the following subsections we bound each of these four terms respectively. Combining these bounds (Lemmas 568, and 9), applying a union bound, and plugging in the (optimal) values of and prove Theorem 3.

Throughout this section we use to denote the index of the epoch to which episode belongs. Note that and are both -measurable.

### 4.1 Bounding Error

The idea of bounding is the same as previous work such as (Jaksch et al., 2010; Neu et al., 2012; Rosenberg and Mansour, 2019a). In particular, we make use the following lemma essentially taken from (Rosenberg and Mansour, 2019a).

###### Lemma 4.

For any sequence of transition functions such that is -measurable and belongs to for all , we have with probability at least ,

 T∑t=1∑x∈X,a∈A|qPt,πt(x,a)−qt(x,a)|=O⎛⎜⎝L|X| ⎷|A|Tln(T|X||A|δ)⎞⎟⎠.
###### Proof.

The proof is identical to those of Lemma B.2 and Lemma B.3 of (Rosenberg and Mansour, 2019a). Note that importantly, the concrete form of the policies (which is different for our algorithm versus theirs) does not matter for these proofs, as long as the data are collected by executing these policies. ∎

With this lemma, we immediately obtain the following bound on the term Error.

###### Lemma 5.

With probability at least , UOB-REPS ensures

###### Proof.

Since all losses are in , we have Applying Lemma 4 with so that (by the definition of and Lemma 1) finishes the proof. ∎

### 4.2 Bounding \textscBias1

To bound the term , we need to show that is not underestimating by too much, which, at a high-level, is also ensured due to the fact that the confidence sets become more and more accurate for the frequently visited state-action pairs.

###### Lemma 6.

With probability at least , UOB-REPS ensures

 \textscBias1=O⎛⎜⎝L|X|2 ⎷|A|Tln(T|X||A|δ)+γ|X||A|T⎞⎟⎠.
###### Proof.

First note that is in because by the definition of and thus by the definition of , which implies

 ∑x,aˆqt(x,a)ˆℓt(x,a)≤∑x,aI{xk(x)=x,ak(x)=a}=L.

Applying Azuma’s inequality we thus have with probability at least , Therefore, we can bound by under this event. We then focus on the term and rewrite it as (by the definition of )

 ∑x,aˆqt(x,a)ℓt(x,a)(1−Et[I{xk(x)=x,ak(x)=a}]ut(x,a)+γ) =∑x,aˆqt(x,a)ℓt(x,a)(1−qt(x,a)ut(x,a)+γ) =∑x,aˆqt(x,a)ut(x,a)+γ(ut(x,a)−qt(x,a)+γ) ≤∑x,a|ut(x,a)−qt(x,a)|+γ|X||A|

where the last step is again due to . It thus remains to show for each fixed . Indeed, by Eq. (7) one has for (which is -measurable) and thus

 ∑a|ut(x,a)−qt(x,a)|≤∑x′,a|qPxt,πt(x′,a)−qt(x′,a)|.

Applying Lemma 4 together with a union bound over all then finishes the proof. ∎

### 4.3 Bounding Reg

To bound , note that under the event of Lemma 2, , and thus Reg is controlled by the standard regret guarantee of OMD. We also make use of the following concentration lemma which is a variant of (Neu, 2015, Lemma 1) and is the key for analyzing the implicit exploration effect introduced by (see Appendix B for the proof).

###### Lemma 7.

For any sequence of functions such that is -measurable for all , we have with probability at least ,

 T∑t=1∑x,aαt(x,a)(ˆℓt(x,a)−qt(x,a)ut(x,a)ℓt(x,a))≤LlnLδ.

We are now ready to bound Reg.

###### Lemma 8.

With probability at least , UOB-REPS ensures

###### Proof.

By standard analysis (see Lemma 10 in Appendix B), OMD with KL-divergence ensures for any