Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

10/12/2021 ∙ by Weichao Mao, et al. ∙ 0

This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents' strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an ϵ-approximate CCE in at most O( H^6S A /ϵ^2) episodes, where S is the number of states, A is the size of the largest individual action space, and H is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is fully decentralized, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) has recently shown the capability to solve many challenging sequential decision-making problems, ranging from the game of Go (silver2016mastering), Poker (brown2018superhuman), and real-time strategy games (vinyals2019grandmaster), to autonomous driving (shalev2016safe), and robotics (kober2013reinforcement). Many of the RL applications involve the interaction of multiple agents, which are modeled systematically within the framework of multi-agent reinforcement learning (MARL). These success stories have inspired a remarkable line of studies on the theoretical aspects of MARL.

Most of the theoretical efforts in MARL, however, have been devoted to Markov games with special reward structures, such as fully competitive or cooperative games. One prevalent setting is MARL in two-player zero-sum Markov games (wei2017online; xie2020learning), where the two agents have exactly opposite objectives. Such prevalence is mostly due to the fundamental computational difficulty in more general scenarios: Finding a Nash equilibrium (NE) is known to be PPAD-complete both for two-player general-sum games (chen2009settling) and zero-sum games with more than two players (daskalakis2009complexity). Given the daunting impossibility results, convergence to NE in generic games with no special structure seems hopeless in general. As a result, many important problems in the multi-player general-sum settings, which can model broader and more practical interactive behaviors of decision makers, have been left relatively open.

In this paper, we make an initial attempt toward understanding some of the theoretical aspects of MARL in decentralized general-sum Markov games. Given the inherent challenges for computing Nash equilibria, we need to target a slightly weaker solution concept than NE. One reasonable alternative is to find a coarse correlated equilibrium (CCE) (moulin1978strategically; aumann1987correlated) of the game. Unlike NE, CCE can always be found in polynomial time for general-sum games (papadimitriou2008computing), and due to its tractability, calculating CCE has also been commonly used as an important subroutine toward finding Nash equilibria in two-player zero-sum Markov games (xie2020learning; bai2020near).

Our interest in CCE is mostly motivated by the following folklore result for learning in normal-form games: When the agents independently run no-regret learning algorithms in general-sum normal-form games, their empirical frequency of plays converges to the set of CCE of the game (hart2000simple; cesa2006prediction). In no-regret learning, each agent independently adapts its policy to minimize the cumulative regret based on only its local information, irrespective of the actions or rewards of the other agents. Well-known examples of no-regret learning algorithms include multiplicative weights update (MWU) (auer1995gambling) and online gradient descent (zinkevich2003online). Such a folk result hence suggests that CCE is a natural outcome of the simple and uncoupled learning dynamics of the agents. A natural question to ask is whether a similar result also holds for Markov games. Specifically, in this paper, we ask the following questions: Can we find CCE in general-sum Markov games using decentralized/uncoupled learning dynamics? If so, can we achieve such a result efficiently, by showing an explicit sample complexity upper bound?

Before answering these questions, we would like to remark that decentralized MARL in general-sum games can be highly challenging. Specifically, in decentralized learning111This setting has been studied under various names in the literature, including individual learning (leslie2005individual), decentralized learning (arslan2016decentralized), agnostic learning (tian2020provably; wei2021last), and independent learning (claus1998dynamics; daskalakis2020independent). It also belongs to a more general category of teams/games with decentralized information structure (ho1980team; nayyar2013common; nayyar2013decentralized)., each agent only has access to its own local information (history of states, local rewards and local actions), and can neither communicate with other agents nor be coordinated by any central controller during learning. In fact, we require that each agent is completely unaware of the underlying structure of the game, or the presence of the other agents. In decentralized learning, since both the reward and the transition are affected by the other agents, the environment becomes non-stationary from each agent’s own perspective, especially when the agents learn and update their policies simultaneously. Hence, an agent needs to efficiently explore the unknown environment while bearing in mind that the information it gathered a while ago might no longer be accurate. This makes many successful single-agent RL solutions, which assume that the agent is learning in a stationary Markovian environment, inapplicable. Furthermore, compared with RL in two-player zero-sum games, an additional challenge in general-sum games is equilibrium selection. In zero-sum games, all NE have the same value (filar2012competitive), and there is no ambiguity in defining the sub-optimality of a policy. However, in general-sum games, multiple equilibria can have different values. We hence need to first identify which equilibrium to compare with when we are trying to measure the performance of a policy. Finally, we remark that we consider in this paper the fully observable setup, where the agents have full access to the state information. This is in contrast to the more general team theory with partially observable information structures (ho1980team; yuksel2013stochastic)

, such as those modeled by decentralized partially observable Markov decision processes (DecPOMDPs)

(bernstein2002complexity; nayyar2013decentralized), where each agent has only a private partial observation of the state. Learning or even computing a NE in the latter case is much more challenging, and we do not target such a more general setup in this paper.

Despite the challenges identified above, we answer both of the aforementioned questions affirmatively, by presenting an algorithm in which the agents can find a CCE in general-sum Markov games efficiently through decentralized learning. Our contributions in this paper are summarized as follows.

Contributions. 1) We study provably efficient exploration in decentralized general-sum Markov games. We propose an algorithm named Optimistic V-learning with Stabilized Online Mirror Descent (V-learning OMD), where V-learning (bai2020near) is a simple variant of Q-learning. In V-learning OMD, each agent independently runs an optimistic V-learning algorithm to explore the unknown environment, while using an online mirror descent procedure for policy updates. Following the learning process, the CCE can be extracted by simply letting the agents randomly repeat their previous strategies using a common random seed. 2) We show that if all agents in the game run the V-learning OMD algorithm, they can find an -approximate coarse correlated equilibrium in at most episodes, where is the number of states, is the size of the largest action space among the agents, and is the length of an episode. Our result complements its counterpart in normal-form games that uncoupled no-regret learning dynamics lead to CCE. We further show that our sample complexity is nearly-optimal in that it matches all the parameter dependences in the information-theoretical lower bound, except the horizon length . 3) As an important building block of our analysis, we conduct a novel investigation of a high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which might be of independent interest. We emphasize that due to the decentralization property, our algorithm readily generalizes to an arbitrary number of agents without suffering from the exponential dependence on the number of agents. Our work appears to be the first to provide non-asymptotic guarantees for MARL in generic general-sum Markov games with efficient exploration, with an additional appealing feature of being decentralized.

Outline. The rest of the paper is organized as follows: We start with a literature review in Section 2. In Section 3, we introduce the mathematical model of our problem and necessary preliminaries. In Section 4, we present our V-learning OMD algorithm for decentralized learning in general-sum Markov games. A sample complexity analysis of V-learning OMD is given in Section 5. In Section 6, we analyze a specific adversarial multi-armed bandit problem, which plays a central role in our analysis of the V-learning OMD algorithm. Finally, we conclude the paper in Section 7.

2 Related Work

Multi-agent reinforcement learning is generally modeled within the mathematical framework of stochastic games (shapley1953stochastic) (also referred to as Markov games). Given the PPAD completeness of finding a Nash equilibrium in generic games (daskalakis2009complexity; chen2009settling), convergence to NE has mostly been studied in games with special structures, such as two-player zero-sum games or cooperative games.

In two-player zero-sum Markov games, littman1994markov has proposed a Q-learning based algorithm named minimax-Q, whose asymptotic convergence guarantee has later been established in littman1996generalized. Recently, various sample efficient methods have been proposed in this setting (wei2017online; bai2020provable; sidford2020solving; xie2020learning; bai2020near; liu2021sharp; zhao2021provably). Most notably, several works (daskalakis2020independent; tian2020provably; wei2021last) have investigated two-player zero-sum games in a decentralized environment similar to ours, where an agent cannot observe the actions of the other agent.

Another line of research with convergence guarantees is RL in teams or cooperative games. wang2002reinforcement have proposed an optimal adaptive learning algorithm that converges to the optimal Nash equilibrium in Markov teams, where the agents share the same objective. More recently, arslan2016decentralized have shown that decentralized Q-learning can converge to NE in weakly acyclic games, which cover Markov teams and potential games as important special cases. Later, yongacoglu2019learning have further improved arslan2016decentralized and achieved convergence to the team-optimal equilibrium. Overall, the aforementioned works have mainly focused on two-player zero-sum games or cooperative games. These results do not carry over in any way to the general-sum games that we consider in this paper.

A few works have considered games beyond the zero-sum or cooperative settings: hu2003nash, littman2001friend, and zehfroosh2020pac have established convergence guarantees under the assumptions that either a saddle point equilibrium or a coordination equilibrium exists. greenwald2003correlated has bypassed the computation of NE in general-sum games by targeting correlated equilibria instead, but no theoretical convergence result has been given. Other approaches for finding NE in general-sum games include minimizing the Bellman-like residuals learned from offline/batch data (perolat2017learning), or using a two-timescale algorithm to learn the policy of each player from an optimization perspective (prasad2015two). Nevertheless, none of these works has considered sample-efficient exploration in a decentralized environment, a more challenging objective that we pursue in this paper. More recently, liu2021sharp have studied the non-asymptotic properties of learning CCE in general-sum Markov games, but their sample complexity bound scales exponentially in the number of agents as a consequence of using a centralized learning approach.

3 Preliminaries

An -player (episodic) general-sum Markov game is defined by a tuple , where (1) is the set of agents; (2) is the number of time steps in each episode; (3) is the finite state space; (4) is the finite action space for agent ; (5) is the reward function for agent , where is the joint action space; and (6) is the transition kernel. We remark that both the reward function and the state transition function depend on the joint actions of all the agents. We assume for simplicity that the reward function is deterministic. Our results can be easily generalized to stochastic reward functions.

The agents interact with the unknown environment for episodes, and we let be the total number of time steps. At each time step, the agents observe the state , and take actions simultaneously. We let . The reward is then revealed to agent privately, and the environment transitions to the next state . We assume for simplicity that the initial state of each episode is fixed.

As mentioned before, we focus on the decentralized setting: Each agent only observes the states and its own actions and rewards, but not the actions or rewards of the other agents. In fact, each agent is completely oblivious of the existence of other agents, and is also unaware of the number of agents in the game. The agents are also not allowed to communicate with each other. This decentralized information structure requires an agent to learn to make decisions based on only its local information.

For each agent , a (Markov) policy is a mapping from the time index and the state space to a distribution over its action space. We define the space of policies for agent by . The joint policy space is denoted by . Each agent seeks to find a policy that maximizes its own cumulative reward. A joint policy induces a probability measure on the sequence of states and joint actions. For a joint policy , and for each time step , state , and joint action , we define the (state) value function and the state-action value function (Q-function) for agent as follows:

(1)

For convenience of notations, we use the superscript to denote the set of agents excluding agent , i.e., . For example, we can rewrite using this convention.

Definition 1.

(Best response). A policy is a best response to for agent if for any state .

The supremum is taken over all general policies222A general policy (bai2020near) of agent is a set of maps

from a random variable

and a history of length to a distribution of actions in . that are not necessarily Markov. When is Markov, the environment is stationary from the perspective of agent , and there always exists a Markov best response.

Definition 2.

(Nash equilibrium). A Markov joint policy is a Nash equilibrium (NE) if is a best response to for all .

Given the fundamental difficulty of calculating Nash equilibria in general-sum games (daskalakis2009complexity; chen2009settling), in this paper, we mainly target a weaker solution concept named coarse correlated equilibrium (moulin1978strategically; aumann1987correlated). CCE allows possible correlations among the agents’ policies. Before we formally define CCE, we need to refine the above definitions of policies to reflect such correlations. We define as a set of (non-Markov) correlated policies, where for each , maps from a random variable and a history of length to a distribution over the joint action space . We also assume that the agents can access a common source of randomness (e.g., a common random seed) to sample the random variable . One can see that our definition of correlated policies generalizes the usual concept of policies as we defined earlier, since the latter only allows mappings to each agent’s individual action space simplex. Following the new notations, we can analogously define or for each as the proper marginal distributions of . The best responses , value functions , and state-action value functions for agent can be defined similarly as in Definition 1 and Equation (1). We are now in a position to define the CCE in an episodic Markov game, as follows:

Definition 3.

(Coarse correlated equilibrium). A correlated policy is a coarse correlated equilibrium if for every agent ,

For illustrative purposes, let us compare the definitions of NE and CCE in a simple normal-form game named Hawk-Dove (with no state transitions). There are two players in this game. The row player has the action space , and the column player’s action space is . The reward matrix of the Hawk-Dove game is described in Table 2. There are three Nash equilibria in this game: and are two pure strategy NE, and is a NE in mixed strategies. Table 2 gives a CCE distribution of the Hawk-Dove game, which assigns equal probabilities to three action pairs: , , and . We can see that NE defines for each player an independentprobability distribution over a player’s own action space; in contrast, a CCE is a probability distribution over the joint action space of the players. In this sense, CCE generalizes NE by allowing possible correlations among the strategies of the agents. In our proposed algorithm, such correlation is implicitly achieved by letting the players use a common random seed.

4,4 1,5
5,1 0,0
Table 2: A CCE in the Hawk-Dove game.
1/3 1/3
1/3 0
Table 1: The Hawk-Dove game.

Unlike NE, it can be shown that a CCE can always be calculated in polynomial time for general-sum games (papadimitriou2008computing). If the inequality in Definition 3 only holds approximately, then we have an approximate notion of coarse correlated equilibrium.

Definition 4.

(-approximate CCE). For any , a correlated policy is an -approximate coarse correlated equilibrium if for every agent ,

For notational convenience, in most parts of the paper we illustrate our algorithms and results for the special case of two-player general-sum games, i.e., . It is straightforward to extend our results to the general -player games as we defined above. With two players, we use and to denote the action spaces of players 1 and 2, respectively. Let , and . We assume that an upper bound on the sizes of the action spaces is common knowledge. We also rewrite the policies as .

4 The V-Learning OMD Algorithm

In this section, we introduce our algorithm Optimistic V-learning with Stabilized Online Mirror Descent (V-learning OMD) for decentralized multi-agent RL in general-sum Markov games.

V-learning OMD naturally integrates the idea of optimistic V-learning in single-agent RL (jin2018q) with Online Mirror Descent (OMD) (nemirovskij1983problem; zinkevich2003online) in online convex optimization. First, our algorithm uses optimistic V-learning to efficiently explore the unknown environment, as in single-agent RL. Second, each agent selects its actions following a no-regret OMD algorithm in order to achieve a CCE. The intuition of using no-regret learning here is to defend against the unobserved behavior of the opponents, by presuming that the opponents’ behavior will impair the reward sequence arbitrarily. Seemingly conservative, we will show that this suffices to find the CCE. The use of no-regret learning is also reminiscent of the well-known result in normal-form games that if all agents run a no-regret learning algorithm, the empirical frequency of their actions converge to a CCE (cesa2006prediction). These components also make our algorithm decentralized, which can be implemented individually using only the local rewards received and the local actions executed, without any communication among the agents.

algocf[!tbp]    

The algorithm run by agent 1 (with action space ) is presented in Algorithm LABEL:alg:qomd. The algorithm for agent 2 (or other agents in the setting with more than two agents) is symmetric, by simply replacing the action space with the agent’s own action space. We thus omit the index of an agent in the notations for clarity. We use to denote the probability of taking action at state and step , where . At each step of an episode, the agent first takes an action according to a policy for the current state , and observes the reward and the next state . It also counts the number of times that state has been visited, and constructs a bonus term ( is some absolute constant and is a log factor to be defined later) that is used to upper bound the state value function. The agent then updates the optimistic state value functions by:

(2)

where the learning rate is . This update rule essentially follows the optimistic Q-learning algorithm (jin2018q)

in the single-agent scenario, except that instead of estimating the Q-functions, we maintain optimistic estimates of the state value functions. This is because the definition of

explicitly depends on the joint actions of all the agents, which cannot be observed in a decentralized environment. Such an argument is also consistent with the Optimistic Nash V-learning (bai2020near) and the V-OL (tian2020provably) algorithms for RL in two-player zero-sum games.

Unlike RL in the single-agent problem where the agent takes an action with the largest optimistic Q-function, in a multi-agent environment, the agent proceeds more conservatively by running an adversarial bandit algorithm to account for the unobserved effects of other agents’ policy changes. At each step and each state , we use a variant of online mirror descent with bandit feedback to compute a policy . OMD is an iterative process that computes the current policy by carrying out a simple gradient update in the dual space, where the dual space is defined by a mirror map (or a regularizer) . In our algorithm, we use a standard unnormalized negentropy regularizer for . Given a mirror map , the -induced Bregman divergence is defined as

Given the (bandit-feedback) loss vector

at step and state , the OMD update rule is given by (Line 13 of Algorithm LABEL:alg:qomd):

where is the learning rate. We remark that OMD itself is a well-developed algorithmic framework with a rich literature. But in our case, to be consistent with the changing learning rate in the V-learning part and the high-probability nature of the sample complexity bounds, we additionally require an OMD algorithm to have (1) a dynamic learning rate and (2) a high probability regret bound, with respect to (3) a weighted definition of regret. Such a result is absent in the literature as far as we know. Interestingly, incorporating OMD with a dynamic learning rate is an active and challenging sub-area per se: An impossibility result (orabona2018scale) has shown that standard OMD with an learning rate can incur linear regret when the Bregman divergence is unbounded, which actually covers our choice of . A stabilization technique (fang2020online) was later introduced to resolve this problem, by replacing the policy at each step with a convex combination of this policy and the initial policy. This stabilization technique is also helpful in our method (Line 14 in Algorithm LABEL:alg:qomd), although the design of the convex combination is a little more involved due to the weighted regret. We provide a more detailed description of the bandit subroutine and an analysis of our OMD algorithm in Section 6.

5 Theoretical Analyses

In this section, we present our main results on the sample complexity upper bound of V-learning OMD, and characterize the fundamental limits of the problem by providing a lower bound.

We first introduce a few notations to facilitate the analysis. For a given step of episode , we denote by the state that the agents observe at this step. Let and be the (interim) strategies at step of episode specified by in Algorithm LABEL:alg:qomd to agents 1 and 2, respectively. Let and be the actual actions taken by the two agents. Let and , respectively, be the values of and in Algorithm LABEL:alg:qomd calculated by agent 1 at the beginning of the -th episode. Symmetrically, define to be the value of calculated by agent 2, which does not necessarily take the same value as . For notational convenience, we often suppress the sub/super-scripts when there is no possibility of any ambiguity. When the state is clear from the context, we also sometimes abbreviate as or even simply as . For a fixed state , let , and suppose that was visited at episodes at the -th step before the -th episodes. If we further define and , one can show that the update rule in (2) can be equivalently expressed as

(3)

This update rule follows the standard optimistic Q-learning algorithm (jin2018q) in single-agent RL, and has also appeared in RL for two-player zero-sum games (bai2020near). In the following lemma, we recall several properties of that are useful in our analysis.

Lemma 1.

(Properties for , Lemma 4.1 in jin2018q).

  1. .

  2. .

  3. .

  4. .

algocf[!tbp]    

Based on the strategy trajectories of the two agents specified by Algorithm LABEL:alg:qomd, we construct an auxiliary pair of correlated policies for each . The construction of such correlated policies, largely inspired by the construction of the “certified policies” in bai2020near, is formally defined in Algorithm LABEL:alg:correlated. Such auxiliary correlated policies will play a significant role throughout our analysis, and are closely related to the CCE correlated policy that we will construct later. In words, proceeds as follows: It first observes the current state , and let . Then, it randomly samples an episode index from , the set of episodes in which the state was previously visited during the execution of the first episodes of Algorithm LABEL:alg:qomd. Each index has a probability of to be selected. It is easy to verify that , and hence we have specified a well-defined probability distribution over the episode index set. Finally, executes the sampled strategy at step , and then repeats a similar procedure using at step , and so on.

From the collection of such auxiliary correlated policies , we finally construct a correlated policy , which we will show later is a CCE. A detailed description of the construction of is presented in Algorithm LABEL:alg:certify. By construction, first uniformly samples an index from using a common random seed, and then proceeds by following the auxiliary correlated policy . One can see that the notations we have defined are related through the following equation: . We also remark that the common random seed used in Algorithms LABEL:alg:correlated and LABEL:alg:certify implicitly plays the role of the “trusted coordinator” typically used in the language of correlated equilibria.

For notational convenience, we further introduce the operator for any value function , and for any strategy pair and any state-action value function . With these notations, for any and for any policy pair , the Bellman equations can be rewritten more succinctly as and Using these notations, we define , and for each , define

(4)

which can be considered as the value of agent 1’s best response at step against the auxiliary correlated policy . In what follows, we will simply write as for notational convenience, because the time step is always clear from the subscripts. We can also define analogously.

algocf[!tbp]    

Remark 1.

Our definition of the correlated policy is inspired by the “certified policies” (bai2020near) for learning in two-player zero-sum Markov games, but with additional challenges to address: In the zero-sum setting, the Nash equilibrium value is always unique, and the regret with respect to the equilibrium value can be easily defined a priori (by means of the “duality gap”). But in general-sum games, the equilibrium value is not necessarily unique. We hence need to first specify an equilibrium before we are able to define the regret. In our analysis, the equilibrium value we choose is the one associated with the correlated policy . In addition, we also emphasize that the correlated policy is only used for analytical purposes; the actual strategies adopted by the agents during the execution of Algorithm LABEL:alg:qomd are still .

We start with an intermediate result, which states that the optimistic and values are indeed high-probability upper bounds of and , respectively. The proof relies on a delicate investigation of a high-probability regret bound for OMD with a dynamic learning rate, which we will elaborate on in the next section.

Lemma 2.

For any , let . It holds with probability at least that and , for all .

Proof.

In the following, we provide a proof for the first inequality. The second inequality can be shown using a similar argument.

Our proof relies on backward induction on . First, the claim holds for by the definition of . Now, suppose for all . By the definition of and the induction hypothesis,

(5)

Further, define

(6)

One may observe that from the perspective of player 1, is the weighted sum of the differences between the actual value that player 1 collected for the first times that state is visited, and the value that could have been achieved using the best fixed policy in hindsight. can hence be thought of as the weighted regret of an adversarial bandit problem, which we will formally present and analyze in Section 6

. Specifically, the loss function of the bandit problem is defined as

The weight of the regret at round is . If we define

then, can be equivalently rewritten as

Later in Section 6, we will analyze an adversarial bandit problem in exactly the same form. Applying the regret bound (which will be presented in Theorem 2 of Section 6) of this bandit problem, we obtain the following result with probability at least :

(7)

where in the first step we have used the fact that is increasing and , and the second step is due to Lemma 1.

Finally, let be the -algebra generated by all the random variables before episode . Then, we can see that is a martingale with respect to . From the Azuma-Hoeffding inequality, it holds with probability at least that

(8)

where suppresses logarithmic terms. Finally, combining the results in (5), (6), (7), (8), and applying a union bound, we obtain that

where the second to last step is by the definition of for some large constant , and Lemma 1. In the last step we used the formulation of in (3). This completes the proof of the induction step. ∎

By construction of the auxiliary correlated policies , we know that for any , the corresponding value function can be written recursively as follows:

and for any , where again notice that we have dropped the dependence on . The following result shows that, on average, the agents have no incentive to deviate from the correlated policies, up to a regret term of the order .

Theorem 1.

For any , let . With probability at least ,

Proof.

We provide a proof for the first bound. The second one can be shown using a similar argument. From Lemma 2, it holds that

and so we only need to find an upper bound for the RHS. Define . The main idea of the proof is similar to optimistic Q-learning in the single-agent setting (jin2018q): We seek to upper bound by the next step , and then obtain a recursive formula.

Let be the -algebra generated by all the random variables before the -th episode. It can be seen that is a martingale with respect to the filtration . From the Azuma-Hoeffding inequality, it holds with high probability that

for some constant . Recall the update rule (3) that

Taking the difference of the two leads to

for some constant that could take different values from line to line, and the last step is due to Lemma 1. Summing over , notice that

because there are at most pairs of to be visited. Further,