# Learning Zero-sum Stochastic Games with Posterior Sampling

In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of O(HS√(AT)) in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here H is an upper bound on the span of the bias function, S is the number of states, A is the number of joint actions and T is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. This improves the best existing regret bound of O(√(DS^2AT^2)) by Wei et. al., 2017 under the same assumption and matches the theoretical lower bound in A and T.

## Authors

• 9 publications
• 35 publications
• 16 publications
• ### Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

Computational results demonstrate that posterior sampling for reinforcem...
07/01/2016 ∙ by Ian Osband, et al. ∙ 0

• ### Gap-Dependent Bounds for Two-Player Markov Games

As one of the most popular methods in the field of reinforcement learnin...
07/01/2021 ∙ by Zehao Dou, et al. ∙ 0

• ### Fast and Furious Learning in Zero-Sum Games: Vanishing Regret with Non-Vanishing Step Sizes

We show for the first time, to our knowledge, that it is possible to rec...
05/11/2019 ∙ by James P. Bailey, et al. ∙ 0

• ### Concave Utility Reinforcement Learning with Zero-Constraint Violations

We consider the problem of tabular infinite horizon concave utility rein...
09/12/2021 ∙ by Mridul Agarwal, et al. ∙ 0

• ### Provably Efficient Online Agnostic Learning in Markov Games

We study online agnostic learning, a problem that arises in episodic mul...
10/28/2020 ∙ by Yi Tian, et al. ∙ 0

• ### UCB Momentum Q-learning: Correcting the bias without forgetting

We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algo...
03/01/2021 ∙ by Pierre Ménard, et al. ∙ 0

• ### TopRank+: A Refinement of TopRank Algorithm

Online learning to rank is a core problem in machine learning. In Lattim...
01/21/2020 ∙ by Victor de la Pena, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

Recent advances in playing the game of Go Silver et al. (2017) and Starcraft Vinyals et al. (2019) has proved the capability of self-play in achieving super-human performance in competitive reinforcement learning (competitive RL) Crandall and Goodrich (2005), a special case of multi-agent RL where each player tries to maximize its own reward. These self-play algorithms are able to learn through repeatedly playing against themselves and update their policy based on the observed trajectory in the absence of human supervision. Despite the empirical success, the theoretical understanding of these algorithms is limited and is significantly more challenging than the single-agent RL due to its multi-agent nature.

Self-play can be considered as a special category of offline competitive RL where the learning algorithm can control both the agent and the opponent during the learning process Bai and Jin (2020); Bai et al. (2020). In the more general and sophisticated online learning case, the opponent can take arbitrary history-dependent strategies and the agent has no control on the opponent during the learning process Wei et al. (2017); Xie et al. (2020); Tian et al. (2021).

In this paper, the online learning case is considered where the agent learns against an arbitrary opponent who can follow a time-variant history-dependent policy and can switch its policy at any time. We consider infinite-horizon two-player zero-sum stochastic games (SGs) with the average-reward criterion. At each time step, both players determine their actions simultaneously upon observing the state of the environment. The reward and the probability distribution of the next state is then determined by the chosen actions and the current state. The players’ payoffs sum to zero, i.e., the reward of one player (agent) is exactly the loss of the other player (opponent). The agent’s goal is to maximize its cumulative reward while the opponent tries to minimize the total loss.

We propose Posterior Samling Reinforcement Learning algorithm for Zero-sum Stochastic Games (PSRL-ZSG), a learning algorithm that achieves Bayesian regret bound. Here is an upper bound on the bias-span, is the number of states, is the size of all possible action pairs for both players, is the horizon, and hides logarithmic factors. The best existing result in this setting is achieved by UCSG algorithm Wei et al. (2017) which obtains a regret bound of where

is the diameter of the SG. As stochastic games generalize Markov Decision Processes (MDPs), our regret bound is optimal (except for logarithmic factors) in

and due to the lower bound provided by Jaksch et al. (2010).

### Related Literature

SG was first formulated by Shapley (1953). A large body of work focuses on finding the Nash equilibria in SGs with known transition kernel Littman (2001); Hu and Wellman (2003); Hansen et al. (2013), or learning with a generative model Jia et al. (2019); Sidford et al. (2020); Zhang et al. (2020) to simulate the transition for an arbitrary state-action pair. In these cases no exploration is needed.

There is a long line of research on exploration and regret analysis in single-agent RL (see e.g. Jaksch et al. (2010); Osband et al. (2013); Gopalan and Mannor (2015); Azar et al. (2017); Ouyang et al. (2017); Jin et al. (2018); Zhang and Ji (2019); Zanette and Brunskill (2019); Wei et al. (2020, 2021); Chen et al. (2021a); Jafarnia-Jahromi et al. (2021b, a) and references therein). Extending these results to the SGs is non-trivial since the actions of the opponent also affects the state transition and can not be controlled by the agent. We review the literature on exploration in SGs and refer the interested reader to Zhang et al. (2021); Yang and Wang (2020) for an extensive literature review on multi-agent RL in various settings.

#### Stochastic Games.

A few recent works use self-play as a method to learn stochastic games Bai and Jin (2020); Bai et al. (2020); Liu et al. (2021); Chen et al. (2021b). However, self-play requires controlling both the agent and the opponent and cannot be applied in the online setting where the agent plays against an arbitrary opponent. All of these works consider the setting of finite-horizon SG where the interaction of the players and the environment terminates after a fixed number of steps.

In the online setting where the opponent is arbitrary, Xie et al. (2020); Jin et al. (2021) achieve a regret bound of in the finite-horizon SGs with linear and general function approximation, respectively. However, in the applications where the interaction between the players and the environment is non-stopping (e.g., stock trading), the infinite-horizon SG is more suitable. Lack of a fixed horizon in this setting makes the problem more challenging. This is since the backward induction, a technique that is widely used in the finite-horizon, is not applicable in the infinite-horizon setting.

In the infinite-horizon setting, the primary work of Brafman and Tennenholtz (2002) who proposes R-max does not consider regret. A special case of online learning in general-sum games is studied by DiGiovanni and Tewari (2021) where the opponent is allowed to switch its stationary policy a limited number of times. They achieve a regret bound of via posterior sampling, where is the number of switches. Their result is not directly comparable to ours because their definition of regret is different. Moreover, they assume the transition kernel is known and the opponent adopts stationary policies. To the best of our knowledge, the only existing algorithm that considers online learning against an arbitrary opponent in the infinite-horizon average-reward SG is UCSG Wei et al. (2017).

#### Comparison with UcsgWei et al. (2017).

Our work is closely related to UCSG, however clear distinctions exist in the result, the algorithm, and the technical contribution:

• UCSG achieves a regret bound of under the finite-diameter assumption (i.e., for any two states and every stationary randomized policy of the opponent, there exists a stationary randomized policy for the agent to move from one state to the other in finite expected time). Under the much stronger ergodicity assumption (i.e., for any two states and every stationary randomized policy of the agent and the opponent, it is possible to move from one state to the other in finite expected time), UCSG obtains a regret bound of . Note that the ergodicity assumption greatly alleviates the challenge in exploration. Our algorithm significantly improves this result and achieves a regret bound of under the finite-diameter assumption.

• UCSG is an optimism-based algorithm inspired by Jaksch et al. (2010) and requires the complicated maximin extended value iteration. Our algorithm, however, is the first posterior sampling-based algorithm in SGs, leveraging the ideas of Ouyang et al. (2017) in MDPs, and is much simpler both in the algorithm and the analysis.

• From the analysis perspective, under the finite-diameter assumption, UCSG uses a sequence of finite-horizon SGs to approximate the average-reward SG and that leads to the sub-optimal regret bound of . Our analysis avoids the finite-horizon approximation by directly using the Bellman equation in the infinite-horizon SG and achieves near-optimal regret bound.

## Preliminaries

Let be a stochastic zero-sum game where is the state space, is the joint action space, is the reward function and represents the transition kernel such that where are the state, the agent and the opponent’s actions at time , respectively. We assume that are finite sets with size .

The game starts at some initial state . At time , the players observe state and take actions . The agent (maximizer) receives reward from the opponent (minimizer). Then, the state evolves to

according to the probability distribution

. The goal of the agent is to maximize its cumulative reward while the opponent tries to minimize it. For the ease of notation, we denote and and accordingly will be denoted by and , respectively.

The players’ actions are assumed to depend on the history. Namely, denote by (resp. ) the mappings from the history to the probability distributions over (resp. ). Let (resp. ) be the sequence of history-dependent randomized policies whose class is denoted by . In the case that (resp. ) is independent of time (stationary randomized policies), we remove the subscript and with abuse of notation denote (resp. ). The class of stationary randomized policies is denoted by .

For the ease of presentation, we introduce a few notations. Let , denote the size of the action spaces. For an integer , denote by the probability simplex of dimension . Let and . With abuse of notation, let and .

To achieve a low regret algorithm, it is necessary to assume that all the states are accessible by the agent under some policy. In the special case of MDPs, this is stated by the notion of “weakly communication” (or “finite diameter” Jaksch et al. (2010)) and is known to be the minimal assumption to achieve sub-linear regret Bartlett and Tewari (2009). The following assumption generalizes this notion to the stochastic games.

###### Assumption 1.

(Finite Diameter) There exists such that for any stationary randomized policy of the opponent and any , there exists a stationary randomized policy of the agent, such that the expected time of reaching starting from under policy does not exceed , i.e.,

 maxs,s′maxπ2∈ΠSRminπ1∈ΠSRTπs→s′≤D,

where is the expected time of reaching starting from under policy .

This assumption was first introduced by Federgruen (1978) and is essential to achieve low regret algorithms in the adversarial setting Wei et al. (2017). To see this, suppose that the opponent has a way to lock the agent in a “bad” state. In the initial stages of the game where the agent has limited knowledge of the environment, it may not be possible to avoid such a state and linear regret is unavoidable. Thus, this assumption states that regardless of the strategy used by the opponent, the agent has a way to recover from bad states.

For a general matrix game with matrix of size , the game value is denoted by . Moreover, the Nash equilibrium always exists Nash and others (1950). For SGs, under Assumption 1, Federgruen (1978); Wei et al. (2017) prove that there exist unique and unique (upto an additive constant) function that satisfy the Bellman equation, i.e., for all ,

 J(θ)+v(s,θ) =val{r(s,⋅,⋅)+∑s′θ(s′|s,⋅,⋅)v(s′,θ)}. (1)

In particular, the Nash equilibrium of the right hand side for each yields maximin stationary policies such that

 J(θ)+v(s,θ) =maxq1∈ΔA1{r(s,q1,π2∗(⋅|s)) +∑s′θ(s′|s,q1,π2∗(⋅|s))v(s′,θ)}, (2) J(θ)+v(s,θ) =minq2∈ΔA2{r(s,π1∗(⋅|s),q2) +∑s′θ(s′|s,π1∗(⋅|s),q2)v(s′,θ)}. (3)

Moreover, is the maximin average reward obtained by the agent and is independent of the initial state , i.e.,

 J(θ) =supπ1∈ΠHRinfπ2∈ΠHRliminfT→∞1TE[T∑t=1r(st,at)|s1=s],

where and and . Note that because the range of the reward function is . Define the span of the stochastic game with transition kernel as the span of the corresponding value function , i.e., . We restrict our attention to stochastic games whose transition kernel satisfies Assumption 1 and where is a known scalar. Let denote the set of all such . Moreover, observe that if satisfies the Bellman equation, also satisfies the Bellman equation for any scalar . Thus, without loss of generality, we can assume that for all and .

We consider the problem of an agent playing a stochastic game against an opponent who can take time-adaptive policies. We assume that the opponent knows the history of states and actions and can play time-adaptive history-dependent policies. Recall that the state of such policies is denoted by . and are completely known to the agent. However, the transition kernel is unknown. In the beginning of the game, is drawn from an initial distribution and is then fixed. We assume that the support of is a subset of . The performance of the agent is then measured with the notion of regret defined as

 RT:=supπ2∈ΠHRE[T∑t=1(J(θ∗)−r(st,at))], (4)

where . Here the expectation is with respect to the prior distribution , randomized algorithm and the randomness in the state transition. Note that the regret guarantee is against an arbitrary opponent who can change its policy at each time step and has the perfect knowledge of the history of the states and actions. The only hidden information from the opponent is the realization of the agent’s current action (which will be revealed after both players have chosen their actions). We note that self-play and the case when the opponent uses the same learning algorithm as the agent are two special cases of the scenario considered here.

## Posterior Sampling for Stochastic Games

In this section, we propose Posterior Sampling algorithm for Zero-sum SGs (PSRL-ZSG). The agent maintains the posterior distribution on parameter . More precisely, the learning algorithm receives an initial distribution as the input and updates the posterior distribution upon observing the new state according to

 μt+1(dθ)∝θ(st+1|st,at)μt(dθ). (5)

PSRL-ZSG proceeds in episodes. Let denote the start time and the length of episode , respectively. In the beginning of each episode, the agent draws a sample of the transition kernel from the posterior distribution . The maximin strategy is then derived for the sampled transition kernel according to (1) and used by the agent during the episode. Let be the number of visits to state-action pair before time , i.e.,

 Nt(s,a)=t−1∑τ=11(sτ=s,aτ=a).

As described in Algorithm 1, a new episode starts if or for some . The first criterion, , states that the length of the episode grows at most by 1 if the other criterion is not triggered. This ensures that for all . The second criterion is triggered if the number of visits to a state-action pair is doubled. These stopping criteria balance the trade-off between exploration and exploitation. In the beginning of the game, the episodes are short to motivate exploration since the agent is uncertain about the underlying environment. As the game proceeds, the episodes grow to exploit the information gathered about the environment. These stopping criteria are the same as those used in MDPs Ouyang et al. (2017).

Algorithm 1 can achieve regret bound of . This result improves upon the previous best known result of UCSG algorithm which achieves under the same assumption Wei et al. (2017).

###### Theorem 1.

Under Assumption 1, Algorithm 1 can achieve regret bound of

 RT ≤(H+1)√2SATlogT+H +H(SA+2√SAT)√224Slog(2AT). (6)

## Analysis

In this section, we provide the proof of Theorem 1. A central observation in our analysis is that in the beginning of each episode, and are identically distributed conditioned on the history. This key property of posterior sampling relates quantities that depend on the unknown to those of the sampled which is fully observed by the agent. Posterior sampling ensures that if is a stopping time, for any measurable function and any

-measurable random variable

, Ouyang et al. (2017); Osband et al. (2013).

The key challenge in the analysis of stochastic games is that the opponent is also making decisions. If the opponent follows a fixed stationary policy, it can be considered as part of the environment and thus the SG reduces to an MDP. However, in the case that the opponent uses a dynamic history-dependent policy during the learning phase of the agent, this reduction is not possible. The key lemma in our analysis is Lemma 3 which overcomes this difficulty through the Bellman equation for the SG.

### Proof of Theorem 1

Let be the number of episodes until time and define . Recall that where

 RT(π2)=E[TJ(θ∗)−T∑t=1r(st,at)]. (7)

Let be an arbitrary history-dependent randomized strategy followed by the opponent. We start by decomposing the regret into two terms

 RT(π2) =E[TJ(θ∗)−T∑t=1r(st,at)] =E⎡⎣TJ(θ∗)−KT∑k=1tk+1−1∑t=tkJ(θk)⎤⎦ +E⎡⎣KT∑k=1tk+1−1∑t=tk(J(θk)−r(st,at))⎤⎦. (8)

Lemma 2 uses the property of posterior sampling to bound the first term. The second term is handled by combining the Bellman equation, concentration inequalities and the property of posterior sampling as detailed in Lemma 3. Finally, Lemma 4 bounds the number of episodes and completes the proof.

###### Lemma 2.

The first term of (8) can be bounded by

 E⎡⎣TJ(θ∗)−KT∑k=1tk+1−1∑t=tkJ(θk)⎤⎦≤E[KT]
###### Proof.
 KT∑k=1tk+1−1∑t=tkJ(θk) ≥∞∑k=11(tk≤T)(Tk−1+1)J(θk) (9)

where the last inequality is by the fact that and due to the first stopping criterion. Now, note that is a stopping time and and are -measurable random variables. Thus, by the property of posterior sampling and monotone convergence theorem,

 E[∞∑k=11(tk≤T)(Tk−1+1)J(θk)|htk] =∞∑k=1E[1(tk≤T)(Tk−1+1)J(θk)|htk] =∞∑k=1E[1(tk≤T)(Tk−1+1)J(θ∗)|htk] =E[∞∑k=11(tk≤T)(Tk−1+1)J(θ∗)|htk] =E[KT∑k=1(Tk−1+1)J(θ∗)|htk].

Taking another expectation from both sides and using the tower property, we have

 E[∞∑k=11(tk≤T)(Tk−1+1)J(θk)] =E[KT∑k=1(Tk−1+1)J(θ∗)].

Replacing this in (Proof of Theorem 1) implies that

 E⎡⎣TJ(θ∗)−KT∑k=1tk+1−1∑t=tkJ(θk)⎤⎦ ≤E[(T−KT∑k=1Tk−1)J(θ∗)]−E[KTJ(θ∗)]≤E[KT].

The last inequality is by the fact that and . ∎

###### Lemma 3.

The second term of (8) can be bounded by

 E⎡⎣KT∑k=1tk+1−1∑t=tk(J(θk)−r(st,at))⎤⎦≤HE[KT]+H +√224Slog(2AT)(HSA+2H√SAT).
###### Proof.

The policy used by the agent at episode is the solution of the Nash equilibrium in (1). Thus, for and any , (3) implies that

 J(θk)+v(s,θk) ≤r(s,π1k(⋅|s),q2)+∑s′θk(s′|s,π1k(⋅|s),q2)v(s′,θk),

for any distribution . Let be an arbitrary history-dependent randomized strategy for the opponent. Note that for any , is -measurable. Replacing by and by implies that

 J(θk)−r(st,π1k(⋅|st),π2t(⋅|ht)) ≤∑s′θk(s′|st,π1k(⋅|st),π2t(⋅|ht))v(s′,θk)−v(st,θk).

Adding and subtracting to the right hand side and summing over time steps within episode implies that

 tk+1−1∑t=tk(J(θk)−r(st,π1k(⋅|st),π2t(⋅|ht))) ≤tk+1−1∑t=tk(∑s′θk(s′|st,π1k(⋅|st),π2t(⋅|ht))v(s′,θk) −v(st+1,θk)) +tk+1−1∑t=tk(v(st+1,θk)−v(st,θk)). (10)

The second term on the right hand side of (10) telescopes and can be bounded as

 tk+1−1∑t=tk(v(st+1,θk)−v(st,θk)) =v(stk+1,θk)−v(stk,θk) ≤H, (11)

where the last inequality is by the fact that is chosen from the posterior distribution whose support is a subset of . Substituting (11) in (10), summing over episodes, and taking expectation implies that

 E⎡⎣KT∑k=1tk+1−1∑t=tk(J(θk)−r(st,at))⎤⎦ =E⎡⎣KT∑k=1tk+1−1∑t=tk(J(θk)−r(st,π1k(⋅|st),π2t(⋅|ht)))⎤⎦ ≤HE[KT]+E[KT∑k=1tk+1−1∑t=tk (∑s′θk(s′|st,π1k(⋅|st),π2t(⋅|ht))v(s′,θk)−v(st+1,θk))].

We proceed to bound the last term on the right hand side of the above inequality.

 E[KT∑k=1tk+1−1∑t=tk (∑s′θk(s′|st,π1k(⋅|st),π2t(⋅|ht))v(s′,θk)−v(st+1,θk))] =E[KT∑k=1tk+1−1∑t=tk (∑s′θk(s′|st,a1t,a2t)v(s′,θk)−v(st+1,θk))] =E⎡⎣KT∑k=1tk+1−1∑t=tk∑s′[θk(s′|st,at)−θ∗(s′|st,at)]v(s′,θk)⎤⎦ ≤HE⎡⎣KT∑k=1tk+1−1∑t=tk∑s′|θk(s′|st,at)−θ∗(s′|st,at)|⎤⎦ (12)

To bound the inner summation, similar to Ouyang et al. (2017); Jaksch et al. (2010), we define a confidence set around the empirical transition kernel . Here is the number of visits to state-action pair whose next state is . The confidence set is defined as

 {θ:∑s′|θ(s′|s,a)−^θk(s′|s,a)|≤bk(s,a)∀s,a,s′},

where . Weissman et al. (2003) shows that the true transition kernel belongs to with high probability. We use this fact to show concentration of around . Concentration of around is then followed by the property of posterior sampling. More precisely, we can write

 ∑s′|θk(s′|st,at)−θ∗(s′|st,at)| ≤∑s′|θk(s′|st,at)−^θk(s′|st,at)| +∑s′|θ∗(s′|st,at)−^θk(s′|st,at)| ≤2bk(st,at)+2(1(θk∉Ck)+1(θ∗∉Ck)).

Substituting the inner sum of (12) with this upper bound implies

 HE⎡⎣KT∑k=1tk+1−1∑t=tk∑s′|θk(s′|st,at)−θ∗(s′|st,at)|⎤⎦ ≤2H⎧⎨⎩KT∑k=1tk+1−1∑t=tkbk(st,at)⎫⎬⎭ +2HE[KT∑k=1Tk{1(θk∉Ck)+1(θ∗∉Ck)}]. (13)

The first term on the right hand side of (Proof of Theorem 1) can be bounded as

 KT∑k=1tk+1−1∑t=tk bk(st,at)=KT∑k=1tk+1−1∑t=tk√14Slog(2AtkT)max{1,Ntk(st,at)} ≤KT∑k=1tk+1−1∑t=tk√28Slog(2AT2)max{1,Nt(st,at)} =T∑t=1√28Slog(2AT2)max{1,Nt(st,at)} ≤√56Slog(2AT)(SA+2√SAT), (14)

where the first inequality is by the fact that and for all and the second inequality is by the following argument:

 T∑t=1√1max{1,Nt(st,at)}=T∑t=1∑s,a1(st=s,at=a)√max{1,Nt(s,a)} =∑s,aT∑t=11(st=s,at=a)√max{1,Nt(s,a)}=∑s,a⎛⎝1+nT+1(s,a)−1∑j=11√j⎞⎠ ≤∑s,a(1+2√NT+1(s,a))=SA+2∑s,a√NT+1(s,a) ≤SA+2√SA∑s,aNT+1(s,a)=SA+2√SAT,

where the last inequality is by Cauchy-Schwarz and the last equality is by the fact that . To bound the second term on the right hand side of (Proof of Theorem 1), we can write

 E[KT∑k=1Tk{1(θk∉Ck)+1(θ∗∉Ck)}] ≤E[∞∑k=1T{1(θk∉Ck)+1(θ∗∉Ck)}] =T∞∑k=1E[1(θk∉Ck)+1(θ∗∉Ck)] =2T∞∑k=1E[1(θ∗∉Ck)]=2T∞∑k=1P(θ∗∉Ck),

where the second equality is by the property of Posterior Sampling since is -measurable. Note that