
OffPolicy ExploitabilityEvaluation and EquilibriumLearning in TwoPlayer ZeroSum Markov Games
Offpolicy evaluation (OPE) is the problem of evaluating new policies us...
read it

On Reinforcement Learning Using Monte Carlo Tree Search with Supervised Learning: NonAsymptotic Analysis
Inspired by the success of AlphaGo Zero (AGZ) which utilizes Monte Carlo...
read it

A Theoretical Analysis of Deep QLearning
Despite the great empirical success of deep reinforcement learning, its ...
read it

Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games
Finding approximate Nash equilibria in zerosum imperfectinformation ga...
read it

On Bellman's Optimality Principle for zsPOSGs
Many nontrivial sequential decisionmaking problems are efficiently sol...
read it

On Tightness of the TsaknakisSpirakis Algorithm for Approximate Nash Equilibrium
Finding the minimum approximate ratio for Nash equilibrium of bimatrix ...
read it

Analysis of Hannan Consistent Selection for Monte Carlo Tree Search in Simultaneous Move Games
Hannan consistency, or no external regret, is a key concept for learning...
read it
On Reinforcement Learning for Turnbased Zerosum Markov Games
We consider the problem of finding Nash equilibrium for twoplayer turnbased zerosum games. Inspired by the AlphaGo Zero (AGZ) algorithm, we develop a Reinforcement Learning based approach. Specifically, we propose ExploreImproveSupervise (EIS) method that combines "exploration", "policy improvement"' and "supervised learning" to find the value function and policy associated with Nash equilibrium. We identify sufficient conditions for convergence and correctness for such an approach. For a concrete instance of EIS where random policy is used for "exploration", MonteCarlo Tree Search is used for "policy improvement" and Nearest Neighbors is used for "supervised learning", we establish that this method finds an εapproximate value function of Nash equilibrium in O(ε^(d+4)) steps when the underlying statespace of the game is continuous and ddimensional. This is nearly optimal as we establish a lower bound of Ω(ε^(d+2)) for any policy.
READ FULL TEXT
Comments
There are no comments yet.