
On Reinforcement Learning for Turnbased Zerosum Markov Games
We consider the problem of finding Nash equilibrium for twoplayer turn...
read it

Stable Reinforcement Learning with Unbounded State Space
We consider the problem of reinforcement learning (RL) with unbounded st...
read it

POLYHOOT: MonteCarlo Planning in Continuous Space MDPs with NonAsymptotic Analysis
MonteCarlo planning, as exemplified by MonteCarlo Tree Search (MCTS), ...
read it

On Queryefficient Planning in MDPs under Linear Realizability of the Optimal Statevalue Function
We consider the problem of local planning in fixedhorizon Markov Decisi...
read it

Qlearning with Nearest Neighbors
We consider the problem of modelfree reinforcement learning for infinit...
read it

On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning
A simple and natural algorithm for reinforcement learning is Monte Carlo...
read it

RealTime Optimal Guidance and Control for Interplanetary Transfers Using Deep Networks
We consider the EarthVenus massoptimal interplanetary transfer of a lo...
read it
On Reinforcement Learning Using Monte Carlo Tree Search with Supervised Learning: NonAsymptotic Analysis
Inspired by the success of AlphaGo Zero (AGZ) which utilizes Monte Carlo Tree Search (MCTS) with Supervised Learning via Neural Network to learn the optimal policy and value function, in this work, we focus on establishing formally that such an approach indeed finds optimal policy asymptotically, as well as establishing nonasymptotic guarantees in the process. We shall focus on infinitehorizon discounted Markov Decision Process to establish the results. To start with, it requires establishing the MCTS's claimed property in the literature that for any given query state, MCTS provides approximate value function for the state with enough simulation steps of MDP. We provide nonasymptotic analysis establishing this property by analyzing a nonstationary multiarm bandit setup. Our proof suggests that MCTS needs to be utilized with polynomial rather than logarithmic "upper confidence bound" for establishing its desired performance  interestingly enough, AGZ chooses such polynomial bound. Using this as a building block, combined with nearest neighbor supervised learning, we argue that MCTS acts as a "policy improvement" operator; it has a natural "bootstrapping" property to iteratively improve value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn ε approximation of value function in ℓ_∞ norm, MCTS combined with nearestneighbors requires samples scaling as O(ε^(d+4)), where d is the dimension of the state space. This is nearly optimal due to a minimax lower bound of Ω(ε^(d+2)).
READ FULL TEXT
Comments
There are no comments yet.