MCTS Based on Simple Regret

07/23/2012
by   David Tolpin, et al.
0

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB, a sampling policy for the Multi-armed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the final "arm pull" (the actual move selection) that collects a reward, rather than all "arm pulls". Therefore, it makes more sense to minimize the simple regret, as opposed to the cumulative regret. We begin by introducing policies for multi-armed bandits with lower finite-time and asymptotic simple regret than UCB, using it to develop a two-stage scheme (SR+CR) for MCTS which outperforms UCT empirically. Optimizing the sampling process is itself a metareasoning problem, a solution of which can use value of information (VOI) techniques. Although the theory of VOI for search exists, applying it to MCTS is non-trivial, as typical myopic assumptions fail. Lacking a complete working VOI theory for MCTS, we nevertheless propose a sampling scheme that is "aware" of VOI, achieving an algorithm that in empirical evaluation outperforms both UCT and the other proposed algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/24/2012

VOI-aware MCTS

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in ...
research
08/18/2011

Doing Better Than UCT: Rational Monte Carlo Sampling in Trees

UCT, a state-of-the art algorithm for Monte Carlo tree sampling (MCTS), ...
research
09/06/2019

Efficient Multivariate Bandit Algorithm with Path Planning

In this paper, we solve the arms exponential exploding issue in multivar...
research
02/25/2021

Doubly-Adaptive Thompson Sampling for Multi-Armed and Contextual Bandits

To balance exploration and exploitation, multi-armed bandit algorithms n...
research
10/26/2020

Expert Selection in High-Dimensional Markov Decision Processes

In this work we present a multi-armed bandit framework for online expert...
research
04/03/2020

Hawkes Process Multi-armed Bandits for Disaster Search and Rescue

We propose a novel framework for integrating Hawkes processes with multi...
research
07/03/2023

Statistical Inference on Multi-armed Bandits with Delayed Feedback

Multi armed bandit (MAB) algorithms have been increasingly used to compl...

Please sign up or login with your details

Forgot password? Click here to reset