Log In Sign Up

Doing Better Than UCT: Rational Monte Carlo Sampling in Trees

by   David Tolpin, et al.

UCT, a state-of-the art algorithm for Monte Carlo tree sampling (MCTS), is based on UCB, a sampling policy for the Multi-armed Bandit Problem (MAB) that minimizes the accumulated regret. However, MCTS differs from MAB in that only the final choice, rather than all arm pulls, brings a reward, that is, the simple regret, as opposite to the cumulative regret, must be minimized. This ongoing work aims at applying meta-reasoning techniques to MCTS, which is non-trivial. We begin by introducing policies for multi-armed bandits with lower simple regret than UCB, and an algorithm for MCTS which combines cumulative and simple regret minimization and outperforms UCT. We also develop a sampling scheme loosely based on a myopic version of perfect value of information. Finite-time and asymptotic analysis of the policies is provided, and the algorithms are compared empirically.


MCTS Based on Simple Regret

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in ...

VOI-aware MCTS

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in ...

Limited depth bandit-based strategy for Monte Carlo planning in continuous action spaces

This paper addresses the problem of optimal control using search trees. ...

Efficient Multivariate Bandit Algorithm with Path Planning

In this paper, we solve the arms exponential exploding issue in multivar...

On the Hardness of Inventory Management with Censored Demand Data

We consider a repeated newsvendor problem where the inventory manager ha...

On Thompson Sampling with Langevin Algorithms

Thompson sampling is a methodology for multi-armed bandit problems that ...

Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits

I analyse the frequentist regret of the famous Gittins index strategy fo...