Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

05/14/2019
by   Dennis J. N. J. Soemers, et al.
0

In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. In this paper, we are interested in learning policies for a project with future goals including the extraction of interpretable strategies, rather than state-of-the-art game-playing performance. For these goals, we argue that such an extent of exploration is undesirable, and we propose a novel objective function for training policies that are not exploratory. We derive a policy gradient expression for maximising this objective function, which can be estimated using MCTS value estimates, rather than MCTS visit counts. We empirically evaluate various properties of resulting policies, in a variety of board games.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2020

Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration

Expert Iteration (ExIt) is an effective framework for learning game-play...
research
04/07/2019

Policy Gradient Search: Online Planning and Expert Iteration without Search Trees

Monte Carlo Tree Search (MCTS) algorithms perform simulation-based searc...
research
05/31/2019

Multiple Policy Value Monte Carlo Tree Search

Many of the strongest game playing programs use a combination of Monte C...
research
02/23/2023

Targeted Search Control in AlphaZero for Effective Policy Improvement

AlphaZero is a self-play reinforcement learning algorithm that achieves ...
research
07/14/2020

Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting

Policy gradient methods have shown success in learning control policies ...
research
01/27/2023

Policy-Value Alignment and Robustness in Search-based Multi-Agent Learning

Large-scale AI systems that combine search and learning have reached sup...
research
06/29/2021

Curious Explorer: a provable exploration strategy in Policy Learning

Having access to an exploring restart distribution (the so-called wide c...

Please sign up or login with your details

Forgot password? Click here to reset