Policy-Value Alignment and Robustness in Search-based Multi-Agent Learning

01/27/2023
by   Niko A. Grupen, et al.
0

Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment – for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by up to 76 value error by up to 55

READ FULL TEXT

page 6

page 7

research
05/31/2019

Multiple Policy Value Monte Carlo Tree Search

Many of the strongest game playing programs use a combination of Monte C...
research
11/25/2020

Towards Playing Full MOBA Games with Deep Reinforcement Learning

MOBA games, e.g., Honor of Kings, League of Legends, and Dota 2, pose gr...
research
01/30/2013

Solving POMDPs by Searching in Policy Space

Most algorithms for solving POMDPs iteratively improve a value function ...
research
06/22/2022

PAC: Assisted Value Factorisation with Counterfactual Predictions in Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) has witnessed significant prog...
research
05/14/2019

Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

In recent years, state-of-the-art game-playing agents often involve poli...
research
10/09/2020

Discussion of Kallus (2020) and Mo, Qi, and Liu (2020): New Objectives for Policy Learning

We discuss the thought-provoking new objective functions for policy lear...
research
10/30/2017

Artificial Intelligence as Structural Estimation: Economic Interpretations of Deep Blue, Bonanza, and AlphaGo

Artificial intelligence (AI) has achieved superhuman performance in a gr...

Please sign up or login with your details

Forgot password? Click here to reset