Off-Belief Learning

03/06/2021
by   Hengyuan Hu, et al.
0

The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and rely on multi-step counterfactual reasoning based on assumptions about other agents' actions and thus fail when paired with humans or independently trained agents. In contrast, no current methods can learn optimal policies that are fully grounded, i.e., do not rely on counterfactual information from observing other agents' actions. To address this, we present off-belief learning (OBL): at each time step OBL agents assume that all past actions were taken by a given, fixed policy (π_0), but that future actions will be taken by an optimal policy under these same assumptions. When π_0 is uniform random, OBL learns the optimal grounded policy. OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next. This introduces counterfactual reasoning in a controlled manner. Unlike independent RL which may converge to any equilibrium policy, OBL converges to a unique policy, making it more suitable for zero-shot coordination. OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a simple toy-setting and the benchmark human-AI/zero-shot coordination problem Hanabi.

READ FULL TEXT

page 6

page 8

research
07/14/2022

K-level Reasoning for Zero-Shot Coordination in Hanabi

The standard problem setting in cooperative multi-agent settings is self...
research
06/11/2021

A New Formalism, Method and Open Issues for Zero-Shot Coordination

In many coordination problems, independently reasoning humans are able t...
research
03/14/2021

Quasi-Equivalence Discovery for Zero-Shot Emergent Communication

Effective communication is an important skill for enabling information e...
research
01/16/2020

Adversarially Guided Self-Play for Adopting Social Conventions

Robotic agents must adopt existing social conventions in order to be eff...
research
10/21/2022

Equivariant Networks for Zero-Shot Coordination

Successful coordination in Dec-POMDPs requires agents to adopt robust st...
research
01/10/2018

Reasoning about Unforeseen Possibilities During Policy Learning

Methods for learning optimal policies in autonomous agents often assume ...
research
05/08/2023

Sense, Imagine, Act: Multimodal Perception Improves Model-Based Reinforcement Learning for Head-to-Head Autonomous Racing

Model-based reinforcement learning (MBRL) techniques have recently yield...

Please sign up or login with your details

Forgot password? Click here to reset