The Advantage Regret-Matching Actor-Critic

08/27/2020
by   Audrūnas Gruslys, et al.
0

Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior. We propose a model-free RL algorithm, the AdvantageRegret-Matching Actor-Critic (ARMAC): rather than saving past state-action data, ARMAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.

READ FULL TEXT

page 1

page 2

page 3

page 4

10/21/2018

Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Optimization of parameterized policies for reinforcement learning (RL) i...
06/08/2022

ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret

Recent techniques for approximating Nash equilibria in very large games ...
01/08/2014

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games

We consider the problem of finding stationary Nash equilibria (NE) in a ...
02/10/2021

Risk-Averse Offline Reinforcement Learning

Training Reinforcement Learning (RL) agents in high-stakes applications ...
06/14/2018

Self-Imitation Learning

This paper proposes Self-Imitation Learning (SIL), a simple off-policy a...
03/04/2021

Conservative Optimistic Policy Optimization via Multiple Importance Sampling

Reinforcement Learning (RL) has been able to solve hard problems such as...
09/09/2018

Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines

Learning strategies for imperfect information games from samples of inte...