The Advantage Regret-Matching Actor-Critic

08/27/2020
by   Audrūnas Gruslys, et al.
0

Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior. We propose a model-free RL algorithm, the AdvantageRegret-Matching Actor-Critic (ARMAC): rather than saving past state-action data, ARMAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2018

Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Optimization of parameterized policies for reinforcement learning (RL) i...
research
06/08/2022

ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret

Recent techniques for approximating Nash equilibria in very large games ...
research
01/08/2014

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games

We consider the problem of finding stationary Nash equilibria (NE) in a ...
research
10/14/2022

Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations

Providing densely shaped reward functions for RL algorithms is often exc...
research
06/14/2018

Self-Imitation Learning

This paper proposes Self-Imitation Learning (SIL), a simple off-policy a...
research
03/04/2021

Conservative Optimistic Policy Optimization via Multiple Importance Sampling

Reinforcement Learning (RL) has been able to solve hard problems such as...
research
02/18/2023

Efficient exploration via epistemic-risk-seeking policy optimization

Exploration remains a key challenge in deep reinforcement learning (RL)....

Please sign up or login with your details

Forgot password? Click here to reset