Log In Sign Up

Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning for Hanabi

by   Bram Grooten, et al.

In pursuit of enhanced multi-agent collaboration, we analyze several on-policy deep reinforcement learning algorithms in the recently published Hanabi benchmark. Our research suggests a perhaps counter-intuitive finding, where Proximal Policy Optimization (PPO) is outperformed by Vanilla Policy Gradient over multiple random seeds in a simplified environment of the multi-agent cooperative card game. In our analysis of this behavior we look into Hanabi-specific metrics and hypothesize a reason for PPO's plateau. In addition, we provide proofs for the maximum length of a perfect game (71 turns) and any game (89 turns). Our code can be found at:


page 4

page 5

page 6

page 10

page 11


MAGNet: Multi-agent Graph Network for Deep Multi-agent Reinforcement Learning

Over recent years, deep reinforcement learning has shown strong successe...

Halftoning with Multi-Agent Deep Reinforcement Learning

Deep neural networks have recently succeeded in digital halftoning using...

Consolidation via Policy Information Regularization in Deep RL for Multi-Agent Games

This paper introduces an information-theoretic constraint on learned pol...

Deep Reinforcement Learning in Portfolio Management

In this paper, we implement two state-of-art continuous reinforcement le...

Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO

We study the roots of algorithmic progress in deep policy gradient algor...

Deep Reinforcement Learning for Stock Portfolio Optimization

Stock portfolio optimization is the process of constant re-distribution ...

Coding for Distributed Multi-Agent Reinforcement Learning

This paper aims to mitigate straggler effects in synchronous distributed...