Policy Gradient With Value Function Approximation For Collective Multiagent Planning

04/09/2018
by   Duc Thien Nguyen, et al.
0

Decentralized (PO)MDPs provide an expressive framework for sequential decision making in a multiagent system. Given their computational complexity, recent research has focused on tractable yet practical subclasses of Dec-POMDPs. We address such a subclass called CDEC-POMDP where the collective behavior of a population of agents affects the joint-reward and environment dynamics. Our main contribution is an actor-critic (AC) reinforcement learning method for optimizing CDEC-POMDP policies. Vanilla AC has slow convergence for larger problems. To address this, we show how a particular decomposition of the approximate action-value function over agents leads to effective updates, and also derive a new way to train the critic based on local reward signals. Comparisons on a synthetic benchmark and a real-world taxi fleet optimization problem show that our new AC approach provides better quality solutions than previous best approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2019

On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

Reinforcement learning, mathematically described by Markov Decision Prob...
research
10/09/2020

Is Standard Deviation the New Standard? Revisiting the Critic in Deep Policy Gradients

Policy gradient algorithms have proven to be successful in diverse decis...
research
06/24/2022

Value Function Decomposition for Iterative Design of Reinforcement Learning Agents

Designing reinforcement learning (RL) agents is typically a difficult pr...
research
11/22/2021

Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

We study policy gradient (PG) for reinforcement learning in continuous t...
research
06/12/2014

Algorithms for CVaR Optimization in MDPs

In many sequential decision-making problems we may want to manage risk b...
research
02/14/2021

Sparse Attention Guided Dynamic Value Estimation for Single-Task Multi-Scene Reinforcement Learning

Training deep reinforcement learning agents on environments with multipl...
research
11/03/2020

Loss Bounds for Approximate Influence-Based Abstraction

Sequential decision making techniques hold great promise to improve the ...

Please sign up or login with your details

Forgot password? Click here to reset