Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure

by   Hsu Kao, et al.

Multi-agent reinforcement learning (MARL) problems are challenging due to information asymmetry. To overcome this challenge, existing methods often require high level of coordination or communication between the agents. We consider two-agent multi-armed bandits (MABs) and Markov decision processes (MDPs) with a hierarchical information structure arising in applications, which we exploit to propose simpler and more efficient algorithms that require no coordination or communication. In the structure, in each step the “leader" chooses her action first, and then the “follower" decides his action after observing the leader's action. The two agents observe the same reward (and the same state transition in the MDP setting) that depends on their joint action. For the bandit setting, we propose a hierarchical bandit algorithm that achieves a near-optimal gap-independent regret of 𝒪(√(ABT)) and a near-optimal gap-dependent regret of 𝒪(log(T)), where A and B are the numbers of actions of the leader and the follower, respectively, and T is the number of steps. We further extend to the case of multiple followers and the case with a deep hierarchy, where we both obtain near-optimal regret bounds. For the MDP setting, we obtain 𝒪(√(H^7S^2ABT)) regret, where H is the number of steps per episode, S is the number of states, T is the number of episodes. This matches the existing lower bound in terms of A, B, and T.



There are no comments yet.


page 1

page 2

page 3

page 4


Near-optimal Reinforcement Learning in Factored MDPs

Any reinforcement learning algorithm that applies to all Markov decision...

Distributed Bandit Learning: How Much Communication is Needed to Achieve (Near) Optimal Regret

We study the communication complexity of distributed multi-armed bandits...

Near-optimal Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms for the Non-episodic Setting

We study reinforcement learning in factored Markov decision processes (F...

Provably Efficient Cooperative Multi-Agent Reinforcement Learning with Function Approximation

Reinforcement learning in cooperative multi-agent settings has recently ...

Reinforcement Learning with Feedback Graphs

We study episodic reinforcement learning in Markov decision processes wh...

Satisficing in Time-Sensitive Bandit Learning

Much of the recent literature on bandit learning focuses on algorithms t...

Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs

In order to make good decision under uncertainty an agent must learn fro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.