Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure
Multi-agent reinforcement learning (MARL) problems are challenging due to information asymmetry. To overcome this challenge, existing methods often require high level of coordination or communication between the agents. We consider two-agent multi-armed bandits (MABs) and Markov decision processes (MDPs) with a hierarchical information structure arising in applications, which we exploit to propose simpler and more efficient algorithms that require no coordination or communication. In the structure, in each step the “leader" chooses her action first, and then the “follower" decides his action after observing the leader's action. The two agents observe the same reward (and the same state transition in the MDP setting) that depends on their joint action. For the bandit setting, we propose a hierarchical bandit algorithm that achieves a near-optimal gap-independent regret of 𝒪(√(ABT)) and a near-optimal gap-dependent regret of 𝒪(log(T)), where A and B are the numbers of actions of the leader and the follower, respectively, and T is the number of steps. We further extend to the case of multiple followers and the case with a deep hierarchy, where we both obtain near-optimal regret bounds. For the MDP setting, we obtain 𝒪(√(H^7S^2ABT)) regret, where H is the number of steps per episode, S is the number of states, T is the number of episodes. This matches the existing lower bound in terms of A, B, and T.
READ FULL TEXT