Hierarchical Critics Assignment for Multi-agent Reinforcement Learning

02/08/2019 ∙ by Zehong Cao, et al. ∙ University of Technology Sydney 0

In this paper, we investigate the use of global information to speed up the learning process and increase the cumulative rewards of multi-agent reinforcement learning (MARL) tasks. Within the actor-critic MARL, we introduce multiple cooperative critics from two levels of the hierarchy and propose a hierarchical critic-based multi-agent reinforcement learning algorithm. In our approach, the agent is allowed to receive information from local and global critics in a competition task. The agent not only receives low-level details but also consider coordination from high levels that receiving global information to increase operation skills. Here, we define multiple cooperative critics in the top-bottom hierarchy, called the Hierarchical Critics Assignment (HCA) framework. Our experiment, a two-player tennis competition task in the Unity environment, tested HCA multi-agent framework based on Asynchronous Advantage Actor-Critic (A3C) with Proximal Policy Optimization (PPO) algorithm. The results showed that the HCA- framework outperforms the non-hierarchical critics baseline method for MARL tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The analysis of multi-agent systems is a topic of interest in the field of artificial intelligence (AI). Although multi-agent systems have been widely studied in robotic control, decision support systems, and data mining, only recently have they begun to attract interest in AI [González-Briones et al.2018]. A significant portion of research on multi-agent learning concerns reinforcement learning (RL) techniques [Busoniu et al.2010], which can provide learning policies for achieving target tasks by maximising rewards provided by the environment. In the multi-agent reinforcement learning (MARL) framework, each agent learns by interacting with its dynamic environment to solve a cooperative or competitive task. At each time step, the agent perceives the state of the environment and takes an action, which causes the environment to transit into a new state.

In a competitive game of multiple players (for two agents, when ), the mini-max principle can be applied traditionally: maximise one’s benefit under the worst-case assumption that the opponent will always endeavour to minimise that benefit. This principle suggests using opponent-independent algorithms. The minimax-Q algorithm [Littman2001] employs the minimax principle to compute strategies and values for the stage games, and a temporal-difference rule similar to Q-learning is used to propagate the values across state transitions. If considering policy gradient methods, each agent can use model-based policy optimisation to learn optimal policies via back-propagation such as the Monte-Carlo policy gradient and Deterministic Policy Gradient (DPG) [Silver et al.2014]. Unfortunately, traditional Q-Learning and policy gradient methods are poorly suited to multi-agent environments. Thus, [Lowe et al.2017] presented an adaptation of actor-critic methods that considers the action policies of other agents and can successfully learn policies that require complex multi-agent coordination.

One hint at enabling MARL algorithms to overcome these challenges may lie in the way in which multiple agents are hierarchically structured [Mnih et al.2015]. Inspired by feudal reinforcement learning [Dayan1993], the DeepMind group proposed Feudal Networks (FuNs) [Vezhnevets et al.2017], which employ a manager module and a worker module for hierarchical reinforcement learning. The manager sets abstract goals, which are conveyed to and enacted by the worker, who generates primitive actions at every tick of the environment. Furthermore, the FuNs structure has been extended to cooperative reinforcement learning [Ahilan and Dayan2019], whereby the manager learns to communicate sub-goals to multiple workers. Indeed, these properties of extracting sub-goals from the manager allow FuN to dramatically outperform a strong baseline agent on tasks.

However, almost all the above MARL methods ignore this critical fact that an agent might have access to the multiple cooperative critics to speed up the learning process and increase the rewards on competition tasks. In particular, it is frequently the case that high-level agents agree to be assigned different observations that co-work with low-level agents for the benefit of hierarchical cooperation. For example, military personnel typically have different roles and responsibilities. The commander is required to monitor multiple information sources, assess changing operational conditions and recommend courses of action to soldiers. The advanced hierarchical MARL technologies can evaluate the relative importance of new and changing data and make recommendations that will both improve decision-making capabilities and empower commanders to make practical judgements as quickly as possible.

Our proposed framework in this paper differs from existing approaches, namely, the use of global information to speed up and increase the cumulative rewards of MARL tasks. Within the actor-critic MARL, we introduce multiple cooperative critics from two levels of the hierarchy and propose a hierarchical critic-based multi-agent reinforcement learning algorithm. The main contributions of our proposed approach are the following: (1) The agent is allowed to receive the information from local and global critics in a competition task. (2) The agent not only receives low-level details but also considers coordination from high levels receiving global information to increase operational performance. (3) We define multiple cooperative critics in the top-bottom hierarchy, called the Hierarchical Critic Assignment (HCA) framework. We assume that HCA is a generalised RL framework and thus more applicable to multi-agent learning. These benefits can potentially be obtained when using any type of hierarchical MARL algorithm.

The remainder of this paper is organised as follows. In Section 2, we introduce the RL background for developing the multiple cooperative critic framework in multi-agent domains. Section 3 describes the baseline and proposes the HCA framework for hierarchical MARL. Section 4 presents an experimental design in a simple Unity tennis task with four types of settings. Section 5 demonstrates the training performance results of the baseline and proposed HCA framework. Finally, we summarise the paper and discuss some directions for future work in Section 6.

2 Background

2.1 Theory

In a standard RL framework [Kaelbling et al.1996], an agent interacts with the external environment over a number of time steps. Here, is the set of all possible states, and is all possible actions. At each time step , the agent in state , by perceiving the observation information from the environment, receives feedback from the reward source, say, , by taking action . Then, the agent moves to a new state , and the reward associated with the transition is determined. The agent can choose any action as a function of the history, and the goal of a reinforcement learning agent is to collect as much reward as possible with minimal delay.

2.2 Asynchronous Advantage Actor-Critic (A3C)

The A3C structure [Mnih et al.2016] can master a variety of continuous motor control tasks as well as learned general strategies for exploring games purely from sensor and visual inputs. A3C maintains a policy

and an estimate of the value function

. This variant of actor-critic can operate in the forward view and uses the same mix of n-step returns to update both the policy and the value function. The policy and the value function are updated after every actions or when a terminal state is reached. The update performed by the algorithm can be written as

where is an estimate of the advantage function.

The advantage function is given by

where can vary from state to state and is upper bounded by .

As with value-based methods, this method relies on actor-learners and accumulate updates for improving the training stability. The parameters of of the policy and

of the value function are shared, even if they are shown to be separate for generality. For example, a convolutional neural network has one softmax output for the policy

and one linear output for the value function , with all non-output layers shared.

2.3 Proximal Policy Optimization (PPO) Algorithm

PPO [Schulman et al.2017]

is a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interactions with the environment and optimising a surrogate objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, the objective function enables multiple epochs of minibatch updates, which is simpler to implement, more general, and has better sample complexity.

PPO can be investigated in the A3C framework. Specifically, if using a neural network architecture that shares parameters between the policy and value function, a loss function must be used to combine the policy surrogate and a value function error term. This objective can further be augmented by adding an entropy bonus to ensure sufficient exploration. To approximately maximise each iteration, the “surrogate” objective function is as follows:

where and are coefficients, denotes an entropy bonus, is a surrogate objective, and is a squared-error loss.

2.4 Hierarchies

Hierarchical reinforcement learning (HRL) is a promising approach to extending traditional RL methods to solve more complex tasks [Kulkarni et al.2016]. In its most straightforward setting, the hierarchy corresponds to a rooted directed tree, with the highest-level manager as the root node and each worker reporting to only a single manager. A popular scheme is meta-learning shared hierarchies [Frans et al.2017], which learn a hierarchical policy whereby a master policy switches between a set of sub-policies. The master selects an action every time steps, and a sub-policy executed for time steps constitutes a high-level action. Another scheme [Nachum et al.2018] is learning goal-directed behaviours in environments, where lower level controllers are supervised with goals that are learned and proposed automatically by the higher level controllers.

3 Methods

Toward the propagation of the critics in the hierarchies, we propose HCA, a framework for MARL that considers multiple cooperative critics from two levels of the hierarchy. To speed up the learning process and increase the cumulative rewards, the agent is allowed to receive information from local and global critics in a competition task.

3.1 Baseline: A3C-PPO

A3C and PPO-based RL algorithms have performed comparably to or better than state-of-the-art approaches while being much simpler to implement and tune. In particular, PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance. Here, we provide the A3C algorithm with PPO, called A3C-PPO, which is a state-of-the-art deep RL algorithm. It can be used as the baseline to validate experiment environments as well as starting points for the development of novel algorithms.

3.2 Multiple-Critic Assignment

To apply existing RL methods to the problem of agents with variable attention to more than one critic, we consider a softmax approach for resolving the multiple-critic learning problem. In terms of advantage actor-critic methods, the actor is a policy function that controls how our agent acts, and the critic is a value function that measures how good these actions are. For multiple critics, the update advantage function performed by the softmax function can be written as

, where is the total number of critics.

The advantage function calculates the extra reward if taking this action, which tell us the improvement compared to the average action taken at that state. In other words, the maximised indicates that the gradient is pushed in that direction. Based on the A3C structure, the policy function would estimate .

Furthermore, we consider the time step intervals of multiple critics, and the update advantage function can be written as

, where is the total number of critics.

If , , where and is a time period with time steps; otherwise, .

3.3 Hca-A3c-Ppo

For simplicity, the experiments generally used two-level hierarchies such as a multi-agent hierarchy with one manager agent and two worker agents. To propagate the critics in the hierarchies, we are the first to develop an HCA framework allowing a worker to receive multiple critics computed locally and globally. The manager is responsible for collecting broader observations and estimating the corresponding global critic. As shown in Fig. 1, the HCA framework is constructed by the two-level hierarchies with one manager agent and two worker agents. The local and global critics are implemented by the softmax function illustrated in the ‘multiple-critic assignment’ subsection.

Figure 1. The HCA framework. The multi-agent hierarchy with one manager agent and two worker agents. The worker receives multiple critics computed locally and globally, and the manager provides the global critic.

Here, we applied the HCA framework in the A3C-PPO, called HCA-A3C-PPO, or simply HCA. The successfully trained HCA-A3C-PPO model requires tuning of the trained hyperparameters, which is beneficial to the output of the training process containing the optimised policy.

Types Agents Observation Spaces Input Variables
Type 1 worker 1&2 ball (, ) 8 variables
racket (, )
manager ball (, ) 10 variables
racket (, )
distance (, )
Type 2 worker 1&2 ball (, ) 8 variables
racket (, )
manager ball (, ) 16 variables
racket (, )
Type 3 worker 1&2 ball (, ) 8 variables
racket (, )
manager ball (,) 20 variables
racket (, )
distance (, )
Type 4 worker 1&2 ball (, ) 8 variables
racket (, )
manager ball (, ) 10 variables
racket (, )
distance (, )
Table 1: Four types of multi-agent observations

4 Experiment

We apply our proposed HCA framework to scenarios in which two agents compete with each other. We empirically show the success of our framework compared to the existing method in competitive scenarios. We have released codes for both the model and the environments on GitHub.

4.1 Unity Platform for MARL

Since many existing platforms, such as OpenAI Gym, lack the ability to flexibly configure the simulation, the simulation environment becomes a black box from the perspective of the learning system. The Unity platform, a new open-source toolkit, has been developed for creating and interacting with simulation environments. Specifically, the Unity Machine Learning Agents Toolkit (ML-Agents Toolkit)

[Juliani et al.2018] is an open-source Unity plugin that enables games and simulations to serve as environments for training intelligent agents. The toolkit supports dynamic multi-agent interaction, and agents can be trained using RL through a simple-to-use Python API.

4.2 Unity Scenario: Tennis Competition

We set up a tennis competition scenario in Unity including a two-player game whereby agents control rackets to bounce the ball over a net. The goal of this task is that the agents must bounce the ball between one another while not dropping or sending the ball out of bounds. Furthermore, as shown in Fig. 2, we construct a new learning environment involving the two-layer hierarchy by introducing a manager to look at broader observation spaces. The information that the low-level agents (racket workers 1 and 2) collect includes the position of the target and the position of the agent itself, as well as the velocity of the agent. The stat observation of the manager contains additional variables, such as the distance between the ball and the racket and information about the previous time steps. These observation state spaces are continuous, and we need them for initialisation. Here, we provide four types of observation spaces in Table 1 to test our proposed HCA framework and baseline A3C-PPO.

Of note, the agent reward function is +0.1 when hitting the ball over the net and -0.1 when letting the ball hit the ground or when the ball is hit out of bounds. The observation space includes 8-20 variable vectors corresponding to the position and velocity of the ball and racket, as well as the distance between the ball and the racket in continuous time steps. The vector action space is continuous, with a size of 3, corresponding to movement toward the net or away from the net, and jumping.

Figure 2 Tennis competition in Unity

The hyperparameters for the RL used for training are specified in Table 2, which provides the initialisation settings that we used in the tennis competition learning environment. In PPO, the batch size and buffer size represent the number of experiences in each iteration of gradient descent and the number of experiences to collect before updating the policy model, respectively. Beta controls the strength of entropy regularisation, and epsilon influences how rapidly the policy can evolve during training. Gamma and lambda indicate the reward discount rate for the generalised advantage estimator and the regularisation parameter, respectively.

Parameters Values Parameters Values
batch size 1024 beta
buffer size 10240 epsilon 0.2
gamma 0.99 hidden units 128
lambda 0.95 learning rate
max steps memory size 256
normalise true num. epoch 3
num. layers 2 time horizon 64
sequence length 64 summary freq. 1000
Table 2: Parameters in the learning environment

5 Results

We provide the training performance of the HCA framework (HCA-A3C-PPO) and baseline algorithm (A3C-PPO). The HCA framework has been shown to be efficient and more general than the baseline algorithm; as such, we chose an example scenario for use with the two-player tennis competition. To study the training process in more detail, we used TensorBoard (smoothing =0.7) to demonstrate the dynamic rewards, episodes, and policies with four types (type 1, type 2, type 3 and type 4) of observation spaces. In particular, we focus on two indices, cumulative reward and episode length, which represent the mean cumulative episode reward and the mean length of each episode in the environment for all agents, respectively.

5.1 Type 1: HCA vs. Baseline

Considering the 10 variable vectors of the observation spaces of the manager, we compare our HCA framework (HCA-A3C-PPO, pink-red line) and baseline (A3C-PPO, dark-red line) performance. As shown in Fig. 3, the HCA framework results showed higher cumulative reward and longer episode length with short training steps. Both methods experience a successful training process, as they both presented a slowly decreased entropy and ultimately decreased magnitude of the policy loss.

Figure 3. Graphs depicting the mean cumulative episodic reward, mean episode length, mean entropy and policy loss (y-axis) with respect to the time steps of the simulation (in thousands, x-axis) during the training process.

5.2 Type 2: HCA vs. Baseline

Considering the 16 variable vectors of the observation spaces in the manager, we compare our HCA framework (HCA-A3C-PPO, orange line) and baseline (A3C-PPO, blue line) performance. As shown in Fig. 4, the HCA framework results show higher cumulative reward and longer episode length with short training steps. Both methods experience a successful training process, as they both presented a slowly decreased entropy and ultimate decreased magnitude of the policy loss.

Figure 4. Graphs depicting mean cumulative episodic reward, mean episode length, mean entropy and policy loss (y-axis) with respect to the time steps of the simulation (in thousands, x-axis) during the training process.

5.3 Type 3: HCA vs. Baseline

Considering the 16 variable vectors of the observation spaces of the manager, we compare our HCA framework (HCA-A3C-PPO, light-blue line) and baseline (A3C-PPO, dark-blue line) performance. As shown in Fig. 5, the HCA framework results show higher cumulative reward and longer episode length with short training steps. Both methods experience a successful training process, as they both presented slowly decreased entropy and ultimately decreased magnitude of the policy loss.

Figure 5. Graphs depicting the mean cumulative episodic reward, mean episode length, mean entropy and policy loss (y-axis) with respect to the time steps of the simulation (in thousands, x-axis) during the training process.

5.4 Type 4: HCA vs. Baseline

Considering the 10 variable vectors (with 5-time-step intervals) of the observation spaces of the manager, we compare our HCA framework (HCA-A3C-PPO, green line) and baseline (A3C-PPO, blue line) performance. As shown in Fig. 6, the HCA framework results showed higher cumulative reward and longer episode length with short training steps. Both methods experience successful training processes, as they both presented slowly decreased entropy and ultimately decreased magnitude of the policy loss.

Figure 6. Graphs depicting the mean cumulative episodic reward, mean episode length, mean entropy and policy loss (y-axis) with respect to the time steps of the simulation (in thousands, x-axis) during the training process.

6 Conclusion

In this study, we developed the HCA framework using global information to speed up the learning process and increase the cumulative rewards. Within this framework, the agent is allowed to receive information from local and global critics in a competition task. We tested the proposed framework in a two-player tennis competition task in the Unity environment by comparing with a baseline algorithm: A3C-PPO. The results showed that the HCA framework outperforms the non-hierarchical critic baseline method on MARL tasks.

In future work, we will explore weighted approaches to fuse critics from different layers and consider optimising the temporal scaling in different layers. Furthermore, we will extend the number of agents and the number of layers, and even allow for more than one manager at the highest level of the hierarchy. We expect the possibility, in more exotic circumstances, of considering more general multi-agent reinforcement loops in which each agent can potentially achieve the maximum reward hierarchically.

Acknowledgments

This work was partially supported by grants from the Australian Research Council under Discovery Projects [DP180100670 and DP180100656], US Army Research Laboratory [W911NF-10-2-0022 and W911NF-10-D-0002/TO 0023], and Australia Defence Science Technology Group.

References