The analysis of multi-agent systems is a topic of interest in the field of artificial intelligence (AI). Although multi-agent systems have been widely studied in robotic control, decision support systems, and data mining, only recently have they begun to attract interest in AI [González-Briones et al.2018]. A significant portion of research on multi-agent learning concerns reinforcement learning (RL) techniques [Busoniu et al.2010], which can provide learning policies for achieving target tasks by maximising rewards provided by the environment. In the multi-agent reinforcement learning (MARL) framework, each agent learns by interacting with its dynamic environment to solve a cooperative or competitive task. At each time step, the agent perceives the state of the environment and takes an action, which causes the environment to transit into a new state.
In a competitive game of multiple players (for two agents, when ), the mini-max principle can be applied traditionally: maximise one’s benefit under the worst-case assumption that the opponent will always endeavour to minimise that benefit. This principle suggests using opponent-independent algorithms. The minimax-Q algorithm [Littman2001] employs the minimax principle to compute strategies and values for the stage games, and a temporal-difference rule similar to Q-learning is used to propagate the values across state transitions. If considering policy gradient methods, each agent can use model-based policy optimisation to learn optimal policies via back-propagation such as the Monte-Carlo policy gradient and Deterministic Policy Gradient (DPG) [Silver et al.2014]. Unfortunately, traditional Q-Learning and policy gradient methods are poorly suited to multi-agent environments. Thus, [Lowe et al.2017] presented an adaptation of actor-critic methods that considers the action policies of other agents and can successfully learn policies that require complex multi-agent coordination.
One hint at enabling MARL algorithms to overcome these challenges may lie in the way in which multiple agents are hierarchically structured [Mnih et al.2015]. Inspired by feudal reinforcement learning [Dayan1993], the DeepMind group proposed Feudal Networks (FuNs) [Vezhnevets et al.2017], which employ a manager module and a worker module for hierarchical reinforcement learning. The manager sets abstract goals, which are conveyed to and enacted by the worker, who generates primitive actions at every tick of the environment. Furthermore, the FuNs structure has been extended to cooperative reinforcement learning [Ahilan and Dayan2019], whereby the manager learns to communicate sub-goals to multiple workers. Indeed, these properties of extracting sub-goals from the manager allow FuN to dramatically outperform a strong baseline agent on tasks.
However, almost all the above MARL methods ignore this critical fact that an agent might have access to the multiple cooperative critics to speed up the learning process and increase the rewards on competition tasks. In particular, it is frequently the case that high-level agents agree to be assigned different observations that co-work with low-level agents for the benefit of hierarchical cooperation. For example, military personnel typically have different roles and responsibilities. The commander is required to monitor multiple information sources, assess changing operational conditions and recommend courses of action to soldiers. The advanced hierarchical MARL technologies can evaluate the relative importance of new and changing data and make recommendations that will both improve decision-making capabilities and empower commanders to make practical judgements as quickly as possible.
Our proposed framework in this paper differs from existing approaches, namely, the use of global information to speed up and increase the cumulative rewards of MARL tasks. Within the actor-critic MARL, we introduce multiple cooperative critics from two levels of the hierarchy and propose a hierarchical critic-based multi-agent reinforcement learning algorithm. The main contributions of our proposed approach are the following: (1) The agent is allowed to receive the information from local and global critics in a competition task. (2) The agent not only receives low-level details but also considers coordination from high levels receiving global information to increase operational performance. (3) We define multiple cooperative critics in the top-bottom hierarchy, called the Hierarchical Critic Assignment (HCA) framework. We assume that HCA is a generalised RL framework and thus more applicable to multi-agent learning. These benefits can potentially be obtained when using any type of hierarchical MARL algorithm.
The remainder of this paper is organised as follows. In Section 2, we introduce the RL background for developing the multiple cooperative critic framework in multi-agent domains. Section 3 describes the baseline and proposes the HCA framework for hierarchical MARL. Section 4 presents an experimental design in a simple Unity tennis task with four types of settings. Section 5 demonstrates the training performance results of the baseline and proposed HCA framework. Finally, we summarise the paper and discuss some directions for future work in Section 6.
In a standard RL framework [Kaelbling et al.1996], an agent interacts with the external environment over a number of time steps. Here, is the set of all possible states, and is all possible actions. At each time step , the agent in state , by perceiving the observation information from the environment, receives feedback from the reward source, say, , by taking action . Then, the agent moves to a new state , and the reward associated with the transition is determined. The agent can choose any action as a function of the history, and the goal of a reinforcement learning agent is to collect as much reward as possible with minimal delay.
2.2 Asynchronous Advantage Actor-Critic (A3C)
The A3C structure [Mnih et al.2016] can master a variety of continuous motor control tasks as well as learned general strategies for exploring games purely from sensor and visual inputs. A3C maintains a policy
and an estimate of the value function. This variant of actor-critic can operate in the forward view and uses the same mix of n-step returns to update both the policy and the value function. The policy and the value function are updated after every actions or when a terminal state is reached. The update performed by the algorithm can be written as
where is an estimate of the advantage function.
The advantage function is given by
where can vary from state to state and is upper bounded by .
As with value-based methods, this method relies on actor-learners and accumulate updates for improving the training stability. The parameters of of the policy and
of the value function are shared, even if they are shown to be separate for generality. For example, a convolutional neural network has one softmax output for the policyand one linear output for the value function , with all non-output layers shared.
2.3 Proximal Policy Optimization (PPO) Algorithm
PPO [Schulman et al.2017]
is a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interactions with the environment and optimising a surrogate objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, the objective function enables multiple epochs of minibatch updates, which is simpler to implement, more general, and has better sample complexity.
PPO can be investigated in the A3C framework. Specifically, if using a neural network architecture that shares parameters between the policy and value function, a loss function must be used to combine the policy surrogate and a value function error term. This objective can further be augmented by adding an entropy bonus to ensure sufficient exploration. To approximately maximise each iteration, the “surrogate” objective function is as follows:
where and are coefficients, denotes an entropy bonus, is a surrogate objective, and is a squared-error loss.
Hierarchical reinforcement learning (HRL) is a promising approach to extending traditional RL methods to solve more complex tasks [Kulkarni et al.2016]. In its most straightforward setting, the hierarchy corresponds to a rooted directed tree, with the highest-level manager as the root node and each worker reporting to only a single manager. A popular scheme is meta-learning shared hierarchies [Frans et al.2017], which learn a hierarchical policy whereby a master policy switches between a set of sub-policies. The master selects an action every time steps, and a sub-policy executed for time steps constitutes a high-level action. Another scheme [Nachum et al.2018] is learning goal-directed behaviours in environments, where lower level controllers are supervised with goals that are learned and proposed automatically by the higher level controllers.
Toward the propagation of the critics in the hierarchies, we propose HCA, a framework for MARL that considers multiple cooperative critics from two levels of the hierarchy. To speed up the learning process and increase the cumulative rewards, the agent is allowed to receive information from local and global critics in a competition task.
3.1 Baseline: A3C-PPO
A3C and PPO-based RL algorithms have performed comparably to or better than state-of-the-art approaches while being much simpler to implement and tune. In particular, PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance. Here, we provide the A3C algorithm with PPO, called A3C-PPO, which is a state-of-the-art deep RL algorithm. It can be used as the baseline to validate experiment environments as well as starting points for the development of novel algorithms.
3.2 Multiple-Critic Assignment
To apply existing RL methods to the problem of agents with variable attention to more than one critic, we consider a softmax approach for resolving the multiple-critic learning problem. In terms of advantage actor-critic methods, the actor is a policy function that controls how our agent acts, and the critic is a value function that measures how good these actions are. For multiple critics, the update advantage function performed by the softmax function can be written as
, where is the total number of critics.
The advantage function calculates the extra reward if taking this action, which tell us the improvement compared to the average action taken at that state. In other words, the maximised indicates that the gradient is pushed in that direction. Based on the A3C structure, the policy function would estimate .
Furthermore, we consider the time step intervals of multiple critics, and the update advantage function can be written as
, where is the total number of critics.
If , , where and is a time period with time steps; otherwise, .
For simplicity, the experiments generally used two-level hierarchies such as a multi-agent hierarchy with one manager agent and two worker agents. To propagate the critics in the hierarchies, we are the first to develop an HCA framework allowing a worker to receive multiple critics computed locally and globally. The manager is responsible for collecting broader observations and estimating the corresponding global critic. As shown in Fig. 1, the HCA framework is constructed by the two-level hierarchies with one manager agent and two worker agents. The local and global critics are implemented by the softmax function illustrated in the ‘multiple-critic assignment’ subsection.
Figure 1. The HCA framework. The multi-agent hierarchy with one manager agent and two worker agents. The worker receives multiple critics computed locally and globally, and the manager provides the global critic.
Here, we applied the HCA framework in the A3C-PPO, called HCA-A3C-PPO, or simply HCA. The successfully trained HCA-A3C-PPO model requires tuning of the trained hyperparameters, which is beneficial to the output of the training process containing the optimised policy.
|Types||Agents||Observation Spaces||Input Variables|
|Type 1||worker 1&2||ball (, )||8 variables|
|racket (, )|
|manager||ball (, )||10 variables|
|racket (, )|
|distance (, )|
|Type 2||worker 1&2||ball (, )||8 variables|
|racket (, )|
|manager||ball (, )||16 variables|
|racket (, )|
|Type 3||worker 1&2||ball (, )||8 variables|
|racket (, )|
|manager||ball (,)||20 variables|
|racket (, )|
|distance (, )|
|Type 4||worker 1&2||ball (, )||8 variables|
|racket (, )|
|manager||ball (, )||10 variables|
|racket (, )|
|distance (, )|
We apply our proposed HCA framework to scenarios in which two agents compete with each other. We empirically show the success of our framework compared to the existing method in competitive scenarios. We have released codes for both the model and the environments on GitHub.
4.1 Unity Platform for MARL
Since many existing platforms, such as OpenAI Gym, lack the ability to flexibly configure the simulation, the simulation environment becomes a black box from the perspective of the learning system. The Unity platform, a new open-source toolkit, has been developed for creating and interacting with simulation environments. Specifically, the Unity Machine Learning Agents Toolkit (ML-Agents Toolkit)[Juliani et al.2018] is an open-source Unity plugin that enables games and simulations to serve as environments for training intelligent agents. The toolkit supports dynamic multi-agent interaction, and agents can be trained using RL through a simple-to-use Python API.
4.2 Unity Scenario: Tennis Competition
We set up a tennis competition scenario in Unity including a two-player game whereby agents control rackets to bounce the ball over a net. The goal of this task is that the agents must bounce the ball between one another while not dropping or sending the ball out of bounds. Furthermore, as shown in Fig. 2, we construct a new learning environment involving the two-layer hierarchy by introducing a manager to look at broader observation spaces. The information that the low-level agents (racket workers 1 and 2) collect includes the position of the target and the position of the agent itself, as well as the velocity of the agent. The stat observation of the manager contains additional variables, such as the distance between the ball and the racket and information about the previous time steps. These observation state spaces are continuous, and we need them for initialisation. Here, we provide four types of observation spaces in Table 1 to test our proposed HCA framework and baseline A3C-PPO.
Of note, the agent reward function is +0.1 when hitting the ball over the net and -0.1 when letting the ball hit the ground or when the ball is hit out of bounds. The observation space includes 8-20 variable vectors corresponding to the position and velocity of the ball and racket, as well as the distance between the ball and the racket in continuous time steps. The vector action space is continuous, with a size of 3, corresponding to movement toward the net or away from the net, and jumping.
Figure 2 Tennis competition in Unity
The hyperparameters for the RL used for training are specified in Table 2, which provides the initialisation settings that we used in the tennis competition learning environment. In PPO, the batch size and buffer size represent the number of experiences in each iteration of gradient descent and the number of experiences to collect before updating the policy model, respectively. Beta controls the strength of entropy regularisation, and epsilon influences how rapidly the policy can evolve during training. Gamma and lambda indicate the reward discount rate for the generalised advantage estimator and the regularisation parameter, respectively.
|max steps||memory size||256|
|num. layers||2||time horizon||64|
|sequence length||64||summary freq.||1000|
We provide the training performance of the HCA framework (HCA-A3C-PPO) and baseline algorithm (A3C-PPO). The HCA framework has been shown to be efficient and more general than the baseline algorithm; as such, we chose an example scenario for use with the two-player tennis competition. To study the training process in more detail, we used TensorBoard (smoothing =0.7) to demonstrate the dynamic rewards, episodes, and policies with four types (type 1, type 2, type 3 and type 4) of observation spaces. In particular, we focus on two indices, cumulative reward and episode length, which represent the mean cumulative episode reward and the mean length of each episode in the environment for all agents, respectively.
5.1 Type 1: HCA vs. Baseline
Considering the 10 variable vectors of the observation spaces of the manager, we compare our HCA framework (HCA-A3C-PPO, pink-red line) and baseline (A3C-PPO, dark-red line) performance. As shown in Fig. 3, the HCA framework results showed higher cumulative reward and longer episode length with short training steps. Both methods experience a successful training process, as they both presented a slowly decreased entropy and ultimately decreased magnitude of the policy loss.
Figure 3. Graphs depicting the mean cumulative episodic reward, mean episode length, mean entropy and policy loss (y-axis) with respect to the time steps of the simulation (in thousands, x-axis) during the training process.
5.2 Type 2: HCA vs. Baseline
Considering the 16 variable vectors of the observation spaces in the manager, we compare our HCA framework (HCA-A3C-PPO, orange line) and baseline (A3C-PPO, blue line) performance. As shown in Fig. 4, the HCA framework results show higher cumulative reward and longer episode length with short training steps. Both methods experience a successful training process, as they both presented a slowly decreased entropy and ultimate decreased magnitude of the policy loss.
Figure 4. Graphs depicting mean cumulative episodic reward, mean episode length, mean entropy and policy loss (y-axis) with respect to the time steps of the simulation (in thousands, x-axis) during the training process.
5.3 Type 3: HCA vs. Baseline
Considering the 16 variable vectors of the observation spaces of the manager, we compare our HCA framework (HCA-A3C-PPO, light-blue line) and baseline (A3C-PPO, dark-blue line) performance. As shown in Fig. 5, the HCA framework results show higher cumulative reward and longer episode length with short training steps. Both methods experience a successful training process, as they both presented slowly decreased entropy and ultimately decreased magnitude of the policy loss.
Figure 5. Graphs depicting the mean cumulative episodic reward, mean episode length, mean entropy and policy loss (y-axis) with respect to the time steps of the simulation (in thousands, x-axis) during the training process.
5.4 Type 4: HCA vs. Baseline
Considering the 10 variable vectors (with 5-time-step intervals) of the observation spaces of the manager, we compare our HCA framework (HCA-A3C-PPO, green line) and baseline (A3C-PPO, blue line) performance. As shown in Fig. 6, the HCA framework results showed higher cumulative reward and longer episode length with short training steps. Both methods experience successful training processes, as they both presented slowly decreased entropy and ultimately decreased magnitude of the policy loss.
Figure 6. Graphs depicting the mean cumulative episodic reward, mean episode length, mean entropy and policy loss (y-axis) with respect to the time steps of the simulation (in thousands, x-axis) during the training process.
In this study, we developed the HCA framework using global information to speed up the learning process and increase the cumulative rewards. Within this framework, the agent is allowed to receive information from local and global critics in a competition task. We tested the proposed framework in a two-player tennis competition task in the Unity environment by comparing with a baseline algorithm: A3C-PPO. The results showed that the HCA framework outperforms the non-hierarchical critic baseline method on MARL tasks.
In future work, we will explore weighted approaches to fuse critics from different layers and consider optimising the temporal scaling in different layers. Furthermore, we will extend the number of agents and the number of layers, and even allow for more than one manager at the highest level of the hierarchy. We expect the possibility, in more exotic circumstances, of considering more general multi-agent reinforcement loops in which each agent can potentially achieve the maximum reward hierarchically.
This work was partially supported by grants from the Australian Research Council under Discovery Projects [DP180100670 and DP180100656], US Army Research Laboratory [W911NF-10-2-0022 and W911NF-10-D-0002/TO 0023], and Australia Defence Science Technology Group.
- [Ahilan and Dayan2019] Sanjeevan Ahilan and Peter Dayan. Feudal multi-agent hierarchies for cooperative reinforcement learning. arXiv preprint arXiv:1901.08492, 2019.
- [Busoniu et al.2010] Lucian Busoniu, Robert Babuška, and Bart De Schutter. Multi-agent reinforcement learning: An overview. Innovations in multi-agent systems and applications-1, 310:183–221, 2010.
- [Dayan1993] Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
- [Frans et al.2017] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017.
- [González-Briones et al.2018] Alfonso González-Briones, Fernando De La Prieta, Mohd Mohamad, Sigeru Omatu, and Juan Corchado. Multi-agent systems applications in energy optimization problems: A state-of-the-art review. Energies, 11(8):1928, 2018.
- [Juliani et al.2018] Arthur Juliani, Vincent-Pierre Berges, Esh Vckay, Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange. Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627, 2018.
- [Kaelbling et al.1996] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
- [Kulkarni et al.2016] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
- [Littman2001] Michael L Littman. Value-function reinforcement learning in markov games. Cognitive Systems Research, 2(1):55–66, 2001.
- [Lowe et al.2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
- [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- [Mnih et al.2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
- [Nachum et al.2018] Ofir Nachum, Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. arXiv preprint arXiv:1805.08296, 2018.
- [Schulman et al.2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- [Silver et al.2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
- [Vezhnevets et al.2017] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.