There is a growing interest for Multi-Agent Reinforcement Learning (MARL) as first steps towards building general intelligent agents that learn to make low and high level decisions in non-stationary complex environments in the presence of other agents [10, 6, 7, 9, 2]. Previous results point us towards increased conditions for coordination, efficiency/fairness (see for example  and [3, 4] in which loss averse agents are considered) and common-pool resource sharing [10, 4]. We further study coordination focusing on how individuals specialise in multitask environments where several rewarding tasks can be performed (as in ). Multitask environments are suited to study general intelligence : the ability of agents to perform as well as possible in a big variety of tasks. Agents learning in big multi-agent populations don’t necessarily need to perform well in all tasks, but under certain conditions may specialise. The number of individuals present in the learning may affect or be linked to the cognitive capabilities of each individual (see the relation of these two axis in ). An example of this can be seen in large social insect communities in which individuals specialize and surprisingly can also switch tasks dynamically whenever needed as in ant colonies .
We consider a multitask multi-agent environment in which tasks can be performed sequentially, one after the other (we only use two tasks but could be generalized to any number). The agents have available resources in an area of the environment that need to be carried to the next area and when this is done the resource is assigned the next target area or disappears if in the last area. Concretely, agents will be rewarded for bringing resources of type one from the first area to the second (Task 1) and from the second to the third (Task 2), as shown in Figure 1. Agents can perform 5 actions (turn left and right, move forward, take and drop resource) and have partial observability of the environment (a square of 7x7 centered in its position). This partial observability contains several channels which determine what the agents can perceive: the ids of the different areas together with the walls, the types of resources, the other agent positions. Agents get rewarded for the accomplishment of each task (when dropping a resource in the correct zone) with a reward value associated to the resource that decays over time. The resources, like the agents, spawn at random positions at the beginning of each episode and their value resets when the first task is accomplished. We shaped the reward function in order to boost the learning speed, discouraging with negative rewards useless actions or waiting behaviours.
We study factors that affect specialisation such as agent number, task throughput (limiting the number of times one task can be performed before another task: in our case this means limiting the number of resources that can be in area two, we also call it bottleneck) and reward degradation over time. Agents are independent learners, trained through two state of the art reinforcement learning algorithms: Dueling Double Deep Q-Network (DDDQN) [13, 14] with agent-independent prioritised experience replay buffers  and Asynchronous Advantage Actor-Critic (A2C) 
, both using Convolutional Neural Networks as function approximators. The former is a value-based method, that learns a value function and then acts greedily with respect to it. The policy is intrinsically given by the value function itself and the actions are selected according to a time-decreasing probability called epsilon (
-greedy) that controls whenever they will be sampled according to the current estimated value function or from a uniform random distribution (can be seen to decrease linearly and equally for all agents in Figure 2
). On the other hand, the algorithm A2C, is a mix of value based and policy gradient approaches, in which we have two networks, an actor who maps states to actions and a critic (value based) who learns how well the actor is performing and contributes to improve the policy. This is performed through the actor policy gradient, who aims to change the policy to maximize the value estimations of the critic. In this case the exploration is controlled by the entropy of the policy (how uncertain and similar to a uniform distribution it is), computed in the actor network loss function, without any external parameters.
The specialization of each agent is defined as the absolute difference of the two tasks counts (number of times a task has been done) and divided by their sum: . In this way we will consider an agent more specialized when it mainly performs one task independently of its respawn location.
We expect to observe a progressive specialisation in single tasks with respect to agent number. This result would indicate the advantage of sacrificing general cognitive abilities for the benefit of population efficiency as it happens in ant colonies .
The results we obtained using independent DDDQN learners suggest that, increasing the number of agents, increases their specialization (see Figure 3). Nevertheless, DDDQN is unstable and has difficulties to converge (we measured convergence by looking at the weight updates of the network). On the other hand, using A2C proved to be much more stable and all agent networks converged. We attribute the enhanced convergence of A2C with respect to DDDQN to the fact that the two algorithms have very different exploration strategies. In the case of DDDQN, all agents are synchronized by the external exploration factor
which is the probability of taking a random action. This causes an undesirable situation as all agents are forced to have deterministic policies at the same time and they cannot adapt to each other, thus creating the instabilities and the convergence problems. This is not the case of A2C in which exploration is determined by the action probability distributions of the actor network and vary across agents independently.
Another problem of DDDQN may be its experience replay. In fact it can lead to more non-stationarity since experiences related to the other agents’ behaviour collected in the past and stored in the buffers might be replayed to the network when they are outdated . We can have an idea of this effect looking at Figure 2; the two different plots show respectively the same simulation performed using the two DDDQN and A2C. The former, that uses -greedy as exploration strategy makes all the agents explore at the same time, leading to a drop of the total community reward due to competition when the policy starts to be more deterministic. The total reward will be recovered only when a new equilibrium will be reestablished among the agents. This reason is also probably responsible of the further instability, that is not noticeable in the A2C simulation plot, in which the exploration is connected to the entropy level of the policy, allowing the agents to explore not necessary all at the same time with the same rate. In addition, the positive correlation between number of agents and specialization is confirmed and clearer in the experiments performed with A2C.
The throughput (also called bottleneck) seems to affect fairness and the stability of the learning, probably because the environment will be more difficult to exploit. This may be another reason why the simulations with a stricter (lower) bottleneck had more problems to converge.
The next steps will be to employ A2C algorithm exclusively and perform a wider set of simulations with different environment sizes, number of agents and resources availability. A2C has no convergence problems and we expect to obtain a cleaner specialization trend. Then we could probably be able to link the throughput and the agent number with the specialization and study if there will be a positive correlation, as we expect.
Acknowledgments Research supported by INSOCO-DPI2016-80116-P.
-  (2017) The morphospace of consciousness. arXiv preprint arXiv:1705.11190. Cited by: Extended Abstract.
-  (2018) Modeling the formation of social conventions in multi-agent populations. arXiv preprint arXiv:1802.06108. Cited by: Extended Abstract.
-  (2018) Loss aversion fosters coordination in independent reinforcement learners. Artificial Intelligence Research and Development: Current Challenges, New Trends and Applications 308, pp. 307. Cited by: Extended Abstract.
-  (2018) Inequity aversion resolves intertemporal social dilemmas. arXiv preprint arXiv:1803.08884. Cited by: Extended Abstract.
-  (2007) Universal intelligence: a definition of machine intelligence. Minds and machines 17 (4), pp. 391–444. Cited by: Extended Abstract.
-  (2017) Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 464–473. External Links: Cited by: Extended Abstract.
-  (2017) Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068. Cited by: Extended Abstract.
Asynchronous methods for deep reinforcement learning.
International conference on machine learning, pp. 1928–1937. Cited by: Extended Abstract.
-  (2017) Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2681–2690. Cited by: Extended Abstract, Extended Abstract.
-  (2017) A multi-agent reinforcement learning model of common-pool resource appropriation. In Advances in Neural Information Processing Systems, pp. 3646–3655. Cited by: Extended Abstract.
-  (2019) Statistical physics of liquid brains. Philosophical Transactions of the Royal Society B 374 (1774), pp. 20180376. Cited by: Extended Abstract, Extended Abstract.
-  (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: Extended Abstract.
-  (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: Extended Abstract.
-  (2015) Dueling network architectures for deep reinforcement learning. CoRR abs/1511.06581. External Links: Cited by: Extended Abstract.