True human intelligence embraces social and collective wisdom, laying a foundation for general Artificial Intelligence (AI)[Goertzel and Pennachin2007]. In essence, human intelligence would collectively solve the problem that otherwise is unthinkable by a single person. For instance, theoretically, Condorcet’s jury theorem [Landemore2013]
shows that, under certain assumption, adding more voters in a group would increase the probability that the majority chooses the right answer.
In parallel, in the coming era of algorithmic economy, many AI agents work together, artificially creating their own collective intelligence. With certain rudimentary abilities, Artificial Collective Intelligence (ACI) starts to emerge from multiple domains, including stock trading [Wang and others2013], strategic games [Peng et al.2017] and city transportation optimization [Zhang et al.2017] etc.
A key technology of ACI is multi-agent reinforcement learning (RL) but typically requires a scale of hundreds to millions of agents [Yang et al.2017]. However, existing experimentation platforms, including ALE [Bellemare et al.2013], OpenAI Gym/Universe [Brockman et al.2016], Malmo [Johnson et al.2016], ELF [Tian et al.2017] and SC2LE [Vinyals et al.2017] fail to meet the demand. Although they gradually show the desire to cover multi-agent scenarios, they are basically designed to take no more than dozens of agents. Thus there is a significant need for a platform dedicated to large population multi-agent reinforcement learning, which is critical for ACI. On the other hand, the majority of state-of-the-art multi-agent reinforcement learning algorithms [Lowe et al.2017] are also limited in the scale of dozens of agents. Therefore, it is also a great challenge to the research community.
The MAgent Platform
The MAgent project111https://github.com/geek-ai/MAgent aims to build a many-agent reinforcement learning platform for the research of ACI. With the idea of network sharing and ID embedding, MAgent is highly scalable and can host up to one million agents on a single GPU server. Moreover, MAgent provides environment/agent configurations and a reward description language to enable flexible environment design. Finally, MAgent provides a simple yet visually effective render to interactively present the state of environment and agents. Users can also slide or zoom the scope window and even manipulate agents in the game.
A large-scale gridworld is used as the fundamental environment for the large population of agents. Each agent has a rectangular body with a local detailed perspective and (optional) global information. The actions could be moving, turning or attacking. A C++ engine is provided to support fast simulation. Heterogeneous agents can be registered in our engine. The state space, action space and agents’ attributes can be configured easily so that various environments can be swiftly developed.
Reward Description Language
In our python interface, users can describe the reward, agent symbols, and events by a description language we developed. When the boolean expression of an event expression is true, reward will be assigned to the agents involved in the events. Logical operation like ‘and’, ‘or’, and ‘not’ is also supported. The following is an example used to describe the rule of game pursuit as described in Live and Interactive Part Section.
We implement parameter-sharing DQN, DRQN and A2C in our platform. We found DQN performs best in our settings and mainly use it to do following experiment. To bring in the diversity of agents, we also train an embedding for each agent’s unique ID. Users can benchmark their own multi-agent algorithms against these algorithms.
Live and Interactive Part
In the demo, we will show some example games conducted on MAgent. Demo visitors can see what will happen when applying deep reinforcement learning for such a large number of agents. They can use our render to explore in the gridworld, find intelligent patterns and the diversity of RL agents. Three examples are available at present, namely pursuit, gathering and battle. Some illustrations are shown in Fig. 1.
shows the emergence of local cooperation. Predators can get rewards if they attack preys while preys will get penalties if they are attacked. After training, predators learn to cooperate with nearby teammates to form several types of closure to lock the preys (see Fig. 1(b)), by which they can get reward every step afterward.
shows the emergence of competition in a limited resource environment. Agents are in a dilemma whether to get reward directly by eating food or killing other agents to monopolize the food. After training, they learn to rush to eat food at first. But when two agents come close, they will try to kill each other.
shows the hybrid of cooperation and competition. There are two armies in the map, each of which consists of hundreds of agents. The goal is to collaborate with teammates and eliminate all opponent agents. Simple self-play training enables them to learn both global and local strategies, including encirclement attack, guerrilla warfare.
In addition, we maintain a human player interface. Thus our demo visitors will be able to place themselves in the large world by controlling several agents to gain reward by cooperating or competing with RL agents.
MAgent is a research platform designed to support many-agent RL. With MAgent, researchers can study a population of AI agents in both individual agent and society levels. For future work, we would support continuous environments and provide more algorithms in MAgent.
- [Bellemare et al.2013] Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. H. 2013. The arcade learning environment: an evaluation platform for general agents. IJCAI.
- [Brockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. arXiv.
- [Goertzel and Pennachin2007] Goertzel, B., and Pennachin, C. 2007. Artificial general intelligence. Springer.
- [Johnson et al.2016] Johnson, M.; Hofmann, K.; Hutton, T.; and Bignell, D. 2016. The malmo platform for artificial intelligence experimentation. In IJCAI.
- [Landemore2013] Landemore, H. 2013. Democratic reason: Politics, collective intelligence, and the rule of the many. Princeton University Press.
- [Lowe et al.2017] Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and Mordatch, I. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. NIPS.
- [Peng et al.2017] Peng, P.; Yuan, Q.; Wen, Y.; Yang, Y.; Tang, Z.; Long, H.; and Wang, J. 2017. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv.
- [Tian et al.2017] Tian, Y.; Gong, Q.; Shang, W.; Wu, Y.; and Zitnick, L. 2017. Elf: An extensive, lightweight and flexible research platform for real-time strategy games. arXiv.
- [Vinyals et al.2017] Vinyals, O.; Ewalds, T.; Bartunov, S.; et al. 2017. Starcraft ii: A new challenge for reinforcement learning. arXiv.
- [Wang and others2013] Wang, C., et al. 2013. Investigating the impact of trading frequencies of market makers: a multi-agent simulation approach. SICE JCMSI.
- [Yang et al.2017] Yang, Y.; Yu, L.; Bai, Y.; Wang, J.; Zhang, W.; Wen, Y.; and Yu, Y. 2017. An empirical study of AI population dynamics with million-agent reinforcement learning. arXiv preprint.
[Zhang et al.2017]
Zhang, L.; Hu, T.; Min, Y.; Wu, G.; Zhang, J.; Feng, P.; Gong, P.; and Ye, J.
A taxi order dispatch model based on combinatorial optimization.In KDD.