1 Introduction
Computer simulated environments, and particularly games, have played a central role in advancing artificial intelligence (AI). From the early days of machines playing checkers to Deep Blue, and to the most recent accomplishments of Atari bots, AlphaGo, OpenAI Dota 2 bots, and AlphaStar, artificial game agents have achieved superhuman level performance even in the most complex games. This progress is mainly due to a combination of advancements in deep learning, tree search, and reinforcement learning (RL) techniques in the past decade.
(Samuel, 1959)
used a form of heuristic search combined with RL ideas to solve checkers. IBM Deep Blue followed the tree search path and was the first artificial game agent who beat the chess world champion, Gary Kasparov
(Deep Blue, 1997). A decade later, Monte Carlo Tree Search (MCTS) (Coulom, 2006; Kocsis & Szepesvári, 2006) was a big leap in AI to train game agents. MCTS agents for playing Settlers of Catan were reported in (Szita et al., 2009; Chaslot et al., 2008) and shown to beat previous heuristics. (Heyden, 2009) compares multiple approaches of agents to one another in the game Carcassonne on the two-player variant of the game and discusses variations of MCTS and Minimax search for playing the game. MCTS has also been applied to the game of 7 Wonders (Robilliard et al., 2014) and Ticket to Ride (Huchler, 2015).(Tesauro, 1995), on the other hand, used TD-Lambda which is a temporal difference RL algorithm to train Backgammon agents at a superhuman level. More recently, deep Q networks (DQNs) have emerged as a general representation learning framework from the pixels in a frame buffer combined with Q-Learning with function approximation without need for task-specific feature engineering (Mnih et al., 2015).111While the original DQNs worked with pixels as state space, the same idea could be applied to other cases by changing the network structure appropriately. The impressive recent progress on RL to solve video games partly owes to the recent abundance of processing power and AI computing technology.222The amount of AI compute has been doubling every 3-4 months in the past few years (AI & Compute, 2018).
DeepMind researchers remarried the two approaches by demonstrating that neural networks and their generalization properties could significantly speed up and scale MCTS. This led to AI agents that play Go at a superhuman level
(Silver et al., 2016), and solely via self-play (Silver et al., 2017b, a). Subsequently, OpenAI researchers showed that a policy optimization approach with function approximation, called Proximal Policy Optimization (PPO) (Schulman et al., 2017), would lead to training agents at a superhuman level in Dota 2 (OpenAI Five, 2018). The most recent progress was reported by DeepMind on StarCraft II, where AlphaStar was unveiled to play the game at a superhuman level by combining a variety of techniques including the use of attention networks (AlphaStar, 2019).Despite the tremendous success stories of deep RL at solving games, we believe that winning isn’t everything. We consider the alternative problem of training human-like and believable agents that would make the video game engaging and fun for the human players. As video games have evolved, so have the game graphics and the gameplaying AI, also referred to as game AI. Considering games with a limited state-action space, such as Atari games, the human-likeness and believability of AI agents would be non-issues. Today, we have reached a point where game worlds look very realistic calling for more intelligent and realistic gameplaying agents.
While the traditional game AI solutions are already providing excellent experiences for players, it is becoming increasingly more difficult to scale those handcrafted solutions up as the game worlds are becoming larger, the content is becoming more dynamic, and the number of interacting agents is increasing. This calls for alternative approaches to train human-like and believable game AI. We build on a variety of planning methods and machine learning techniques (including the state-of-the-art deep RL) and move away from the recent trends at training superhuman agents in solving the game AI problem (Zhao et al., 2019).
In this paper, we describe a work-in-progress hierarchical solution to a team sports video game. At a low level, the agents need to take actions that are believable and human-like whereas at a high level the agents should appear to be following a “game plan”. While imitation learning seems apt for solving the low level problem, we propose to rely on reinforcement learning and planning to solve the high-level game strategic plan.
The rest of the paper is organized as follows. In Section 2, we provide the basic problem setup. In Section 3, we describe the solution techniques used to solve the problem. In Section 4, we provide a more in-depth presentation of the reinforcement learning techniques used for achieving multi-agent strategic gameplay. Finally, concluding remarks are presented in Section 5.
2 Problem Setup & Related Work
In this paper, we study a team sports video game, where the designer’s goal is to train agents that exhibit strategic teamplay with a high skill level while the agents play like human players. Hence, the solution would entail a variety of techniques, which will be discussed in more detail in this section.
2.1 Multi-agent learning
Our problem naturally lends itself to the multi-agent learning (MAL) framework. In such a framework, iteratively optimizing for a policy could suffer from non-convergence due to the breakdown of the stationarity of the decision process and partial observability of the state space (Littman, 1994; Chang et al., 2004). This is because the environment for each of the agents would change whenever any other agent updates their policy, and hence independent reinforcement learning agents do not work well in practice (Matignon et al., 2012).
More recently, (Lowe et al., 2017) proposed an actor-critic algorithm with a centralized critic during training and a decentralized actor at training and inference. (Gupta et al., 2017) compare policy gradient, temporal-difference error, and actor-critic methods on cooperative deep multi-agent reinforcement learning (MARL). See (Hernandez-Leal et al., 2017, 2018) for recent surveys on MAL and deep MARL advancements.
We emphasize that our problem is fundamentally simpler than the MAL framework. In contrast to the robotics problems where each agent would need to execute their own decentralized policy, in a video game all agents could be trained centrally and executed centrally as well on a single CPU. However, in the centralized treatment of the problem, in addition to the action space growing exponentially with the number of agents in the field, the chance of randomly executing a strategic play is very low, which requires collecting a huge number of state-action pairs for the agent to be able to learn such strategies if they start from random gameplay. We will discuss some of these challenges in Section 4.
2.2 Learning from demonstrations
To ensure human-like behavior, we use human demonstrations in the training loop. There are three general ways of using the demonstrations to train agents. Inverse reinforcement learning (IRL) (Ng & Russell, 2000; Abbeel & Ng, 2004) would infer reward functions that promote the observed behavior in demonstrations, which can then be used in model-free RL. However, IRL is by nature an ill-posed inverse problem and tricky to solve, especially in a multi-agent framework. (Ho & Ermon, 2016) proposed a direct approach to distilling a policy from the demonstrations using adversarial training, which has recently been extended to the multi-agent case (Yu et al., 2019).
It is also possible to use demonstrations to guide RL. (Levine & Koltun, 2013) train off-policy RL using demonstrations. (Mnih et al., 2015) use behavioral cloning to initialize value and policy networks that would solve Go, and (AlphaStar, 2019) is built on the same thought process. (Večerík et al., 2017; Harmer et al., 2018) use demonstrations in the replay buffer to guide the policy to a better local optimum. (Rajeswaran et al., 2017; Peng et al., 2018) shape the reward function to promote actions that mimic the demonstrator. (Kartal et al., 2019) use demonstrations to teach the policy to avoid catastrophic events in the game of Pommerman where model-free RL fails.
2.3 Hierarchical learning
To manage the complexity of the posed problem (see Section 3), our solution involves a hierarchical approach. (Le et al., 2018) consider a hierarchical approach where the underlying low level actions are learned via RL whereas the high-level goals are picked up via IL from human demonstrations. This is in contrast to the hierarchical approach that we consider in this paper where we use IL at the low-level to achieve human-like behavior. (Pang et al., 2018) break down the complexity of the StarCraft learning environment (Vinyals et al., 2017) by breaking down the problem to a hierarchy of simpler learning tasks. (Bacon et al., 2017) apply a planning layer on top of RL where they infer the abstractions from the data as well. Finally, (Vezhnevets et al., 2017) consider a bi-level neural network architecture where at the top level the Manager sets goals at a low temporal resolution, and at the low level the Worker produces primitive actions conditioned on the high-level goals at a high temporal resolution. More recently, (Zhan et al., 2018) provide a hierarchical generative model for achieving human gameplay using weak supervision.
2.4 Human-Robot Interaction
The human-robot interaction problem shares many similarities with the problem at hand (Sheridan, 2016). However, training agents in video games is simpler in many ways. First, the agents can execute their policies centrally and there is no need for decentralized execution. Second, extracting semantic information from sensory signals such as processing images/videos and text-to-speech conversion is not needed as all of the semantic information is available from the game engine. On the other hand, many of the sample efficient learning techniques designed for training robots are applicable to training agents in team sports video games as well (Doering et al., 2019).
3 Solution Techniques
End-to-end model-free RL requires millions of state-action pairs equivalent of many years of experience for the agent to reach human-level performance.333AlphaStar is trained using the equivalent of 60,000 years of human experience. Applying these same techniques to modern complex games for playtesting and game AI requires obtaining and processing hundreds of years of experience, which is only feasible using significant cloud infrastructure costing millions of dollars (AlphaStar, 2019; Vinyals et al., 2017). Hence, we move away from the end-to-end solutions in favor of hierarchical solutions by breaking the complex problem into a hierarchy of simpler learning tasks.
We assume multiple levels of the problem abstraction in a team sports game. At the lowest level, the agent’s actions and movements should resemble that of actual human players. At the highest level, the agents should learn how to follow a (learned) high-level game plan. In the mid-level, the agents should learn to exhibit skill and to coordinate their movements with each other, e.g., to complete successful passes or to shoot toward the opponent’s goal when they have a good chance of scoring.
While making RL more sample efficient is an active area of research (e.g., by curiosity-driven exploration (Pathak et al., 2017)), to apply RL to modern team sport video games or any part of the problem, we would have to shape rewards that promote a certain style or human-like behavior given human demonstrations. Reward shaping in this setup is an extremely challenging problem. Additionally, we also need to capture human-like cooperation/conflict in multi-agent strategic gameplay. These make reward shaping extremely challenging with mathematically vague objectives. Hence, we rely on imitation learning and behavior cloning (such as DAGGER (Ross et al., 2011), learning from play (LFP) (Lynch et al., 2019), or GAIL (Ho & Ermon, 2016)) to achieve human-like low-level actions while we rely on RL to achieve high skill at the top level. In this paper, we leave out the details of the imitation learning that has been used to train the low-level tactics and only focus on the mid-level strategic gameplay.
To achieve faster convergence, we rely on curriculum learning (Bengio et al., 2009) in our training. We start training the agent to deal with easy situations more often and then make the environment more difficult. For example, the agent can learn to shoot when the opponent’s net is undefended fairly quickly while it is harder to learn when to shoot when the opponent is actively defending the net. We also train the agent against simpler existing game AI agents first and then make the AI level harder once the agent has already learned the basics. Similar approaches are reported in (Yang et al., 2018) to achieve cooperation in simulating self-driving cars, and in (Gao et al., 2019) to solve the game of Pommerman.
To deal with the MAL aspect of the problem, as the first step we train agents one at a time within the team, and let them blend into the overall strategic gameplay. Of course, there is little control gained from this process and shaping rewards is highly dependent on the status of the other agents in the environment While we have not yet implemented the centralized training of multiple agents, this is the immediate problem we are tackling now. We will also have to solve the credit assignment in MARL for each individual agent’s behavior (Devlin et al., 2014). We remind the reader that the goal is to provide a viable approach to solving the problem with reasonable amount of computational resources.
Last but not least, we also move away from using the raw state space through the screen pixels. On the contrary, we provide the agent with any additional form of information that could ease training and might otherwise be hard to infer from the screen pixels. Our ultimate goal is to train human-like agents with believable behavior. Thus, so long as the agents would pass the Turing test we are not alarmed by the unfair extra information at their disposal while training. Furthermore, in the game development stage, the game itself is dynamic in the design and multiple parameters and attributes (particularly related to graphics) may change between different builds, hence it is desirable to train agents on more stable features rather than screen pixels.
4 Strategic Gameplay via RL

Our training takes place on a mid-level simulator, which we call simple team sports simulator (STS2).444We intend to release the STS2 gameplay environment as an open-source package. A screenshot of STS2 gameplay is shown in Fig. 1. The simulator embeds the rules of the game and the physics at a high level abstracting away the low-level tactics. The simulator supports v matches for any positive integer . The two teams are shown as red (home) and white (away). Each of the players can be controlled by a human, traditional game AI, or any other learned policy. The traditional game AI consists of a handful of rules and constraints that govern the gameplay strategy of the agents. The STS2 state space consists of the the coordinates of the players and their velocities as well as an indicator for the possession of the ball. The action space is discrete and is considered to be left, right, forward, backward, pass, and shoot. Although the player can hit two or more of the actions together we do not consider that possibility to keep the action space small for better scalability.
We currently use this mid-level simulator to inform passing and shooting decisions in the low-level imitation learning. In the rest of this section, we report our progress toward applying deep RL in the STS2 environment to achieve multi-agent gameplay. Future work will entail a better integration between these levels of abstraction.
4.1 Single agent in a 1v1 game
As the simplest first experiment, we consider training an agent that learns to play against the traditional game AI in a 1v1 match. We start with a sparse reward function of ‘+1‘ for scoring and ‘-1’ for being scored against. We used DQN (Mnih et al., 2015), Rainbow (Hessel et al., 2017), and PPO (Schulman et al., 2017) to train agents that would replace the home team (player). DQN shows the best sign of learning useful policies after an equivalent of 5 years of human gameplay experience. The gameplay statistics of the DQN agent are reported in Table 1. As can be seen the DQN agent was losing 1:4 to the traditional AI. Note that we randomize the orientation of the agents at the beginning of each episode, and hence, the agent encounters several easy situations with an open net for scoring. On the other hand, the agent does not learn how to play defensively when the opponent is in possession of the ball. In fact, we believe that a successful strategy for defense is more difficult to learn than that of offensive gameplay.
Statistics | DQN Agent | Trad. Game AI |
---|---|---|
Score rate | 22% | 78% |
Possession | 36% | 64% |
Next, we shape the rewarding mechanism with the goal of training agents that also learn how to play defensively. In addition to the ‘+/-1’ scoring reward, we reward the agent with ‘+0.8’ for gaining the possession of the ball and ‘-0.8‘ for losing it. The statistics of the DQN agent are reported in Table 2. In this case, we observe that the DQN agent learns to play the game with an offensive style of chasing the opponent down, gaining the ball, and attempting to shoot. Its score rate as compared to the traditional game AI is 4:1, and it dominates the game.
Statistics | DQN Agent | Trad. Game AI |
---|---|---|
Score rate | 80% | 20% |
Possession | 65% | 35% |
We repeated this experiment using PPO and Rainbow as well. We observe that the PPO agent’s policy converges quickly to a simple one. When it is in possession of the ball, it wanders around in its own half without attempting to cross the half-line or to shoot until the game times out. This happens because the traditional game AI is programmed not to chase the opponent in their half when the opponent is in possession of the ball, and hence, the game goes on as described until timeout with no scoring on either side. PPO has clearly reached a local minimum in the space of policies, which is not unexpected as it is optimizing the policy directly. Finally, the Rainbow agent does not learn a useful policy for either offense or defense.
As the last 1v1 experiment, we train a PPO agent against the abovementioned DQN agent with exactly the same reward function. The gameplay statistics is reported in Table 3. We observe that the PPO agent is no longer stuck in a local optimum policy, and it is dominating the DQN agent with a score rate of 6:1. Notice that this is not a fair comparison as the DQN agent was only trained against traditional game AI agent and had not played against the PPO agent, whereas the PPO agent is directly trained against the DQN agent. While dominating the score rate, we also observe that the game is much more even in terms of the possession of the ball.
Statistics | PPO Agent | DQN Agent |
---|---|---|
Score rate | 86% | 14% |
Possession | 55% | 45% |
Note that in this experiment the DQN agent is fixed, i.e., not learning, and PPO can overfit to exploit it because DQN is deterministic, easier to overfit against as an opponent.
4.2 Single agent in a 2v2 game
Having gained some confidence with single agent training, as the simplest multi-agent experiment, we consider training a single agent in a 2v2 game. We let the traditional game AI be in control of the opponent players as well as the teammate player. The first experiment entails a ‘+/-0.8‘ team reward for any player in the team gaining/losing the ball in addition to the ‘+/-1’ reward for scoring. The agent does not learn a useful defensive or offensive policy and the team loses overall.
In the second experiment, we change the rewarding mechanism to ‘+/-0.8’ individual reward for the agent gaining/losing the ball. This seems to turn the agent into an offensive player that chases the opponent down, gains the ball, and attempts to shoot. The team statistics for this agent are shown in Table 4. We observe that the agent has learned an offensive gameplay style where it scores most of the time.
Statistics | DQN Agent | Teammate | Opponent 1 | Opponent 2 |
---|---|---|---|---|
Score rate | 54% | 20% | 13% | 13% |
Possession | 30% | 18% | 26% | 26% |
While the team is winning in the previous case, we observe that the teammate is not participating much in the game with even less possession of the ball than the opponent players. Next, we explore training an agent that can assist the teammate score and possess the ball. We add another ‘-0.8’ teammate reward, which occurs whenever the teammate loses the ball. The difference with a team reward (which resulted in an agent that did not learn defense/offense policies) here is that the agent is not getting a reward if the teammate gains puck from the opponents. The gameplay statistics of this team are reported in Table 5. In terms of gameplay, we observe that the agent spends more time defending their own goal and passes the ball to the teammate to score when gains the possession of the ball.
Statistics | DQN Agent | Teammate | Opponent 1 | Opponent 2 |
---|---|---|---|---|
Score rate | 20% | 46% | 17% | 17% |
Possession | 36% | 22% | 21% | 21% |
4.3 Two agents trained separately in a 2v2 game
After successful training of a single agent in a 2v2 game, we train a second agent in the home team while reusing one of the previously trained agents as the teammate. For this experiment, we choose the DQN agent with an offensive gameplay style from the previous set of experiments as the teammate. This agent was described in the previous experiment. We train another agent as the teammate using exactly the same reward function as the offensive DQN agent. The statistics of the gameplay for the two agents playing together against the traditional game AI agents are shown in Table 6. While the second agent is trained with the same reward function as the first one, it is trained in a different environment as the teammate is now the offensive DQN agent trained in the previous experiment rather than the traditional game AI agent. As can be seen, the second agent now becomes defensive and is more interested in protecting the net, gaining the possession of the ball back, and passing it to the offensive teammate.
Statistics | DQN 1 | DQN 2 | Opponent 1 | Opponent 2 |
---|---|---|---|---|
Score rate | 50% | 26% | 12% | 12% |
Possession | 28% | 22% | 25% | 25% |
As the second 2v2 experiment, we train two PPO agents in the exact same manner as we trained the DQN agents in the previous experiment. We observe a similar trait in the role of the agents as offensive and defensive. Then we let the PPO team play against the DQN team. We observe that the PPO team defeats the DQN team by a slight edge, 55:45. While this experiment is a fair comparison between PPO and DQN, we emphasize that these teams are both trained against the traditional game AI agents and are now both playing in a new environment. In a sense, this is measuring how generalizable the learned policy is to environments that it has not experienced before. The training would converge using equivalent of 5 years of human experience using DQN (Mnih et al., 2015). On the other hand, PPO (Schulman et al., 2017) was an order of magnitude faster on all of the experiments reaching convergence in 6 months of human experience.
We repeated all of these experiments using Rainbow (Hessel et al., 2017)
agents as well, and they failed all of the experiments. We suspect that the default hyperparameters in distributional RL
(Bellemare et al., 2017) or prioritized experience replay (Schaul et al., 2015) is not suited to this problem, however, we are still investigating which addition in Rainbow is resulting in the failure of the algorithm in the described team sports environment.4.4 Two agents trained simultaneously in a 2v2 game
Finally, we consider centralized training of the two home agents where a single policy controls them at the same time. We tried multiple reward functions including rewarding the team by ‘+1‘ for scoring, ‘-1’ for being scored against, ‘+0.8‘ for gaining the possession of the ball, and ‘-0.8‘ for losing the possession of the ball. We observed that neither algorithm learned a useful policy in this case. We believe with a higher level planner on top of the reinforcement learning, we should be able to train the agents to exhibit teamplay but that remains for future investigation. We are currently looking into centralized training of actor-critic methods on this environment.
5 Concluding Remarks & Future Work
In this paper, we consider a team sports game. The goal is to train agents that play like humans, both in terms of tactics and strategies. We presented a hierarchical approach to solving the problem, where the low-level problem is solved via imitation learning and the high-level problem is addressed via reinforcement learning. We focus on strategy using a mid-level simulator, called simple team sports simulator (STS2) which we intend to release as an open-source repository. Our main takeaways are summarized below:
-
End-to-end model-free RL is unlikely to provide human-like and believable agent behavior, and we resort to a hierarchical approach using demonstrations to solve the problem.
-
Sparse rewards for scoring do not provide sufficient signal for training agents, even a high level, which required us to apply more refined reward shaping.
-
Using proper reward shaping, we trained agents with a variety of offensive and defensive styles. In particular, we trained an agent that can assist the teammate player to achieve better scoring and ball possession.
-
Rainbow (Hessel et al., 2017) failed at training agents in this environment, and we are investigating the reason this happens.
In future work, we will be working on better integrating the mid-level simulation results with the low-level imitation learned model. We also plan to better understand and explore multi-agent credit assignment in this environment (Devlin et al., 2014)
. We also plan to investigate transfer learning for translating the policies from this environment to the actual HD game
(Andrychowicz et al., 2018). We plan to explore further on centralized training of the multi-agent policy using QMIX (Rashid et al., 2018) and centralized actor-critic methods (Foerster et al., 2018).Acknowledgements
The authors would like to thank Bilal Kartal (Borealis AI) and Jiachen Yang (Georgia Tech) for useful discussions and feedback. The authors are also thankful to the anonymous reviewers for their valuable feedback.
References
- Abbeel & Ng (2004) Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1. ACM, 2004.
- AI & Compute (2018) AI & Compute, 2018. [Online, May 2018] https://blog.openai.com/ai-and-compute.
- AlphaStar (2019) AlphaStar, 2019. [Online, January 2019] https://tinyurl.com/yc2knerv.
- Andrychowicz et al. (2018) Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.
- Bacon et al. (2017) Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458. JMLR. org, 2017.
- Bengio et al. (2009) Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM, 2009.
- Chang et al. (2004) Chang, Y.-H., Ho, T., and Kaelbling, L. P. All learning is local: Multi-agent learning in global reward games. In Advances in neural information processing systems, pp. 807–814, 2004.
- Chaslot et al. (2008) Chaslot, G., Bakkes, S., Szita, I., and Spronck, P. Monte-carlo tree search: A new framework for game ai. In AIIDE, 2008.
- Coulom (2006) Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree search. In International conference on computers and games, pp. 72–83. Springer, 2006.
- Deep Blue (1997) Deep Blue, 1997. [Online] http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue.
- Devlin et al. (2014) Devlin, S., Yliniemi, L., Kudenko, D., and Tumer, K. Potential-based difference rewards for multiagent reinforcement learning. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pp. 165–172. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
- Doering et al. (2019) Doering, M., Glas, D. F., and Ishiguro, H. Modeling interaction structure for robot imitation learning of human social behavior. IEEE Transactions on Human-Machine Systems, 2019.
- Foerster et al. (2018) Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Gao et al. (2019) Gao, C., Hernandez-Leal, P., Kartal, B., and Taylor, M. E. Skynet: A top deep rl agent in the inaugural pommerman team competition. In 4th Multidisciplinary Conference on Reinforcement Learning and Decision Making, 2019.
- Gupta et al. (2017) Gupta, J. K., Egorov, M., and Kochenderfer, M. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Springer, 2017.
- Harmer et al. (2018) Harmer, J., Gisslen, L., del Val, J., Holst, H., Bergdahl, J., Olsson, T., Sjoo, K., and Nordin, M. Imitation learning with concurrent actions in 3d games, 2018.
- Hernandez-Leal et al. (2017) Hernandez-Leal, P., Kaisers, M., Baarslag, T., and de Cote, E. M. A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183, 2017.
- Hernandez-Leal et al. (2018) Hernandez-Leal, P., Kartal, B., and Taylor, M. E. Is multiagent deep reinforcement learning the answer or the question? a brief survey. arXiv preprint arXiv:1810.05587, 2018.
- Hessel et al. (2017) Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298, 2017.
- Heyden (2009) Heyden, C. Implementing a computer player for Carcassonne. PhD thesis, Maastricht University, 2009.
- Ho & Ermon (2016) Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
- Huchler (2015) Huchler, C. An mcts agent for ticket to ride. Master’s thesis, Maastricht University, 2015.
- Kartal et al. (2019) Kartal, B., Hernandez-Leal, P., Gao, C., and Taylor, M. E. Safer deep RL with shallow MCTS: A case study in Pommerman. arXiv preprint arXiv:1904.05759, 2019.
- Kocsis & Szepesvári (2006) Kocsis, L. and Szepesvári, C. Bandit based Monte-Carlo planning. In European conference on machine learning, pp. 282–293. Springer, 2006.
- Le et al. (2018) Le, H., Jiang, N., Agarwal, A., Dudik, M., Yue, Y., and Daumé, H. Hierarchical imitation and reinforcement learning. In International Conference on Machine Learning, pp. 2923–2932, 2018.
- Levine & Koltun (2013) Levine, S. and Koltun, V. Guided policy search. In International Conference on Machine Learning, pp. 1–9, 2013.
- Littman (1994) Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163. Elsevier, 1994.
- Lowe et al. (2017) Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., and Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390, 2017.
- Lynch et al. (2019) Lynch, C., Khansari, M., Xiao, T., Kumar, V., Tompson, J., Levine, S., and Sermanet, P. Learning latent plans from play. arXiv preprint arXiv:1903.01973, 2019.
-
Matignon et al. (2012)
Matignon, L., Laurent, G. J., and Le Fort-Piat, N.
Independent reinforcement learners in cooperative markov games: a
survey regarding coordination problems.
The Knowledge Engineering Review
, 27(1):1–31, 2012. - Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Ng & Russell (2000) Ng, A. Y. and Russell, S. Algorithms for inverse reinforcement learning. In in Proc. 17th International Conf. on Machine Learning, 2000.
- OpenAI Five (2018) OpenAI Five, 2018. [Online, June 2018] https://openai.com/five.
- Pang et al. (2018) Pang, Z.-J., Liu, R.-Z., Meng, Z.-Y., Zhang, Y., Yu, Y., and Lu, T. On reinforcement learning for full-length game of starcraft. arXiv preprint arXiv:1809.09095, 2018.
- Pathak et al. (2017) Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2778–2787. JMLR. org, 2017.
- Peng et al. (2018) Peng, X. B., Abbeel, P., Levine, S., and van de Panne, M. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG), 37(4):143, 2018.
- Rajeswaran et al. (2017) Rajeswaran, A., Kumar, V., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., and Levine, S. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
- Rashid et al. (2018) Rashid, T., Samvelyan, M., Witt, C. S., Farquhar, G., Foerster, J., and Whiteson, S. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4292–4301, 2018.
- Robilliard et al. (2014) Robilliard, D., Fonlupt, C., and Teytaud, F. Monte-carlo tree search for the game of “7 wonders”. In Computer Games, pp. 64–77. Springer, 2014.
- Ross et al. (2011) Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.
- Samuel (1959) Samuel, A. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210–29, July 1959.
- Schaul et al. (2015) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Sheridan (2016) Sheridan, T. B. Human–robot interaction: status and challenges. Human factors, 58(4):525–532, 2016.
- Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Silver et al. (2017a) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering Chess and Shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017a.
- Silver et al. (2017b) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017b.
- Szita et al. (2009) Szita, I., Chaslot, G., and Spronck, P. Monte-carlo tree search in settlers of catan. In Advances in Computer Games, pp. 21–32. Springer, 2009.
- Tesauro (1995) Tesauro, G. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–69, 1995.
- Večerík et al. (2017) Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
- Vezhnevets et al. (2017) Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3540–3549. JMLR. org, 2017.
- Vinyals et al. (2017) Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al. StarCraft II: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
- Yang et al. (2018) Yang, J., Nakhaei, A., Isele, D., Zha, H., and Fujimura, K. CM3: Cooperative multi-goal multi-stage multi-agent reinforcement learning. arXiv preprint arXiv:1809.05188, 2018.
- Yu et al. (2019) Yu, L., Song, J., and Ermon, S. Multi-agent adversarial inverse reinforcement learning. In International Conference on Machine Learning, pp. 7194–7201, 2019.
- Zhan et al. (2018) Zhan, E., Zheng, S., Yue, Y., Sha, L., and Lucey, P. Generating multi-agent trajectories using programmatic weak supervision. arXiv preprint arXiv:1803.07612, 2018.
- Zhao et al. (2019) Zhao, Y., Borovikov, I., Beirami, A., Rupert, J., Somers, C., Harder, J., Silva, F. d. M., Kolen, J., Pinto, J., Pourabolghasem, R., et al. Winning Isn’t Everything: Training Human-Like Agents for Playtesting and Game AI. arXiv preprint arXiv:1903.10545, 2019.
Comments
There are no comments yet.