1 Introduction
With rapid develop of deep reinforcement learning, AI already beats human expert in perfectinformation games like Go. However, researchers haven’t make same progress in imperfect games like Starcraft or Dota. In order to guarantee effectiveness of our model, we’d better to evaluate training and results in a theorical and quantitive way, but we always neglect it.
Game theory[14] is the cornerstone of human behavior patterns in real world competitions. It studies how agents can maximize their own interests through competition and cooperation, and can measure the quality the decisions in game. It has become an attractive research task in computer science, the intersection research topic called ”algorithmic game theory” has established[8], and gets more and more interact with the development of artificial intelligence[17, 1]. Its main motivation is to make realworld complext problems, like transaction and traffic control, work in practice.
In Game theory, Nash Equilibrium[14] would be an optimal solution in games, i.e. no one can gain extra profit by alleviating their policy. Fictitious play[2] is a traditional algorithm for finding Nash Equilibrium in normalform imperfect games. Fictitious players repeatedly choose best response to the opponent’s average strategy. The average strategy of players would converge to Nash Equilibrium. Heinrich et al.[5] proposed Extensive Fictitious Play, extending the idea of fictitious play to extensiveform games. However, the states is represented in the form of lookup table in each tree node, so that the generalization training (of similar states) would be unpractical; And the update of average policy needs the traverse of the whole game tree which results in dimension disaster for large games. Fictitious SelfPlay(FSP)[6]
addresses these problems by introducing sample–based machine learning approach. The approximation of best response is learned by reinforcement learning and the update of average strategy is processed by samplebased supervised learning. However, due the sampling control, the interaction between agents is controlled by a central controller.
Heinrich and Silver [6]
introduced Neural Fictitious SelfPlay(NFSP), which combines FSP with neural network function approximation. A player is consisted of Qlearning network and supervised learning network. The algorithm calculates a ”best response”,by
–greedy deep Qlearning, as well as an average strategy by supervised learning of agents’ history behaviors. It solves the coordinated problem by introducing anticipatory dynamics — players behaves according to a mixture of their average policy and best response. It’s the first endtoend reinforcement learning method which learns approximate Nash Equilibrium in imperfect games without any prior knowledge.However, NFSP has bad performance in games with largescale search space and search depth, because the nature that opponents’ strategy is complex and DQN learns in an offline mode. In this paper, we propose Monte Carlo Neural Fictitious Self Play(MCNFSP). Our algorithm combines NFSP with Monte Carlo Tree Searches[4]. We evaluate our method in various twoplayer zerosum games. Experimentally we show that MCNFSP would converge to approximate Nash Equilibrium in Othello while NFSP can’t.
Another drawback is in NFSP the calculation of best response relies on Deep Qlearning, which takes a long time to run until convergence. In this paper, we propose Asynchronous Neural Fictitious SelfPlay(ANFSP), which uses parallel actor learners to stabilize and speed up training. Multiple players choose actions in parallel, on multiple copies of the environment. Players share Qlearning network and supervised learning network, accumulate gradients over multiple steps in Qlearning and calculate gradients of minibatch in supervised learning. This reduces the data storage memory needed compared to NFSP. We evaluate our method in twoplayer zerosum poker games. We show that the ANFSP can approach approximate Nash Equilibrium more stable and quickly compared to NFSP.
In order to show the effect of the advantage of the techniques of MCNFSP and ANFSP in more complex game, we also evaluated the effectiveness in a FPS team combat game, in which an AI agent team fights with a human team, and our system provided good tactic strategies and control policies to our agent team, and help it to beat humans.
2 Background
In this section we briefly inroduce: related game theory concepts, current AI systems for games, relationship between reinforcement learning and Nash Equilibrium, and finally the Neural Fictitious Self Play (NFSP) techniques. For a better introduction we refer the reader to[21, 13, 6]
2.1 Related Game Theory Concepts
Game in Study. In this paper, we mainly research on twoplayer imperfectinformation zerosum game. A zerosum game is a game in which the sum of each player’s payoff is zero, and an imperfectinformation game is a game in which each player only observes partial game state. For example, Texas Hold’em, realtime strategy games and FPS games. Such game is often represented in ”Normal form”. Normal form is a game representation schema, which lists payoffs that players get as a function of their actions by way of a matrix. In our studied games, players take actions simultaneously. The goal of each player is to maximize their own payoff in the game. Assume is the action distribution of player given the information set he observes, refers to the strategy set of all players, is the behavior set of player , is the strategy set in except , is the expected payoff the player gained following strategy in game. The best responses of player to opponent’s strategy ,
contains all strategies whose payoff against that is suboptimal by no more than .
Nash equilibrium. Nash equilibrium refers to the strategy that satisfies any player in the game can’t obtain higher profit by changing his own strategy when the others don’t change their strategy. Nash proved that if we allow mixed strategies, then every game with a finite number of rational players that choose from finitely many pure strategies has at least one Nash equilibrium.
Exploitability evaluates the distance between a strategy and Nash equilibrium strategy, which can be measured from the strategy profits of both side. For twoplayer zerosum games, policy is exploitable by if and only if
In the equation above, is the profit(reword) of by making best response to his opponent. It is obvious that an exploitability of yields at least an approximate Nash Equilibrium (distance to Nash Equilibrium no larger than ), becuase exploitability measures how much the opponent can benefit from a player’s failure to adopt a Nash Equilibrium strategy. Nash Equilibrium is unexploitable, i.e. exploitability is 0.
2.2 Reinforcement learning and Nash Equilibrium
Reinforcement learning agents learn how to maximize their expected payoff during the interaction with the environment. The interaction can be modelled as Markov Decision Process(MDP). At time step
, agent observes current environment state and selects an action according to policy , where is a mapping from state to action. In return, agent receives reward and next environment state from environment.The goal of agent is maximizing the accumulated return for each state with discount factor . The actionvalue function defines the expected gain of taking action in state . It’s the common objective function for most reinforcement learning.An agent is learning onpolicy if the policy it learns is what it currently follows, otherwise it’s learning offpolicy. Deep Qlearning[11] is an offpolicy method which aims to update the actionvalue function toward the one step return. Monte Carlo Tree Search algorithm[4] is an onpolicy method which aims to choose the bestresponse action by simulating the game according to policy and updates the actionvalue function till the episode ends. Asynchronous Deep Qlearning[10] is a multithreaded asynchronous variant of Deep Qlearning. It uses multiple actorlearners running in parallel on multiple copies of the environment. Agents share Qlearning network and apply gradient updates asynchronously.
The relationship between reinforcement learning (RL) solution and Game Theory or Nash Equilibrium is: 1) MDP/RL adopts differential learning mechanism, which theoretically achieves Bellman optimality (or Markov perfect equilibrium, a refinement of the concept of Nash equilibrium), so it can learn the subgame optimization substructure including Nash Equilibrium; 2) However, in practice it’s very difficult to measure how near a trained strategy to a Nash Equilibrium in large scale games, due to the cost in training.
2.3 Modern RL systems for games
In these years, reinforcement learning has great breakthrough in more complex games. The most significant is the DeepMind AlphaGo[18] and AlphaZero[19] which beated world champions in the game Go. AlphaGo is initialized with human anotated training datas, after achieved certain levels, it improves itself by RL and selfplay. AlphaZero can teach itself the wining strategy by playing the game purely with itself using a Monte Carlo search tree. DeepMind has shown the effectiveness of Monte Carlo techniques in games, but the game Go is a sequential perfectinformation game.
Recently, RL has more researches on imperfectinformation games. StarCraft is a hot point of research, many researches like CommNet[20] and BicNet[15] focus on small map combat, and DeepMind’s AlphaStar[22] can play a whole game, and has defeated top human players. The AlphaStar used the supervised training with human data at first, then use a group of agents (league) play with each other in RL to improve to superhuman level, but its current performance is still not stable enough. So it is valuable to think about whether we can get some tools in game theory to measure and control quality of trained strategy, e.g. Nash Equilibrium.
In the study of AI for Poker games, researches often consider Nash Equilibrium. In 2014, Heinrich and Silver of University College London proposed a SmoothUCT algorithm[7] that combines the Monte Carlo tree search, converges to approximate Nash equilibrium, and wins three silver medals in the annual Computer Poker Contest (ACPC). In 2015, Lisỳ et al. developed an online Counterfactual regret minimization algorithm[9], which can be used to solve Nash Equilibrium in the upper limit betting Texas Hold’em. The artificial intelligence based on this algorithm Cepheus is a near perfect player, human In the long run, the result can only be a tie, or the computer wins. In 2016, Heinrich and Silver proposed the Neural Fictitious SelfPlay algorithm, which approximates the Nash equilibrium of imperfect information games without any prior knowledge. 2017 Carnegie Mellon’s artificial intelligence ”Libratus”[3] defeated top Texas Hold’em players in oneonone NoLimit Hold’em, and Libratus developed a balanced game to bring the strategy to Nash equilibrium. Also in 2017, Moravčík et al. proposed DeepStack[12], which also defeeted professional human players, and it use exploitability to measure their results.
2.4 Neural Fictitious Self Play
Neural Fictitious SelfPlay[6]
is a model of learning approximate Nash Equilibrium in imperfectinformation games. The model combines fictitious play with deep learning. At each iteration, players determine best response to others’ average strategy with DQN and update their average strategy by supervised learning. The stateaction pair
is stored in supervised learning memory only when player determines action from best response. The transition tuple is stored in reinforcement learning memory whichever the policy player follows when taking action. NFSP updates the average strategy network with cross entropy loss ,and updates the best response network with mean squared loss .NFPS use an offpolicy methods DQN in it, so it has problems in onpolicy games like RTS where we need to sample opponents’ changing strategy while we play, and enumerating opponents’ strategy is too costly. As shown in Figure 1, we compared the training efficiency of FP and NFSP, subfigures a) and b) show in the game ”Matching Pennies”, FP converges in 200 iterations, but NFSP in more than 3,000. And in the game ”RockPaperScissors”, NFSP converges in more than 10,000 iterations as subfigures c) and d) show.
3 Monte Carlo Neural Fictitious Self Play
3.1 Network Overview
The Monte Carlo Tree Search (MCTS) algorithm uses a policy network to generate an action probability, and uses a value network to evaluate the value of a state, so it doesn’t suffer the complexity in DQN to score each action under each state. So MCTS can be used in highdimensional and continuous problems. Moreover, MCTS directly uses reward that the player gets after each game to train its networks, it can avoid the inaccurate problem of DQN to evaluate Qvalue (
) in the early stages. So we combines MCTS and NFSP to propose a new algorithm more suitable to larger imperfect games.Our aogorithm is called Monte Carlo Neural Fictitious Self Play (MCNFSP), it learns best response to opponents’ average strategy by Monte Carlo Tree Search and updates average strategy by supervised learning with collected best response history. The training dataset is generated from selfplay in MCNFSP. Agent plays a mixture of best response and average strategy as NFSP. Most of time they play an average strategy to the policy , but with some probability() they play a best response to MCTS.
The algorithm makes use of two neural networks: a policyvalue network for Monte Carlo Tree Search (i.e. bestresponse network), a policy network for supervised learning (i.e. averagepolicy network). The bestresponse network is shown in Figure. 2. The input is board state. The network has two outputs: a policy , which is a mapping from current state to action probability, and a value in
, which is the predicted value of the given state. In our network, relu activation is used in convolution layers; dropout used in fully connection layers to reduce overfitting; and for policy probability, softmax is used. The averagepolicy network is almost same with bestresponse ntework except that it only outputs a policy
(no value output), which represent the average policy of player.3.2 Algorithm Training
The networks are trained along with the game selfplay. The selfplayer adopts a mixed policy: . In each action, the player chooses to use the result from bestresponse networkbased MCTS, or from averagepolicy network, with a probalilities .
In MCTS, the algorithm plays a modified Monte Carlo Tree Search(MCTS) algorithm to calculate best response in current state. In MCTS, a tree node is a state, and a edge is an action. MCTS simulates the future plays by adding possible future actionss after an action is executed. At each time, agent chooses action maximizing .
is the expected payoff taking action in state , is probability of taking action from state according to the bestresponse network. is the number of visits to state across simulation. counts the number of times action been chosen at state across simulation,
is a hyperparameter that controls the degree of exploration. Agent takes action
and reaches next state . If is the terminal state, then the player’s final score (win or loss) is used as the node score. If it is not, the opponent takes actions. When a node has not been visited before, when adding it to the game tree, its , value is calculated using bestresponse network as initial node value.Node value is propagated along the path visited in current simulation and used to update corresponding value.
After multiple simulations, the values are a better approximation for the policy. is normalized as the improved policy . Agent picks an action by sampling from the . After an episode, tuples are stored in reinforcement learning memory to train best response network. Pairs are stored in supervised learning memory to train average strategy network. After a certain number of episodes, the best response network is trained with loss
(in our loss functions,
is current game states, is the output of the average network, is the result of the game, value is 1 or 1), the average strategy network is trained with loss .Algorithm 1 MCNFSP algorithm
3.3 Experiment
We compare MCNFSP with NFSP in Othello. Our experiment investigate the convergence of MCNFSP to Nash equilibrium in Othello and measure the exploitability of learned strategy as comparative standard. To reduce the calculation time of exploitability, we choose Othello board.
3.3.1 Othello
The neural network in MCNFSP takes the board position
as input and passes it through two convolutional layers and a flatten layer. Then, the resultant 120 dimensional vector is passed through many fully connected layers, and output both a vector
, representing a probability distribution over moves, and a scalar value
, representing the probability of the current player winning in position ,for best response network. The architect of average strategy’s neural network is same as best response’s neural network except the output. It only outputs a vector representing probability distribution over moves for average strategy network. We set the sizes of memory to 4M and 400K for and respectively. was updated with a circular buffer containing recent training experiences, was updated with reservoir sampling [23] to ensure an even distribution of training experiences. The reinforcement learning and supervised learning rate were set to 0.01 and 0.005, and both used Adam optimizer. Players perform gradient update every 100 episodes of play. MCNFSP’s anticipatory parameter was set to .3.3.2 Comparison MCNFSP with NFSP
The neural network in NFSP is the same with MCNFSP’ except the output. The output of best response network in NFSP is a vector , representing the value of each action in state . The output of average strategy network is a vector , representing the probability distribution over moves for average strategy network. We set the size of memory to 400w and 40w for and respectively. The reinforcement learning and supervised learning rate were set 0.01 and 0.005. Each player performed gradient updates of minibatch size 256 per network for every 256 moves. The target network of best response network was refitted every 300 training. NFSP’s anticipatory parameter was set to . The greedy policy’s exploration rate started at 0.6 and decayed to 0, proportionally to the inverse square root of the number of iterations.
Figure 3.2 shows NFSP can’t approach Nash equilibrium in Othello. The exploitability of strategy oscillates during the training time. The reason about NFSP doesn’t converge to approximate Nash equilibrium in Othello is NFSP players rely on Deep Qlearning approximating best response, which don’t have good performance in scenes with largescale search space and depth.
4 Asynchronous Neural Fictitious Self Play
4.1 Algorithm Overview
Based on MCNFSP, we further improve the timeefficiency by proposing a multithread learnig mechanism called Asynchronous Neural Fictitious SelfPlay(ANFSP), which asynchronously runs multiple players in parallel instances of the game environment. Players run different exploration policies in different threads, and share the nerual network and performs gradient update asynchronously.
Inspired with A3C algorithm, our algorithm also starts multithreads of plays, as Algorithm 4.1 shows. In each thread, an MCNFSP players a mixture of averagepolicy and bestresponse networks. The stateaction pair is stored in supervised learning memory only when the player determine action from best response. Each thread computes gradient of best response network using transition tuple each step and accumulate gradients over multiple timesteps to certain number before they are applied, which is similar to minibatches. And we compute gradient of average strategy network using minibatch of supervised learning memory after multiple timesteps after accumulated to certain number. After a global counter achieves certain count, the networks are updated. The loss of best response network is defined as , and the loss of average strategy network is defined as .
Algorithm 2 AsynchronousNeuralFictitiousSelfPlay
4.2 Experiment
4.2.1 Leduc Hold’em
We compare ANFSP with NFSP in modified twoplayer Leduc Hold’em. For simplification, we limit the maximum bet size each round in Leduc Hold’em is 2. In the game, the bet history is represented as a tensor with 4 dimensions, namely players, round, bet, action taken. Leduc Hold’em contains 2 rounds. Players usually have three actions to choose from, namely fold, call, raise. As the game ends once a player gives up, then the betting history is represented as a
tensor. We flatten the 4dimensional tensor to a vector of length 16. Leduc Hold’em has a card deck of 6 cards. We represent each rounds’ cards by a kofn encoding. E.g. LHE has a card vector of 6 cards and we set public cards to 1, the rest to 0. Concatenating with the cards input, we encode the information state of LHE as a vector of length 22.We started 4 threads with exploration rate randomly chosen from . The exploration rate decayed to 0, proportionally to the inverse square root of the number of iterations. We set the sizes of to 200w. We train network every 32 iterations of play. We update Deep Qlearning network with accumulated gradients, and update average strategy network with minibatch size of 128. The reinforcement learning and supervised learning rate were set 0.01 and 0.005, and both used SGD. The target network of Deep Qlearning was updated every 50000 actions. ANFSP’s anticipatory parameter was set to .
Figure 4.1 shows ANFSP approaching Nash equilibrium in modified Leduc Hold’em. The exploitability declined continually and appeared to stabilize at around 0.64 after 140w episodes of play. The training costed about 2 hours.
4.2.2 Comparison with NFSP
The architecture of neural network in NFSP is the same with ANFSP’s. We set the size of memory to 20w and 200w for and respectively. The reinforcement learning and supervised learning rate were set 0.01 and 0.005. Players performed gradient updates of minibatch size 128 per network for every 128 actions. The target network of best response network was refitted every 300 training. NFSP’s anticipatory parameter was set to . The greedy policy’ exploration rate started at 0.06 and decayed to 0, proportionally to the inverse square root of the number of iterations.
Figure 4.2 shows the learning performance of NFSP. The exploitability of strategy fluctuated and appeared to stablize at around 0.75 after 70w episodes of play. The training also costed about 2 hours. It means in same training time (2 hours), ANFSP can complete more episodes and achieve better results (lower exploitability).
5 Evaluation in First Player Shooting Game
5.1 Experiment Setting
In order to evaluate the effectiveness of our algorithm in a complex imperfectinformation game, we tried to train it in an FPS game and make it combats with humanbings. The FPS platform used in this experiment is designed by our research team. The game scene is an offensive and defensive confrontation of two teams (10 VS 10). In training, one side is the MCNFSP, the other side is a memory trained by thousands of human plays (SLHuman). The experiment was performed in a fixed closed 255 x 255 square map. The entire map was divided into 12 x 12 areas each with a 20 x 20 square. The detail of the scene is shown below (Figure. 5). All the green areas in Figure. 5 are passible regions, and the gray areas are obstacles that cannot be crossed (rock or fence). Figure 5) is marked with two points A, B, which are the birth points of the two teams. In addition, the ninea reas marked with red represent the destination areas that an agent can choose to affend or defend. The centers of the four walls respectively correspond to four doors. The size of the doors is limited to 23 people at the same time. The team outside has a mission to break into the walls and kill all of the inside ones, and the inside team is to defense.
5.2 Experiment
In training, each team is represented as a player. The states is a dictionary of the form , where is the location block in game map, means the number of current trained team in in time , and is the believed number of team in location . The actions of a team is the force assignment of number of fighters to different locations like , which means to assign fighters from to . For reward, each fighter in team has a health of 100, so the reward of team is . Different with our previous works in this paper, the two networks are built and trained for both the outside team and inside team. Figure 6 shows the training result of the outside team (results of inner team is similar). We can see the training converges very fast (in no more than 150 episode, each episode has 5 games). The win rate of the outside team against the SLHuman gets higher than 80%, and the loss of training gets near to zero.
After training, we make this algorithm to play the game with pure 10 college students, it plays with one human for 10 rounds (after five rounds, they change the location), totally 100 games are played, in which our trained algorithm achieved 75 victories, so it is a superhuman result.
6 Conclusion
In this paper, we extend the original NFSP algorithm that can learn approximate Nash equilibrium in twoplayer imperfectinformation games, and propose our MCNFSP and ANFSP algorithms. MCNFSP use MCTS to help NFSP get rid of offline DQN, so it achieves high improvement on training efficiency, and has approached one more step to achieve approximate Nash equilibrium in largerscaled imperfect games with wider and deeper game tree. Experiment on Othello shows players’ strategy converges to approximate Nash equilibrium with the improved algorithm running a certain round, where the original algorithm cannot converge. ANFSP uses asynchronous and parallel architecture to collect game experiences efficiently and reduces converging time and memory cost that NFSP needed. Experiment on modified Leduc Hold’em shows ANFSP can converge in a shorter time and the convergence is more stable compared with NFSP. Finally we tested the algorithm in FPS games, and it achieved superhuman results in short time, it shows the combination of Monte Carlo Tree Search and NFSP is a practical way to solve imperfectinformation game problems.
References
 [1] Bosansky, B., Kiekintveld, C., Lisy, V., Pechoucek, M.: An exact doubleoracle algorithm for zerosum extensiveform games with imperfect information. Journal of Artificial Intelligence Research pp. 829–866 (2014)
 [2] Brown, G.W.: Iterative solution of games by fictitious play. Activity analysis of production and allocation (1951)
 [3] Brown, N., Sandholm, T.: Libratus: The superhuman ai for nolimit poker. In: IJCAI. pp. 5226–5228 (2017)
 [4] Browne, C.B., Powley, E., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games 4(1), 1–43 (2012)
 [5] Heinrich, J., Lanctot, M., Silver, D.: Fictitious selfplay in extensiveform games. In: Proceedings of the 32nd International Conference on Machine Learning. (2015)
 [6] Heinrich, J., Silver, D.: Deep reinforcement learning from selfplay in imperfectinformation games. arXiv preprint arXiv:1603.01121 (2016)
 [7] Heinrich, J., Silver, D.: Smooth uct search in computer poker. In: TwentyFourth International Joint Conference on Artificial Intelligence (2015)
 [8] Lavi, R.: Algorithmic game theory. Computationallyefficient approximate mechanisms pp. 301–330 (2007)
 [9] Lisỳ, V., Lanctot, M., Bowling, M.: Online monte carlo counterfactual regret minimization for search in imperfect information games. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. pp. 27–36. International Foundation for Autonomous Agents and Multiagent Systems (2015)
 [10] Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: Proceedings of Machine Learning Research (2016)
 [11] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Humanlevel control through deep reinforcement learning. Nature pp. 529–533 (2015)
 [12] Moravčík, M., Schmid, M., Burch, N., Lisỳ, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., Bowling, M.: Deepstack: Expertlevel artificial intelligence in headsup nolimit poker. Science 356(6337), 508–513 (2017)
 [13] Myerson, R.B.: Game Theory:Analysis of Conflict. Harvard University Press (1991)
 [14] Nash, J.: Noncooperative games. Annals of mathematics pp. 286–295 (1951)
 [15] Peng, P., Wen, Y., Yang, Y., Yuan, Q., Tang, Z., Long, H., Wang, J.: Multiagent bidirectionallycoordinated nets: Emergence of humanlevel coordination in learning to play starcraft combat games. arXiv preprint arXiv:1703.10069 (2017)
 [16] Robinson, J.: An iterative method of solving a game. Annals of Mathematics pp. 296–301 (1951)
 [17] Sanholm, T.: The state of solving large incompleteinformation games, and application to poker. AI Magazine 31(4), 13–32 (2010)
 [18] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484 (2016)
 [19] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)

[20]
Sukhbaatar, S., Fergus, R., et al.: Learning multiagent communication with backpropagation. In: Advances in Neural Information Processing Systems. pp. 2244–2252 (2016)
 [21] Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction, vol. 1 (1998)
 [22] Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W.M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou, J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets, S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C., Wang, Z., Pfaff, T., Pohlen, T., Yogatama, D., Cohen, J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C., Kavukcuoglu, K., Hassabis, D., Silver, D.: AlphaStar: Mastering the RealTime Strategy Game StarCraft II. https://deepmind.com/blog/alphastarmasteringrealtimestrategygamestarcraftii/ (2019)
 [23] Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software (1985)
Comments
There are no comments yet.