Incomplete information, large state and action spaces, complex and stochastic but closed-world dynamics make real-time strategy (RTS) games such as StarCraft: Brood War or DoTA 2 games an interesting test-bed for search, planning and reinforcement learning algorithms ontanon2013survey ; vinyals2017starcraft . The “fog of war” is a fundamental aspect of RTS gameplay: players only observe the immediate surroundings of the units they control. Hence, it is crucial to make good estimates about the strategy and positioning of an opponent and to produce matching counter-strategies in order to win a game.
In this work, we consider the selection between fixed high-level strategies (we will use the term build orders; see also Appendix A) for a StarCraft: Brood War bot222We perform experiments with CherryPi (https://torchcraft.github.io/TorchCraftAI), dependent on the current observable game state. A build order consists of a rule set targeting specific unit compositions, decisions to expand to multiple bases and a global decision on whether to initiate attacks against the opponent or not. The bot we use in our experiments has 25 build orders to choose from. Learning strategic decisions for StarCraft has previously been addressed in justesen2017learning with a focus on learning individual commands from human replays, while we select among predefined strategies with reinforcement learning and evaluate against high-level competitive bots.
For this selection task, it is key to infer the strategy and actions of the opponent player from limited observations. While it is possible to tackle hidden state estimation separately (e.g. lin2018forward in the context of StarCraft
) and to provide a model with these estimates, we instead opt to perform estimation as an auxiliary prediction task alongside the default training objective. Auxiliary losses (or multi-task learning) are well-known in neural-network based supervised learningando2005framework ; zhang2016augmenting and have recently found application in reinforcement learning tasks such as navigation (mirowski2016learning employ auxiliary depth and loop closure prediction) and FPS game playing (dosovitskiy2016learning predict future low-dimensional measurements; lample2017playing predict symbolic game features). A common motivation in these works is to enable faster or more data-efficient learning of robust representations which facilitate mastering the actual control task at hand. While we share this inspiration, our auxiliary task concerns present but hidden information.
Every five seconds of game time, we provide our model with a global, non-spatial representation of the current observation. The features contain observed unit counts for both players (only partially observed for the opponent), our resources and technologies as well as game time and static game information such as the opponent faction. We define two settings for featurization: visible only counts units currently visible, whereas memory uses hard-coded rules to keep track of enemy units that were seen before but are currently hidden, as commonly done in StarCraft bots. The model learns the value of switching to build-order given the observation . It uses an LSTM encoder with 2048 cells followed by a linear layer with as many sigmoid outputs as build orders, and is trained with the win/lose outcome of the game as target. The auxilary task is dealt with by another branch of the network, taking the LSTM encoding as input. It consists of three fully connected layers of 256 hidden units with as many outputs as unit types. It predicts the (nomalized) unit counts of the opponent by minimizing the Huber loss (the true opponent unit counts are available at training time333We activate BWAPI’s CompleteMapInformation cheat flag for training games.). We train our models in two stages (see Appendix B and C for more details on the features and training):
Off-policy Initialization: Offline training data is collected by performing random build order switches during a game. Specifically, we start by selecting an initial build order and perform a random switch every 8, 10 or 13 minutes on average (interval randomly selected for each game). We produced a corpus of 2.8M games with 3.3M switches used as training data points for the Q-function.
On-policy Refinement: We play games as in evaluation mode (Appendix C), selecting the build order according to the trained Q-function. As before, we perform one random switch within a sampled average time interval and keep the selected build order for minutes. Afterwards, we fall back to following the current Q-function.
|Model||Unit Obs.||Loss||Win Rate (std. dev)|
|Control||-||-||0.793 (0.010)||0.387 (0.036)||0.556 (0.032)|
|LSTM||Visible||Value||0.878 (0.004)||0.635 (0.031)||0.706 (0.048)|
|Visible||Value+Aux.||0.879 (0.001)||0.579 (0.065)||0.723 (0.063)|
|Memory||Value||0.886 (<0.001)||0.587 (0.031)||0.693 (0.077)|
|Memory||Value+Aux.||0.888 (0.004)||0.600 (0.027)||0.725 (0.039)|
compares win rates for a control run (without build order switching) and trained models when playing against the training set opponents and two held-out bot versions. The training set win rate is the average of all per-bot win rates to account for the varying number of build orders per opponent faction. Standard deviations are computed on averages for the first, second and third set of games per map iteration.
For all variants, we observe strong win rate improvements over the control run, for both training and held-out bots. With an auxiliary loss, we observe reductions in Q-value prediction error in the initialization training phase (Fig. 0(a)
). However, final win rate improvements are within the standard deviation. Performance on the held-out bots is subject to high variance; remarkably, the gains for variants withmemory observations on the training bots can not be transferred to held-out bots. Fig. 2 illustrates unit prediction performance during two games against the the two held-out bots and reveals sensible predictions.
- (1) Santiago Ontanón, Gabriel Synnaeve, Alberto Uriarte, Florian Richoux, David Churchill, and Mike Preuss. A Survey of Real-Time Strategy Game AI Research and Competition in StarCraft. IEEE Transactions on Computational Intelligence and AI in games, 5(4):293–311, 2013.
- (2) Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft II: A New Challenge for Reinforcement Learning. arXiv preprint arXiv:1708.04782, 2017.
Niels Justesen and Sebastian Risi.
Learning macromanagement in starcraft from replays using deep learning.In Computational Intelligence and Games (CIG), 2017 IEEE Conference on, pages 162–169. IEEE, 2017.
- (4) Gabriel Synnaeve, Zeming Lin, Jonas Gehring, Daniel Gant, Vegard Mella, Vasil Khalidov, Nicolas Carion, and Nicolas Usunier. Forward Modeling for Partial Observation Strategy Games - A StarCraft Defogger. In Advances in Neural Information Processing Systems, 2018.
Rie Kubota Ando and Tong Zhang.
A framework for learning predictive structures from multiple tasks
and unlabeled data.
Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
- (6) Yuting Zhang, Kibok Lee, and Honglak Lee. Augmenting supervised neural networks with unsupervised objectives for large-scale image classification. In International Conference on Machine Learning, pages 612–621, 2016.
- (7) Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. In International Conference on Learning Representations, ICLR, 2016.
- (8) Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. In International Conference on Learning Representations, ICLR, 2017.
Guillaume Lample and Devendra Singh Chaplot.
Playing FPS Games with Deep Reinforcement Learning.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017.
- (10) Gabriel Synnaeve. Bayesian programming and learning for multi-player video games. PhD thesis, Université de Grenoble, 2012.
- (11) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], December 2014. arXiv: 1412.6980.
Appendix A Action Space
The actions we consider consist of switching between (or continue playing) fixed, rule-based build orders. A build order contains commands to build specific units at specific points in time, or depending on properties of the current observation, such as the unit counts per type of the player and its opponent, as well as buildings. From a reinforcement learning perspective, they can be regarded as hard-coded options that can play an entire game or be terminated at any moment.
A build order is specialized to implement a given strategy, which corresponds to different army compositions (different types of units to build), as well as different trade-offs between short-term and long-term army strength. For instance, some build orders are specialized to assemble large armies of weaker units in the short term, while others invest more heavily in buildings and upgrades to create stronger units in the long run. The winning probabilities of build orders against other build orders are not transitive, so the build order needs to be changed if it implements an ineffective strategy against the one chosen by the opponent.
The action space contains 25 different build orders, each of which is specialized to a specific match-up: the game of StarCraft has different “races” (Zerg, Protoss and Terran), and strategies are specialized depending on the race of the player (Zerg in our case) and its opponent. In total, we obtain 42 distinct actions. During a game, model outputs corresponding to build order specializations not relevant for the opponent race are ignored.
Appendix B Model Input
Our models are provided with the following features that are extracted from the game state for each model evaluation (every 5 seconds of game time).
Unit counts are provided per-type, in disjoint channels for allied and enemy units. We scale the counts by approximate unit type value synnaeve2012bayesian and a factor of . We consider two variants for enemy units:
Visible: the currently visible enemy units.
Memory: all enemy units that have been observed since the start of the game, excluding units for which their destruction had been observed.
Resources, i.e. minerals, gas, used supply, maximum supply are each transformed by .
Upgrades and technologies are marked as 1 if available and 0 otherwise.
Separately, upgrades and technologies that are currently being researched are marked as 1 and 0 otherwise.
Game time in minutes is transformed by
Build order: index of the currently active build order
Race: index of the enemy race
Map: index of the map the game is played on
The LSTM input is a concatenation of all these features, with categorical features (race, map, build order) each represented by an 8-dimensional embedding and non-unit features (resources, upgrades and technologies) undergoing a linear projection with 8 units each.
Appendix C Training and Evaluation Details
All games are played on the AIIDE map pool444https://skatgame.net/mburo/sc2011/rules.html. During training, opponents are selected at random, and initial build orders are chosen according to a bandit algorithm. Players can adapt between matches in short series of 25 games each.
In evaluation mode, we execute the build order with the highest value, i.e. . To reduce unnecessary back-and-forth switching, we only switch to a new build order if its value has a minimum advantage of 0.01 over the current active one. At the start of a game, the model receives no observations about the opponent besides its race. Hence, we only start following the Q-function after six minutes of game time555If an opponent rush or proxy attack is detected, we allow for earlier switching. to ensure that the selected initial build order has a sufficient effect.
During off-policy initialization, models are optimized with Adam kingma2014adam . We use a learning rate of and batch 256 games per update. Gradients are back-propagated in time (BPTT) and truncated after 512 time steps.
-value heads are trained for build order switching points only while we compute the Huber loss for the auxiliary prediction task on every sample. However, for each mini-batch, both losses are averaged independently, taking the total number of contributing samples into account. We apply a scaling factor of 10 to the loss obtained from the auxiliary task; we found no notable difference between factors of 5,10 and 20 but worse performance (in terms of value head error rates) for 1 and 100. All models are trained for four epochs on the training corpus.
The same optimization settings are used for on-policy refinement, with the exception of a smaller batch size of 64. The learning rate is reduced to to improve stability during online learning. After the initialization step above, models receive 2500 additional updates in the on-policy setting.
For testing, our model is run in evaluation, and we run three games per map against each opponent for each starting build order; players are not allowed to adapt between games to reduce the variance of the results.
Appendix D Auxiliary Task Performance
|LetaBot-BBS||Terran||from Github https://git.io/fxSlk|
|LetaBot-SCVMarineRush||Terran||from Github https://git.io/fxSlL|
|LetaBot-SCVRush||Terran||from Github https://git.io/fxSlq|
|McRave-51e49b0||Protoss||from Github https://git.io/fx1ho|
|Tscmoo||Protoss||Provided by author|
|Tscmoo||Terran||Provided by author|
|Tscmoo||Zerg||Provided by author|