High-Level Strategy Selection under Partial Observability in StarCraft: Brood War

11/21/2018 ∙ by Jonas Gehring, et al. ∙ Facebook 2

We consider the problem of high-level strategy selection in the adversarial setting of real-time strategy games from a reinforcement learning perspective, where taking an action corresponds to switching to the respective strategy. Here, a good strategy successfully counters the opponent's current and possible future strategies which can only be estimated using partial observations. We investigate whether we can utilize the full game state information during training time (in the form of an auxiliary prediction task) to increase performance. Experiments carried out within a StarCraft: Brood War bot against strong community bots show substantial win rate improvements over a fixed-strategy baseline and encouraging results when learning with the auxiliary task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Incomplete information, large state and action spaces, complex and stochastic but closed-world dynamics make real-time strategy (RTS) games such as StarCraft: Brood War or DoTA 2 games an interesting test-bed for search, planning and reinforcement learning algorithms ontanon2013survey ; vinyals2017starcraft . The “fog of war” is a fundamental aspect of RTS gameplay: players only observe the immediate surroundings of the units they control. Hence, it is crucial to make good estimates about the strategy and positioning of an opponent and to produce matching counter-strategies in order to win a game.

In this work, we consider the selection between fixed high-level strategies (we will use the term build orders; see also Appendix A) for a StarCraft: Brood War bot222We perform experiments with CherryPi (https://torchcraft.github.io/TorchCraftAI), dependent on the current observable game state. A build order consists of a rule set targeting specific unit compositions, decisions to expand to multiple bases and a global decision on whether to initiate attacks against the opponent or not. The bot we use in our experiments has 25 build orders to choose from. Learning strategic decisions for StarCraft has previously been addressed in justesen2017learning with a focus on learning individual commands from human replays, while we select among predefined strategies with reinforcement learning and evaluate against high-level competitive bots.

For this selection task, it is key to infer the strategy and actions of the opponent player from limited observations. While it is possible to tackle hidden state estimation separately (e.g. lin2018forward in the context of StarCraft

) and to provide a model with these estimates, we instead opt to perform estimation as an auxiliary prediction task alongside the default training objective. Auxiliary losses (or multi-task learning) are well-known in neural-network based supervised learning 

ando2005framework ; zhang2016augmenting and have recently found application in reinforcement learning tasks such as navigation (mirowski2016learning employ auxiliary depth and loop closure prediction) and FPS game playing (dosovitskiy2016learning predict future low-dimensional measurements; lample2017playing  predict symbolic game features). A common motivation in these works is to enable faster or more data-efficient learning of robust representations which facilitate mastering the actual control task at hand. While we share this inspiration, our auxiliary task concerns present but hidden information.

2 Approach

Every five seconds of game time, we provide our model with a global, non-spatial representation of the current observation. The features contain observed unit counts for both players (only partially observed for the opponent), our resources and technologies as well as game time and static game information such as the opponent faction. We define two settings for featurization: visible only counts units currently visible, whereas memory uses hard-coded rules to keep track of enemy units that were seen before but are currently hidden, as commonly done in StarCraft bots. The model learns the value of switching to build-order given the observation . It uses an LSTM encoder with 2048 cells followed by a linear layer with as many sigmoid outputs as build orders, and is trained with the win/lose outcome of the game as target. The auxilary task is dealt with by another branch of the network, taking the LSTM encoding as input. It consists of three fully connected layers of 256 hidden units with as many outputs as unit types. It predicts the (nomalized) unit counts of the opponent by minimizing the Huber loss (the true opponent unit counts are available at training time333We activate BWAPI’s CompleteMapInformation cheat flag for training games.). We train our models in two stages (see Appendix B and C for more details on the features and training):

Off-policy Initialization: Offline training data is collected by performing random build order switches during a game. Specifically, we start by selecting an initial build order and perform a random switch every 8, 10 or 13 minutes on average (interval randomly selected for each game). We produced a corpus of 2.8M games with 3.3M switches used as training data points for the Q-function.

On-policy Refinement: We play games as in evaluation mode (Appendix C), selecting the build order according to the trained Q-function. As before, we perform one random switch within a sampled average time interval and keep the selected build order for minutes. Afterwards, we fall back to following the current Q-function.

3 Results

Model Unit Obs. Loss Win Rate (std. dev)
Training bots Locutus-20181007 McRave-51e49b0
Control - - 0.793 (0.010) 0.387 (0.036) 0.556 (0.032)
LSTM Visible Value 0.878 (0.004) 0.635 (0.031) 0.706 (0.048)
Visible Value+Aux. 0.879 (0.001) 0.579 (0.065) 0.723 (0.063)
Memory Value 0.886 (<0.001) 0.587 (0.031) 0.693 (0.077)
Memory Value+Aux. 0.888 (0.004) 0.600 (0.027) 0.725 (0.039)
Table 1: Win rates for different setups after on-policy refinement. Evaluations are done on the bots in the training set (see Table 2) and two bots that have not been seen during training.

Our evaluation protocol is described in Appendix C. Table 1

compares win rates for a control run (without build order switching) and trained models when playing against the training set opponents and two held-out bot versions. The training set win rate is the average of all per-bot win rates to account for the varying number of build orders per opponent faction. Standard deviations are computed on averages for the first, second and third set of games per map iteration.

For all variants, we observe strong win rate improvements over the control run, for both training and held-out bots. With an auxiliary loss, we observe reductions in Q-value prediction error in the initialization training phase (Fig. 0(a)

). However, final win rate improvements are within the standard deviation. Performance on the held-out bots is subject to high variance; remarkably, the gains for variants with

memory observations on the training bots can not be transferred to held-out bots. Fig. 2 illustrates unit prediction performance during two games against the the two held-out bots and reveals sensible predictions.

References

Appendix A Action Space

The actions we consider consist of switching between (or continue playing) fixed, rule-based build orders. A build order contains commands to build specific units at specific points in time, or depending on properties of the current observation, such as the unit counts per type of the player and its opponent, as well as buildings. From a reinforcement learning perspective, they can be regarded as hard-coded options that can play an entire game or be terminated at any moment.

A build order is specialized to implement a given strategy, which corresponds to different army compositions (different types of units to build), as well as different trade-offs between short-term and long-term army strength. For instance, some build orders are specialized to assemble large armies of weaker units in the short term, while others invest more heavily in buildings and upgrades to create stronger units in the long run. The winning probabilities of build orders against other build orders are not transitive, so the build order needs to be changed if it implements an ineffective strategy against the one chosen by the opponent.

The action space contains 25 different build orders, each of which is specialized to a specific match-up: the game of StarCraft has different “races” (Zerg, Protoss and Terran), and strategies are specialized depending on the race of the player (Zerg in our case) and its opponent. In total, we obtain 42 distinct actions. During a game, model outputs corresponding to build order specializations not relevant for the opponent race are ignored.

Appendix B Model Input

Our models are provided with the following features that are extracted from the game state for each model evaluation (every 5 seconds of game time).

  • Unit counts are provided per-type, in disjoint channels for allied and enemy units. We scale the counts by approximate unit type value synnaeve2012bayesian and a factor of . We consider two variants for enemy units:

    • Visible: the currently visible enemy units.

    • Memory: all enemy units that have been observed since the start of the game, excluding units for which their destruction had been observed.

  • Resources, i.e. minerals, gas, used supply, maximum supply are each transformed by .

  • Upgrades and technologies are marked as 1 if available and 0 otherwise.

  • Separately, upgrades and technologies that are currently being researched are marked as 1 and 0 otherwise.

  • Game time in minutes is transformed by

  • Build order: index of the currently active build order

  • Race: index of the enemy race

  • Map: index of the map the game is played on

The LSTM input is a concatenation of all these features, with categorical features (race, map, build order) each represented by an 8-dimensional embedding and non-unit features (resources, upgrades and technologies) undergoing a linear projection with 8 units each.

Appendix C Training and Evaluation Details

All games are played on the AIIDE map pool444https://skatgame.net/mburo/sc2011/rules.html. During training, opponents are selected at random, and initial build orders are chosen according to a bandit algorithm. Players can adapt between matches in short series of 25 games each.

In evaluation mode, we execute the build order with the highest value, i.e. . To reduce unnecessary back-and-forth switching, we only switch to a new build order if its value has a minimum advantage of 0.01 over the current active one. At the start of a game, the model receives no observations about the opponent besides its race. Hence, we only start following the Q-function after six minutes of game time555If an opponent rush or proxy attack is detected, we allow for earlier switching. to ensure that the selected initial build order has a sufficient effect.

During off-policy initialization, models are optimized with Adam kingma2014adam . We use a learning rate of and batch 256 games per update. Gradients are back-propagated in time (BPTT) and truncated after 512 time steps.

-value heads are trained for build order switching points only while we compute the Huber loss for the auxiliary prediction task on every sample. However, for each mini-batch, both losses are averaged independently, taking the total number of contributing samples into account. We apply a scaling factor of 10 to the loss obtained from the auxiliary task; we found no notable difference between factors of 5,10 and 20 but worse performance (in terms of value head error rates) for 1 and 100. All models are trained for four epochs on the training corpus.

The same optimization settings are used for on-policy refinement, with the exception of a smaller batch size of 64. The learning rate is reduced to to improve stability during online learning. After the initialization step above, models receive 2500 additional updates in the on-policy setting.

For testing, our model is run in evaluation, and we run three games per map against each opponent for each starting build order; players are not allowed to adapt between games to reduce the variance of the results.

Appendix D Auxiliary Task Performance

(a) Q-value prediction errors
(b) Auxiliary task Huber loss
Figure 1: Performance during off-policy initialization on a held-out in-domain validation set for Q-value prediction on build order switching points (errors) and the auxiliary task (Huber loss). Data points are averaged over 3 runs each.
(a) vs. Locutus-2018107
(b) vs. McRave-51e49b0
Figure 2: Observed (visible only), predicted and ground truth opponent unit counts against two bots not seen during training. The y-axis represents a sum of the respective counts, individually normalized as described in Appendix B. In both cases, the unit count prediction closely follows the ground truth. Fig. 1(a) depicts a loss while the game in Fig. 1(b) was won.
Bot Name Race Version/Source
AILien Zerg SSCAIT/AIIDE2017
AIUR Protoss SSCAIT/AIIDE2017
Arrakhammer Zerg SSCAIT/AIIDE2017
BananaBrain Protoss SSCAIT*
Bereaver Protoss SSCAIT
BlackCrow Zerg SSCAIT
CUNYBot Zerg SSCAIT
HannesBredberg Terran SSCAIT
HaoPan Terran SSCAIT
ICEBot Terran AIIDE2017
Iron Terran SSCAIT
Iron Terran AIIDE2017
Juno Protoss SSCAIT/AIIDE2017
KillAll Zerg SSCAIT/AIIDE2017
Killerbot Zerg SSCAIT
LetaBot Terran SSCAIT
LetaBot Terran AIIDE2017
LetaBot-BBS Terran from Github https://git.io/fxSlk
LetaBot-SCVMarineRush Terran from Github https://git.io/fxSlL
LetaBot-SCVRush Terran from Github https://git.io/fxSlq
Locutus Protoss SSCAIT*
Locutus-2018107 Protoss SSCAIT
Matej_Istenik Terran SSCAIT
McRave Protoss SSCAIT*
McRave-51e49b0 Protoss from Github https://git.io/fx1ho
MegaBot Protoss SSCAIT
Microwave Zerg SSCAIT*
NLPRBot_CPAC Zerg SSCAIT/AIIDE2017
NeoEdmundZerg Zerg SSCAIT
NiteKatP Protoss SSCAIT
NiteKatT Terran SSCAIT
Overkill Zerg SSCAIT
Overkill-AIIDE2016 Zerg AIIDE2016
Overkill-AIIDE2017 Zerg AIIDE2017
Pineapple_Cactus Zerg SSCAIT
Prism_Cactus Protoss SSCAIT
Proxy Zerg SSCAIT*
Randomhammer Protoss SSCAIT*
Randomhammer Terran SSCAIT*
SkyFORKNet Protoss SSCAIT
Skynet Protoss SSCAIT
Steamhammer Zerg SSCAIT*
Stone Terran SSCAIT
Toothpick_Cactus Terran SSCAIT
Tscmoo Protoss Provided by author
Tscmoo Terran Provided by author
Tscmoo Zerg Provided by author
UAlbertaBot Protoss SSCAIT
UAlbertaBot Terran SSCAIT
UAlbertaBot Zerg SSCAIT
UITTest Protoss SSCAIT
WillyT Terran SSCAIT*
WuliBot Protoss SSCAIT
Xelnaga Protoss AIIDE2017
Ximp Protoss SSCAIT
ZZZKBot Zerg SSCAIT
Zia_bot Zerg AIIDE2017
Table 2: List of opponent bots considered. SSCAIT versions were obtained from the public SSCAIT ladder at https://sscaitournament.com/index.php?action=scores; AIIDE versions can be found at https://www.cs.mun.ca/~dchurchill/starcraftaicomp/results.shtml. Bots denoted as “SSCAIT*” have been been replaced or removed online by newer versions by their respective authors.