Applying supervised and reinforcement learning methods to create neural-network-based agents for playing StarCraft II

by   Michal Opanowicz, et al.

Recently, multiple approaches for creating agents for playing various complex real-time computer games such as StarCraft II or Dota 2 were proposed, however, they either embed a significant amount of expert knowledge into the agent or use a prohibitively large for most researchers amount of computational resources. We propose a neural network architecture for playing the full two-player match of StarCraft II trained with general-purpose supervised and reinforcement learning, that can be trained on a single consumer-grade PC with a single GPU. We also show that our implementation achieves a non-trivial performance when compared to the in-game scripted bots. We make no simplifying assumptions about the game except for playing on a single chosen map, and we use very little expert knowledge. In principle, our approach can be applied to any RTS game with small modifications. While our results are far behind the state-of-the-art large-scale approaches in terms of the final performance, we believe our work can serve as a solid baseline for other small-scale experiments.


Deep RL Agent for a Real-Time Action Strategy Game

We introduce a reinforcement learning environment based on Heroic - Magi...

AlphaZero-Inspired General Board Game Learning and Playing

Recently, the seminal algorithms AlphaGo and AlphaZero have started a ne...

Q-DeckRec: A Fast Deck Recommendation System for Collectible Card Games

Deck building is a crucial component in playing Collectible Card Games (...

Reuse of Neural Modules for General Video Game Playing

A general approach to knowledge transfer is introduced in which an agent...

PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene Rearrangement Planning

Scene Rearrangement Planning (SRP) is an interior task proposed recently...

Fully Differentiable Procedural Content Generation through Generative Playing Networks

To procedurally create interactive content such as environments or game ...

HEX and Neurodynamic Programming

Hex is a complex game with a high branching factor. For the first time H...

1.1 AI in games

Playing popular games has been a benchmark for artificial intelligence since the 1970s, with computers being used to play board games like chess or checkers. With the increase of computing power, search-based techniques became viable and humans were defeated in many board games, such as mentioned chess, checkers, and Reversi.

Playing complex real-time games such as StarCraft II or Dota 2 however, remained largely out of reach for a long time. Search-based approaches are often inapplicable to them due to the extremely large state and action spaces, and because of that methods based purely on expert knowledge were usually used. Due to the complexity of those games however, hand-crafting a policy capable of acting in every situation is effectively impossible. That means those agents are usually easily exploited by human players.

In recent years with the development of neural networks and deep learning, especially deep reinforcement learning, new possibilities have opened. Using Convolutional Neural Networks combined with Monte Carlo Tree Search playing board games with relatively large action spaces such as Go became possible

[18]. At the same time, in many simple real-time games with small action spaces a superhuman performance was achieved through the development of approaches such as Deep Q-learning [12].

In 2018 OpenAI has presented a large-scale RL approach capable of playing Dota 2 that was able to defeat professional players repeatedly, known as OpenAI Five [2]. Not too long after that in 2019, DeepMind has presented a somewhat similar approach for playing StarCraft II [21], which shows reinforcement learning is capable of achieving very strong performance. Unfortunately, those approaches usually require hundreds or thousands of years of in-game experience to train successful agents - learning to play games with the speed of a human remains largely unachievable.

1.2 StarCraft II

StarCraft II (SC2) is a complex, single- and multiplayer real-time strategy (RTS) game where players need to manage resources, build and command multiple units. It poses significant challenges for machine learning: the game engine makes 22 steps per second, and games tend to last more than 10 minutes and sometimes over an hour, meaning that an agent needs to deal with long time horizons; action space is also very large as the players command units by selecting points on the screen.

A brief description of the 1vs1 game in SC2:

  1. Each of the players chooses one of the three races they will play - Terran, Protoss, or Zerg. Each race has access to completely different units and some unique game mechanics, which has a significant impact on the strategy.

  2. The players start the game with one base each, usually located at the opposite ends of the map, and the same amount of workers that collect resources and build production buildings or more bases. Resources in each base are limited, so creating more bases in longer games is necessary.

  3. The players build production buildings and offensive units and attempt to attack each other. Players can give commands to the units by clicking specific units or groups of them and selecting points on the map where the units should move or enemy units to attack. They can also use various abilities that the units might have, such as disabling opponent‘s weapons, healing, or increasing the armor of the allies in a certain radius.

  4. The game ends when one of the players loses all the buildings or leaves the game, in which case the other player wins. The game also ends in a draw when neither of the players builds or destroys any structure or collects resources for around 6 minutes.

It should be also noted that enemy units are invisible (covered by the so-called fog-of-war) unless in the vision range of the allied units, which significantly complicates the strategies and enables various mind-games such as faking committing into some strategy, or denying scouting information.

For use in machine learning, DeepMind in cooperation with Blizzard Entertainment have developed StarCraft 2 Learning Environment (SC2LE, with main python library called PySC2) [22], that allows computer programs to play the game using three different interfaces:

  1. A true 3D rendering that is very similar to what the human players see (known as the render interface),

  2. A 2D top-down view split into multiple feature layers that show various properties of the units and maps (known as the feature interface),

  3. A list of units containing all of the information about them that is visible for the player (known as the raw interface).

For sending commands, interfaces 1 and 2 allow the program to only send commands that would also be available to the human player, interface 3 however allows for much more powerful commands, such as selecting any desired subset of units located anywhere on the map.

The game also includes different built-in scripted bots, ordered by their increasing difficulty: Very Easy, Easy, Medium, Medium-hard, Hard, Very Hard, Elite (names as listed in the PySC2 API). There are also Cheater 1, Cheater 2 and Cheater 3 bots which are stronger than the Elite bot and get various advantages (such as double the income, no fog-of-war), that would violate the rules of the normal game.

‘Very Easy‘ bot avoids building large armies and can be defeated by a human without much experience in the game, ‘Elite‘ requires a significant amount of skill, and some players estimate it plays on a Silver or Gold league level, so in the top 60-70% of the players playing regularly. ‘Medium‘ bot is fairly difficult for new players, and usually requires several hours of experience in RTS games to defeat it.

Figure 1.1: Visualisation generated by PySC2: rendering from the render interface (left) next to the feature layers (with colors for human readability, right) from the feature interface

1.3 Previous work

Since PySC2 was released in 2017 [22], not many machine learning approaches attempting to play the full SC2 game using it were published. First approaches focused on solving ‘minigames‘ - a set of custom maps, each focused on solving a specific objective - moving units around the map, fighting a group of enemies in a specific fixed situation, or building units with a limited action set. Playing the full multiplayer game of StarCraft II was mostly attempted using partially-scripted approaches with large parts of strategy built into the agent by experts. It should be noted that those agents can be quite good, for example TStarBot [19] that was able to defeat all built-in AIs. Hybrid approaches also were proposed, with the agent architecture consisting of both hard-coded and learning components, which were successful at defeating bots in restricted conditions [10] [13].

DeepMind in 2020 showed a powerful agent called AlphaStar [21], that was controlled by a neural network trained on human replays and through self-play with reinforcement learning and played on a level comparable to human experts - clearly showing that using enough computational resources it is possible to play the game through pure machine learning. However, their implementation wasn‘t made public, and computational resources used for this experiment would be prohibitively expensive for most amateur researchers and small-scale labs. Two approaches based on AlphaStar were also published by other researchers [23] [6]. They used less resources than the original AlphaStar, but they limited their agents to playing only mirror matchups (Terran vs Terran or Zerg vs Zerg).

Recently, a small-scale approach that was based purely on imitation learning was published on GitHub

[15]. The authors report it took a week to train using 4 GPUs and is capable of producing an agent that can defeat some of the built-in AIs.

1.4 Our contribution

We combine recent developments in applying machine learning to the RTS games to produce an agent that achieves non-trivial performance in StarCraft II using low computational resources - specifically, around 10 days on a single PC with one GPU. We also show that with supervised pretraining, it is possible to obtain significant improvement using reinforcement learning even on our relatively small scale.

Our agent plays using a single recurrent neural network to process the information from the game and select an action. The network is trained to predict the probability distribution of the actions made by human players at every time step, known as behavioral cloning. The dataset consists of around 18000 games that were played on a single chosen map and include at least one Protoss player. Training takes around 7 days on a single NVidia GTX 1080 Ti GPU + 2 days to preprocess the replays. We adapt many parts of the AlphaStar architecture for our network, but we also use parts of the architecture described in Relational Reinforcement Learning


Our agents achieve higher win percentages against the built-in AIs than the ones presented in [15]. However, they are not directly comparable - agents from [15] were trained only on Terran vs Terran matches, while our agents play as a Protoss against all three races, but on a single map.

Our implementation is written in Python, using PyTorch as a machine learning framework, and is available on GitHub:

2.1 Neural networks

Neural networks are a family of very powerful predictive models, used as classifiers or function approximators in various contexts. They usually have a form of a sequence or a directed graph of ‘layers‘, where each layer is some parametrized linear transformation applied on a vector and an element-wise nonlinearity applied on the result. They are most commonly trained with gradient descent (as it is possible to efficiently compute the gradients using the chain-rule, such computation is usually called backpropagation), which makes it possible to use a lot of different loss functions.

In this section, we briefly describe the neural network components that we use in this work.

2.1.1 Multi-Layer Perceptron (MLP)

MLP module is a chain of so called fully-connected layers, each fully connected layer has a form:

where (called weight matrix) is a matrix of shape , (called bias) is a vector of length , is the input vector of length , is an output vector of length , and

is a (usually element-wise) function called activation function, in modern neural network often the so-called ReLU (Rectified Linear Unit):

. are the trainable parameters of the layer.

In this work, we use ELU (Exponential Linear Unit) (as proposed in [3]) activation function:

The authors of ELU show that it seems to allow for faster training and better generalization. In our case, it appears to improve the stability of training when operating on half-precision (16-bit) floating point numbers.

2.1.2 Convolutional Neural Network (CNN)

CNNs are used to process inputs in which spatial relationships are important, such as images. They consist of convolutional layers which can be thought of as trainable filters - moving a small window over an image, and repeatedly applying a shared, small fully-connected layer to all pixels in the window, to produce a new, transformed ‘image‘ (called feature map), potentially with a different number of channels.

Figure 2.1: Convolutional layer with a kernel in shape and

output channels is being applied to a 3-dimensional input tensor, producing a tensor in shape

, where , .

A convolutional layer has 4 important hyperparameters:

  • the kernel shape, which is the width and height of the window that sweeps over the input,

  • the number of input channels,

  • the number of output channels,

  • the stride, which describes how the window moves over the input - with stride=1, the window will be applied at every valid location, with stride=2, the window will only be applied at locations with coordinates divisible by 2, halving the width and height of the output image.

Often, CNNs consist of interleaving layers with stride 1 that keep the number of channels, and layers with stride 2 that double it, applied until the output is an ‘image‘ that has a small number of pixels, but a very large number of channels.

2.1.3 Residual Neural Network (ResNet)

Residual neural networks were developed to allow training networks with a very large number of layers. Such networks are usually built from the following modules:

where is the input vector, is an output vector, is some neural network with input and output that have the same shape, and the

term is a so-called residual connection, which allows the deeper layers to have the direct access to the input of earlier layers. Residual neural networks with convolutional layers are currently considered the state-of-the-art on image classification tasks


In ResNets a so-called Batch Normalization

[9] layer is often added to the

neural network. This layer normalizes its inputs so that values have a mean 0 and variance 1 along the batch dimension and was shown to improve performance of deep ResNets. In our case, we replace Batch Normalization with Layer Normalization

[1], which normalizes the values along the layer dimensions, as we found it to improve the learning speed compared to Batch Normalization.

2.1.4 Feature-wise Linear Modulation (FiLM)

FiLM [14] is a relatively simple upgrade to the convolutional neural networks designed to ’inject’ some non-spatial information into the spatial input - originally introduced to create networks that process a piece of text and an image simultaneously, and answers the questions from the text by selecting points on the image. It is also useful in our case - when the non-spatial component of the network chooses to build a building, information about this can be introduced to the spatial component to choose the proper location on the screen where the building should be built.

It works by using a small MLP applied to the non-spatial input to compute and coefficients for each channel of the spatial input, and then uses them to apply an affine transformation to the entire channel:

That way, the information gets ‘evenly introduced‘ to the entire image.

2.1.5 Recurrent Neural Network (RNN)

Recurrent neural networks have an ‘internal state‘ - an output that is passed as a part of the input in the next iteration. Such networks are usually trained by ‘unrolling‘ them over several iterations, effectively creating a network consisting of a chain of modules with shared weights that can be trained as usual.

Recurrent networks were originally used for the tasks such as natural language translation or sound processing, but they are also very useful in playing real-time games such as SC2 - they most importantly allow the network to remember or infer information about the game state that is not visible from a single observation.

A simple RNN can look like this:

which is similar to the fully-connected layer with the following differences: is a matrix of shape , is the input vector for current iteration, is an output vector from the previous iteration, and is the output from the current iteration.

However, such RNNs are difficult to train, as they suffer heavily either from vanishing or exploding gradients. When computing the gradients, the transposed

matrix will be applied to the back-propagated gradient many times, and if any of its eigenvalues is greater than 1, the gradients will start growing exponentially. To counteract that one can use an activation function that will limit the values, such as sigmoid, but that will on the other hand cause the gradients to vanish very quickly.

To counteract that, an architecture called Long Short-Term Memory (LSTM)


was introduced, in which one part of internal state is not multiplied by any matrix at all, and no nonlinearity is applied to it. The internal state can only be multiplied by factors smaller than 1, and a value smaller than 1 can be added to or subtracted from it. (The exact formulation is quite complicated so we recommend referring to the original paper for details.) This effectively removes the vanishing or exploding gradient problem, and allows for training on a very long sequences. In this work, we use LSTM as our RNN component.

2.1.6 Transformer

Transformer [20] is an architecture introduced originally to process sentences in natural language (specifically for translation). It has been since then adapted to other tasks where a list of objects with some sort of relationship between them is present, often achieving state-of-the-art performance in such tasks. For example, Transformers have been used for image classification, achieving comparable results to ResNets [4].

Transformer architecture is based on an attention mechanism, which intuitively allows a neural network to ‘focus‘ on the part of some data that the network ‘wants‘ to find at a given moment. This is done by computing a vector

called ‘query‘, and for every th sample in the data vectors and called ‘keys‘ and ‘values‘. Then, one can compute sample weights

and compute the output value:

In the case of Transformers, this step is usually done for each part of the data, which means that for example every token in the text ‘asks‘ about all the other tokens.

In our case, we use Transformers as proposed in Relational Deep Reinforcement Learning [24], where the Transformer module is applied to the ‘flattened‘ output of the convolutional layer that looks on the screen - the idea is that this might allow the network to apply reasoning that requires understanding relationships between objects, such as ‘this is an idle worker, so let‘s search for mineral fields in the vicinity for it to mine‘, or ‘this is a combat unit, so lets search for enemy units that it counters and command it to attack‘, etc.

2.2 Reinforcement learning

Reinforcement learning (RL) is a family of problems where a learning agent interacting with an environment receives a scalar reward for its actions, and its goal is to maximize said reward. Typically, a problem in reinforcement learning is described as Markov‘s Decision Process (MDP), which is a tuple . is a set of states, is a set of actions the agent can make,

is a probability of transitioning from a state

to a state when the agent selects an action , and is the reward that the agent gets in such transition.

An agent usually chooses actions using a policy , where is a probability of making an action in a state . An agent playing in the environment generates a chain of observation-action-reward tuples, called a trajectory: usually ending in some terminal state.

To make optimization easier, a proxy objective is introduced - at any step in a state , we want the agent to maximize the expected discounted sum of rewards , where is the discount factor , usually . We will call the value of a state. This objective has the nice property of being bounded if the rewards in MDP are bounded, and resembles a natural intuition that events in the near future are more important than events in the far future.

The goal becomes now to choose the in such a way that it maximizes for any state. This is however a quite difficult task, as in most cases the state space is gigantic. In deep reinforcement learning, the policy (represented by a neural network) is optimized locally, using Monte Carlo estimates of the true . In this work, we will focus on methods based on the advantage of an action - the difference between the value in a state and the value under an assumption a specific action is chosen in the state . To compute the advantage estimate efficiently, another part of the agent is introduced - an estimator of the true value , also represented by a neural network.

To compute the very noisy Monte Carlo estimate of the one can use an equation

However, since computing this estimate requires acting in the environment until a terminal state is reached, which in case of SC2 would mean potentially thousands of steps, to allow for more frequent updates a bootstrapped estimate is used instead - a short trajectory of length is generated, and the estimates are computed as follows:

Where is the neural network value estimate of the last state in the trajectory, called bootstrap value.

The advantages are then computed as follows - first, a trajectory is generated by running the current policy in the environment. Then, an advantage estimate can be computed:

Since we know what action was chosen at the step , is a bootstraped estimate of the true advantage.

Having the advantage computed, it is possible to write an optimization objective for the policy. Several such objectives were introduced - for more information see papers on the Asynchronous Advantage Actor Critic [11], Trust Region Policy Optimization [16], Proximal Policy Optimization [17], IMPALA [5].

We use the Proximal Policy Optimization objective, as it is simple to implement and was shown to be empirically quite stable in a variety of conditions. The optimization objective is defined as follows:

where is the policy being currently optimized, and the policy that was used to compute the advantages. Detaching the two allows applying the policy update several times on the same trajectory, as long as the current policy is not too different from the old one - the tolerance for this difference is controlled by the hyperparameter, usually set to the value of . We use a conservative .

The estimator is usually optimized by minimizing the mean squared error between the neural network estimate and the Monte Carlo estimate:

Due to all estimates being very noisy, the data is usually processed in large batches, so that the gradient descent updates remain relatively stable. In our context, we have found out batches of size 512 were necessary for the training to remain stable.

3.1 Input

The network receives input as tensors of different shapes that represent what a human would see on a screen in a more machine-readable format - feature interface

from the introduction. Those inputs provide information that is directly available to the human player, however, there are some notable differences - the game screen is shown for humans as a 3d RGB rendering, but for the agent as a 2d top-down view with 27 layers, with individual layers describing specific properties of the units in the field of view. This is done both to reduce the rendering time and to allow the agent to focus on understanding the game instead of computer vision.

This is the main difference in the problem formulation between our work and AlphaStar. AlphaStar has used the raw interface that gives the agent superhuman knowledge, as the agent sees all of its units for the entire time (‘camera‘ in AlphaStar is merely a movable rectangle on the map where certain commands can be issued and all properties of enemy units are visible). It also gives the agent the potential to use superhuman capabilities, such as precisely moving units in any place on the map without moving the ‘camera‘ (although AlphaStar did not seem to use those capabilities).

Description of all input sources for the network can be found in the Table 3.1. We have extended certain inputs beyond what was available in the environment by default in places where we felt the information available to the network does not resemble the information available to the human players well.

Input name Input shape Description
feature_screen (27, 64, 64) The agent‘s view of the game screen. Extended with information about the point on the screen previously selected by the agent.
feature_minimap (11, 64, 64) The agent‘s view of the game minimap. Extended with information about the point on the minimap previously selected by the agent.
cargo (N, 7) Units in a currently selected transport vehicle.
control_groups (10, 2) State of the 10 control groups that the player has - number of units in a control group and a type of unit that was first added to the control group.
control_groups_hint (10, 2) State of the 10 control groups that the player we are trying to imitate had in the middle of the game - added by us.
multi_select (N, 7) Units selected when selecting multiple units.
player (11) Information about various resources and scores that the player has - minerals, vespene, supply, game_score, etc.
production_queue (N, 2) Units queued for production in a currently selected production building.
single_select (N, 7) Information about the unit that is currently selected, if a single unit is selected.
game_loop (1) Game time, normalized to reach 1 after 1 hour.
available_actions (573) Actions that are currently valid.
prev_action (1) Previous action the agent did, modified to show the last meaningful action (not a no-op).
build_order (20) First 20 units and structures that were constructed by the player we are trying to imitate, added by us.
mmr (6) The MMR (Match Making Rating, the skill estimate) of the player we are trying to imitate, divided by 1000, one-hot, added by us.
Table 3.1: All of the neural network‘s inputs. N in shape denotes that the input has variable length. For such inputs up to 32 first entries are considered, the rest are ignored. Descriptions in bold denote our modifications of the base environment.

3.2 Spatial processing

Probably the most important and complex parts of the agent‘s input are the spatial processing columns that use feature_screen and feature_minimap and extract information from them for action selection and selecting points on the screen. The general design follows the map processing module used in AlphaStar, however, there are some important distinctions, most notably the use of attention for relational processing of the features produced by the convolutional network as proposed in the Relational Deep Reinforcement Learning [24].

The following steps are applied to process spatial observations:

  1. Screen and minimap features might be either categorical (unit type, is the building under construction, etc.) or numeric (health, energy, etc). Binary categorical features are left as-it, non-binary are expanded using trainable embeddings (effectively equivalent to using one-hot representation followed by an MLP), and numeric features are scaled into the 0-1 range based on their maximum possible value.

  2. After embedding and scaling, features are processed by 3 convolutional layers with stride 2, which change the size of the feature maps from 64x64x32 to 8x8x128. Between the application of the convolutional layers intermediate values are stored as screen_bypass and minimap_bypass, to preserve precise spatial information for selecting points on the screen.

  3. 8x8 features are then processed by several residual blocks with FiLM, conditioned on the scalar_input, described in the next section.

  4. Features from screen and minimap are then flattened along width and height and joined to form a single tensor in shape 128x128, and processed by a Transformer self-attention layer.

  5. A shared LSTM is then applied to each of the items from the Transformer outputs, storing information about each pixel of the feature map separately. This is done to hopefully allow the network to remember some facts about the relatively precise state of the game map, potentially already including some results from relational reasoning performed by the Transformer layer.

  6. LSTM output is then processed by two Transformer layers, to further apply relational reasoning to the entities on the map.

  7. Processed features from the previous step are then separated and reshaped back into 8x8x128 feature maps, to be used in transposed convolutional layers for selecting points on the screen or minimap.

3.3 Non-spatial processing

Similarly to spatial features, non-spatial ones can be categorical or numeric. We use the same technique to preprocess them as with the spatial features.

For further processing, we use two types of modules - for features of constant length we use MLP, and for features of variable length (they are in essence lists of items) we first apply an MLP to each item, and then aggregate the list using MaxPool along the list dimension, to produce an output in the size of a single item - such approach was used in OpenAI Five [2]. This can be thought of as skimming over the entries and only remembering some things that stand out.

Outputs of those layers are then concatenated into a single tensor called
scalar_input. We also concatenate selected outputs: multiple_select, single_select, available_actions, build_order to produce gating_input, that is later used during the action selection - intuitively, those are the inputs that should influence the selection of the action quite significantly, so adding an easier access for them should make training faster.

One other input that gets special treatment is the control_groups input - we found that controlling the control groups is very hard to learn for the network and also very influential, so we tried to improve the performance on it, by adding a bypass from the control groups input to the layers that make a selection using them.

We also found that it was important to normalize the output vectors to have variance 1 and mean 0 before concatenating - presumably because different inputs might have different magnitudes despite earlier normalization.

After spatial input processing finishes, the scalar_input is concatenated with a flattened output of Transformer layers from spatial processing and passed through a 2-layer LSTM with 512 output size to produce lstm_output.

3.4 Selecting actions

Action space has a hierarchical structure - there are 573 actions that a player can do, however, each of those actions has also additional parameters - for example, action select_screen, which selects a unit at the specific point on the screen if there is any, requires coordinates on the screen and the information whether the clicked unit should be selected instead of the current selection, added to current selection or removed from it.

To properly model probability distributions in this complex action space, we use an autoregressive model. We first sample the action, and then sample the required parameters one by one, from distributions conditioned on the selected action and previously sampled action parameters. A similar approach was also used in AlphaStar.

In more detail, it works as follows: first, one of the 573 possible game actions is sampled from the action distribution generated by MLP from lstm_output; then the result of that sampling is embedded into a vector of size of the lstm_output and added to lstm_output. After that, the delay is sampled based on the new lstm_output, and again lstm_output is updated in the same fashion as before. The process is repeated until all required parameters are sampled.

It should be noted that we only use this way of sampling for the action, delay and queued outputs, as they are involved in most of the actions, and apart from them actions usually use only one or two other parameters.

For sampling positions on the screen, the updated lstm_output is reshaped into size 8x8x16 and concatenated with the output of spatial input processing columns (that has shape 8x8x128), and later passed through 3 residual blocks with FiLM [14], also conditioned on lstm_output. Their result is subsequently combined with the screen_bypass

and passed through transposed convolutional layers to produce 3 maps of logits of shape 64x64x1, from which

screen1, screen2 and minimap

outputs are sampled. Notably, selected points on the screen and the minimap are added to the agent‘s observation in the next step as an additional feature with one-hot encoding.

During training, probability distributions are conditioned on the player‘s actions that occurred in the game that we are trying to imitate, and during play, they are conditioned on previous actions made by the agent.

Most of the non-spatial outputs use simple, 2-layer MLPs to select the action parameters, however the main action selection is done by a 4-layer MLP with residual connections, and its output is multiplied by an output of an additional MLP with sigmoid activations on the final layer that uses the gating_input. This mimics the similar part in AlphaStar, and aims to improve the quality of actions the network makes by providing it with as much information and processing capabilities as possible. Intuitively, the stack of 4 layers is supposed to choose the action the network wants to make, and the gating layer allows only the actions that make sense.

Other differently processed output is the set of outputs for the control groups. The control group id to act on is selected by a small attention layer that takes the control groups bypass computed earlier and the lstm_output as an input to make the selection possibly focused on the content of the control groups and independent of their order.

Figure 3.1: A high-level diagram of the neural network. The boxes with sharp edges denote layers, the boxes with rounded edges denote values.

4.1 Replay selection

We use SC2 version 4.9.3 as this version has over 300000 1v1 replays available through Blizzard‘s API. From those replays we select ones that meet the following criteria:

  • At least one of the players is Protoss;

  • At least one of the players is above 2500 MMR (roughly top 60% of the leaderboard);

  • The map the game is played on is ‘Acropolis‘;

  • The game is longer than 1 minute.

This gives us a dataset of around 18000 replays.

4.2 Replay preprocessing

SC2 replays aren‘t stored as a screen recording but as an entire list of actions that all players have made and the random seed, from which the entire game can be reconstructed. This means the game needs to be replayed to extract the observations for the training. To prevent unnecessary computation, we generate observations once and store them for later.

During the ‘replaying‘ of the replays, we remove no-ops from the observations and add a ‘delay‘ target for the network, similarly to AlphaStar. This means that in evaluation the network decides how long it will ‘sleep‘ before the next action will be executed. We find this approach necessary as otherwise the sequences of observation-action pairs would mostly consist of very long chains of no-ops. We also remove chains of move_camera actions and store only the last position. This is done to circumvent the limitation of the game interface - human players can move their mouse to the edge of the screen to move the camera at a certain rate, but bots that play using PySC2 need to specify exact consecutive coordinates of the camera, so imitating this camera movement is quite difficult for the network.

At this stage, we also extract the build_order - first 20 units and buildings that the player that we are imitating has built, to condition the network on it. This step was also done for AlphaStar.

The preprocessing takes around 2 days on a 16-thread CPU (and is mostly CPU-bound), and the resulting dataset can be compressed to around 100 GB.

4.3 Training

4.3.1 Supervised learning

We train our neural network to minimize cross-entropy loss between the actions done by the player and the probability distribution predicted by the network. We use an Adam optimizer with a learning rate equal to 0.001.

Depending on memory constraints, the network can be trained using batches of size 32 or 16, and our implementation allows to accumulate gradients over many batches so that large batch sizes can be emulated using smaller true batch sizes for improved training stability on machines with small GPU memory size. We use a sequence length of 32 for training the recurrent parts of the network.

During supervised learning, we process the entire dataset around 7 times - this is nowhere near fitting the dataset, as seen on the loss plot

4.1. This suggests that using larger computational resources the network could be trained much longer for potentially significantly higher performance.

Figure 4.1: The average loss per sample during supervised training.

We have also discovered that fine-tuning the model with effective batch size of 512 after training with batch size 32 improves the performance significantly, especially when combined with selecting the well-performing networks by automatically running small number of games against the AIs.

4.3.2 Reinforcement learning

We collect trajectories of length 32 using 24 environments, which send their inputs to a single thread that runs the network on batches of observations. Each time 512 trajectories are collected, an optimization step is done. We emulate using batch of size 512 by accumulating gradients during several iterations with batch of size 16, and doing the update optimizer step afterwards on the accumulated gradients.

As mentioned earlier, we use Proximal Policy Optimization [17], as it is easy to implement and known for its stability. For value estimation, we add a single MLP layer to the network that takes the lstm_output as an input and predicts the value before the action is chosen.

During reinforcement learning, we process around 500000 trajectories or 16M observation-action pairs. This is relatively low for reinforcement learning, especially with such a large action space, but due to computational limitations it is not possible to process many more iterations.

We train the agents against the built-in bots. The reward that the agent receives comes from several factors:

  • Game result: 1 for the win, 0 for the draw, -1 for the loss;

  • Resource cost of enemy units killed multiplied by 0.00003;

  • Resource cost of enemy buildings destroyed multiplied by 0.0001;

  • Collected resources - 0.00001 for each unit of minerals collected and 0.00003 for each unit of vespene collected.

The coefficients for the reward components were chosen to make the total reward in the same order of magnitude for different sources, while also keeping the game result as the dominating factor. It should be also noted that we use different value estimator outputs for the different reward components, and we sum their output to form a total prediction of the value.

To further stabilize the training, we experimented with 3 different approaches:

  1. Training the network using both reinforcement learning and supervised learning updates at the same time, so that it both improves through reinforcement learning and keeps correctly predicting player actions when observing a human replay.

  2. Applying the KL divergence penalty between the trained model and the original supervised model, with the supervised model‘s predictions computed on the same batch as the currently trained model, including the LSTM state stored at the beginning of the sequence (which means it experiences the LSTM states generated by the new model, which might shift significantly during training).

  3. Applying the KL divergence penalty between the trained model and the original supervised model, with the supervised model‘s predictions computed during trajectory generation (running the supervised network and storing its output each time the RL network runs). That way, the LSTM state for the supervised network is processed in the same way it would be processed if the network was actually playing the game.

The approach 1 used , approaches 2 and 3 used .

We have discovered that the approach 1 was the least stable, and it only trained properly against Easy AIs. However, presumably because of the lack of KL divergence penalty, it has developed the most unusual strategy of the three.

Figure 4.2: Win rate in the last 100 finished games (against the Easy Zerg AI) during reinforcement learning in approach 1.

Approaches 2 and 3 were successfully trained against the mix of Easy and Medium AIs. The approach 3 seemed like it was the most consistent and has achieved the highest performance against the mix, with the least significant drops of performance during training.

Figure 4.3: Win rate in the last 100 finished games (against the mix of Easy and Medium AIs) during reinforcement learning in approach 2.
Figure 4.4: Win rate in the last 100 finished games (against the mix of Easy and Medium AIs) during reinforcement learning in approach 3.

5.1 Performance

To measure the performance of the agent we run multiple games against built-in scripted bots. We run the experiments using Very Easy, Easy, Medium and Hard AIs playing all three races, with our agent always playing as a Protoss.

For computational reasons, the games are run for 60 in-game minutes, and if the game does not finish in that time, we consider it a loss for our agent. Similarly, we treat true draws as losses.

Bot Race \Difficulty Very Easy Easy Medium
Protoss 49 25 3
Terran 55 27 3
Zerg 69 22 7
Table 5.1: Number of wins of the supervised agent (out of 100 games) against various in-game bots - the model saved after days of training.
Bot Race \Difficulty Very Easy Easy Medium Hard
Protoss 88 63 18 0
Terran 91 61 17 0
Zerg 95 66 25 5
Table 5.2: Number of wins of the supervised learning agent fine-tuned on 512 batch size, with the best model automatically selected by running experiments consisting of 20 vs Easy and 20 vs Medium games during training.

For the RL models, we have analyzed the performance of the 3 regularization approaches. All 3 approaches significantly improved the performance of the network trained with supervised learning.

For the approach 1, the best result we got was when the agent was trained solely against Easy Zerg AI and discovered a very simple strategy consisting of building several ‘Gateways‘ (basic Protoss infantry production buildings), producing around ten ‘Zealots‘ (basic Protoss offensive units) relatively fast, and attacking the enemy with small waves of around 10 units each as fast as possible. Such strategies are usually called Rush.

Networks in the approaches 2 and 3 were trained against a mix of Easy and Medium AIs, and it seems their strategies are quite similar to each other - they both start the game with a large number of Gateways, but appear to build a significant number of workers and bases, and they also build Stargates, which produce flying units. They also build quite a large number of defensive buildings in their bases, especially when attacked.

Bot Race \Difficulty Very Easy Easy Medium Hard
Protoss 100 91 21 3
Terran 100 99 83 17
Zerg 100 96 47 24
Table 5.3: Number of wins of the reinforcement learning agent from approach 1. The strategy it discovered appears to be least effective in mirror matchups, and very effective specifically against Terrans - we suspect this is because Terrans have no melee units, and AI is not capable of correctly controlling their ranged units for maximum effectiveness.
Bot Race \Difficulty Very Easy Easy Medium Hard
Protoss 99 93 39 4
Terran 99 98 68 13
Zerg 100 95 61 11
Table 5.4: Number of wins of the reinforcement learning agent from approach 2. Mirror matchups still remain the hardest for this network, but the differences in effectiveness against different races appear smaller.
Bot Race \Difficulty Very Easy Easy Medium Hard
Protoss 100 98 54 1
Terran 98 98 72 6
Zerg 100 98 66 5
Table 5.5: Number of wins of the reinforcement learning agent from approach 3. It appears the performance against the Medium AI has improved, especially for mirror matchups, but the network did not perform as well against the Hard AI as the previous ones.

5.2 Conclusion

We have presented a neural network architecture and a training pipeline that can produce agents for playing StarCraft II using limited computational resources. Unlike other small-scale approaches for playing the full game, we use a single neural network for the entirety of decision-making.

By evaluating our agents in a large number of two-player games, we have shown that our approach is capable of training agents that can compete with some of the built-in bots. Notably, all 3 RL experiments resulted in significant improvement over the original supervised agents.

Some of our agents were also surprisingly effective against more difficult bots after only training against a fixed, easy opponent - an agent trained only against the Easy Zerg AI developed a strategy that was moderately effective even against the Hard AI.

5.3 Future works

Even with our relatively small computational requirements, our agents still take multiple days to train, which significantly limited the amount of experiments we were able to perform. In the future, we hope to explore the following topics:

  • Removing the limitation of playing the game on a single map;

  • Exploring different neural network architectures for faster training and better performance;

  • Using newer reinforcement learning algorithms;

  • Introducing learning through self-play and a league of agents as proposed in AlphaStar.


  • [1] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer Normalization. External Links: 1607.06450 Cited by: §2.1.3.
  • [2] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang (2019) Dota 2 with Large Scale Deep Reinforcement Learning. CoRR abs/1912.06680. External Links: Link, 1912.06680 Cited by: §1.1, §3.3.
  • [3] D. Clevert, T. Unterthiner, and S. Hochreiter (2016) Fast and accurate deep network learning by exponential linear units (elus). External Links: 1511.07289 Cited by: §2.1.1.
  • [4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. External Links: 2010.11929 Cited by: §2.1.6.
  • [5] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu (2018) IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. External Links: 1802.01561 Cited by: §2.2.
  • [6] L. Han, J. Xiong, P. Sun, X. Sun, M. Fang, Q. Guo, Q. Chen, T. Shi, H. Yu, X. Wu, and Z. Zhang (2021)

    TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game

    External Links: 2011.13729 Cited by: §1.3.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep Residual Learning for Image Recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §2.1.3.
  • [8] S. Hochreiter and J. Schmidhuber (1997-11) Long Short-Term Memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §2.1.5.
  • [9] S. Ioffe and C. Szegedy (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. External Links: 1502.03167 Cited by: §2.1.3.
  • [10] D. Lee, H. Tang, J. O. Zhang, H. Xu, T. Darrell, and P. Abbeel (2018) Modular Architecture for StarCraft II with Deep Reinforcement Learning. External Links: 1811.03555 Cited by: §1.3.
  • [11] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. External Links: 1602.01783 Cited by: §2.2.
  • [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing Atari with Deep Reinforcement Learning. External Links: 1312.5602 Cited by: §1.1.
  • [13] Z. Pang, R. Liu, Z. Meng, Y. Zhang, Y. Yu, and T. Lu (2019) On Reinforcement Learning for Full-length Game of StarCraft. External Links: 1809.09095 Cited by: §1.3.
  • [14] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. C. Courville (2017) FiLM: Visual Reasoning with a General Conditioning Layer. CoRR abs/1709.07871. External Links: Link, 1709.07871 Cited by: §2.1.4, §3.4.
  • [15] C. Scheller (2021) StarCraft II Imitation Learning. GitHub. Note:˙imitation˙learning Cited by: §1.3, §1.4.
  • [16] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2017) Trust Region Policy Optimization. External Links: 1502.05477 Cited by: §2.2.
  • [17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal Policy Optimization Algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: §2.2, §4.3.2.
  • [18] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529, pp. 484–503. External Links: Link Cited by: §1.1.
  • [19] P. Sun, X. Sun, L. Han, J. Xiong, Q. Wang, B. Li, Y. Zheng, J. Liu, Y. Liu, H. Liu, and T. Zhang (2018) TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game. CoRR abs/1809.07193. External Links: Link, 1809.07193 Cited by: §1.3.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. External Links: 1706.03762 Cited by: §2.1.6.
  • [21] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver (2019-11-01) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. External Links: ISSN 1476-4687, Document, Link Cited by: §1.1, §1.3.
  • [22] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. van Hasselt, D. Silver, T. Lillicrap, K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ekermo, J. Repp, and R. Tsing (2017) StarCraft II: A New Challenge for Reinforcement Learning. External Links: 1708.04782 Cited by: §1.2, §1.3.
  • [23] X. Wang, J. Song, P. Qi, P. Peng, Z. Tang, W. Zhang, W. Li, X. Pi, J. He, C. Gao, H. Long, and Q. Yuan (2020) SCC: an efficient deep reinforcement learning agent mastering the game of StarCraft II. CoRR abs/2012.13169. External Links: Link, 2012.13169 Cited by: §1.3.
  • [24] V. F. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. P. Reichert, T. P. Lillicrap, E. Lockhart, M. Shanahan, V. Langston, R. Pascanu, M. Botvinick, O. Vinyals, and P. W. Battaglia (2018) Relational Deep Reinforcement Learning. CoRR abs/1806.01830. External Links: Link, 1806.01830 Cited by: §1.4, §2.1.6, §3.2.