MSC: A Dataset for Macro-Management in StarCraft II

by   Huikai Wu, et al.

Macro-management is an important problem in StarCraft, which has been studied for a long time. Various datasets together with assorted methods have been proposed in the last few years. But these datasets have some defects for boosting the academic and industrial research: 1) There're neither standard preprocessing, parsing and feature extraction procedures nor predefined training, validation and test set in some datasets. 2) Some datasets are only specified for certain tasks in macro-management. 3) Some datasets are either too small or don't have enough labeled data for modern machine learning algorithms such as deep neural networks. So most previous methods are trained with various features, evaluated on different test sets from the same or different datasets, making it difficult to be compared directly. To boost the research of macro-management in StarCraft, we release a new dataset MSC based on the platform SC2LE. MSC consists of well-designed feature vectors, pre-defined high-level actions and final result of each match. We also split MSC into training, validation and test set for the convenience of evaluation and comparison. Besides the dataset, we propose a baseline model and present initial baseline results for global state evaluation and build order prediction, which are two of the key tasks in macro-management. Various downstream tasks and analyses of the dataset are also described for the sake of research on macro-management in StarCraft II. Homepage:


page 2

page 6


SuperSim: a test set for word similarity and relatedness in Swedish

Language models are notoriously difficult to evaluate. We release SuperS...

A Practical Chinese Dependency Parser Based on A Large-scale Dataset

Dependency parsing is a longstanding natural language processing task, w...

Learning Macro-actions for State-Space Planning

Planning has achieved significant progress in recent years. Among the va...

An Evaluation of Action Recognition Models on EPIC-Kitchens

We benchmark contemporary action recognition models (TSN, TRN, and TSM) ...

Reproducible evaluation of classification methods in Alzheimer's disease: framework and application to MRI and PET data

A large number of papers have introduced novel machine learning and feat...

Manual Evaluation Matters: Reviewing Test Protocols of Distantly Supervised Relation Extraction

Distantly supervised (DS) relation extraction (RE) has attracted much at...

Macro-Micro Adversarial Network for Human Parsing

In human parsing, the pixel-wise classification loss has drawbacks in it...

1 Introduction

Figure 1: Framework Overview of MSC. Replays are firstly filtered according to pre-defined criterions and then parsed with PySC2. The states in parsed replays are sampled and turned into N-dimensional vectors. The final files which contain feature-action pairs and the final results are split into training, validation and test set.

Deep learning has surpassed the previous state-of-the-art in playing Atari games [Mnih et al.2015], the classic board game Go [Silver et al.2016] and the 3D first-person shooter game Doom [Lample and Chaplot2017]. But it remains as a challenge to play real-time strategy (RTS) games like StarCraft II with deep learning algorithms [Vinyals et al.2017]. Such games usually have enormous state and action space compared to Atari games and Doom. Furthermore, the observations in RTS games are usually partially observed, in contrast to Go.

Recent experiment has shown that it’s difficult to train a deep neural network (DNN) end-to-end for playing StarCraft II. [Vinyals et al.2017] introduce a new platform SC2LE on StarCraft II and train a DNN with Asynchronous Advantage Actor Critic (A3C) [Mnih et al.2016]. Unsurprisingly, the agent trained with A3C couldn’t win a single game even against the easiest built-in AI. Based on this experiment and the progresses made in StarCraft I such as micro-management [Peng et al.2017], build order prediction [Justesen and Risi2017b] and global state evaluation [Erickson and Buro2014], we believe that treating StarCraft II as a hierarchical learning problem and breaking it down into micro-management and macro-management is a feasible way to boost the performance of current AI bots.

Micro-management includes all low-level tasks related to unit control such as collecting mineral shards and fighting against enemy units, while macro-management refers to the higher-level game strategy the player is following such as build order prediction and global state evaluation. We could obtain near-human performance in micro-management easily with deep reinforcement learning algorithms such as A3C 

[Vinyals et al.2017], while it’s hard to solve macro-management at present, though lots of efforts have been made by StarCraft community [Churchill and Buro2011, Synnaeve, Bessiere, and others2011, Erickson and Buro2014, Justesen and Risi2017b]. One promising way for macro-management is to gain experience from professional human players with machine learning methods. [Erickson and Buro2014] learns to evaluate the global state from replays while [Justesen and Risi2017b] utilizes DNN for build order prediction. Both methods learn from replays, which are official log files used to record the entire game status when playing StarCraft.

There’re many datasets released in StarCraft I for learning macro-management from replays [Weber and Mateas2009, Cho, Kim, and Cho2013, Erickson and Buro2014, Justesen and Risi2017b]. But these datasets are designed for specific tasks in macro-management and didn’t release pre-divided training, validation and test set. Besides, datasets in [Cho, Kim, and Cho2013, Erickson and Buro2014] only contain about 500 replays, which is too small for modern machine learning algorithms. StarData [Lin et al.2017] is the largest dataset in StarCraft I containing 65646 replays. But there’re only a few replays containing the final results, which is not suitable for many tasks in macro-management such as global state evaluation. SC2LE [Vinyals et al.2017] contains the largest dataset in StarCraft II, which has 800K replays and is suitable for various tasks in macro-management. However, there is neither a standard processing procedure nor pre-defined training, validation and test set. Besides, it’s designed for end-to-end human-like control of StarCraft II, which is not easy to use for tasks in macro-management.

To take the research of learning macro-management from replays a step further, we build a new dataset MSC based on SC2LE. It’s the biggest dataset dedicated for macro-management in StarCraft II, which could be used for assorted tasks like build order prediction and global state evaluation. MSC is based on SC2LE for three reasons: 1) SC2LE contains the largest replay dataset. 2) SC2LE is supported officially and updated frequently. 3) The replays in SC2LE have higher qualities and more standard format. We define standard procedure for processing replays from SC2LE, as shown in Figure 1. After processing, our dataset consists of well-designed feature vectors, pre-defined action space and the final result of each match. All processed files are divided into training, validation and test set. Based on MSC, we train baseline models and present the initial baseline results for global state evaluation and build order prediction, which are two of the key tasks in macro-management. For the sake of research on other tasks, we also show some statistics of MSC and list some downstream tasks suitable for it. Our main contributions are two folds and summarized as follows:

  • We build a new dataset MSC for macro-management on StarCraft II, which contains standard preprocessing, parsing and feature extraction procedure. The dataset is divided into training, validation and test set for the convenience of evaluation and comparison between different methods.

  • We propose baseline models together with initial baseline results for two of the key tasks in macro-management i.e. global state evaluation and build order prediction.

2 Related Work

We briefly review the related works of macro-management in StarCraft. We also compare our dataset with several released datasets which are suitable for macro-management.

2.1 Macro-Management in StarCraft

We introduce some background for StarCraft I and StarCraft II shortly, and then review several related works focusing on various tasks in macro-management.


StarCraft I is a RTS game released by Blizzard in 1998. In the game, each player controls one of three races including Terran, Protoss and Zerg to simulate a strategic military combat. The goal is to gather resources, build buildings, train units, research techniques and finally, destroy all enemy units and buildings. During playing, the areas which are unoccupied by friendly units and buildings are unobservable due to the fog-of-war, which makes the game more challenging. The players must not only control each unit accurately and efficiently but also make some strategic plans given current situation and assumptions about enemies. StarCraft II is the next generation of StarCraft I which is better designed and played by most StarCraft players. Both in StarCraft I and StarCraft II, build refers to the union of units, buildings and techniques. Order and action are interchangeably used which mean the controls for the game. Replays are used to record the sequence of game states and actions during a match, which could be watched from the view of enemies, friendlies or both afterwards. There are usually two or more players in a match, but we focus on the matches that only have two players, noted as enemy and friendly.


In StarCraft community, all tasks related to unit control are called micro-management, while macro-management refers to the high-level game strategy the player is following. Global state evaluation is one of the key tasks in macro-management, which focuses on predicting the probability of winning given current state 

[Erickson and Buro2014, Stanescu et al.2016, Ravari, Bakkes, and Spronck2016, Sánchez-Ruiz and Miranda2017]. Build order prediction is used to predict what to train, build or research in next step given current state [Hsieh and Sun2008, Churchill and Buro2011, Synnaeve, Bessiere, and others2011, Justesen and Risi2017b]. [Churchill and Buro2011] applied tree search for build order planning with a goal-based approach. [Synnaeve, Bessiere, and others2011] learned a Bayesian model from replays while [Justesen and Risi2017b] exploited DNN. Opening strategy prediction is a subset of build order prediction, which aims at predicting the build order in the initial stage of a match [Köstler and Gmeiner2013, Blackford and Lamont2014, Justesen and Risi2017a]. [Dereszynski et al.2011] works on predicting the state of enemy while [Cho, Kim, and Cho2013] tries to predict enemy build order.

2.2 Datasets for Macro-Management in StarCraft

There’re various datasets for macro-management, which could be subdivided into two groups. The datasets in the first group usually focus on specific tasks in macro-management, while the datasets from the second group could be generally applied to assorted tasks.

Task-Oriented Datasets

The dataset in [Weber and Mateas2009] is designed for opening strategy prediction. There’re 5493 replays of matches between all races, while our dataset contains 5543 replays just for Terran versus Terran matches. [Cho, Kim, and Cho2013] learns to predict build order with a small dataset including 570 replays in total. [Erickson and Buro2014] designed a procedure for preprocessing and feature extraction among 400 replays. However, these two datasets are both too small and not released yet. [Justesen and Risi2017b] also focuses on build order prediction and builds a dataset containing 7649 replays. But there are not pre-defined training, validation and test set. Compared to these datasets, our dataset is more general and much larger besides the standard processing procedure and dataset division.

General-Purpose Datasets

The dataset proposed in [Synnaeve, Bessiere, and others2012] is widely used in various tasks of macro-management. There’re 7649 replays in total but barely with the final result of a match. Besides, it also lacks a standard feature definition, compared to our dataset. StarData [Lin et al.2017] is the biggest dataset in StarCraft I containing 65646 replays. However, it’s not suitable for tasks that require the final result of a match, because there aren’t many replays with the result label. [Vinyals et al.2017] proposed a new and large dataset in StarCraft II containing 800K replays. We transform it into our dataset for macro-management with standard processing procedure, well-designed feature vectors, pre-defined high-level action space as well as the division of training, validation and test set.

3 Dataset

Macro-management in StarCraft has been researched for a long time, but there isn’t a standard dataset available for evaluating various algorithms. Current research on macro-management usually needs to collect replays firstly, and then parse and extract hand-designed features from the replays, which causes that there is neither unified datasets nor consistent features. As a result, nearly all the algorithms in macro-management couldn’t be compared with each other directly.

We try to build a standard dataset MSC111, which is dedicated for macro-management in StarCraft II, with the hope that it could serve as the benchmark for evaluating assorted algorithms in macro-management. MSC is built upon SC2LE, which contains 800K replays in total [Vinyals et al.2017]. However, only 64396 replays in SC2LE are released currently by Blizzard Entertainment. To build our dataset, we design a standard procedure for processing the 64396 replays, as shown in Figure 1. We first preprocess the replays to ensure their quality. We then parse the replays using PySC2222 We sample and extract feature vectors from the parsed replays subsequently and then divide them into training, validation and test set. In this section, we will take Terran versus Terran matches as an example and introduce the details of these three steps together with some statistics and downstream tasks of MSC.

3.1 Preprocessing

There’re more than 6K replays containing Terran versus Terran matches in SC2LE. To ensure the quality of the replays in our dataset, we drop out all the replays dissatisfying the criterions:

  • Total frames of a match must be greater than 10000.

  • The APM (Actions Per Minute) of both players must be higher than 10.

  • The MMR (Match Making Ratio) of both players must be higher than 1000.

Because low APM means that player is just standing around while low MMR refers to corrupt replay or player who is weak.

After applying these criterions, we obtain 4897 high quality replays. Figure 2 shows the densities of APM and MMR among all 4897 replays. Most players’ APMs are around 100 while their MMR are roughly 4000. Interestingly, the densities of APM and MMR from winners and losers have similar distribution, which shows that APM and MMR are not the key factors to win a match.

Figure 2: Density plots of APM and MMR among all the preprocessed replays. For APM and MMR, we also plot the densities both from the winners’ view and losers’ view. Surprisingly, there seems no strong connection between APM, MMR and winning. Best viewed in color.

3.2 Parsing Replays

Build Order Space

We define a high-level action space , which consists of four groups: Build a building, Train a unit, Research a technique and Morph (Update) a building333Cancel, Halt and Stop certain actions from are also included for completion.. We also define an extra action , which means doing nothing. Both and constitute the entire build order space.

Observation Definition

Each observation we extract includes (1) buildings, units and techniques owned by the player, (2) resources used and owned by the player and (3) enemy units and buildings which are observed by the player.

Parsing Process

The preprocessed replays are parsed using Algorithm 1 with PySC2, which is a python API designed for reading replays in StarCraft II. When parsing replays, we extract an observation of current state and an action set every frames, where contains all actions since . The first action in that belongs to is set to be the target build order for observation . If there’s no action belonging to , we take as the target. When reaching the end of a replay, we save all (observation, action) pairs and the final result of the match into the corresponding local file. is set to be in our experiments, because in most cases, there’s at most one action belonging to every 8 frames.

3.3 Sampling and Extracting Features

As shown in Figure 3, the number of action is much larger than the total number of high-level actions in . Thus, we sample the (observation, action) pairs in the parsed files to balance the number of these two kinds of actions, and then extract features from them, as shown in Algorithm 2. is set to , because it’s a reasonable choice for balancing the two kinds of actions as shown in Figure 3. The feature we extracted are a vector with all values normalized into the interval . The entire feature vector consists of a few sub-vectors described here in order:

  1. frame id.

  2. the resources collected and used by the player.

  3. the alerts received by the player.

  4. the upgrades applied by the player.

  5. the techniques researched by the player.

  6. the units and buildings owned by the player.

  7. the enemy units and buildings observed by the player.

Once features are extracted, we split our dataset into training, validation and test set in the ratio 7:1:2. The ratio between winners and losers preserves 1:1 in the three sets. The statics for all replays are shown in Table 1.

V.S. TvT TvP TvZ PvP PvZ ZvZ
#Replays 4897 7894 9996 4334 6509 2989
Table 1: The number of replays after applying our pipeline.
Figure 3: Ratio between the number of a certain kind of build orders and the number of all actions in a parsed replay. The plots without come from the parsed replays in Section 3.2, while the plots with come from Section 3.3 with equal to 12. Best viewed in color.
1 Global: List states = [] Global: Observation previousObservation = None while True do
2       Observation currentObservation observation of current frame List actions actions conducted since previousObservation Action action = for a in actions do
3             if  then
4                   action = a break
5             end if
7       end for
8      states.append((previousObservation, action)) previousObservation currentObservation if reach the end of the replay then
9             Result result result of this match (win or lose) return (result, states)
10       end if
11      Skip frames
12 end while
Algorithm 1 Replay Parser

3.4 Downstream Tasks

Our dataset MSC is designed for macro-management in StarCraft II. We will list some tasks of macro-management that could benefit from our dataset in this section.

Game Statistics

One use of MSC is to analyze the behavior patterns of players when playing StarCraft, such as the statistics of winners’ opening strategy. We collect all the builds that winners trained or built in the first 20 steps, and show them in Figure 4. We can see that SCV is trained more often than any other build during the entire 20 steps, especially in the first 5 steps, while Marine is trained more and more often after the first 10 steps. Other possible analyses include the usage of gases and minerals, the relationship between winning and the usage of supply and etc.

Sequence Modeling

Each replay is a time sequence containing states (feature vectors), actions and the final result. One possible task for MSC is sequence modeling. As shown in Figure 5, the replays in MSC usually have 100-300 states, which could be used for testing sequence models like LSTM [Hochreiter and Schmidhuber1997] and NTM [Graves, Wayne, and Danihelka2014]. As for tasks in StarCraft II, MSC could be used for build order prediction [Justesen and Risi2017b], global state evaluation [Erickson and Buro2014] and forward model learning.

Uncertainty Modeling

Due to “the fog of war”, the player in StarCraft II could only observe friendly builds and part of enemy builds, which increases the uncertainty of making decisions. As shown in Figure 6

, it’s hard to observe enemy builds at the beginning of the game. Though the ratio of observed enemy builds increases as game progressing, we still know nothing about more than half of the enemy builds. This makes our dataset suitable for evaluating generative models such as variational autoencoders 

[Kingma and Welling2013]. Some macro-management tasks in StarCraft such as enemy future build prediction [Dereszynski et al.2011] or enemy state prediction can also benefit from MSC.

1 Input: List observationsActions Global: results = [] for (index, observation, action) in observationsActions do
2       if MOD(index, ) is 0 or action is not  then
3             results.append((ExtractFeature(observation), action))
4       end if
6 end for
return results
Algorithm 2 Sample and Extract Features

Learning from Unbalanced Dataset

Though we sample our dataset as described in Section 3.3, the number of action still dominates actions in . As shown in Figure 3, accounts for more than 50% of all actions. One way to ease the problem is to sample the dataset further. However, it’s not a practicable option. Because if we decrease the number of to a comparable level, we could not learn an accurate model for deciding whether to train a build or not under current state. Thus, learning how to dig out useful actions among enormous useless actions is one of the challenges urgent to be solved. Our dataset MSC is a good choice for testing such algorithms.

Reinforcement Learning

Sequences in our dataset MSC are usually more than 100 steps long with only the final 0-1 result as the reward. It’s useful to learn a reward function for every state through inverse reinforcement learning (IRL) [Abbeel and Ng2004]

, so that the AI bots can control the game more accurately. Besides IRL, MSC can also be used for learning to play StarCraft with the demonstration of human players, since we have both states and actions that human conducted. This task is called imitation learning 

[Argall et al.2009], which is one of the major tasks in reinforcement learning.

Planning and Tree Search

Games with long time steps and sparse rewards usually benefit a lot from planning and tree search algorithms. The most successful application is AlphaGO [Silver et al.2016], which uses Monte Carlo tree search [Coulom2006, Kocsis and Szepesvári2006] to boost its performance. MSC is a high-level abstraction of StarCraft II, which could be viewed as a planning problem. Once a good forward model and an accurate global state evaluator are learned, MSC is the right dataset for testing various planning algorithms and tree search methods.

Figure 4: Opening Strategy of the Winners. The 6 lines show the probabilities of training a certain unit in the first 20 steps. Best viewed in color.
Figure 5: The number of states in each replay file after sampling and extracting features.
Figure 6: Density of Partially Observed Enemy Units. X-axis represents the progress of the game while Y-axis is the ratio between the number of partially observed enemy units and total enemy units. Best viewed in color.

4 Baselines for Global State Evaluation

MSC is a general-purpose dataset for macro-management in StarCraft II, which could be used for various high-level tasks as shown in Section 3.4. We present the baseline model and initial baseline results for global state evaluation in this paper, and leave baselines of other tasks as our future work. This section is organized as follows: We first define the task of global state evaluation formally, and then propose a baseline model for this task. Finally, we present the experiment results of our baseline model.

4.1 Definition

When human players play StarCraft II, they usually have a sense of whether they would win or lose in the current state. Such a sense is essential for the decision making of what to train or build in the following steps. For AI bots, it’s also desirable to have the ability of predicting the probability of winning in a certain state. Such an ability is called global state evaluation in StarCraft community. Formally, global state evaluation is predicting the probability of winning given current state at time step , i.e. predicting the value of . is the state at time step while is the final result. Usually, couldn’t be accessed directly, what we obtain is the observation of noted as . Thus, we use to represent and try to learn a model for predicting instead.

Phrase 1/4 th 2/4 th 3/4 th 4/4 th Average
Baseline 0.529 0.581 0.642 0.797 0.611
Table 2: Mean Accuracy for Global State Evaluation. We test our baseline model on test set and list the mean accuracies in different game phrases. Mean accuracy among the entire game is also reported.
V.S. TvT TvP TvZ PvP PvZ ZvZ
Baseline(%) 61.1 59.1 59.8 51.4 59.7 59.9
Table 3: Mean Accuracy for Global State Evaluation of all replays.
V.S. TvT TvP TvZ PvP PvZ ZvZ
Baseline(%) 74.1 74.8 73.5 76.3 75.1 76.1
Table 4: Mean Accuracy for Build Order Prediction of all replays.
Figure 7: Baseline Network Architecture. is the input feature vector. A, B and E are linear units with the number of units 1024, 2048 and 1, while C and D are GRUs with size 2048 and 512.
Figure 8: The Trend of Mean Accuracy with Time Steps for Global State Evaluation. The mean accuracy on test set increases as game progresses.

4.2 Baseline Network


We model global state evaluation as a sequence decision making problem and use Recurrent Neural Networks (RNNs) 

[Mikolov et al.2010] to learn from replays. Concretely, we use GRU [Cho et al.2014] in the last two layers to model the time series . As shown in Figure 7, the feature vector

flows through linear units A and B with size 1024 and 2048. Then two GRUs C and D with size 2048 and 512 are applied. The hidden state from D is fed into the linear unit E followed by a Sigmoid function to get the final result

. ReLUs are applied after both A and B.

Objective Function

Binary Entropy Loss (BCE) serves as our objective function, which is defined as Equation 1,


where stands for and is the final result of a match. We simply set to be 1 if the player wins at the end and set it to be 0 otherwise.

Implementation Details

Our algorithms are implemented using PyTorch

444 To train our baseline model, we use ADAM [Kingma and Ba2014] for optimization and set learning rate to

. At the end of every epoch, the learning rate is decreased by a factor of 2. The batch size is set to 256, while the size of time steps is set to 20 in case of gradient vanishing or explosion.

4.3 Experiment Results

The baseline network is trained on our dataset using Terran versus Terran matches and evaluated with mean accuracy. The mean accuracy in test set is around 0.61 after model converges. We also show the mean accuracies in different phrases in Figure 8. At the beginning of the game (0%-25%), it’s hard to tell the probability of winning, as the mean accuracy of this curve is around 0.5 and doesn’t change much with the training progressing. After half of the game (50%-75%), the mean accuracy could reach 0.64, while it’s around 0.80 at the end of the game (75%-100%). The accurate results are listed in Table 2 and serve as the baseline results for global state evaluation in MSC. The results for all replays are shown in Table 3.

5 Baselines for Build Order Prediction

Build order prediction is used to predict what to train, build or research in next step given current state. The procedure is similar to that in Section 4, except that the output is a N-way softmax. We use Top-1 accuracy as the metric and show the result in Table 4.

6 Conclusion

We released a new dataset MSC based on SC2LE, which focuses on macro-management in StarCraft II. Different from the datasets in macro-management released before, we proposed a standard procedure for preprocessing, parsing and feature extraction. We also defined the specifics of feature vector, the space of high-level actions and three subsets for training, validation and test. Our dataset preserves the high-level information directly parsed from replays as well as the final result (win or lose) of each match. These characteristics make MSC the right place to experiment and evaluate various methods for assorted tasks in macro-management, such as build order prediction, global state evaluation and opening strategy clustering. Multiple tasks in macro-management are listed and the advantages of MSC for each task are analyzed. Among all these tasks, global state evaluation and build order prediction are two of the key tasks. Thus, we proposed a baseline model and presented initial baseline results for them. However, other tasks require baselines as well, we remain these as future work and encourage other researchers to evaluate various tasks on MSC and report their results as baselines.