On Reinforcement Learning for Full-length Game of StarCraft

by   Zhen-Jia Pang, et al.
Nanjing University

StarCraft II poses a grand challenge for reinforcement learning. The main difficulties of it include huge state and action space and a long-time horizon. In this paper, we investigate a hierarchical reinforcement learning approach for StarCraft II. The hierarchy involves two levels of abstraction. One is the macro-action automatically extracted from expert's trajectories, which reduces the action space in an order of magnitude yet remains effective. The other is a two-layer hierarchical architecture which is modular and easy to scale, enabling a curriculum transferring from simpler tasks to more complex tasks. The reinforcement training algorithm for this architecture is also investigated. On a 64x64 map and using restrictive units, we achieve a winning rate of more than 99% against the difficulty level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat model, we can achieve over 93% winning rate of Protoss against the most difficult non-cheating built-in AI (level-7) of Terran, training within two days using a single machine with only 48 CPU cores and 8 K40 GPUs. It also shows strong generalization performance, when tested against never seen opponents including cheating levels built-in AI and all levels of Zerg and Protoss built-in AI. We hope this study could shed some light on the future research of large-scale reinforcement learning.



There are no comments yet.


page 11

page 13


StarCraft Micromanagement with Reinforcement Learning and Curriculum Transfer Learning

Real-time strategy games have been an important field of game artificial...

TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game

Starcraft II (SCII) is widely considered as the most challenging Real Ti...

Efficient Reinforcement Learning with a Mind-Game for Full-Length StarCraft II

StarCraft II provides an extremely challenging platform for reinforcemen...

Accelerating Reinforcement Learning by Composing Solutions of Automatically Identified Subtasks

This paper discusses a system that accelerates reinforcement learning by...

A Brandom-ian view of Reinforcement Learning towards strong-AI

The analytic philosophy of Robert Brandom, based on the ideas of pragmat...

Multi-lane Cruising Using Hierarchical Planning and Reinforcement Learning

Competent multi-lane cruising requires using lane changes and within-lan...

An Introduction of mini-AlphaStar

StarCraft II (SC2) is a real-time strategy game, in which players produc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, reinforcement learning [13] (RL) has developed rapidly in many different domains. Game of Go has been considered to be conquered after AlphaGo [11] and AlphaGo Zero [12] . Most of Atari games are nearly solved using DQN [6] and follow-up methods. Various mechanical control problems, such as robotic arms [4] and self-driving vehicles [10], have made great progress. However, reinforcement learning algorithms at present are still difficult to be used in large-scale reinforcement learning problem. Agents can not learn to solve problems as smartly and efficiently as human. In order to improve the ability of reinforcement learning, complex strategic games like StarCraft has become the perfect simulation environment for many institutions such as DeepMind [18], FAIR [15], and Alibaba [8].

From the perspective of reinforcement learning, StarCraft is a very difficult problem. First of all, it is a imperfect information game. Players can only see a small area of map through a local camera and there is a fog of war in the game. Secondly, the state space and action space of StarCraft are huge. The input image features are much larger than Go. There are hundreds of units and buildings and each of them has unique operations, making action space extremely large. Third, a full-length game of StarCraft commonly lasts for 30 minutes to more than one hour and the player needs to make thousands of decisions to win. Finally, StarCraft is a multi-agent game. The combination of these issues makes StarCraft a great challenge for reinforcement learning.

Most previous agents in StarCraft are based on manual rules and scripts. Some works related to reinforcement learning are usually about micromanagement [16] or macromanagement [3]. These works solved some specific problems like local combat in StarCraft. However, there are rare works about the full-length games. In the paper of SC2LE [18], the benchmark result given by DeepMind shows that the A3C algorithm [5] in SC2LE did not achieve one victory on the easiest difficulty level-1, which verifies the difficulty of full-length game in StarCraft II.

In this paper, we investigate a hierarchical reinforcement learning method for full-length game of StarCraft II (SC2). First of all, we have a summary of the difficulties encountered in the StarCraft II and introduce the simulation platform SC2LE (StarCraft II Learning Environment). Then, we present our hierarchical architecture which uses several levels of abstraction to make intractable large-scale reinforcement learning problems become easy to solve. An effective training algorithm tailed to the architecture is also investigated. Then, we give some experiments in the full-length game on a 64x64 map of SC2LE. Finally, we discuss about the impacts of the architecture, reward design and the settings of curriculum learning. Experimental results achieved in several difficult levels of full-length games on SC2LE illustrate the effectiveness of different method. The main contributions of this paper are as follow:

  • We investigate a hierarchical architecture which makes large-scale SC2 problem easy to solve.

  • An simple yet effective training algorithm for this architecture is also presented.

  • We give detailed study of impact on training settings of our architecture on SC2LE.

  • Experiment results on SC2LE show that our method achieves state-of-the-art results.

2 Background

In this section, we first introduce the preliminaries of reinforcement learning problems. Then, we give related works of hierarchical reinforcement learning problems. At last, we introduce the environment of StarCraft II and discuss some related works.

2.1 Reinforcement Learning

Consider a finite-horizon Markov Decision Process (MDP), which can be specified as 6-tuple:


is the state space and is a state of the state space. is the action space and is an action which agent can choose in state .

represents the probability distribution of next state

over when agent choose action in state . represents the instant reward gained from the environment when agent choose action in state . is discount factor which represents the influence of future reward on the choice at now. is the max length of time horizon.

Policy is a mapping or distribution form to . If is a scalar and agent select an action when it is in state , this policy is called a distinct policy. In contrast, if is a distribution and agent samples an action , this policy is called a stochastic policy. Assuming one agent, using a policy , starts from state , choose an action , gains a reward , then transform to next state according to distribution and repeat this process. This will generate a sequence below:


There is a state at where if agent arrives, exploration of the agent will stop. Process from to is called one episode. The sequence in the episode is called a trajectory of the agent. For finite-horizon problem, when time step exceeds , the exploration is also over. Typical RL algorithms need hundreds of episodes to learn a policy. In one episode, the discounted cumulative reward get by agent is defined as:


is called return of the episode. Typical RL algorithms aim to find an optimal policy which maximize the expected return.


2.2 Hierarchical Reinforcement Learning

When the dimension of the state space in the environment is large, the space that needs to be explored exhibits exponential growth, which is called the curse of dimensionality problem in reinforcement learning. Hierarchical reinforcement learning solves this problem by decomposing a complex problem into several sub-problems and solving each sub-question in turn. Option

[14] is one of the traditional hierarchical reinforcement learning algorithms. Although these algorithms can better solve curse of dimensionality problems, the options need to be manually defined, which is time consuming and laborious. Another advantage of the hierarchical reinforcement learning algorithm is that the resolution of the time is reduced, so that the problem of credit assignment over a long time scale can be better handled.

In recent years, some novel hierarchical reinforcement learning algorithms have been proposed. Option-Critic [1] is a method using theorem of gradient descent to learn the options and policy over options simultaneously, thus reducing the effort of manual designing options. However, the automatically learned options do not perform as well as non-hierarchical algorithms on certain tasks. FeUdalNetwork [17] designed a hierarchical architecture that includes a Manager module and a Worker module. The Manager assigns sub-goals to the Worker, and the Worker is responsible for doing specific actions. The paper proposes a gradient transfer strategy to learn the parameters of the Manager and Worker in an end-to-end manner. However, due to the complexity, this architecture is hard-to-tune. MLSH [2] proposes a hierarchical learning approach based on meta-learning, which enhances the learning ability of transferring to new tasks through sub-policies learned in multiple tasks. MLSH has achieved better results on some tasks than the PPO [9] algorithm, but because its setting is multi-tasking, it is more difficult to apply to our environment.

2.3 StarCraft II

Games are ideal environments for reinforcement learning research. RL problems on real-time strategy (RTS) games are far more difficult than problems on Go due to complexity of states, diversity of actions, and long time horizon. Traditionally, research on real-time strategy games is based on search and planning approaches [7]. In recent years, more studies use RL algorithms to conduct research on RTS and one of the most famous RTS research environments is StarCraft. Previous works on StarCraft are mostly focused on local battles or part-length and often get features directly from game engine. [16]

present a heuristic reinforcement learning algorithm combining exploration in the space of policy and back propagation.

[8] introduce BiCNet based on multi-agent reinforcement learning combined with actor-critic. Although they have achieved good results, they are only effective for part-length game. ELF [15] provides a framework for efficient learning and a platform mini-RTS for reinforcement learning research. ELF also give a baseline of A3C [5] algorithm in a full-length game of mini-RTS. However, because the problem is relatively simple, there is still a great distance from the complexity of StarCraft.

SC2LE is a new research learning environment based on StarCraft II which is the follow-up of StarCraft. The location information for each unit is given in the engine of StarCraft. However, the location information of units and buildings need to be perceived from the image features in SC2LE. Therefore, the spatial complexity of the state of its input is much larger than StarCraft I. At the same time, in order to simulate real hand movement of humans, action space in SC2LE is refined to each mouse click event, which greatly increases the difficulty of learning and searching. Benchmark result given by DeepMind shows that A3C algorithm [5] did not achieve one victory on the easiest difficulty level-1, verifying the difficulties of full-length game in StarCraft II. In addition to full-length game, SC2LE also provides several mini-games for research. [20] proposed a relation-based reinforcement learning algorithm, which achieved good results on these mini-games. But the results on the full-length game are still not reported.

3 Methodology

In this section, we introduce our hierarchical architecture and the generation of macro-actions firstly. Then the training algorithm of the architecture is given. At last, we discuss the reward design and the curriculum learning setting used in our method.

Figure 1: Overall Architecture.

3.1 Hierarchical Architecture

Our hierarchical architecture is illustrated in Fig. 1. There are two types of policies running in different timescales. The controller decides to choose a sub-policy based on current observation every long time interval, and the sub-policy picks a macro-action every short time interval.

For further illustration, we use to represent the controller. and is its state and action space. is the reward function of controller. Similarly, assuming there is sub-policies in the pool, we use to represent them. The state and action space of th sub-policy is defined as and . is its reward function. Besides, we have a time interval K. It means that the controller chooses a sub-policy in every K time units and the chosen sub-policy makes a decision in every time unit. Now we can deeply go through the whole process.

At time , the controller gets its own global observation , and it will choose a sub-policy based on its state, like below:


Now the controller will wait for K time units and the th sub-policy begins to make its move. We assume its current time is and its local observation is , so it get the macro-action . After the th sub-policy doing the macro-action in the game, it will get the reward and its next local observation . The tuple will be stored in its local buffer for the future training. After K moves, it will return to the controller and wait for the next chance.

The high level controller gets the return of the chosen sub-policy and compute the reward of its action as follows:


Also, the controller will get its next global state and the tuple will be stored in its local buffer . Now the time is , and the controller will make a next choice based on its current global observation.

From the above, we can see that there is some advantages in our hierarchical architecture. First, each sub-policy and the high-level controller have different state space. We can see that the controller only needs the global information to make high-level decision. The global state space is a small part of all the state space . Also, a sub-policy responsible for combat are more focused on the local state space related to battle. It can be seen that such a hierarchical structure can split the original huge state space into a plurality of subspaces corresponding to different policy networks. Second, the hierarchical structure can also split the tremendous action space . The sub-policies with different functions will have their own action space . Third, the hierarchical architecture can effectively reduce the execution step size of the strategy. Since the control network calls a sub-network every fixed time interval , the total execution step size of the high-level network becomes step. The execution step size of sub-policies will also be reduced. Last but not least, the hierarchical architecture makes design of the reward function easier. Different sub-policies may have different targets. Therefore, they can learn more quickly by their own suitable reward functions.

3.2 Generation of Macro-actions

In StarCraft, the original action space is tremendous, and human player always need to do a sequence of raw actions to achieve one simple purpose. For example, if we want to build a building in the game, we have to select a peasant, order it to build the building in the specific position, and make it come back after finish. The sequences of the raw actions for some simple purposes are more likely some fixed sequence stored in mind for our human player. So we instead generate a macro-action space which is obtained through data mining from trajectories of experts. The original action space is then replaced by the macro-action space . This will improve the learning efficiency and running speed. The generation process of macro-actions is as follow:

  • Firstly, we collect some expert trajectories which are sequence of operations from game replays.

  • Secondly, we use a prefix-span [19] algorithm to mine the relationship of the each operation and combine the related operations to be a sequence of actions of which max length is and constructed a set which is defined as

  • Thirdly, we sort this set by .

  • Fourthly, we remove duplicated and meaningless ones, remain the top ones;

  • Finally, the reduced set are marked as newly generated macro-action space .

Using the macro-action space , our MDP problem of the controller are now reduced to a simple one, which is defined as:


Meanwhile, the MDP problem of each sub-policy is also reduced to new one:


3.3 Reward Design

As we know, reward has a significant impact on reinforcement learning. There are usually two types of reward we can use. One is dense reward that can be got easily during the game, and the other is sparse reward that only shown at few specific states. Obviously, dense reward gives more positive or negative feedback to the agent. As a result, dense reward can help agent learn to play the game faster and better than sparse reward.

There are three types of rewards we explore in this paper. Win/Loss reward is a ternary 1 (win) / 0 (tie) / -1 (loss) received at the end of a game. Score reward is Blizzard scores get from the game engine. Mixture reward is a designed reward function. It’s hard for agent to learn the game using Win/Loss reward. Blizzard scores can be seen as dense reward. We will show that using this score as reward can not help agent get more chances to win in the experiment section. We have designed some reward functions for the sub-policies that combines dense reward like score and sparse reward like win/lose. These rewards seem to be really effective for training. There results are also shown in the experiment section.

3.4 Training Algorithm

The training process of our architecture is showed in Algorithm 1 and can be summarized as follows. Firstly we initialize the controller and sub-policies. Then we run the iteration of times and run the episode of times in each iteration. At the beginning of each iteration, we will clear all the replay buffers. In each episode, we collect the trajectories of the controller and sub-policies. At the end of each iteration, we use the replay buffers to update the parameters of the controller and sub-policies.

The update algorithm we use is PPO [9]. Entropy’s loss was added to the PPO’s loss calculation to encourage exploration. Therefore, our loss formula is as follows:


where are the coefficients we need to tune, and S denotes an entropy bonus. is defined as follows:


where ,

is computed by a truncated version of generalized advantage estimation.

Input: Number of sub-policys , time interval K, reward function , max episodes , max iteration steps

  Initialize replay buffer , controller policy , each sub-policy
  for  to  do
     clear data buffer
     for  to  do
        collect in timescale
        for  to  do
           collect in full timescale
        end for
     end for
     using to update to maximize expected return
     for  to  do
        using to update to maximize expected return
     end for
  end for
Algorithm 1 RL training algorithm

3.5 Curriculum Learning

Curriculum learning is an effective method that can be applied to reinforcement learning. It designs a curriculum from easy to difficult. Agents can continue to learn from this curriculum to improve their abilities. We use the idea of curriculum learning here to let our agents can continue to challenge more difficult tasks.

SC2 includes 10 difficult levels of built-in AI which are all crafted by rules and scripts. From level-1 to level-10, the built-in AI’s ability is constantly improving. Training directly in higher difficulty level gives less positive feedback, so this is very hard for agent to learn. Instead we design a suitable learning sequence. In this work, we train our agent in low level of difficulty at first, and then transfer to high level of difficulty using the pre-trained model in a curriculum setting.

However, when the pre-trained agent transfers to high levels of difficulty, we find that if controller and all sub-policies are updated at the same time, the training is unstable due to the mutual influence of different network. In response to this situation, we have devised a strategy to update the controller and sub-policies alternatively. We found that this method can make the training process more stable, and the winning rate for high difficulty levels can rise steadily.

4 Experiments

In this section, we introduce the experiments results on SC2LE. We first introduce the training and experiment settings in the evaluation. Then three sets of experiments results of evaluation are shown. At last, we discuss the results and analyze the possible causes.

4.1 Setting

The setting of our architecture is as follow: controller selects one sub-policy every 8 seconds, and the sub-policy performs macro-actions every 1 second. In the setup, we have two sub-policies in the sub-policy pool. One sub-policy controls the construction of buildings and the production of units in the base, called as base network. The other sub-policy responsible for battle. This sub-policy has two different settings, which are explained later.

The number of iterations of training is generally set to 800. Each iteration run 100 full-length game of SC2. One full-length game of SC2 is defined as an episode. The maximum number of frames for each game is set to 18,000. In order to speed up learning, we have adopted a distributed training method. We used 10 workers. Each worker has 5 threads. Each worker is assigned several CPU cores and one GPU. Each worker collects data on its own and stores it in its own replay buffer. Suppose we need 100 episodes of data per iteration, then each worker’s thread needs to collect data for 2 episodes. Each worker collects the data of 10 episodes and then calculates the gradients, then passes the gradients to the parameter server. After the parameters are updated on the parameter server, the new parameters are passed back to each worker. Since the algorithm we use is PPO, the last old parameters are maintained on each worker to calculate the gradients.

We tested it on a machine with 8 K40 GPUs and 48 CPU cores. Two experiments are running at the same time, with each experiment assigning 4 GPUs. It takes 8-10 minutes to run 100 games of StarCraft. A training often requires about 20,000 to 50,000 episodes to converge. This means that an effective agent may be trained in more than one day even in a distributed version. Therefore, resource consumption of training for SC2 is also a difficulty in researching on it. In the future, it may be necessary to explore more efficient training methods.

A full-length game of SC2 in the 1v1 mode is as follow: Firstly, two players spawn on different random points in the map and they should accumulate resources. Secondly, constructs the buildings. Thirdly, produces the battle units. Fourthly, attacks and destroys all the buildings of opponents. Fig. 2 shows a screenshot of the running game.

Figure 2: Screenshot of StarCraft II.

For simplicity, we fix the agent’s race to Protoss and build-in AI’s race to Terran. The map we used is the 64x64 map simple64 on SC2LE. We set the maximum length of each game to no more than minutes. For the sake of simplicity, our agent does not open sub-mine, and only uses the basic two military units which are Zealot and Stalker.

In our setup, the battle policy has three different settings. This is explained below.

4.1.1 Combat rule

Combat rule is the simplest battle policy, in which the enemy’s position is automatically selected by priori knowledge. The enemy’s position is set to the most distant point of all resource points from our base.

There are only one action in combat rule: attacking the enemy. The attack action uses the built-in AI to do automatic attack. Thus the result of the attack depends on the construction of the base and the production of the unit. This kind of setting can facilitate our training on base construction and unit production. Only when the agent learns to better carry out base construction (for example, master the order of buildings, do not build some redundant buildings, and timely build more Crystals when supply is insufficient), it is possible for agent to win in the attack on the enemy.

4.1.2 Combat network

Though the simple Combat Rule is effective, the combat process is slightly rigid and may fails when moving to larger and more complex maps. Below we introduce a more intelligent attack approach which is called combat network. The output of the combat network consists of three actions and a position vector. The three actions are: all attack a certain position, all retreat to a certain position, do nothing.

The location position is specified by a position vector. The position vector is a one-hot vector that represents the eight coordinate points on the screen. We can imagine that there is a square in the center of the screen. The side length of the square is half the length of the side of the screen. These 8 points are evenly distributed on the side of rectangle, referring to the position of the attack and movement. With this setup, the agent can be smarter to choose the target and location of the offense, thus increasing the flexibility of the battle. At the same time, because the combat network does not specify the location of the enemy, it can automatically learn and discover the possible existence of the enemy, so that it can maintain performance when moving to a larger and more complex map.

Figure 3: Architecture of combat network.

This combat network is constructed as a Convolution Neural Network (CNN) which is shown in Fig.

3. The CNN accepts the feature map of the minimap so that it can use the information of the full map. Since the combat network needs to specify the position on the screen, it is necessary to input the feature map information of the screen image and specify the position of the local camera.

We use a simple strategy to decide the position of the screen. When the controller chooses the base sub-policy, the camera is moved to the location of agent’s base. When the controller chooses the battle sub-policy, the camera is moved to the location of army. The location of army is chosen in two ways. The first is the center point of all combat units. The second is the center of the most injured unit. Since the injured unit indicated the occurrence of a recent battle, we found that the effect of last setting was better in practice.

4.1.3 Mixture model

We can combine combat network with combat rule into a mixture model. When a certain value is predicted in the position vector of the combat network, the army’s attack position will become a position calculated by prior knowledge.

(a) Curriculum learning (b) Module training (c) Combat rule (d) Combat network
Figure 4: Winning curve in training process.
(a) Hierarchy vs single (b) Win/Loss vs handcrafted (c) Score and win rate (d) Impact of update num
Figure 5: Comparison of settings.

4.2 Comparison of Training Method

First we trained our combat rule agent on the difficulty level-2 from scratch. The training iterates over 800 steps. We found that at the beginning of the training, the agent constructed a lot of redundant buildings. After training, the agent will more effectively use the building resources and control the proportion of the production quantities of each unit.

4.2.1 Effect of curriculum learning

In all of our combat models, we firstly train on low difficulty level and then transfer to high difficulty level. This process follows the purpose of curriculum learning. We found that when training directly on high difficulty level, the performance of our agent is difficult to improve. If we start training from low difficulty level and then use the low-difficulty’s model as the initial parameters to train, we can often get good results.

Fig. 4 (a) shows the comparison between training using pre-trained model and training from scratch in difficulty level-7.

4.2.2 Effect of module training

Since we are using a hierarchical architecture, we can easily replace sub-policies. While replacing a sub-policy, the parameters of the control network and other sub-policies are retained, only the parameters of the newly replaced sub-policy are updated. This can accelerate learning. On the contrary, if we are using a single network architecture, when we want to change the combat strategy, we need to retrain all the parameters in the network, which may lead to poor performance.

Fig. 4 (b) shows the comparison between training using pre-trained model and training from scratch when replacing combat rules with a combat network in our experiment.

4.2.3 Alternative vs simultaneous updating

When performing transfer learning, if all the networks in the hierarchical architecture are updated at the same time, the learning curve will become very unstable, which makes it difficult to converge. We used an alternate updating strategy to train them in turn. This will often result in a steady improvement.

Fig. 4 (c) shows the comparison between simultaneously updating and alternately updating when transferred the combat rule model from level-2 to level-7.

Fig. 4 (d) shows the comparison between simultaneously updating and alternately updating when transferred the combat network model from level-2 to level-7.

4.3 Comparison of Combat Model

Opponent’s Type Non-cheating (Training) Cheating (No-training)
Difficulty Level 1 2 3 4 5 6 7 8 9 10
Combat Rule 1 0.99 0.94 0.99 0.95 0.88 0.78 0.70 0.73 0.60
Combat Network 0.98 0.99 0.45 0.47 0.39 0.73 0.66 0.56 0.52 0.41
Mixture Model 1 1 0.99 0.97 1 0.90 0.93 0.74 0.71 0.43
Table 1: Evaluation of models.
Opponent’s Type Non-cheating (No-Training) Cheating (No-training)
Difficulty Level 1 2 3 4 5 6 7 8 9 10
vs Zerg 1 0.99 1 0.98 0.98 0.99 0.94 0.89 0.82 0.35
vs Protoss 1 1 0.92 0.76 0.46 0.40 0.45 0.47 0.41 0.34
Table 2: Evaluation the policy against Protoss and Zerg without re-training.

After training on difficulty level-7, we perform evaluation for each setting of combat model. Each evaluation tests from difficulty level-1 to difficulty level-10. We run games in each difficulty level and report the each winning rate. The result is shown in Table 1. In the difficulty level-1 to level-7, we found that the training agent has a very good performance.

We also evaluate the policy against even seen opponent, i.e., the built-in AI for difficulty level-8, level-9 and level-10 uses several different cheat techniques. It can be seen that our trained agent still has a good performance on fighting with them. The performance video of our agent is shown in supplementary material.

In order to test the generalization of the policy against different types of opponents, we tested the winning rate of our Protoss agent against the other two races. Table 2 shows the evaluation results of our agent against all levels built-in AI of Zerg and Protoss.

4.3.1 Combat rule

The combat rule agent achieves good results in difficulty level-1 to difficulty level-5. We find that since agents can only use the basic two battle units to fight, they are more inclined to use a fast attack fashion (called ’Rush’), which can guarantee the winning percentage.

At the same time, in order to ensure the effective use of resources, agents will minimize the production of unrelated buildings. At the beginning, agents tend to produce more farmers to develop the economy. When the number of farmers is close to saturation, the agent will turn to the production of soldiers and other combat units. When the number of soldiers is sufficient, the agent will choose to attack. This layered progressive strategy is automatically learned through our reinforcement learning algorithm. We found that our agents can learn a human-like development strategy which illustrates the effectiveness of our approach.

4.3.2 Combat network

We test the combat network on all 9 difficulty levels. The result is shown in Table 1. We found that although the combat network model achieved a good winning percentage on difficulty level-7, the winning rate on several other difficulty levels was not high. It is worth noting that many results of the other levels of difficulty are tie. This means that the model of the combat network is more difficult to completely destroy the enemy (which means eliminating all enemy buildings).

4.3.3 Mixture model

Through experiments, it can be found in Table 1 that the mixture model of combat network and combat rule has achieved the best results in 1-7 difficulty levels. This can be explained as follow. Because the agent has the right to choose to attack the enemy in the screen or the enemy on the small map, it has a lot of freedom. That is, the agent can not only choose the attacking area within the camera, but also can switch to a fierce attack on a certain point. This freedom can lead to a performance improvement which makes mixture model best in all three combat models.

4.4 Comparison of Settings

In this section we give three experiments of importance of hierarchy, design of reward and impact of hyper parameters.

4.4.1 Hierarchy vs non-hierarchy

Our method uses a hierarchical architecture. A common problem is whether it can handle the SC2 problem if using non-hierarchical architecture. In the paper of SC2LE, the learning effect of the original action and state space and non-hierarchical reinforcement learning algorithm has been given. It can be seen that it is very difficult to solve this problem without hierarchy. Our architecture uses two levels of abstraction. One is a reduction in the action space, which is done by macro-action. The other is a two-layer architecture that uses the controller to select sub-policies. We tested the effect of keeping the use of macro-actions without using the controller and sub-policies. This is called the single-policy architecture.

We were surprised to find that on difficulty level-2, the single-policy architecture can be almost as good as hierarchical architecture learning which means that macro-actions are effective for training a good agent. However, when moving to more complex levels of difficulty (such as difficulty level-7), the initial performance and the final learning performance of the hierarchical architecture are significantly better than the single-policy architecture, as shown in Fig. 5 (a). It can be explained that when the difficulty level is low, the difference between the hierarchical and non-hierarchical architecture is less obvious. When the difficulty of the problem continues to increase, the performance of the hierarchical model will be better than the performance of non-hierarchical model.

4.4.2 Outcome reward vs handcrafted reward

Win/Loss reward can achieve good results on low difficulty like handcrafted reward. However, when the training is on high difficulty level, we found that the effect of Win/Loss reward on hierarchical model is relatively poor as shown in Fig. 5 (b). In SC2LE, it is mentioned that you can use Blizzard score as a reward. However, we found that Blizzard score and winning rate are not in a proportional relationship. While the agent is trying to improve the score, it may ignore the attack chance on the enemy base and lose the best opportunity to win. When the score is larger, the winning rate may be smaller. This is shown in Fig. 5 (c). The final performance of the agent trained in SC2LE by Blizzard score also shows this problem. That is the reason why Blizzard score is not an ideal reward function.

4.4.3 Influence of hyper-parameters

In this section we explore the effects of hyper-parameters on training. We experimented with a variety of different parameters. We found that update number of each iteration has a great influence on the learning process of the agent. When the update number is small, the learning will be unstable and hard to converge. At this time, improving update number can significantly mitigate the problem. This effect is shown in Fig. 5 (d). Other parameters do not have much impact on training.

4.5 Discussion

Strategies used by built-in AIs are highly optimized by game developers and therefore pose significant challenges to learning algorithms. Our method is characterized by the abstraction and reduction of StarCraft II from multiple levels. For example, by learning the macro-actions, the action space in the StarCraft II is greatly reduced. Long-time horizon problem of the StarCraft II is mitigated through a hierarchical approach. This learning algorithm is robust (because of the characteristics of learning) and efficient (because of the abstraction of problems). The combination of these factors makes this framework suitable for solving large-scale reinforcement learning problem such as StarCraft II.

5 Conclusion

In this paper, we investigate a hierarchical reinforcement learning approach for full-length games in StarCraft II. This architecture employs two levels of abstraction. After proper training, our architecture has achieved state-of-the-art results on the current challenging platform SC2LE. Though our method has achieved good results in SC2LE, it still has some shortcomings. For example, the 64x64 map we are currently testing is small, we only use the two arms of the initial level. We will explore learning on larger maps and try to use more arms to organize tactics in the future. We hope that this framework could shed some light on the future research of reinforcement learning in real world problems. In the future, we will continue to explore more reinforcement learning algorithms that can better solve large-scale reinforcement learning problem.