Efficient Reinforcement Learning with a Mind-Game for Full-Length StarCraft II

03/02/2019 ∙ by Ruo-Ze Liu, et al. ∙ Nanjing University 0

StarCraft II provides an extremely challenging platform for reinforcement learning due to its huge state-space and game length. The previous fastest method requires days to train a full-length game policy in a single commercial machine. In this paper, we introduce the mind-game to facilitate the reinforcement learning, which is an abstract task model. With the mind-game, the policy is firstly trained in the mind-game fastly and is then mapped to the real game for the second phase training. In our experiments, the trained agent can achieve a 100 non-cheating built-in bot (level-7), and the training is 100 times faster than the previous ones under the same computational resource. To test the generalization performance of the agent, a Golden level of StarCraft II Ladder human player has competed with the agent. With restricted strategy, the agent wins the human player by 4 out of 5 games. The mind-game approach might shed some light for further studies of efficient reinforcement learning. The codes are publicly available (https://github.com/mindgameSC2/mind-SC2).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, reinforcement learning (RL) [1, 2] has received increasing attention, particularly in learning to play games, e.g., [3, 4, 5]. The combination of deep reinforcement learning and Monte-Carlo tree search has conquered the playing of the game of Go [6, 7]

, once a holy grail of artificial intelligence for decades. After that, some researchers shifted the attention to a more challenging game StarCraft II, which is a real-time strategy game. Applying reinforcement learning to StarCraft II faces more difficulties than the game of Go. For examples, StarCraft II has a huge state space that is much larger than Go and has a much longer game length; players need to control hundreds of units in the game,

i.e., the game has a very large action space; it is a game with imperfect information, where players know almost nothing about the state of their opponents, so that they need to scout then speculate on the opponents’ potential behaviors.

Recently, reinforcement learning methods have been applied to learn to play StarCraft II [8, 9, 10, 11]. It is noticeable that these methods all used the model-free kind of reinforcement learning methods. Model-free reinforcement learning methods commonly learn by trial-and-error directly in the environment, which requires a lot of samples, and thus computing resources and training time. Meanwhile, another branch of reinforcement learning methods,i.e., model-based reinforcement learning [12, 13, 14], first obtain an environment model of the environment in some way, and then learn the policy in this model instead of the environment. Once a model is available, one can use planning methods to enhance the exploration, or use the samples generated by the model to reduce the number of samples from the environment, thus accelerating the training phase. Despite the exciting idea of MB-RL, learning an accurate environment model is extremely difficult, which might require even more samples than a model-free policy learning. Therefore, to the best of our knowledge, no model-based methods have been applied successfully on environments as large as StarCraft II yet.

In this paper, we investigate a model-based approach, named the mind-game, assisting the reinforcement learning. Since it is quite easy for human players to describe the game process, we design the mind-game as an abstract game of StarCraft II manually. However, the mind-game is not even close to the real game. The transition, time step, and the game length are very different. We cannot expect that the policy trained from the mind-game can have a good score in the real game. Nevertheless, we do expect that such mind-game can help boost the learning of the real game.

Figure 1: Mind-Game Architecture.

We propose a training process to utilize the mind-game, as described in Figure 1. We first train an agent in the mind-game. An adaptive curriculum reinforcement learning (ACRL) is proposed to train the agent in the mind-game efficiently. Due to the efficiency, the reward is directly set to be the win/lose outcome, instead of the hand-designed reward in some previous studies. Then we map the agent trained in the mind-game environment to the StarCraft II environment. The mapping is built on the macro actions learned in a previous study [8]. A fine-grained training in the real game environment is then followed to obtain the final agent.

In the experiments, we show that moving the mind-game trained agent to be fine-grained trained in the highest non-cheating difficulty (level-7) can directly achieve a strong agent. As a comparison, in [8], the direct training in level 7 was hard, thus a curriculum learning from level-2 to level-7 was employed. Moreover, the total time required to train a strong agent can be greatly reduced. It takes about 95 hours to learn an agent with a 93% win-rate on difficulty level-7 by the method in [8], while our method needs only 1 hour (including the total time in both the mind-game and the real game) to achieve the 100% win-rate.

Our main contributions of this paper are summarized as follows:

  • [leftmargin=*]

  • We validate the idea of using a mind-game to boost reinforcement learning. To the best of our knowledge, this is the first time to use a mind-game in an environment as complex as StarCraft II.

  • We show that the mind-game approach can be robust. In our approach, we don’t need to design a very accurate mind-game in order to achieve a significant boost.

  • Our experiments show that the mind-game can lead to a nearly 100-times acceleration, under the same computational resource. We also find that the high efficiency waives the need for manual reward design and a complicated architecture of the policy model.

2 Background

In this section, we first introduce the background knowledge and then introduce the existing methods of researching on StarCraft II.

2.1 Reinforcement Learning

Reinforcement learning solves the problem of continuous decision making which can generally be formulated as an MDP. MDP can use a 5-tuple to represent.

Assuming the agent interacts in the environment, the goal is to get the largest cumulative reward . The state space of the agent acquisition environment is and the action space of the agent is . At each time step , the agent obtains the state of the current environment, and the action selected by the agent. According to the reward function, the agent obtains a reward , and the environment shifts to the next state according to the state transition function . The Discount factor expresses the ratio of current rewards to future rewards.

Here we consider the MDP problem for a limited time, that is, when the condition is met or when the fixed time limit is exceeded, the agent ends a round of exploration in the environment. This round is called an episode. In this round, the agent experienced is called in this trajectory in the episode. In this trajectory, the agent gets the reward of , which is called the return of the agent. In general, the rewards we consider will be a decaying discount over time. The discount factor expresses the ratio of current rewards to future rewards. So the return can be expressed as .

Our optimization goal is to maximize the expectations of for many episodes. The action selected by the agent will affect the subsequent state and the corresponding reward . The action is selected according to the policy function , which is . The parameters of the policy function are given by , so the optimization goal can be expressed as the following expression:

(1)

2.2 StarCraft II

Most previous works deal with the StarCraft II environment in the way of divide and conquer. For example, in [9], researchers use the framework of Hierarchical Reinforcement Learning (HRL) to reduce the difficulty of learning each section and use manual designed macro actions to reduce the agent’s action space. They achieved good results on StarCraft II games on specific maps and using specific races through the combination of these two methods. Another work at almost the same time [8] using an HRL architecture combined with macro actions to reduce the difficulty of learning the entire problem. Different from the previous work, it introduces the idea of using data mining to extract macro actions and tries several different combat models (including deep models using image input), so it is more automated and intelligent. [10]

used a modular architecture to handle different parts of StarCraft II game, such as a script for the scouting and micro-operations, and several machine learning models for the rest. By combining these different methods, they achieved fine results on specific maps and races as well.

Recently, a new program named as AlphaStar has shown excellent performance in StarCraft II. However, its training resources are very expensive, and the training lasts up to two weeks. What’s worse is, if migrated to different maps or races, the agents need to be retrained from scratch.

3 Method

In this section, we will introduce the methods we proposed. Our method is to implement an abstracted model called Mind-Game, which draws on the way human players think in battles building an abstract simplified model of the game, planning and learning in the model, and mapping the results to the original game. This is a simplified version of the original environment, including the core part of the original environment. An abstracted model can be mapped to the original environment through a mapping function.

3.1 Mind-game Model

Suppose we know the environment for a problem is , represent the original problem we want to optimize. We expect to get the optimization policy of this MDP. The settings of the game we discussed are as follows: Suppose a game has a result , which represents victory, tie, and defeat. After the end of each game, on all time steps, except on the final time step. The optimization goal can be expressed as the following formula:

(2)

At the beginning of the random initialization of , the results of almost all games are . is difficult to get updated signals in these data.

We build a new MDP as follow:

(3)

We expect to get a policy that can help learning. Suppose we have a mapping function for . At the same time, if we have a better policy under mind-game, we can have . Suppose we have an inverse mapping function . For , there is , so we have an action in the original environment:

(4)

The above equation can also be written as:

(5)

For the above formula, there are four unknowns that are more difficult to solve. But conversely, if we know and and can find , then can be obtained immediately.

Suppose we let and let , so and are especially simplified. is unknown and difficult to design. What we have to do is design a . Finally, the environment we designed is as follows:

(6)

If we can design such an environment, it means that we can learn in it. After learning the policy , we can use the mapping function to operate the agent in the original environment.

Input: max episodes for update , , max iteration steps , , max game steps ,
Parameter: win-rate threshold , max-difficult in mind-game , target task in original environment, RL algorithm
Output: ,

1:Random initialize policy . Let .
2:for  to  do
3:     .
4:     while  do
5:         Let mind-game environment reset by .
6:         
7:         for  to  do
8:              for  to  do
9:                   by using .
10:              end for
11:              if   then
12:                  .
13:              end if
14:         end for
15:         if   then
16:              .
17:         end if
18:         Use to update by
19:     end while
20:end for
21:Let original-game environment set by .
22:Initialize policy .
23:for  to  do
24:     .
25:     for  to  do
26:         for  to  do
27:               by using .
28:         end for
29:     end for
30:     Use to update by
31:end for
Algorithm 1 Proposed Algorithm

3.2 Model Design

To model , the approach we take is not to learn or build the model from scratch, but to use a model-mixing approach.

We can think of as being determined by the parameters and . The parameters in are public and can be obtained. comes from hidden parameters in the environment and is harder to get. (The cost of obtaining through learning is greater)

’s parameters include units, buildings, and rules, which come from StarCraft II itself. includes economics, battle, time flow, which use approximate designs. The most important and most difficult part, the part of the battle, is designed from a classic turn-based tactical game (Heroes of Might & Magic III [15]). This design is due to the fact that the war rules on which the combat parts of StarCraft II are based are human own war experiences.

The reason why human beings are very fast in the learning of games is that humans do not learn from scratch. Before touching real-time strategy games, ordinary human players will have a general idea of what the composition of the war is and what is the key to winning a war [16]. The battle design of most games comes from real-world rules, so some of the battle parts of the game can actually be shared.

So the we built is

(7)

The built in this way is similar to a model that mixes StarCraft II with other games. So became the main error in our mixed model. Since this error exists only on some of the parameters in the model, and this part of the parameters are also partially similar. Therefore, this error can be understood as a noise of the model. In this way, we built a model with some noise.

So, is this part of the noisy parameter a constraint on model performance? Through experiments, we find the answer is no which will be shown in the experiment section.

In mind-game, the agent is divided into two sides, so we can do self-play or ordinary reinforcement learning. In this work, we only report the results of reinforcement learning. Self-play research can be put into future work. Since we have an agent, we can control the difficulty, so the above training can take the form of curriculum learning.

3.3 Learning in the Model

After obtaining the model , we can train a policy on it. The reinforcement learning algorithm we used is PPO [17]

. It is worth mentioning that the reinforcement learning algorithm can be arbitrary. The reason for choosing PPO is that it is simpler and easier to compare. The loss function is defined as follow:

(8)

where the definition of and can be find in [17].

In [8], the method of curriculum learning is also used, but the curriculum uses the difficulty in the original game environment. The problem is that the difficulty of the original game environment is non-linear, and there may be a gap between two adjacent difficulty levels. This makes the curriculum less smooth. In our method, we control all aspects of the model, so the difficulty of the enemy can be adjusted in a linear way, which brings the stability of learning.

Here, we introduce the ACRL algorithm to learn in the model. The idea of ACRL is simple, which is observing the win-rate of the agent when learning. If the win-rate exceeds a certain threshold, the difficulty is automatically increased.

3.4 Mapping to Source Domain

After getting policy , we can map it back to the original game. Mapping is not easy because the actions in the original environment are atomic operation while the actions defined in mind-game are macro-actions. If we need to use the same action space as the mind-game in the original game environment, we need to use macro actions. Thanks to previous work, we can directly use the macro-action got by [8].

There are two options for training. We can train in the original environment and use as its teacher. The other is that the agent is still operated by , we collect data collected by to update . For the second option, it can be thought that the training is performed under the environment:

(9)

is neither equivalent to the original environment nor the environment of mind-game. Interestingly, we found that the trained in mind-game initially performed well in , and after a period of training, performance can rise quickly.

Our complete method is as follows. Firstly, we use ACRL algorithm to train an agent on the mind-SC2 model from scratch. After that, we map this policy back to the original environment and use the mind-game policy as the initial value of source policy in the original SC2 environment for continued training. The hyper-parameters of PPO used in the mind-game environment and in the SC2 environment are the same. The complete implementation of our algorithm can be found in Algorithm 1.

Next, we analyse the time cost of our method.

3.5 Time Analysis

The consumption of training time consists of several parts: 1. The time sampled in the environment, depending on the simulation speed of the environment. 2. The required sample size of the training algorithm. Due to the characteristics of the model-free reinforcement learning itself, a large number of samples are needed to learn a better policy. In addition, when using curriculum learning, the total sample size of the training is the sum of the sample sizes on all tasks. 3. The training time of the gradient descent algorithm, . Therefore the total time overhead is . Since the general sampling time is much longer than the gradient descent algorithm: , then .

Suppose our training process divides the task into steps of . Step is our target task, and the previous step is the pre-training process for curriculum learning. Therefore, we can write our training time as follows:

(10)

In our approach, we moved the part of the curriculum to the mind-game. The simulation time in mind-game is much smaller than in the original environment. Assume that the time in the mind-game, and the amount of sample required by the last step of the task is of the total . Then the time required by our algorithm can be denoted:

(11)

This formula shows that the speedup ratio of the new algorithm is determined by the smaller value of the acceleration ratio and the last task sample ratio . If we want to speed up the training time, the simulation speed needs to be as fast as possible and the most part of curriculum learning needs to be moved to the simulation part. Using these two methods at the same time can greatly improve the training speed.

4 Experiments

In this section, we show the experiment results of our methods.

4.1 Training Settings

Our training is taken on a common server which has 2 CPUs and 4 GPUs. We use a multi-processes setting to accelerate training which is the same as [8].

Multi-processes parameters are as follows: the number of processes is set to , the number of threads in each process is set to , max-iteration is set to . It is noted that update-num is set to on mind-game and on the original environment because simulation speed in mind-game is much faster than the original environment.

PPO related parameters are as follows: is set to to encourage exploration, batch-size of PPO is set to

, epoch-num is set to

, initial learning rate is set to .

4.2 Mind-SC2

We design and implement a model that can be considered a mind-game which is called ’mind-SC2’. The model contains all the units and buildings of the three races. Every unit and building is implemented as a class. Unit and building attributes include information on their cost, health, armor and build time, all of which are the same as in the original StarCraft II. At the same time, the model approximates other basic features of StarCraft II, such as collecting resources, constructing buildings, producing units, etc. It is worth noting that we adopt a turn-based system similar to Go in the mind-game. Therefore, the time step of mind-game does not exactly correspond to the time step in StarCraft II which means they are different to a large extent.

In terms of the state transfer function, each action brings about a specific state transition, such as the production of a farmer’s action. First, the algorithm determines whether the action satisfies the condition, such as whether the resource is satisfied, whether the technology is satisfied, and whether the population is satisfied. After that, if the conditions are met, the population is increased, and the resources consumed are subtracted. After that, the farmers’ production enters the production queue of the building. When the farmers are produced out, the number of farmers increases by one. The code of mind-game is open source.

In the model, agents are divided into two parties, the enemy and our side. An enemy can be either an agent that uses a policy or an agent that is implemented using a script. We set the enemy as a script agent. We adjust the speed at which the enemy increases the strength to control the difficulty of the mind-game. There are also several levels of difficulty in mind-game, from easy to hard.

(a)
(b)
Figure 2:

(a) The process of training the agent in mind-game with ACRL. (b) Transfer learning on

Simple64.

4.3 Process of Training

We first train an effective agent through the ACRL algorithm on the mind-game. The training process can be seen in Fig.2 (a). Firstly, we train a Protoss agent using ACRL algorithm from the easiest level to the hardest level on the mind-SC2 model. After that, we map this policy back to the original environment and train a Protoss agent against a Terran bot in the SC2 environment. We compare the training process of using ACRL using PPO. Win-rate threshold of ACRL is set to

Initially, ACRL agent is trained in level-1, and due to the easy setting, the performance of it can gradually increase. When the win-rate breakthroughs , agent transfers to the next difficulty level. Following this way, the agent finally gets a win-rate more than against the most difficult level in mind-game. Contrastly, the PPO agent directly train on the most difficult level and rises very slowly.

Then we transfer the agent to the original SC2 game environment. trained on difficulty level-7. As shown in Fig.2 (b), our win-rate starts at around . The initial winning rate shows that although our model is effective, it is not accurate as the original game. There is a big gap between the win-rate in the mind game and win-rate in the SC2 (from to ) which means they are quite different. However, our mind-game model can facilitate the training in the original environment and the win-rate can rise quickly.

4.4 Comparison to other Methods

Method A Race O Race Map 1 2 3 4 5 6 7
[8] Protoss Terran S64 100% 100% 99% 97% 100% 90% 93%
Ours Protoss Terran S64 100% 100% 100% 100% 100 % 100% 100%
Table 1: Evaluation results of our method to [8]. It is noted that we only train our agent in level-7, and tested in other six levels. Bolds are the results perform better than or equal to the other method. S64=Simple64.

We compare our method with one published state-of-the-art result achieved in [8]. The setting in [8] is as follow. They trained a Protoss agent, and the opponent is a built-in Terran bot. They report the win-rate in the map Simple64 from level-1 to level-9. The setting in [10] is as follow. They trained a Zerg agent, and the opponent is a Zerg bot. They report the win-rate in the map AbyssalReef from level-1 to level-7.

To compare with [8], we use the Protoss agent trained by our method and test it in level-1 to level-7. We do not compare the results in level-8 to level-9 due to that bots in that level is cheating making the results somewhat random.

We can see in Fig. 2 (b), training of our method is better than [8]. It is worth noting that the result of [8] is pre-trained on difficulty level-2 and difficulty level-5. Evaluation results are in Table 1. We can see that our agent is the same as their agent at low levels, but better than theirs at a high level. For example, [8] achieved 93% result at level-7 while our agent is 100%, increased by 7 percentage points.

Method Training time (in hours)
[8]
Ours
Table 2: Comparison of training time. For [8], refer to training time in difficulty level-2, refer to training time in difficulty level-5, and refer to training time in difficulty level-7. For ours, refer to training time in difficulty level-1 of mind-game. refer to training time in difficulty level-2 to level-7 of mind-game. refer to training time in difficulty level-7 in original environment.

In order to demonstrate the efficiency of our method, we compared the training time of our method with the training time of the previous method. As can be seen from Table.2, when training to the same accuracy rate which is on difficulty level-7, training time of our method is almost of the previous method. And our agent takes about hours to achieve a 100% win-rate on difficulty level-7.

4.5 Impact of Parameters

(a)
(b)
Figure 3: (a) The effects of changing the economy parameters (mineral income per step) in mind-game. (b) The effects of changing the bonus damage parameters (factors affected by units attack range) in mind-game.

Our method has surpassed the previous method in most of the comparisons. It can be seen that noise in does not affect the final learning effect. In addition, we empirically analyzed the effects of the parameters in and found that even if the parameters were changed by to percent, the training results in the mind-game and the training results in the original game are similar, indicating our method is robust to these noises.

We modified several parameters in the mind-game that affect combat and economy and then retrained. We tested the parameters for controlling the economy which determines how many crystal mines a farmer can pick in one step. The second parameter is the parameter that controls the battle which determines the effect of the range attribute on the damage.

(a)
(b)
Figure 4: (a) The effects of changing the economy parameters in source game. (b) The effects of changing parameters in source environment.

As can be seen from Figure Fig.3 (a) and Fig.3 (b), the impact of these parameters is small. We can still effectively train an agent in the mind-game. We then transfer these agents to source game. Fig.4 (a) and Fig.4 (b) shows the impact of these parameters is still small in source domain. We can still effectively train an agent achieved similar results on difficulty level-7 of SC2.

Thinking from another perspective, these noises may be thought of as a type of Gaussian noise added to the data generated by the mind-game model. Similar to the way of data augmentation [18]

in deep learning, we get a robust policy through training on these noise data.

4.6 Test of other Races

(a)
(b)
Figure 5: (a) The training process of Zerg agent and Terran agent in the mind-game. (b) The training process of Zerg agent and Terran agent in the source domian.

All previous methods were tested on only one race. Our method can efficiently train an effective agent, so we tested the results of training two other races, namely Zerg and Terran. The training process can be seen in 5.

Economic and combat settings of Zerg are similar to those of the Protoss above. The difference is that when Zerg builds a unit, it needs an extra resource – Larva. In mind-game, the number of Larvae is set to increase by one for every two steps. There will be an additional Larva for every two steps if any queen exists. In the source game, the number of Larvae is controlled by the game, i.e., the Hatchery will generate a Larva every seconds. The Queen with Spawn Larva spell can inject Larva eggs into the Hatchery and then the targeted Hatchery will generate additional Larvae seconds later. We set this spell to be automatically cast before each policy step ends. In addition, the Zerg Drones will disappear after building structures, which is also reflected in the mind-game.

Test A Race O Race Map 1 2 3 4 5 6 7
Race test 1 Zerg Zerg S64 100% 100% 98% 97% 99% 96% 93%
Race test 2 Terran Terran S64 100% 100% 97% 97% 95% 96% 95%
Race test 3 Terran Terran AR 100% 98% 85% 74% 62% 77% 75%
Map test 1 Protoss Terran AR 100% 100% 100% 98% 97% 96% 99%
Map test 2 Protoss Terran F64 100% 100% 99% 99% 98% 98% 97%
Map test 3 Protoss Terran S96 100% 1000% 97% 97% 99% 75% 73%
Table 3: Race and map test of our method. A=Agent, O=Opponent, AR=AbyssalReef, S64=Simple64, F64=Flat64, S96=Simple96,

As Table.3 shows, the results are still very strong. It is worth noting that all previous methods did not train a Terran agent. We claim that our agent is the first one to beat the most difficult non-cheating built-in bot in StarCraft II, which is also due to the effectiveness and efficiency of our algorithm.

4.7 Test on other Maps

(a)
(b)
Figure 6: (a) The training process of Protoss agent when migrated to AbyssalReef. (b) The training process of Protoss agent when migrated to other maps.

To test the scalability of our approach, we also trained our agent on other maps.As can be seen in Fig.6 (a), when migrated to a new map, the agent using our method can still learn fast and eventually rise to nearly 100% winning rate. It is noted that the initial win-rate of policy transferred from mind-game is only about , and the initial win-rate of policy transferred from the previous agent in Simple64 is about , showing that our mind-game still has a big gap from the real game. However, it can be seen that the final win-rate of them are the same, verifying that mind-game model can facilitate training.

The results of training on other two maps, namely Flat64 and Simple96 is shown in Fig.6 (b). All training starts with a pre-trained policy of the mind-game. It is worth noting that learning is difficult because of the large size and complicated terrain of Simple96. However, it can be seen that the win-rate of our agent can still rise steadily.

4.8 Comparison on other Race and Map

Method A Race O Race Map 1 2 3 4 5 6 7
[10] Zerg Zerg AR 100% 100% 99% 95% 94% 50% 60%
Ours Zerg Zerg AR 98% 94% 97% 96% 95% 79% 69%
Table 4: Comparison of our method to [10]. Bolds are the results perform better than or equal to the other method. A=Agent, O=Opponent, AR=AbyssalReef.

Since we have implemented training on different races and different maps (using the same training architecture and hyperparameters on these races and maps), we can compare with the

[10] method. It can be seen in Table4 that the average performance of our method difficulty 4 to difficulty 7 exceeds them, further demonstrating the effectiveness of our method.

Because [10] did not report their accurate training time, it is not convenient for us to compare with them. But their training took a few days, while ours took a few hours, indicating that our training is much faster than theirs.

4.9 Play against Human

Player A Race H Race Map Result (agent:human)
SC1 player Protoss Terran S64 5:0
SC2 novice Protoss Terran S64 5:0
SC2 golden Protoss Terran S64 4:1
Table 5: Play against human. In order to be fair, all players cannot choose blocking tactics, due to the agent did not see any opponents using blocking tactics at training time. A=Agent, H=Human, S64=Simple64.

In order to test the performance of our method in the battle against humans, we tested our agent against three human players. The three human players are: SC1 player, SC2 novice and SC2 Golden level player. In order to be fair, all players cannot choose blocking tactics, due to the agent did not see any opponents using blocking tactics at training time.

The results are in Table.5. The result of our agent’s match with the Golden level player is . It is worth noting that the lost game was due to that the human players continued to learn during the battle and found the weakness of our agent, and our agent did not do any learning. At the same time, human players can also use a series of micro-operations. Our agent’s APM (action per minute) is only , and the player’s APM is , which indicates the future growth potential of our agent.

4.10 Effects of Reward

(a)
Figure 7: (a) Effect of reward.

In mind-game, we found that if the difference between the two states is used as a reward (such as population), the agent rises quickly, but is unstable. In the end, the agent’s winning curve will even show a landslide. Conversely, if only the result is a reward, the learning curve will be much smoother. Therefore, in the subsequent experiments we used the final result to be the most rewarded. This phenomenon is also reflected in this Fig.7 (a).

5 Conclusion

This paper proposes a simple and efficient method which can get better results than the previous method with training time. Furthermore, the agent wins the Golden level player of StarCraft II player by 4 out of 5 games. In the future, we will explore the combination of model-based reinforcement learning methods with other reinforcement learning techniques to train agents that are more diverse, more powerful, and more efficient.

References