Adversary agent reinforcement learning for pursuit-evasion

by   X. Huang, et al.
Peking University

A reinforcement learning environment with adversary agents is proposed in this work for pursuit-evasion game in the presence of fog of war, which is of both scientific significance and practical importance in aerospace applications. One of the most popular learning environments, StarCraft, is adopted here and the associated mini-games are analyzed to identify the current limitation for training adversary agents. The key contribution includes the analysis of the potential performance of an agent by incorporating control and differential game theory into the specific reinforcement learning environment, and the development of an adversary agents challenge (SAAC) environment by extending the current StarCraft mini-games. The subsequent study showcases the use of this learning environment and the effectiveness of an adversary agent for evasion units. Overall, the proposed SAAC environment should benefit pursuit-evasion studies with rapidly-emerging reinforcement learning technologies. Last but not least, the corresponding tutorial code can be found at GitHub.



There are no comments yet.


page 4

page 6

page 7

page 8

page 11

page 17

page 19

page 24


Rogue-Gym: A New Challenge for Generalization in Reinforcement Learning

This paper presents Rogue-Gym, that enables agents to learn and play a s...

A Dynamics Perspective of Pursuit-Evasion Games of Intelligent Agents with the Ability to Learn

Pursuit-evasion games are ubiquitous in nature and in an artificial worl...

Virtuously Safe Reinforcement Learning

We show that when a third party, the adversary, steps into the two-party...

Adapting the Predator-Prey Game Theoretic Environment to Army Tactical Edge Scenarios with Computational Multiagent Systems

The historical origins of the game theoretic predator-prey pursuit probl...

Robust Market Making via Adversarial Reinforcement Learning

We show that adversarial reinforcement learning (ARL) can be used to pro...

Adversary A3C for Robust Reinforcement Learning

Asynchronous Advantage Actor Critic (A3C) is an effective Reinforcement ...

Reinforcement Learning for the Soccer Dribbling Task

We propose a reinforcement learning solution to the soccer dribbling tas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


,    =    discretization of game domain
   =    evading trajectory
   =    unit health
   =    the side length of game domain
   =    linear quadratic cost
   =    searching steps
   =    discretized blocks of game domain
   =    the number of evaders/pursuers
   =    the number of the captured evaders

   =    the probability of capture

   =    the discovery or attacking radius
   =    game domain
   =    the longest length inside game domain
   =    searching trajectory
   =    reward
   =    defeated time of all evaders
   =    game duration
   =    the moving speed of units
   =    the expected capture time
,    =    Cartesian axis
   =    the area of game domain
   =    the optimal solution of ()
   =    variables related to evaders/pursuers

I Introduction

A reinforcement learning environment is developed in this paper for pursuit–evasion game, which is a classical but challenging problem with important aerospace applications, such as simultaneous and cooperative interceptions [1, 2] and exoatmospheric interception [3, 4, 5, 6] and search-and-rescue operations [7]. The problem has been studied extensively under the analytical framework of differential game theory [8, 9] and optimal control theory [4, 10]

, respectively. Recently, the merging of game theory, control theory and deep learning has become a popular topic

[11]. Wang et al. have proposed a distributed cooperative pursuit strategy based on reinforcement learning and performed tests in the openAI Predator–Prey environment [12]. Multi-agent reinforcement learning has been further considered for pursuit–evasion with multiple unmanned aerial vehicles [13]. Li et al.

have proposed an estimation algorithm of optimal pursuing strategy based on Thompson sampling


and conducted tests in Atari Pac-Man environment. Most of those pioneering works have essentially focused on artificial intelligent (AI) strategies only for one player (usually the pursuer), while the other player is either immobile or cannot be directly controlled by another AI agent, which actually reduces possible conflicting levels of pursuit–evasion game. To address this issue, the current work endeavors to develop a reinforcement learning environment for pursuit–evasion, both of which can then be directly controlled by a separate agent, through the extension of the famous StarCraft II game environment.

StarCraft II is one of the most popular real-time strategy games currently played worldwide. The game requires human players making very rapid decisions on strategical, tactical, and economical levels. To study AI’s capability, DeepMind has recently developed a Python interface library, PySC2 [15], which exposes StarCraft II’s low-level application programming interface to a reinforcement learning environment. The combination of PySC2 and StarCraft II learning environment has enabled deep learning studies of competition and coordination within multiple agents [16] in complex environment with representative terrains (cliff, water, forest, etc.) and partial observation (so-called fog of war 111In the game, fog of war means that a player cannot observe the information of the region of map when the region is not close to the player’s units, buildings, or scouting abilities. It is represented by dark region on both the radar map and the main screen.). An agent from DeepMind, AlphaStar [17], has achieved grandmaster level performance by beating top professional players [18]. Nevertheless, it is worthwhile to mention that the training of AlphaStar requires thousands of GPU processors and the cost was estimated to be more than 10 million USD [19], which is prohibitive for small research groups.

To reduce complexities from full StarCraft games, DeepMind has further provided seven single-player, fixed length mini-games to explore deep learning capabilities on various specific tasks. In which, the so-called FindAndDefeatZerglings mini-game is very similar to one type of classical pursuit–evasion games in dark room (also known as the princess and monster game [8]). However, the author argues that the following unsolved issues handicap the existing StarCraft II mini-games to be a viable deep learning environment for pursuit–evasion.

  • First and foremost, the reason about why certain performance could be achieved by an AI agent is still unknown. When the opponent evasion units are controlled by the low-level build-in StarCraft II code, the best mean score of the reward 222It was defined in the mini-game as the number of the evasion units that have been found and defeated in s. is 46 (from Fig. 6 of the reference [20], although DeepMind later mentioned this value as 61 in Table 1 of another article [21]). The same problem has been repeated by some other groups and the achieved best mean score is from 16 (with an asynchronous advantage actor-critic (A3C) agent [22]), 22.1 (with an advantage actor-critic (A2C) agent [23]) to 45 (with an A2C agent [24]). The best score achieved now is 62 through a relational agent from DeepMind [21]. An intriguing open question is that what is the best performance that could be achieved by a well-designed agent, and why?

  • Second, all those mini-game maps from DeepMind only support single agent. The famous StarCraft multi-agent challenge (SMAC) toolkit in the reference [16] has provided a multi-agent reinforcement learning environment. Nevertheless, the SMAC environment also only supports to control all pursuer units with independent agents, whereas the opponent evasion units are still controlled by the built-in StarCraft II scripts and cannot be controlled by any external agent. Is it possible to develop a mini-game based StarCraft learning environment where adversary players can be controlled by opponent AI agents?

Figure 1: Screenshot of the FindAndDefeatZerglings mini-game from DeepMind, where 3 pursuers (marines, denoted by cyan circles) are searching and firing at 25 zerglings, while only a couple of them (denoted by red circles) are visible to the marines when fog of war (i.e., the dark area) is activated. As shown below, some of these army units are replaced with aerospace units in this work to imitate pursuit–evasion game.

The above two issues directly motivate the current work. The first issue is addressed here by merging control theory and differential game theory. Isaacs has pioneered this direction by proposing theoretical strategies under a dynamic game framework [25]. Some more recent developments, especially from the numerical direction, can be found in the monograph [26]. Moreover, Gal has provided a theoretical solution of princess and monster game in a generic geometrical domain [8]. The differences (and similarities) between Gal’s problem and the FindAndDefeatZerglings mini-game (see Fig. 1) are highlighted in this paper, which further enables the theoretical developments that explain the achievable game score under different representative scenarios.

The second issue is resolved by first identifying the pending programming issues inside the PySC2 interface 333The most updated version 3.0 was used during the preparation of this article. when two adversary agents are implemented. Next, some of the necessary rectifications of the corresponding code are conducted to fix the identified issues. The associated programming tricks and modifications are given to enable interested readers to utilize the learning environment and set up their own differential game problems in the future. Moreover, the source code and the extended mini-game maps that support two adversary-agents are developed in this work and available to the public at GitHub444 Overall, the contribution of this work is twofold: (1) to enhance the understanding of reinforcement learning capability for pursuit–evasion game by merging with control and game theories, and (2) to propose an adversary-agent reinforcement learning environment for pursuit–evasion game with progressively complicated set-ups of practical importance.

The remaining part of this paper is organized as follows. Section II will introduce the fundamentals of StarCraft II learning environment, with the focus on pursuit–evasion type game. A couple of pending unsolved issues will be highlighted therein. Then, a theoretical study will be given in Sec. III to explain the possible game performance that can be achieved for the current pursuit–evasion game set-up. Next, Sec. IV will introduce the proposed adversary-agent learning environment and discuss the corresponding results, especially from the perspective of game theory. Finally, Sec. V will summarize the present work. More background information regarding the units and code structures can be found in the appendix.

Ii StarCraft II Learning Environment

StarCraft II is a popular and challenging real-time strategy game developed by Blizzard Entertainment. The basics of StarCraft II is introduced here for the completeness of this paper. For more information, interested readers can download and play this game for free. Moreover, some important programming tricks are summarized in this section (and the appendix) for the better using of this learning environment.

The full game has a science fiction setup with three different races: Terran (supposedly to imitate human army), Zerg (mimics worm army) and Protoss (mimics alien high-intelligence army). Each race consists of a number of distinctive units with unique strength and weakness. Some of those used in this work are summarized in the appendix. The game starts by choosing a race with a number of units and resource, following by macromanagement to develop economics and build up units (many of which are aerospace units) and split-second decisions on tactical level to beat opponents, which could be computer bots, human players or intelligent agents. Due to its complexity and extremely large action space, StarCraft II has been regarded as a new challenge for reinforcement learning [20] after the game of Go.

Recently, DeepMind has developed a Python interface library, PySC2 [15], which enable users to obtain spatial observations (in a form of features, see Fig. 2) and to learn to conduct humanlike actions. In addition, several game score/rewards can be accessed to examine how well an AI agent is working. Users shall design an appropriate score for their own learning tasks to differentiate the performance of agents.

Figure 2: The feature map (left) and a number of feature layers (on the right) of height, fog of war, camera locations, alien and opponent units, etc., can be obtained instantaneously from PySC2 when the StarCraft II game is played by an agent.

As an example, Fig. 1 shows an instantaneous screenshot of the FindAndDefeatZerglings mini-game, where three marine units (from Terran) should be trained to explore the two-dimensional (2D) domain 555The domain can be easily extended to three-dimensional by including different terrain elements in the StarCraft map editor. activated with fog of war to find and defeat 25 individual zerglings (from Zerg) that have been randomly deployed throughout the map. More information of each units, regarding health, detecting range and attacking range, etc., can be found in the appendix.

Figure 3: In the mini-game of FindAndDefeatZerglings: (a) the three marines will search and attack the zergling (denoted by the red spot in each panel) at the verge of the fog of war (the dark area); (b) the zergling will run towards marines and try to push back, instead of escaping into the fog of war nearby; and (c) the counterattack will be easily defeated by the concentrated fire from the three marines.

Immediately, it can be seen 666Install StarCraft and PySC2, then type the terminal command: python -m pysc2.bin.agent –map FindAndDefeatZerglings. that the mini-game FindAndDefeatZerglings is quite similar to pursuit–evasion game, with the loss of the evaders being the number of the defeated zerglings until the game is finished (at s). Nevertheless, compared to game set-ups in most former theoretical works, some distinctive differences of the current game set-up and the associated effects on pursuit–evasion can be identified as follows.

  1. The mini-game contains fog of war, which extensively increases the game complexity to such a level that the corresponding learning speed is much slower than any other mini-games with deactivated fog of war.

  2. A pursuit–evasion game is a two-person zero-sum game, mostly only consisting of 1 pursuer and 1 evader. More complicated set-ups have been considered in the recent work [2], where several (ground) evaders are protected by many defending (airborne) pursuers that call for simultaneous attacking strategies. Similarly, the FindAndDefeatZerglings mini-game consists of 3 pursers and 25 evaders.

  3. In the FindAndDefeatZerglings mini-game, instead of running away, the build-in code from StarCraft will control the evaders (i.e., zerglings) run towards and attack the pursuers (i.e., marines) when both are within the sight range (see Fig. 3). Hence, the mini-game is not a typical pursuit–evasion game. However, as shown in the appendix, the attacking capability of a couple of evaders is much worse than the three pursuers. Hence, the attacking (or self-defense) action from the evaders will actually simplify the searching/exploration tasks of the pursuers. As to be shown below, such an action is deliberately disabled and only the evasion action would be allowed in the proposed new adversary learning environment.

  4. The mini-game from DeepMind only supports one agent to control the three marine units. The other build-in low-level code controls the evaders, which will either remain still when no opponent is within the sight range or otherwise rush towards and attack the pursuers. Hence, the current FindAndDefeatZerglings mini-game imitates pursuit–evasion game with immobile evaders.

  5. Last but not least, in StarCraft/PySC learning environment, a deep learning design is supposed to mimic (and then rival) the intelligence of human players. Figure 4 gives an example. It is natural to expect that an AI agent should directly pursuit any evaders that have been found on the radar. A human play, however, must first move the camera view to the target area and then issues the pursuing action. Hence, when the StarCraft II environment is used, all agent actions must be designed to follow the operation/behavior habit of a human player.

Human actions:

Agent actions:

Select all units


Move camera





Figure 4: When the StarCraft learning environment is adopted, the agent actions should be coded by following the action habits (and frequency) of human players. For example: (a) a human player will select the marines under the present camera view (1), then observe the radar (2) to search zerglings, and could find one (3) to the right of the radar; (b) next, the human player will move the camera to the founded zergling (with click of mouse), and (c) then issue the attack action (with the shortcut key ‘a’  plus click of mouse). An agent should be designed to follow the same action steps, which are summarized in (d). Oterhwise, the StarCraft environment will return unanticipated action results.

The above issues 1, 2 and 5 of the StarCraft learning environment increase the problem complexity of pursuit–evasion game. The issues 3 and 4 are resolved in the proposed new learning environment and details can be found in Sec. IV.

It is worthwhile to mention that many other popular reinforcement learning environments can be found, e.g., from the famous toolkit of Gym and Atari environment [27, 28]. Compared to those reinforcement learning game problems, the StarCraft II learning environment is challenging because of the large size of available action space. For example, the size of the action space for the classical inverted pendulum control problem from Gym is 2 (i.e., either push left or right). The size for the Pac-Man game is 4 (i.e., go top, bottom, left or right). In contrast, the size of the action space for StarCraft II learning environment is around 300, including move, attack, management tasks, etc., further with 13 types of possible arguments. One may argue that most of these StarCraft II actions are unnecessary for pursuit—evasion game and can be easily disabled from the available action space during the reinforcement learning of an agent. Then, it seems that the size of the action space can be reduced to 4 (refer to agent actions in Fig. 4). However, the actual pursuing or evasion coordinates, whose size is for the current mini-game set-up, must be given along with the associated action command. Generally speaking, the spatial coordinate outputs shall be regarded as a part of the action space. Then, for the current pursuit–evasion type mini-game, the size of the action space is + 4 = 1028. For such a complicated set-up, it should be beneficial to first have a theoretical study from the perspective of control and differential game theory.

Iii Merging with Control and Game Theory

iii.1 A linear control perspective

The pursuit–evasion game has a long-lasting connection to control theory. Normally, a strategy of the game consists of two levels, where the bottom level is control level, which is usually the well-known proportional guidance law, while the top level is the pursuing or evading strategy [29]. Moreover, the control perspective helps to show that why the normal linear optimal control designs must merge with deep learning for pursuit–evasion game especially when fog of war is activated.

For the current pursuit–evasion set-up, the state dynamics is absent 777Acceleration capability is neglected here, though some units inside StarCraft, such as marines and medivac dropship, do have the capability, which could be considered in future studies. and the state space representation is simply


where is the 2D coordinates (states) of all the units, , is the moving velocity (control inputs), and the subscripts and represent the pursuers and evaders, respectively. A more generic game set-up with nonzero state dynamics and the corresponding theoretical manipulation can be found in the reference [9].

Following [4], a linear quadratic cost can be defined for each pair of the pursuer and evader (e.g., and ) with a specified final time (i.e., s in the current mini-game set-up),


where the weight can be relaxed to 0 since there is no penalty on control input.

In the mini-game, the number of the defeated units is defined as the loss of the evaders and the gain of the pursuers, that is, when is minimized to 0 (or just within the attacking range of pursuers), the th pursuer will be able to attack the th evader. An optimal controller could be synthesized to optimize the above performance objective, whereas the opponent evasion strategy seeks to reduce the performance objective. Eventually, both sides achieve the well-known Nash equilibrium as a result of the non-cooperative dynamic game.

As to be shown in the next subsection, the performance objective usually used in differential game is the expected capture time,


where denotes the searching strategy, denotes the hiding strategy, and is the expected capture time. It is worthwhile to mention that all these symbols are consistent with the pioneering reference [8]. Given , a control method should be designed to enable the pursuers follow the optimal .


Figure 5: The searching strategy (a) with full observations and (b) partial observations (denoted by the green circles).

Nevertheless, the StarCraft mini-game set-up can only access to partial observations that prevent an optimal control from possible. To show this, an example simply with 1 pursuer and 3 immobile evaders is conceived in Fig. 5. When the locations of all three evaders are known to the pursuer, it is easy to see that the optimal searching strategy is as those shown in Fig. 5(a), first from to , and then towards and consecutively. However, when the observation is partial, only is known to in the initial set-up (refer to the green circle). When captures , the initial invisible becomes to be visible to (refer to the light green circle). Compared to Fig. 5(a), this searching strategy is certainly not so optimal, not to mention that is still in the fog of war that requests further explorations (represented by the question marks in Fig. 5(b)).

Figure 6: The mini-game of FindAndDefeatZerglings when fog of war is (a) activated and (b) deactivated, respectively.

It should be noted that Fig. 5 only considers a much simplified scenario. The FindAndDefeatZerglings mini-game contains more complicated features. Figure 6(a) shows a classical radar screenshot of the FindAndDefeatZerglings mini-game. Figure 6(b) shows the corresponding screenshot when for of war is deactivated, where the three green dots represent the marine units (i.e., the pursuers) and the 25 red dots represent the zerglings (i.e., the evaders). The mini-game has been designed 888Use the script inside the StarCraft II map editor. to randomly redeploy 25 new evaders only when the former 25 evaders are all captured, which challenges the exploration capabilities of a searching agent in the presence of fog of war. As a result, the FindAndDefeatZerglings mini-game is the most difficult one in the seven mini-games provided by DeepMind in terms of training cost. The reference [20] has reported that an AI agent can capture 46 evaders after M training steps (refer to Fig. 6 therein). The other reference [24] has reported to capture 45 after M training steps (i.e., K episodes therein). The same test has been performed in this work for a deep Q network (DQN) agent with around M trainable parameters on a desktop with decent training hardware (Nvidia GeForce RTX 3090). The training speed is quite slow, at M steps per day.

iii.2 A differential game perspective

Theoretically, a differential game consists of two or more players against one another in an adversary environment with competing objectives. The corresponding theoretic studies have produced many classical findings, such as but not limited to the references [8, 9, 25]. Such a game theory perspective is adopted here to understand the achievable performance for pursuit–evasion in the current complicated StarCraft mini-game set-up. Readers who are only interested in the reinforcement learning environment can neglect the following theoretical developments and directly jump to the next section.

More specifically, the derivations inside Theorem 3 (e.g., Eqs. (46), (61) and (62)) from the reference [8] are identified to be particularly useful for the current pursuit–evasion game problem. By essentially following those derivations (but with different simplifications), a constructive proof of the following theorem can be achieved.

Theorem 1. There exists an optimal searching strategy in the game domain such that for any evading trajectory used by the evader, the expected capture time satisfies


here is the area of the game domain , is the moving speed of the pursuer, is the longest distance between any two points inside the game domain, and and are discretized lengthscales of the 2D game domain (see Fig. 7). Moreover, was the sight range of the pursuer in the former work [8], whereas shall be the attack range in this work and the reason will be given below.

Proof. As shown in Fig. 7, and are discretizations in the and directions. Assume , where is the ceiling function. Then, when , the number of small discretized blocks inside the game area is


Otherwise, when ,


and to ensure that pursuers are able to attack any evader when the latter is within the attack range.

According to [8] (especially the proof of Theorem 3 therein), during the time segment , the pursuer could move to any small block of size , followed by first searching along the horizontal line and then by searching along the vertical line. On the other hand, when there is only one evader randomly deployed in blocks, it is easy to see that the probability of capture satisfies


The pursuer would adopt the search strategy that consists of independent repetitions of the above process for any hiding trajectory . Then, the capture probability after the th searching with time satisfies


Hence, from Eqs. (7)–(9) and the Maclaurin series expansion of , the expected capture time satisfies


Comment 1. The derivation for Theorem 3 of the reference [8] was further extensively simplified therein, which is deemed unnecessary for the current game set-up, because each term in Eq. (10) already holds clear physical meaning.


Figure 7: Sketch of the game problem, where the origin is set to the top left, and the dotted arrowed curves represnt the traversal path of a possible searching strategy. Here the coordinate system follows the StarCraft environment.

Proposition 1. When the evaders are immobile, the searching strategy can be simplified to a consecutive traversal. The corresponding expected capture time satisfies


Proof. For a hiding strategy with immobile evaders, a consecutive traversal strategy (the dotted arrowed curves in Fig. 7) would remove from Eq. (10), to directly produce Eq. (11).

Comment 2. From Eq. (11), it can be seen a larger will result in a more rapid capture time. As shown in Fig. 7, a possible searching strategy is traversal of the whole game domain, where could be increased to the whole length in the direction (that is, ). Moreover, with the attacking range to ensure that the pursuer would be able to defeat the evader when the latter is visible.

Proposition 2. The expected capture time for the set-up with independent evaders is equal to that of the set-up with one evader.

Proof. Assume the number of evaders is . From Eq. (8), the new capture probability will be


for evaders. Then, from Eq. (10), the new expected capture time for all evaders is


Comment 4. The reward examined in the mini-games is the number of defeated evaders. Given the expected capture time , the reward becomes


where the game finish time is s, is the time that is required to defeat all the evaders when they are all within attack range, is the number of evaders/pursuers, is the health of each evader, and DPS represents damage per second imposed by the selected units. After substituting those unit parameter values (from the appendix) and Eq. (11) into Eq. (14), the possible best mean score (mean captured number) performance for the FindAndDefeatZerglings mini-game would be


which is very close to the best mean score 62 currently achieved by a relational agent from DeepMind [21]. The slight difference could be caused by the effect due to that has been neglected in the above calculation. The effect could be important especially at the game reset state and its absence in Eq. (IV) could thus yield a slight overestimation. On the other hand, a further optimization on agent network structures could possibly furhter increase the achievable mean score. Overall, the game theoretic perspective helps to increase our understanding of the achievable performance for the current StarCraft mini-games.

Iv The StarCraft Adversary-Agent Challenge

In the abovementioned FindAndDefeatZerglings mini-game, all evaders are almost immobile. The evaders (here is zergling, see Fig. A1(b)) will remain still but run towards the pursuers (here is marine, see Fig. A1(a)) when the latter are visible. Such a hiding strategy actually simplifies the searching task for the pursuers and justifies the simplified assumption adopted in Eq. (11). Gal has pointed out that a mobile evader is more difficult to be captured [8], which motivates this work to develop a new adversary-agent learning environment that can be used to train an AI agent for mobile evaders.








Game Env










Game Env

Figure 8: (a) The learning environment of the mini-map from DeepMind provides an interface to a single agent, which has been extended to (b) multiple cooperative agents in the SMAC learning environment [16].

First, Fig. 8 compares the structures of the existing mini-game reinforcement learning environments. The FindAndDefeatZerglings mini-game essentially follows the diagram shown in Fig. 8(a), where a single agent interacts with StarCraft II environment and controls pursuers to maximize future rewards, whereas the build-in script code from StarCraft II controls evaders (to either remain still or push back). The SMAC toolkit [16] follows the diagram shown in Fig. 8(b), where actions from each agent are concatenated through the SMAC toolkit, and observations from StarCraft environment are separated and redistributed to each of the multi-agents. The paradigm underneath is a centralized training but decentralized execution [30]. Hence, the SMAC toolkit enables reinforcement learning of coordinated actions within multiple cooperative agents. However, as far as this author knows, an adversary-agent environment that would enable reinforcement learning, especially for pursuit–evasion type differential game, is still rare. To fill this gap, the current work endeavors to propose an adversary-agent learning environment (was named StarCraft Adversary-Agent Challenge, SAAC).











Game Env

Figure 9: Overview of the SAAC learning environment.

Figure 9 shows the corresponding structure of the proposed SAAC environment, where two adversary agents control pursuers and evaders, respectively. It is worthwhile to mention that both agents could be further extended to concatenate multiple coordinating agents by further incorporating SMAC toolkit.

The SAAC environment consists of some example mini-maps and adversary agents, which will guide interested readers to build up their own maps and agents. Some of the findings that are important for the correct implementation of the environment are summarized as follows.

  • For unknown reasons, the seven mini-game maps from DeepMind cannot support two adversary-agents for opponent players. Hence, in this work, the mini-map is built from scratch by StarCraft map editor. Then, interested readers can download and further edit my mini-map for their own target research problems.

  • It is well known in the StarCraft programming community that the current PySC2 interface could produce websocket errors during the low-level message passing between multiple agent interfaces. To bypass this issue, a thorough programming debug has been conducted in this work to identify the corresponding code. Then, a temporary fix has been adopted to rectify the issue before any official fix is available from DeepMind in the near future.

Figure 10: The screenshot of the adversary-agent learning environment (the FindAndDefeatDrones mini-game) developed in this work.

Other important modifications include the modified optimization objectives (to optimize the number of defeated units for pursuers and the number of living units for evaders) and the use of different unit types to address the third issue that has been mentioned in Sec. II. More specifically, the evaders are changed from Zerg zergling to Zerg drones, which are farming workers and will only escape to the nearby fog of war rather than pushing back when they are attacked. The pursuers are changed from Terran marines to Protoss void ray, which represents a classical type of attack aircraft. Figure 10 shows the screenshot of this new, so-called FindAndDefeatDrones mini-game. Compared to the former FindAndDefeatZerglings mini-game, it can be seen that the FindAndDefeatDrones mini-game is more similar to the classical pursuit–evasion game. Moreover, other units can be considered in later studies. For example, the set-up with Terran medivac dropship versus Protoss void ray shall be able to imitate aerospace interception and capture applications. The corresponding game modifications should be straightforward based on the proposed FindAndDefeatDrones mini-game.

An analysis similar to Eq. (IV) can be conducted for this new mini-game, which yields

However, after running this mini-map, an expert human player suggested that the above value is extensively overestimated. It is because that Eq. (IV) is only for immobile evaders (recall the evaders will actually run towards the pursuers in the former mini-game). However, in this new FindAndDefeatDrones mini-game, the evaders (Zerg drones) will escape to the nearby fog of war to avoid to be attacked. Moreover, the moving speed of the Zerg drones is slightly faster than the moving speed of the pursuers (Protoss void ray). The effect of , which is longest possible distance inside the game domain and equal to , cannot be neglected anymore. Hence, Eq. (10) is adopted to yield a new estimation of the expected number of captured units,

Before jumping to the learning of the adversary-agent, a couple of tests with simplified scripted agent and random agent 999Interested readers can download SAAC code and type terminal command: python have been conducted to verify and validate the code and the new mini-game set-ups. When the evader agent is random, the testing pursuit agent can achieve mean score of 50.4 (i.e., the number of the captured units), which shows that the whole adversary-agents learning environment is working. Moreover, the pursuit agent is also tested for the classical FindAndDefeatZerglings mini-game 101010Interested readers can download SAAC code and type terminal command: python and achieved mean score of 40. Both tests clearly suggest the effectiveness of this testing pursuit agent.

Figure 11: The code structure of the adversary-agents tests for the FindAndDefeatDrones mini-game, where a full connected network is adopted in agent 1 to extract features and produce decisions.

Next, the proposed adversary-agents environment is utilized to train agents. To the best knowledge of this author, most former works are focused on pursuit agents based on A2C, A3C, DQN and relational-based neural network methods, but the other side of the coin is rarely studied. Enabled by the new learning environment, here the attention is focus on the training of a pursuit agent. Figure 


shows the code structure, where two interfaces from the StarCraft environment output observations (feature maps, etc.) to pursuers and evaders, respectively. Currently, the evader agent (agent 1 in the figure) adopts a four-layers, fully-connected convolutional network architecture. Other hyperparameter values can be found inside the code. The current work only uses such a network to rapidly showcase the proposed adversary learning environment. Further optimizations of the network architecture and hyperparameter configurations can be straightforwardly performed by interested readers. The pursuit agent (agent 2 in Fig. 

11) adopts the above-mentioned traversal agent. Again, this agent can be easily replaced with other reinforcement learning agents.

Figure 12: Some of the representative screenshots from (a) the pursuers and (b) the evaders, respectively, during one episode of the FindAndDefeatDrones game. (c) The corresponding searching and evasion strategies, where the dashed circles represent the corresponding attack radius and the dashed lines represent searching paths.

Figure 12 shows some representative screenshots from (a) the pursuers and (b) the evaders, respectively, during one episode of the FindAndDefeatDrones game 111111Interested readers can download SAAC code and type terminal command: python The four consecutive stages are from the initial step to the middle period of a traversal type searching. Figure 12(c) shows the trajectories of the pursuers in these four stages, and further shows the corresponding spatial distributions (from stage 2 to stage 4) of evading survivals. At the first stage, the initial 25 evaders are randomly scattered throughout the whole game domain and, for clarity, are not shown in Fig. 12(c). It can be seen that all three pursuers stayed together for concentrated firing capability during the searching of the evaders. Similarly, just after a dozen of training episodes, the evader agent learns to control all 25 evaders to gradually move together and eventually convene at either corner of the game domain.

As shown in Fig. 11, the current evader agent only supports collective evasion or collection exploration, which extensively simplified the size of action space and reduce the reinforcement learning time. Such a team action strategy can be modified by changing the available action space. The game reward shows in Fig. 13 suggests that the reinforcement learning quickly helps to reduce the number of the captured evaders from around 51 (the solid line in the figure) to around 30 (the dashed line).

Figure 13: The number of the captured units for the FindAndDefeatDrones game, where (–) denotes the results achieved by the random agent for the evader part, and () shows the results through the adversary-agent reinforcement learning with a convolutional network and DQN method.

Theoretically, through the current group searching and team hiding strategies, the FindAndDefeatDrones mini-game with 3 pursuers and 25 evaders is actually reduced to the classical one princess and one monster game. From Fig. 12(c), the hiding strategy learning in the current adversary game environment is similar to the well-known solution from Gal [8], that is, all evaders behavior as one unit, keep moving to a random location as a team and stay still for a certain time interval, and then repeat such a procedure. It is worthwhile to mention that the two strategies also imitate the possible action behavior from ordinary human players, who are tended to control a group of units together. Whether a separate searching or a separate evasion would lead to better rewards is still an interesting open question that request further study, which however is beyond the scope of the current paper.

V Conclusion

In this paper, a StarCraft based reinforcement learning environment that supports adversary agents has been proposed for the study of pursuit–evasion game in the presence of fog of war. The key contribution includes the analysis of the potential performance of an agent in the current pursuit–evasion mini-game, by merging control and differential game theory for the specific reinforcement learning problem set-ups, and the development of SAAC environment by extending the current StarCraft mini-games. The current work is solely focused on the evader agent learning, which is rare in the former studies, and configures the pursuit agent to a testing traversal agent with decent searching performance. The proposed SAAC environment should also be applicable to the future studies that wish to train adversary agents simultaneously, and the bottleneck that the author can currently envision is the prohibitive training cost.

Theoretically, the most critical part of this work is the analytical explanation of the potential pursuit agent performance by differential game theory for the StarCraft mini-game set-ups. In addition, the resultant performance values help to examine the performance of the traversal pursuit agent used in the adversary-agent trainings. The subsequent study showcases the use of this learning environment and the effectiveness of the learned adversary agent for evasion units. On the other hand, reinforcement learning usually assumes a stationary environment, which could be inapplicable to pursuit–evasion when non-cooperative game dynamics appear. Hence, the proposed SAAC environment should enable new research directions for both differential game research community and reinforcement learning research community, and help to promote the merging of both game theory and AI technology together.

Last but not least, the author wishes to emphasize that this paper serves as an introduction with a focus especially on the development of the SAAC environment, with detailed explanations of why to design in such a way and how to bypass the inherent code issues and certain software limitations, etc.. The corresponding SAAC code can be found at GitHub: More studies regarding different AI network architectures and hyperparameter optimization will be given in the follow-up articles.


This research was conducted during the pandemic era when financial resource and student support were both scarce. The author does wish to acknowledge the great affection, emotional support and understanding from his family.



.1 The units

Figure A1 shows the race units have been considered in this work. Table A1 gives the corresponding unit parameters. Interested readers can try other units by editing the map developed in this work with StarCraft map editor.

Figure A1: The StarCraft units that have been used in this work: (a) marine, (b) zergling, (c) void ray and (d) drone.
Name Health Sight range Attack range Speed Damage per second DPS
Marine 45 9 5 3.15 9.8

35 8 0.1 4.13 10

40 8 0.1 3.94 4.67

Void ray
150 10 6 3.85 16.8
Table A1: The information of the units [31] used in this paper.

.2 Some of the main code subroutines

  • the adversary-agent program entry point, will set up the neural network architecture and conduct the fit operation.

  • defines the DQN agent.

  • is to be inherited by, and defines the key fit function.

  • sets up the StarCraft II environment, and defines the possible actions and the key step function.

  • is to be inherited by, and extends the StarCraft II environment to the pursuit–evasion problem.

  • the script tests a traversal algorithm for the pursuers in the FindAndDefeatZerglings mini-game.

  • the script tests a traversal algorithm for the pursuers in the FindAndDefeatDrones mini-game.

Other files are from Keras-rl, only after slight modifications (most of them should have been explicitly pointed out in code annotation).