Hierarchical Decision Making by Generating and Following Natural Language Instructions

by   Hengyuan Hu, et al.

We explore using latent natural language instructions as an expressive and compositional representation of complex actions for hierarchical decision making. Rather than directly selecting micro-actions, our agent first generates a latent plan in natural language, which is then executed by a separate model. We introduce a challenging real-time strategy game environment in which the actions of a large number of units must be coordinated across long time scales. We gather a dataset of 76 thousand pairs of instructions and executions from human play, and train instructor and executor models. Experiments show that models using natural language as a latent variable significantly outperform models that directly imitate human actions. The compositional structure of language proves crucial to its effectiveness for action representation. We also release our code, models and data.


page 1

page 2

page 3

page 4


Episodic Transformer for Vision-and-Language Navigation

Interaction and navigation defined by natural language instructions in d...

Grounding Complex Navigational Instructions Using Scene Graphs

Training a reinforcement learning agent to carry out natural language in...

Beating Atari with Natural Language Guided Reinforcement Learning

We introduce the first deep reinforcement learning agent that learns to ...

Translating Navigation Instructions in Natural Language to a High-Level Plan for Behavioral Robot Navigation

We propose an end-to-end deep learning model for translating free-form n...

Learning Interpretable Spatial Operations in a Rich 3D Blocks World

In this paper, we study the problem of mapping natural language instruct...

Conditional Driving from Natural Language Instructions

Widespread adoption of self-driving cars will depend not only on their s...

Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning

In this work, we present an alternative approach to making an agent comp...

1 Introduction

Many complex problems can be naturally decomposed into steps of high level planning and low level control. However, plan representation is challenging—manually specifying macro-actions requires significant domain expertise, limiting generality and scalability [18, 22], but learning composite actions from only end-task supervision can result in the hierarchy collapsing to a single action [3].

We explore representing complex actions as natural language instructions. Language can express arbitrary goals, and has compositional structure that allows generalization across commands [14, 1]. Our agent has a two-level hierarchy, where a high-level instructor model communicates a sub-goal in natural language to a low-level executor model, which then interacts with the environment (Fig. 1). Both models are trained to imitate humans playing the roles. This approach decomposes decision making into planning and execution modules, with a natural language interface between them.

We gather example instructions and executions from two humans collaborating in a complex game. Both players have access to the same partial information about the game state. One player acts as the instructor, and periodically issues instructions to the other player (the executor), but has no direct control on the environment. The executor acts to complete the instruction. This setup forces the instructor to focus on high-level planning, while the executor concentrates on low-level control.

To test our approach, we introduce a real-time strategy (RTS) game, developing an environment based on [23]. A key property of our game is the rock-paper-scissors unit attack dynamic, which emphasises strategic planning over micro control. Our game environment is a challenging decision making task, because of exponentially large state-action spaces, partial observability, and the variety of effective strategies. However, it is relatively intuitive for humans, easing data collection.

Using this framework, we gather a dataset of 5392 games, where two humans (the instructor and executor) control an agent against rule-based opponents. The dataset contains 76 thousand pairs of human instructions and executions, spanning a wide range of strategies. This dataset poses challenges for both instruction generation and execution, as instructions may apply to different subsets of units, and multiple instructions may apply at a given time. We design models for both problems, and extensive experiments show that using latent language significantly improves performance.

In summary, we introduce a challenging RTS environment for sequential decision making, and a corresponding dataset of instruction-execution mappings. We develop novel model architectures with planning and control components, connected with a natural language interface. Our agent with latent language significantly outperforms agents that directly imitate human actions, and we show that exploiting the compositional structure of language improves performance by allowing generalization across a large instruction set. We also release our code, models and data.

Figure 1: Two agents, designated instructor and executor collaboratively play a real-time strategy game (§2). The instructor iteratively formulates plans and issues instructions in natural language to the executor, who then executes them as a sequence of actions. We first gather a dataset of humans playing each role (§3). We then train models to imitate humans actions in each role (§4).

2 Task Environment

We implement our approach for an RTS game, which has several attractive properties compared to traditional reinforcement learning environments, such as Atari 

[13] or grid worlds [19]. The large state and action spaces mean that planning at different levels of abstraction is beneficial for both humans and machines. However, manually designed macro-actions typically do not match strong human performance, because of the unbounded space of possible strategies [21, 25]. Even with simple rules, adversarial games have the scope for complex emergent behaviour.

We introduce a new RTS game environment, which distills the key features of more complex games while being faster to simulate and more tractable to learn. Current RTS environments, such as StarCraft, have dozens of unit types, adding large overheads for new players to learn the game. Our new environment is based on MiniRTS [23]. It has a set of 7 unit types, designed with a rock-paper-scissors dynamic such that each has some units it is effective against and vulnerable to. Maps are randomly generated each game to force models to adapt to their environment as well as their opponent. The game is designed to be intuitive for new players (for example, catapults have long range and are effective against buildings). Numerous strategies are viable, and the game presents players with dilemmas such as whether to attack early or focus on resource gathering, or whether to commit to a strategy or to attempt to scout for the opponent’s strategy first. Overall, the game is easy for humans to learn, but challenging for machines due to the large action space, imperfect information, and need to adapt strategies to both the map and opponent. See the Appendix for more details.

3 Dataset

To learn to describe actions with natural language, we gather a dataset of two humans playing collaboratively against a rule-based opponent. Both players have access to the same information about the game state, but have different roles. One is designated the instructor, and is responsible for designing strategies and describing them in natural language, but has no direct control. The other player, the executor, must ground the instructions into low level control. The executor’s goal is to carry out commands, not to try to win the game. This setup causes humans to focus on either planning or control, and provides supervision for both generating and executing instructions.

Statistic Value
Total games 5392
Win rate 58.6%
Total instructions 76045
Unique instructions 50669
Total words 483650
Unique words 5007
# words per instruction 9.54
# instructions per game 14.1
Table 1: We gather a large language dataset for instruction generation and following. Major challenges include the wide range of unique instructions and the large number of low-level actions required to execute each instruction.

We collect 5392 games of human teams against our bots.222Using ParlAI [12] Qualitatively, we observe a wide variety of different strategies. An average game contains 14 natural language instructions and lasts for 16 minutes. Each instruction corresponds to roughly 7 low-level actions, giving a challenging grounding problem (Table 1). The dataset contains over 76 thousand instructions, most of which are unique, and their executions. The diversity of instructions shows the wide range of useful strategies. The instructions contain a number of challenging linguistic phenomena, particularly in terms of reference to locations and units in the game, which often requires pragmatic inference. Instruction execution is typically highly dependent on context. Our dataset is available 333Anonymous url to the dataset: http://bit.ly/2VL9lg6. For more details, refer to the Appendix.

Analysing the list of instructions (see Appendix), we see that the head of the distribution is dominated by straightforward commands to perform the most frequent actions. However, samples from the complete instruction list reveal many complex compositional instructions, such as Send one catapult to attack the northern guard tower [and] send a dragon for protection. We see examples of challenging quantifiers (Send all but 1 peasant to mine), anaphora (Make 2 more cavalry and send them over with the other ones), spatial references (Build a new town hall between the two west minerals patches) and conditionals (If attacked retreat south).

4 Model

We factorize agent into an executor model (§4.2), which maps instructions and the game states into unit-level actions of the environment, and an instructor model (§4.3), which generates language instructions given the game states. We train both models with human supervision (§4.4).

4.1 Game Observation Encoder

We condition both the instructor and executor models on a fixed-sized representation of the current game state, which we construct from a spatial map observation, internal states of visible units, and several previous natural language instructions. (Fig. 2). We detail each individual encoder below.

4.1.1 Spatial Inputs Encoder

We encode the spatial information of the game map using a convolutional network. We discretize the map into a grid and extract different bits of information from it using separate channels. For example, three of those channels provide binary indication of a particular cell visibility, which indicates Invisible, Seen, and Visible states. We also have a separate channel per unit type to record the number of units in each spatial position for both our and enemy units separately. Note that due to "fog-of-war", not all enemy units are visible to the player. See the Appendix for more details.

We apply several

convolutional layers that preserve the spatial dimensions to the input tensor. Then we use 4 sets of different weights to project the shared convolutional features onto different 2D features spaces, namely

Our Units, Enemy Units, Resources, and Map Cells. We then use

locations for units, resources, or map cells to extract their features vectors from corresponding 2D features spaces.

4.1.2 Non-spatial Inputs Encoder

We also take advantage of non-spatial attributes and internal state for game objects. Specifically, we improve features vectors for Our Units and Enemy Units

by adding encodings of units health points, previous and current actions. If an enemy unit goes out the players visibility, we respect this by using the state of the unit’s attributes from the last moment we saw it. We project attribute features onto the same dimensionality of the spatial features and do a element-wise multiplication to get the final set of

Our Units and Enemy Units features vectors.

Figure 2: At each time step of the environment we encode spatial observations (e.g. the game map) and non-spatial internal states for each game object (e.g. units, buildings, or resources) via the observation encoder, which produces separate feature vectors for each unit, resource, or discrete map locations. We also embed each of the last natural language instructions into individual instruction feature vectors. Lastly, we learn features for all the other global game attributes by employing the auxiliary encoder. We then use these features for both the executor and instructor networks.

4.1.3 Instruction Encoders

The state also contains a fixed-size representation of the current instruction. We experiment with:

  • An instruction-independent model (ExecutorOnly), that directly mimics human actions.

  • A non-compositional encoder (OneHot) which embeds each instruction with no parameter sharing across instructions (rare instructions are represented with an unknown embedding).

  • A bag-of-words encoder (BoW), where an instruction encoding is a sum of word embeddings. This model tests if the compositionality of language improves generalization.

  • An RNN encoder (Rnn), which is order-aware. Unlike BoW, this approach can differentiate instructions such as attack the dragon with the archer and attack the archer with the dragon.

4.1.4 Auxiliary Encoder

Finally, we encode additional game context, such as the amount of money the player has, through a simple MLP to get the Extra features vector.

4.2 Executor Model

The executor predicts an action for every unit controlled by the agent based on the global summary of the current observation. We predict an action for each of the player’s units by choosing over an Action Type first, and then selecting the Action Output. There are 7 action types available: Idle, Continue, Gather, Attack, Train Unit, Build Building, Move. Action Output

specifies the target output for the action, such as a target location for the

Move action, or the unit type for Train Unit. Fig. 3 gives an overview of the executor design, also refer to the Appendix.

Figure 3: Modeling an action for an unit requires predicting an action type based on the global summary

of current observation, and then, depending on the predicted action type, computing a probability distribution over a set of the action targets. In this case, the

Move action is sampled, which uses the map cells features as the action targets.

For each unit, we consider a history of recent instructions ( in all our experiments), because some units may still be focusing on a previous instruction that has long term effect like keep scouting or build 3 peasants. To encode the instructions, we first embed them in isolation with the 4.1.3. We take that represents how many frames have passed since that instruction gets issued and compute where are constants defining the number of bins and bin size. We also take that represents the temporal ordering of those instructions. We embed and and concatenate the embeddings with language embedding. Dot product attention is used to compute an attention score between a unit and recent instructions and then a unit dependent instruction representation is obtained through a weighted sum of history instruction embeddings using attention score as weight.

We use the same observation encoder (§4.1) to obtain the features mentioned above. To form a global summary, we sum our unit features, enemy unit features, and resource features respectively and then concatenate together with Extra features.

To decide the action for each unit, we first feed the concatenation of the unit feature, unit depending instruction feature and the global summary into a multi-layer neural classifier to sample an

Action Type. Depending on the action type, we then feed inputs into different action-specific classifiers to sample Action Output. In the action argument classifier, the unit is represented by the concatenation of unit feature and instruction feature, and the targets are represented by different target embeddings. For Attack, the target embeddings are enemy features; for Gather; the target embeddings are resource features; for Move, the target embeddings are map features; for Train Unit, the target embeddings are embeddings of unit types; for Build Building, the target embeddings are embeddings of unit types and map features, and we sample type and location independently. The distribution over targets for the action is computed by taking the dot product between the unit representation and each target, followed by a softmax.

We add an additional binary classifier, Global Continue, that takes the global summary and current instruction embedding as an input to predict whether all the agent’s units should continue working on their previous action.

4.3 Instructor Model

The instructor maps the game state to instructions. It uses the game observation encoder (§4.1) to compute a global summary and current instruction embedding similar to the executor. We experiment with two model types:

Discriminative Models

These models operate on a fixed set of instructions. Each instruction is encoded as a fixed-size vector, and the dot product of this encoding and the game state encoding is fed into a softmax classifier over the set of instructions. As in §4.1.3, we consider non-compositional (OneHot), bag-of-words (BoW) and Rnn Discriminative encoders.

Generative Model

The discriminative models can only choose between a fixed set of instructions. We also train a generative model, Rnn Generative, which generates instructions autoregressively. To compare likelihoods with the discriminative models, which consider a fixed set of instructions, we re-normalize the probability of an instruction over the space of instructions in the set.

The instructor model must also decide at each time-step whether to issue a new command, or leave the executor to follow previous instructions. We add a simple binary classifier that conditions on the global feature, and only sample a new instruction if the result is positive.

Because the game is only partially observable, it is important to consider historical information when deciding an instruction. For simplicity, we add a running average of the number of enemy units of each type that have appeared in the visible region as an extra input to the model. To make the instructor model aware of how long the current instruction has been executed, we add an extra input representing number of time-step passed since the issuance of current instruction. As mentioned above, these extra inputs are fed into separate MLPs and become part of the Extra feature.

4.4 Training

Since one game may last for tens of thousands of frames, it is not feasible nor necessary to use all frames for training. Instead, we take one frame every

frames to form the supervised learning dataset. To preserve unit level actions for the

executor training, we put all actions that happen in frames onto the th frame if possible. For actions that cannot happen on the th frame, such as actions for new units built after th frame, we simply discard them.

Humans players sometimes did not execute instructions immediately. To ensure our executor acts promptly, we filter out action-less frames between a new instruction and first new actions.

4.4.1 Executor Model

The executor is trained to minimize the following negative log-likelihood loss:

where represents game state and instruction, is the executor Global Continue classifier (see §4.2), is a binary label that is 1 if all units should continue their previous action, is the likelihood of unit doing the correct action .

4.4.2 Instructor Model

The loss for the instructor model is the sum of a loss for deciding whether to issue a new instruction, and the loss for issuing the correct instruction:

where represents game state and current instruction, is the continue classifier, and is a binary label with indicating that no new instruction is issued. The language loss is the loss for choosing the correct instruction, and is defined separately for each model.

For OneHot instructor, is simply negative log-likelihood of a categorical classifier over a pool of instructions. If the true target is not in the candidate pool is 0.

Because BoW and Rnn Discriminative can compositionally encode any instruction (in contrast to OneHot), we can additionally train on instructions from outside the candidate pool. To do this, we encode the true instruction, and discriminate against the instructions in the candidate pool and another randomly sampled instructions. The true target is forced to appear in the candidates. We then use the NLL of the true target as language loss. This approach approximates the expensive softmax over all 40K unique instructions.

For Rnn Generative, the language loss is the standard autoregressive loss.

5 Experiments

Executor Model Negative Log Likelihood Win/Lose/Draw Rate (%)
ExecutorOnly 3.16 0.0031 39.1/39.3/21.6
OneHot 3.06 0.0026 53.2/29.0/17.8
BoW 2.91 0.0040 61.9/28.8/ 9.3
Rnn 2.90 0.0039 65.1/24.2/10.7
Table 2: Negative log-likelihoods of human actions for executor models, and win-rates against ExecutorOnly, which does not use latent language. We use the Rnn Discriminative instructor with 500 instructions. Modelling instructions compositionally improves performance, showing linguistic structure enables generalization.

We compare different executor5.1) and instructor5.2) models in terms of both likelihoods and end-task performance. We show that hierarchical models perform better, and that the compositional structure of language improves results by allowing parameter sharing across many instructions.

Instructor Model Negative Log Likelihood Win/Lose/Draw rate (%)
(with N instructions) N=50 N=250 N=500 N=50 N=250 N=500
OneHot 0.622 0.004 0.795 0.004 0.880 0.002 53.9/34.3/11.8 59.3/25.4/15.3 56.9/28.8/14.3
BoW 0.603 0.001 0.774 0.002 0.833 0.002 48.7/37.4/13.9 59.9/26.9/13.2 59.1/30.7/10.2
Rnn Discriminative 0.590 0.002 0.745 0.004 0.797 0.006 55.7/30.6/13.7 61.7/25.2/13.1 65.1/24.2/10.7
Rnn Generative 0.603 0.006 0.767 0.003 0.826 0.007 53.9/30.2/15.9 61.5/26.7/11.8 62.7/27.3/10.0
Table 3: Win-rates and likelihoods for different instructor models, with the most frequent instructions. Win-rates are against a non-hierarchical executor model, and use the Rnn executor. Better results are achieved with larger instruction sets and more compositional instruction encoders.

5.1 Executor Model

The executor model learns to ground pairs of states and instructions onto actions. With over 76 thousand examples, a large action space, and multiple sentences of context, this problem in isolation is one of the largest and most challenging tasks currently available for grounding language in action.

We evaluate executor performance with different instruction encoding models (§4.1.3). Results are shown in Table 2, and show that modelling instructions compositionally—by encoding words (BoW) and word order (Rnn)—improves both the likelihoods of human actions, and win-rates over non-compositional instructor models (OneHot). The gain increases with larger instruction sets, demonstrating that a wide range of instructions are helpful, and that exploiting the compositional structure of language is crucial for generalization across large instruction sets.

We additionally ablate the importance of considering multiple recent instructions during execution (our model performs attention over the most recent 5 commands §4.2). When considering only the current instruction with the RNN executor, we find performance drops to a win-rate of 57.0 (from 65.1) and negative log likelihood worsens from 2.90 to 2.95.

5.2 Instructor Model

We compare different instructor models for mapping game states to instructions. As in §5.1, we experiment with non-compositional, bag-of-words and RNN models for instruction generation. For the RNNs, we train both a discriminative model (which maps complete instructions onto vectors, and then chooses between them) and a generative model that outputs words auto-regressively.

Evaluating language generation quality is challenging, as many instructions may be reasonable in a given situation, and they may have little word overlap. We therefore compare the likelihood of the human instructions. Our models choose from a fixed set of instructions, so we measure the likelihood of choosing the correct instruction, normalized over all instructions in the set. Likelihoods across different instructions sets are not comparable.

Table 3 shows that, as §5.1, more structured instruction models give better likelihoods—particularly for larger instruction sets, which are harder to model non-compositionally.

We compare the win-rate of our models against a baseline which directly imitates human actions (without latent language). All latent instruction models outperform this baseline. More compositional instruction encoders improve performance, and can use more instructions effectively. These results demonstrate the potential of language for compositionally representing large spaces of complex plans.

5.3 Qualitative Analysis

Observing games played by our model, we find that most instructions are both generated and executed as humans plausibly would. The executor is often able to correctly count the number of units it should create in commands such as build 3 dragons.

There are several limitations. The executor sometimes acts without instructions—partly due to mimicking some humans behaviour, but also indicating a failure to learn dependencies between instructions and actions. The instructor sometimes issues commands which are impossible in its state (e.g. to attack with a unit that the it does not have)—causing weak behaviour from executor model.

6 Related work

Previous work has used language to specify exponentially many policies  [14, 1, 27], allowing zero-shot generalization across tasks. We develop this work by generating instructions as well as executing them. We also show how complex tasks can be decomposed into a series of instructions and executions.

Executing natural language instructions has seen much attention. The task of grounding language into an executable representation is sometimes called semantic parsing [29], and has been applied to navigational instruction following, e.g. [2]. More recently, neural models instruction following have been developed for a variety of domains, for example [11] and [10]. Our dataset offers a challenging new problem for instruction following, as different instructions will apply to different subsets of available units, and multiple instructions may be apply at a given time.

Instruction generation has been studied as a separate task. [7] map navigational paths onto instructions. [8] generate instructions for complex tasks that humans can follow, and [9] train a model for instruction generation, which is used both for data augmentation and for pragmatic inference when following human-generated instructions. We build on this work by also generating instructions at test time, and showing that latent language improves performance.

Learning to play a complete real-time strategy game, including unit building, resources gathering, defence, invasion, scouting, and expansion, remains a challenging problem  [15], in particular due to the complexity and variations of commercially successful games (e.g., StarCraft I/II), and its demand of computational resources. Traditional approaches focus on sub-tasks with hand-crafted features and value functions (e.g., building orders [5], spatial placement of building [4], attack tactics between two groups of units [6]

, etc). Inspired by the recent success of deep reinforcement learning, more works focus on training a neural network to finish sub-tasks 

[24, 17], some with strong computational requirement [28]. For full games,  [23] shows that it is possible to train an end-to-end agent on a small-scaled RTS game with predefined macro actions, and TStarBot [20] applies this idea to StarCraft II and shows that the resulting agent can beat carefully-designed, and even cheating rule-based AI. By using human demonstrations, we hand crafting macro-actions.

Learning an end-to-end agent that plays RTS games with unit-level actions is even harder. Progress is reported for MOBA games, a sub-genre of RTS games with fewer units—for example,  [16] shows that achieving professional level of playing DoTA2 is possible with massive computation, and  [26] shows that with supervised pre-training on unit actions, and hierarchical macro strategies, a learned agent on Honor of Kings is on par with a top 1% human player.

7 Conclusion

We introduced a framework for decomposing complex tasks into steps of planning and execution, connected with a natural language interface. We experimented with this approach on a new strategy game which is simple to learn but features challenging strategic decision making. We collected a large dataset of human instruction generations and executions, and trained models to imitate each role. Results show that exploiting the compositional structure of natural language improves generalization for both the instructor and executor model, significantly outperforming agents without latent language. Future work should use reinforcement learning to further improve the planning and execution models, and explore generating novel instructions.



Appendix A Detailed game design

We develop an RTS game based on the MiniRTS framework, aspiring to make it intuitive for humans, while still providing a significant challenge to machines due to extremely high-dimensional observation and actions spaces, partial observability, and non-stationary environment dynamics imposed by the opponent. Below we describe the key game concepts.

a.1 Game units specifications

Building units

Our game supports 6 different building types, each implementing a particular function in the game. Any building unit can be constructed by the Peasant unit type at any available map location by spending a specified amount of resources. Later, the constructed building can be used to construct units. Most of the building types can produce up to one different unit type, except of Workshop, which can produce 3 different unit types. This property of the Workshop building allows various strategies involving bluffing. A full list of available building units can be found in Table 4.

Army units
Figure 4: Our game implements the rock-paper-scissors attack graph, where each unit has some units it is effective against and vulnerable to.

The game provides a player with 7 army unit types, each having different strengths and weaknesses. Peasant is the only unit type that can construct building units and mine resources, so it is essential for advancing to the later stages of the game. We design the attack relationships between each unit type with a rock-paper-scissors dynamic—meaning that each unit type has another unit type that it is effective against or vulnerable to. This property means that effective agents must be reactive to their opponent’s strategy. See Fig. 4 for a visualization. Descriptions of army units can be found in Table 5.

Building name Description
Town Hall
The main building of the game, it allows a player to train Peasants and
serves as a storage for mined resources.
Barrack Produces Spearmen.
Blacksmith Produces Swordmen.
Stable Produces Cavalry.
Produces Catapult, Dragon and Archer. The only building that can
produce multiple unit types.
Guard Tower A building that can attack enemies, but cannot move.

Table 4: The list of the building units available in the game.
Resource unit

Resource is a stationary and neutral unit type, it cannot be constructed by anyone, and is only created during the map generation phase. Peasants of both teams are allowed to mine the same Resource unit, until it is exhausted. Initial capacity is set to 500, and one mine action subtracts 10 points from the Resource. Several Resource units are placed randomly on the map, which gives raise to many strategies around Resource domination.

a.2 Game map

We represent the game map as a discrete grid of 32x32. Each cell of the grid can either be grass or water, where the grass cell is passable for any army units, while the water cell prevents all units except of Dragon to go through. Having water cells around one’s main base can be leveraged as a natural protection. We generate maps randomly for each new game, we first place one Town Hall for each player randomly. We then add some water cells onto the map, making sure that there is at least one path between two opposing Town Halls, but otherwise aiming to create bottlenecks. Finally, we randomly locate several Resource units onto the map such that they are approximately equidistant from the players Town Halls.

Unit name Description
Peasant Gathers minerals and constructs buildings, not good at fighting.
Spearman Effective against cavalry.
Swordman Effective against spearmen.
Cavalry Effective gainst swordmen.
Dragon Can fly over obstacles, can only be attacked by archers and towers.
Archer Great counter unit against dragons.
Catapult Easily demolishes buildings.

Table 5: The list of the army units available in the game.

Appendix B RTS game as an Reinforcement Learning environment

Our platform can be also used as an RL environment. In our code base we implement a framework that allows a straightforward interaction with the game environment in a canonical RL training loop. Below we detail the environment properties.

b.1 Observation space

We leverage both spatial representation of the map, as well as internal state of the game engine (e.g. units health points and attacking cool downs, the amount of resources, etc.) to construct an observation. We carefully address the fog of war, by masking out the regions of the map that have not been visited. In addition, we remove any unseen enemy units attributes from the observation. The partial observability of the environment makes it especially challenging to apply RL due to highly non-stationary state distribution.

b.2 Action space

At each timestep of the environment we predict an action for each of our units, both buildings and army. The action space is consequently large—for example, any unit can go to any location at each timestep. Prediction of an unit action proceeds in steps, we first predict an action type (e.g. Move or Attack), then, based on the action type, we predict the action outputs. For example, for the Build Building action type the outputs will be the type of the future building and its location on the game map. We summarize all available action types and their structure in Table 6.

Action Type Action Output Input Features
Continue NULL NULL
Gather resource_id resources_features
Attack enemy_unit_id enemy_units_features
Train Unit unit_type unit_type_features
Build Building unit_type, (x,y) unit_type_features, map_cells_features
Move (x,y) map_cells_features
Table 6: We implement a separate action classifier per action type, because each action type needs to model a probability distribution over different objects (Action Output). For example, for the Attack

action we need estimate a probability distribution over all visible enemy units and predict an enemy unit id, or

Build Building action needs to model two probability distributions, one over building type to be constructed, and another over all possible discrete location on the map where the future building will be placed.

b.3 Reward structure

We support a sparse reward structure, e.g. the reward of 1 is issued to an agent at the end if the game is won, all the other timesteps receive the reward of 0. Such reward structure makes exploration an especially challenging given the large dimensionality of the action space and the planning horizon.

Appendix C Data collection

We design a data collection task based on ParlAI, a transparent framework to interact with human workers. We develop separate game control interfaces for both the instructor and the executor players, and ask two humans to play the game collaboratively against a rule-based AI opponent. Both player have the same access to the game observation, but different control abilities.

The instructor control interface allows the human player to perform the following actions:

  • Issue a natural language instruction to the executor at any time of the game. We allow any free-form language instruction.

  • Pause the game flow at any time. Pausing allows the instructor to analyze the game state more thoroughly and plan strategically.

  • Warn the executor player in case they do not follow issued instructions precisely. This option allows us to improve data quality, by filtering executors who do not follow instructions.

On the other hand, the executor player gets to:

  • Control the game units by direct manipulation using computer’s input devices (e.g. mouse). The executor is tasked to complete the current instruction, rather than to win the game.

  • Ask the instructor for either a new instruction, or a clarification.

Each human workers is assigned with either the instructor or the executor role randomly, thus the same person can experience the game on both ends over multiple attempts.

Strategy Name Description
Simple This strategy first sends all 3 initially available Peasants to mine to the closest resource, then it chooses one army unit type from Spearman, Swordman, Cavalry, Archer, or Dragon, then it constructs a corresponding building, and finally trains 3 units of the selected type and sends them to attack. The strategy then continuously maintains the army size of 3, in case an army unit dies.
Medium Same as Simple strategy, only the size of the army is randomly selected between 3 and 7.
Strong This strategy is adaptive, and it reacts to the opponent’s army. In particular, this strategy constantly scouts the map using one Peasant and to lean the opponent’s behaviour. Once it sees the opponent’s army it immediately trains a counter army based on the attack graph (see Fig. 4). Then it clones the Medium strategy.
Second Base This strategy aims to build a second Town Hall near the second closest resource and then it uses the double income to build a large army of a particular unit type. The other behaviours is the same as in the Medium strategy.
Tower Rush A non-standard strategy, that first scouts the map in order to find the opponent using a spare Peasant. Once it finds it, it starts building Guard Towers close to the opponent’s Town Hall so they can attack the opponent’s units.
Peasant Rush This strategy sends first 3 Peasants to mine, then it keeps producing more Peasants and sending them to attack the opponent. The hope of this strategy is to beat the opponent by surprise.

Table 7: The rule-based strategies we use as an opponent to the human players during data collection.

c.1 Quality control

To make sure that we collect data of high quality we take the following steps:

Game manual

We provide a detailed list of instructions to a human worker at the beginning of each game and during the game’s duration. This manual aims to narrate a comprehensive overview various game elements, such as player roles, army and building units, control mechanics, etc. We also record several game replays that serve as an introductory guideline to the players.


We implement an onboarding process to make sure that novice players are comfortable with the game mechanics, so that they can play with other players effectively. For this, we ask a novice player to perform the executor

’s duties and pair them with a bot that issues a pre-defined set of natural language instructions that implements a simple walkthrough strategy. We allocate enough time for the human player to work on the current instruction, and to also get comfortable with the game flow. We let the novice player play several games until we verify that they pass the required quality bar. We assess the performance of the player by running a set of pattern-matching scripts that verify if the performed control actions correspond to the issued instructions (for example, if an instruction says "build a barrack", we make sure that the player executes the corresponding low-level action). If the human player doesn’t pass our qualification requirements within 5 games, we prevent them from participating in our data collection going forward and filter their games from the dataset.

Player profile

We track performance of each player, breaking it down by a particular role (e.g. instructor or executor). We gather various statistics about each player and build a comprehensive player profile. For example, for the instructor role we gather data such as overall win rate, the number of instructions issued per game, diversity of issued instructions; for the executor role we monitor how well they perform on the issued instruction (using a pattern matching algorithm), the number of warnings they receive from the instructor, and many more. We then use this profile to decide whether to upgrade a particular player to playing against stronger opponents (see  Section C.2) in case they are performing well, or prevent them from participating in our data collection at all otherwise.


We use several initial round of data collection as a source of feedback from the human players. The received feedback helps us to improve the game quality. Importantly, after we finalize the game configuration, we disregard all the previously collected data in our final dataset.

Final filtering

Lastly, we take another filtering pass against all the collected game replays and eliminate those replays that don’t meet the following requirements:

  • A game should have at least natural language instructions issued by the instructor.

  • A game should have at least low-level control actions issued by the executor.

By implementing all the aforementioned safe guards we are able to gather a high quality dataset.

c.2 Rule-based bots

We design a set of diverse game strategies that are implemented by our rule-based bots ( Table 7

). Our handcrafted strategies explore much of the possibilities that the game can offer, which in turn allows us to gather a multitude of emergent human behaviours in our dataset. Additionally, we employ a resource scaling hyperparameter, which controls the amount of resources a bot gets during mining. This hypermarameter offers a finer control over the bot’s strength, which we find beneficial for onboarding novice human players. We pair a team of two human players (the

instructor and executor) with a randomly sampled instance of a rule-based strategy and the resource scaling hyperparameter during our data collection, so the human player doesn’t know in advance who is their opponent. This property rewards reactive players. We later observe that our models are able to learn the scouting mechanics from the data, which is a crucial skill to be successful in our game.

(a) Top most frequent instructions
(b) Top most frequent words
Figure 5: Frequency histograms for the dataset instructions and words.

Appendix D Model architecture

d.1 Convolutional channels of Spatial Encoder

We use the following set of convolutional channels to extract different bits of information from spatial representation of the current observation.

  1. Visibility: 3 binary channels for each state of visibility of a cell (Visible, Seen, and Invisible).

  2. Terrain: 2 binary channels for each terrain type of a cell (grass or water).

  3. Our Units: 13 channels for each unit type of our units. Here, a cell contains the number of our units of the same type located in it.

  4. Enemy Units: similarly 13 channels for visible enemy units.

  5. Resources: 1 channel for resource units.

d.2 Action Classifiers

At each step of the game we predict actions for each of the player’s units, we do this by performing a separate forward pass for ofv the following network for each unit. Firstly, we run an MLP (Fig. 6) based action classifier to sample the unit’s Action Type. We feed the unit’s global summary features (see Fig. 3 of the main paper) into the classifier and sample an action type (see Table 6 for the full list of possible actions). Then, given the sampled action type we predict the Action Output based on the unit’s features, unit dependent instructions features, and the action input features. We provide an overview of Action Outputs and Input Features for each actions in Table 6. In addition, you can refer to the diagram Fig. 7.

Figure 6: The Action Type classifier is parameterized as an MLP network to model a softmax distribution over action types based on the unit’s global summary features vector.

Appendix E Dataset details

Through our data collection we gather a dataset of over 76 thousand of instructions and corresponding executions. We observe a wide variety of different strategies and their realizations in natural language. For example, we observe emergence of complicated linguistic constructions (Table 8).

We also study the distribution of collected instructions. While we notice that some instructions are more frequent than others, we still observe a good coverage of strategies realizations, which serve as a ground for generalization. In Table 10 we provide a list of most frequently used instructions, and in Fig. 5 shows the overall frequency distribution for instructions and words in our dataset.

Finally, we provide a random sample of 50 instructions from our dataset in Table 9, where showing the diversity and complexity of the collected instructions.

Linguistic Phenomena Example
Counting Build 3 dragons.
Spatial Reference Send him to the choke point behind the tower.
Locations Build one to the left of that tower.
Composed Actions Attack archers, then peasants.
Cross-instruction anaphora Use it as a lure to kill them.
Table 8: Complex linguistic phenomena emerge as humans instruct others how to play the game.
Build 1 more cavalry.
Attack peaons.
Build barrack in between south pass at new town.
Have all peasants gather minerals next to town hall.
Have all peasants mine ore.
Fight u peaas.
Stop the peasants from mining.
Build a new town hall between the two west minerals patches.
Build 2 more swords.
Use cavalry to attack enemey.
Explore and find miners.
If you see any idle peasants please have them build.
Okay that doesn’t work then build them on your side of the wall then.
Create 4 more archers.
Make a new town hall in the middle of all 3.
Attack tower with catas.
Kill cavalry and peasants then their townhall.
Attack enemy peasants with cavalry as well.
Send all peasants to collect minerals.
Attack enemy peasant.
Keep creating peasants and sending them to mine.
Send one catapult to attack the northern guard tower send a dragon for protection.
Send all but 1 peasant to mine.
Mine with the three peasants.
Use that one to scout and don’t stop.
Bring scout back to base to mine.
You’ll need to attack them with more peasants to kill them.
Build a barracks.
Send all peasants to find a mine and mine it.
Start mining there with your 3.
Make four peasants.
Move archers west then north.
Attack with cavalry.
Make two more workers.
Make 2 more calvary and send them over with the other ones.
Return to base with scout.
Build 2 peasants at the new mine.
If attacked retreat south.
Make the rest gather minerals too.
All peasants flee the enemy.
Attack the peasants in the area.
Attack the last archer with all peasants on the map.
Table 9: Examples of randomly sampled instructions.
Instruction Frequency Instruction Frequency
Attack. 527 Send idle peasants to mine. 68
Send all peasants to mine. 471 Attack that peasant. 68
Build a workshop. 414 Send all peasants to mine minerals. 65
Retreat. 323 Build a barracks. 64
Build a stable. 278 Build barrack. 62
Send peasants to mine. 267 Return to mine. 62
All peasants mine. 266 Build peasant. 61
Send idle peasant to mine. 211 Build catapult. 61
Build workshop. 191 Create a dragon. 61
Build a dragon. 168 Mine with peasants. 60
Kill peasants. 168 Build 3 peasants. 59
Attack enemy. 166 Defend. 58
Attack peasants. 159 Build cavalry. 58
Build a guard tower. 146 Make an archer. 58
Attack the enemy. 142 Attack dragon. 58
Stop. 141 Send all peasants to collect minerals. 57
Attack peasant. 139 Defend base. 57
Kill that peasant. 132 Build 2 more peasants. 56
Mine. 119 Build 2 peasants. 55
Build another dragon. 113 Make 2 archers. 55
Make another peasant. 113 Make dragon. 54
Build stable. 112 Build 2 dragons. 54
Make a dragon. 110 Attack dragons. 54
Build a blacksmith. 108 Make a stable. 53
Build a catapult. 108 Make a catapult. 53
Back to mining. 106 Build 6 peasants. 52
Build another peasant. 104 Attack archers. 50
Make a peasant. 98 Kill all peasants. 50
Build a barrack. 97 Build 2 catapults. 50
Build 4 peasants. 93 Idle peasant mine. 49
Have all peasants mine. 92 Make peasant. 48
Build 2 archers. 90 Attack enemy peasant. 48
Build dragon. 87 Attack archer. 48
Attack with peasants. 87 Build another archer. 47
Return to mining. 87 Make 4 peasants. 47
Build a peasant. 86 Make 3 peasants. 47
Idle peasant to mine. 85 Build 2 more archers. 46
Make a workshop. 83 Send idle peasant back to mine. 46
Create a workshop. 81 Make more peasants. 46
Mine with all peasants. 80 Make 2 more peasants. 46
Build 3 more peasants. 79 Build blacksmith. 46
Create another peasant. 79 Collect minerals. 45
Send all idle peasants to mine. 77 Kill. 45
Build 3 archers. 77 Build an archer. 45
Kill peasant. 77 Keep mining. 45
Make another dragon. 76 Keep attacking. 43
Kill him. 72 Attack dragons with archers. 43
Build guard tower. 70 Create a stable. 42
Attack town hall. 70 Make 3 more peasants. 42
Start mining. 69 Attack the peasant. 41
Table 10: The top 100 instructions sorted by their usage frequency.
Figure 7: Separate classifiers for each of the available action types.