CraftAssist: A Framework for Dialogue-enabled Interactive Agents

07/19/2019 ∙ by Jonathan Gray, et al. ∙ Facebook 0

This paper describes an implementation of a bot assistant in Minecraft, and the tools and platform allowing players to interact with the bot and to record those interactions. The purpose of building such an assistant is to facilitate the study of agents that can complete tasks specified by dialogue, and eventually, to learn from dialogue interactions.



There are no comments yet.


page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While machine learning (ML) methods have achieved impressive performance on difficult but narrowly-defined tasks

(Silver et al., 2016; He et al., 2017; Mahajan et al., 2018; Mnih et al., 2013), building more general systems that perform well at a variety of tasks remains an area of active research. Here we are interested in systems that are competent in a long-tailed distribution of simpler tasks, specified (perhaps ambiguously) by humans using natural language. As described in our position paper (Szlam et al., 2019), we propose to study such systems through the development of an assistant bot in the open sandbox game of Minecraft111Minecraft features: ©Mojang Synergies AB included courtesy of Mojang AB (Johnson et al., 2016; Guss et al., 2019). This paper describes the implementation of such a bot, and the tools and platform allowing players to interact with the bot and to record those interactions.

The bot appears and interacts like another player: other players can observe the bot moving around and modifying the world, and communicate with it via in-game chat. Figure 1 shows a screenshot of a typical in-game experience. Neither Minecraft nor the software framework described here provides an explicit objective or reward function; the ultimate goal of the bot is to be a useful and fun assistant in a wide variety of tasks specified and evaluated by human players.

Longer term, we hope to build assistants that interact and collaborate with humans to actively learn new concepts and skills. However, the bot described here should be taken as initial point from which we (and others) can iterate. As the bots become more capable, we can expand the scenarios where they can effectively learn.

To encourage collaborative research, the code, data, and models are open-sourced222 The design of the framework is purposefully modular to allow research on components of the bot as well as the whole system. The released data includes the human actions used to build 2,586 houses, the labeling of the sub-parts of the houses (e.g., walls, roofs, etc.), human rewordings of templated commands, and the mapping of natural language commands to bot interpretable logical forms. To enable researchers to independently collect data, the infrastructure that allows for the recording of human and bot interaction on a Minecraft server is also released. We hope these tools will help empower research on agents that can complete tasks specified by dialogue, and eventually, learn form dialogue interactions.

Figure 1: An in-game screenshot of a human player using in-game chat to communicate with the bot.

2 Minecraft

Minecraft333 is a popular multiplayer open world voxel-based building and crafting game. Gameplay starts with a procedurally generated world containing natural features (e.g. trees, mountains, and fields) all created from an atomic set of a few hundred possible blocks. Additionally, the world is populated from an atomic set of animals and other non-player characters, commonly referred to as “mobs”.

Figure 2: An in-game screenshot showing some of the block types available to the user in creative mode.

The game has two main modes: “creative” and “survival”. In survival mode the player is resource limited, can be harmed, and is subject to more restrictive physics. In creative mode, the player is not resource limited, cannot be harmed, and is subject to less restrictive physics, e.g. the player can fly through the air. An in-depth guide to Minecraft can be found at

In survival mode, blocks can be combined in a process called “crafting” to create other blocks. For example, three wood blocks and three wool can be combined to create an atomic “bed” block. In creative mode, players have access to all block types without the need for crafting.

Compound objects are arrangements of multiple atomic objects, such as a house constructed from brick, glass and door blocks. Players may build compound objects in the world by placing or removing blocks of different types in the environment. Figure 2 shows a sample of different block types. The blocks are placed on a 3D voxel grid. Each voxel in the grid contains one material. In this paper, we assume players are in creative mode and we focus on building compound objects.

Minecraft, particularly in its creative mode setting, has no win condition and encourages players to be creative. The diversity of objects created in Minecraft is astounding; these include landmarks, sculptures, temples, roller-coasters and entire cityscapes. Collaborative building is a common activity in Minecraft.

Minecraft allows multiplayer servers, and players can collaborate to build, survive, or compete. It has a huge player base (91M monthly active users in October 2018) 444, and players actively create game mods and shareable content. The multiplayer game has a built-in text chat for player to player communication. Dialogue between users on multi-user servers is a standard part of the game.

3 Client/Server Architecture

Minecraft operates through a client and server architecture. The bot acting as a client communicates with the Minecraft server using the Minecraft network protocol555We have implemented protocol version 340, which corresponds to Minecraft Computer Edition v1.12.2, and is described here: The server may receive actions from multiple bot or human clients, and returns world updates based on player and mob actions. Our implementation of a Minecraft network client is included in the top-level client directory.

Implementing the Minecraft protocol enables a bot to connect to any Minecraft server without the need for installing server-side mods, when using this framework. This provides two main benefits:

  1. A bot can easily join a multiplayer server along with human players or other bots.

  2. A bot can join an alternative server which implements the server-side component of the Minecraft network protocol. The development of the bot described in this paper uses the 3rd-party, open source Cuberite server. Among other benefits, this server can be easily modified to record the game state that can be useful information to help improve the bot.

4 Assistant v0

This section outlines our initial approach to building a Minecraft assistant, highlighting some of the major design decisions made:

  • a modular architecture

  • the use of high-level, hand-written composable actions called Tasks

  • a pipelined approach to natural language understanding (NLU) involving a neural semantic parser

A simplified module-level diagram is shown in Figure 3, and the code described here is available at: See Section 8 for a discussion of these decisions and our future plans to improve the bot.

Figure 3: A simplified block diagram demonstrating how the modular system reacts to incoming events (in-game chats and modifications to the block world)

Rather than directly modelling the action distribution as a function of the incoming chat text, our approach first parses the incoming text into a logical form we refer to as an action dictionary, described later in section 5.2.1. The action dictionary is then interpreted by a dialogue object which queries the memory module – a symbolic representation of the bot’s understanding of the world state – to produce an action and/or a chat response to the user.

The bot responds to commands using a set of higher-level actions we refer to as Tasks, such as move to location , or build a at location . The Tasks act as abstractions of long sequences of low-level movement steps and individual block placements. The Tasks are executed in a stack (LIFO) order. The interpretation of an action dictionary by a dialogue object generally produces one or more Tasks, and the execution of the Task (e.g. performing the path-finding necessary to complete a Move command) is performed in a Task object in the bot’s task stack.

4.1 Handling an Example Command

Consider a situation where a human player tells the bot: “go to the blue house”. The Dialogue Manager first checks for illegal or profane words, then queries the semantic parser. The semantic parser takes the chat as input and produces the action dictionary shown in figure 4. The dictionary indicates that the text is a command given by a human, that the high-level action requested is a MOVE, and that the destination of the MOVE is an object that is called a “house” and is “blue” in colour. More details on action dictionaries are provided in section 5.2.1. Based on the output of the semantic parser, the Dialogue Manager chooses the appropriate Dialogue Object to handle the chat, and pushes this Object to the Dialogue Stack.

In the current version of the bot, the semantic parser is a function of only text – it is not aware of the objects present in the world. As shown in figure 3, it is the job of the Dialogue Object666The code implementing the dialogue object that would handle this scenario is in to interpret the action dictionary in the context of the world state stored in the memory. In this case, the Dialogue Object would query the memory for objects tagged ”blue” and ”house”, and if present, create a Move Task whose target location is the actual coordinates of the blue house. More details on Tasks are in section 5.1.2

Once the Task is created and pushed onto the Task stack, it is the Move Task’s responsibility, when called, to compare the bot’s current location to the target location and produce a sequence of low-level step movements to reach the target.

[fontsize=] Input: [0] ”go to the blue house” Output: ”dialogue_type”: ”HUMAN_GIVE_COMMAND”, ”action”: ”action_type”: ”MOVE”, ”location”: ”location_type”: ”REFERENCE_OBJECT”, ”reference_object”: ”has_colour”: [0, [3, 3]], ”has_name”: [0, [4, 4]]

Figure 4: An example input and output for the neural semantic parser. References to words in the input (e.g. ”house”) are written as spans of word indices, to allow generalization to words not present in the dictionary at train-time. For example, the word ”house” is represented as the span beginning and ending with word 3, in sentence index 0.

A flowchart of the bot’s main event loop is shown in figure 5, and the implementation can be found in the step method in

Figure 5: A flowchart of the bot’s main event loop. On every loop, the bot responds to incoming chat or block-change events if necessary, and makes progress on the topmost Task on its stack. Note that dialogue context (e.g. if the bot has asked a question and is awaiting a response from the user) is stored in a stack of Dialogue Objects. If this dialogue stack is not empty, the topmost Dialogue Object will handle an incoming chat.

5 Modules

This section provides a detailed documentation of each module of the system as implemented, at the time of this release.

5.1 Task Stack

5.1.1 Task Primitives

The following definitions are concepts used throughout the bot’s Tasks and execution system:

BlockId: A Minecraft building material (e.g. dirt, diamond, glass, or water), characterized by an 8-bit id and 4-bit metadata777See

Location: An absolute position in the world

Schematic: An object blueprint that can be copied into the world: a map of relative

BlockObject: A real object that exists in the world: a set of absolute

Mob: A moving object in the world (e.g. cow, pig, sheep, etc.)

5.1.2 Tasks

A Task is an interruptible process with a clearly defined objective. A Task can be executed step by step, and must be resilient to long pauses between steps (to allow tasks to be paused and resumed if the user changes their priorities). A Task can also push other Tasks onto the stack, similar to the way that functions can call other functions in a standard programming language. For example, a Build may first require a Move if the bot is not close enough to place blocks at the desired location.

The following is a list of basic Tasks:

Move(Location) Move to a specific coordinate in the world. Implemented by an A* search which destroys and replaces blocks if necessary to reach a destination.

Build(Schematic, Location) Build a specific schematic into the world at a specified location.

Destroy(BlockObject) Destroy the specified BlockObject.

Dig(Location, Size) Dig a rectangular hole of a given Size at the specified Location.

Fill(Location) Fill the holes at the specified Location.

Spawn(Mob, Location) Spawn a Mob at a given Location.

Dance(Movement) Perform a defined sequence of moves (e.g. move in a clockwise pattern around a coordinate)

There are also control flow actions which take other Tasks as arguments:

Undo(Task) This Task reverses the effects of a specified Task, or defaults to the last Task executed (e.g. destroy the blocks that resulted from a Build)

Loop(StopCondition, Task) This Task keeps executing the given Task until a StopCondition is met (e.g keep digging until you hit a bedrock block)

5.2 Semantic Parser

The core of the bot’s natural language understanding is performed by a neural semantic parser called the Text-to-Action-Dictionary (TTAD) model. This model receives an incoming chat / dialogue and parses it into an action dictionary that can be interpreted by the Dialogue Object.

A detailed report of this model is available at (Jernite et al., 2019). The model is a modification of the approach in (Dong & Lapata, 2016)). We use bi-directional GRU encoder for encoding the sentences and multi-headed attention over the input sentence.

5.2.1 Action Dictionaries

An action dictionary is an unambiguous logical form of the intent of a chat. An example of an action dictionary is shown in figure 4. Every action dictionary is one of four dialogue types:

  1. HUMAN_GIVE_COMMAND: The human is giving an instruction to the bot to perform a Task, e.g. to Move somewhere or Build something. An action dictionary of this type must have an action key that has a dictionary with an action_type specifying the Task, along with further information detailing the information for the Task (e.g. “schematic” and “location” for a Build Task).

  2. GET_MEMORY: The human is asking a question or otherwise probing the bot’s understanding of the environment.

  3. PUT_MEMORY: The human is providing information to the bot for future reference or providing feedback to the bot, e.g. assigning a name to an object “that brown thing is a shed”.

  4. NOOP: No action is required.

There is a dialogue object associated with each dialogue type. For example, the GetMemoryHandler interprets a GET_MEMORY action dictionary, querying the memory, and responding to the user with an answer to the question.

For HUMAN_GIVE_COMMAND action dictionaries, with few exceptions, there is a direct mapping from “action_type” values to Task names in section 5.1.2.

5.3 Dialogue Manager & Dialogue Stack

The Dialogue Manager is the top-level handler for incoming chats. It performs the following :

  1. Checking the chat for obscenities or illegal words

  2. Calling the neural semantic parser to produce an action dictionary

  3. Routing the handling of the action dictionary to an appropriate Dialogue Object

  4. Storing (in the Dialogue Stack) persistent state and context to allow multi-turn dialogues

The Dialogue Stack is to Dialogue Objects what the Task Stack is to Tasks. The execution of a Dialogue Object may require pushing another Dialogue Object onto the Stack. For example, the Interpreter Object, while handling a Destroy command and determining which object should be destroyed, may ask the user for clarification. This places a ConfirmReferenceObject object on the Stack, which in turn either pushes a Say object to ask the clarification question or AwaitResponse object (if the question has already been asked) to wait for the user’s response. The Dialogue Manager will then first call the Say and then call the AwaitResponse object to help resolve the Interpreter object.

5.4 Memory

The data stored in the bot’s memory includes the locations of BlockObjects and Mobs (animals), information about them (e.g. user-assigned names, colour etc), the historical and current state of the Task Stack, all the chats and relations between different memory objects. Memory data is queried by DialogueObjects when interpreting an action dictionary (e.g. to interpret the action dictionary in figure 4, the memory is queried for the locations of block objects named “house” with colour “blue”).

The memory module is implemented using an in-memory SQLite888 database. Relations and tags are stored in a single triple store. All memory objects (including triples themselves) can be referenced as the subject or object of a memory triple.

How are BlockObjects populated into Memory?

At this time, BlockObjects are defined as maximally connected components of unnatural blocks (i.e. ignoring blocks like grass and stone that are naturally found in the world, unless those blocks were placed by a human or bot). The bot periodically searches for BlockObjects in its vicinity and adds them to Memory.

How are tags populated into Memory?

At this time, tag triples of the form (BlockObject_id, "has_tag", tag) are inserted as the result of some PUT_MEMORY actions, triggered when a user assigns a name or description to an object via chat or gives feedback (e.g. “that object is a house”, “that barn is tall” or “that was really cool”). Some relations (e.g. has_colour

, indicating BlockObject colours) are determined heuristically. Neural network perception modules may also populate tags into the memory.

5.5 Perception

The bot has access to two raw forms of visual sensory input:

2D block vision999The implementation of 2D block vision is found at agent.cpp#L328

By default, this produces a 64x64 image where each “pixel” contains the block type and distance to the block in the bot’s line of sight. For example, instead of a pixel containing RGB colour information representing “brown”, the bot might see block-id 17, indicating “Oak Wood”.

3D block vision101010The implementation of 3D block vision is found at agent.cpp#L321

The bot has access to the underlying block map: the block type at any absolute position nearby. This information is not available to a human player interacting normally with the Minecraft game – if it is important to compare a bot’s sensorimotor capabilities to a human’s (e.g. in playing an adversarial game against a human player), avoid the use of the get_blocks function which implements this capability.

Other common perceptual capabilities are implemented using ML models or heuristics as appropriate:

Semantic segmentation

A 3d convolutional neural network processes each Block Object and outputs a tag for each voxel, indicating for example whether it is part of a wall, roof, or floor. The code for this model is in


Relative directions

Referring to objects based on their positions relative to other objects is performed heuristically based on a coordinate shift relative to the speaker’s point of view. For example, referencing “the barn left of the house” is handled by searching for the closest object called “barn” that is to the speaker’s left of the “house”.

Size and colour

Referring to objects based on their size or colour is handled heuristically. The colour of a Block Object is based on the colours of its most common block types. Adjectives referring to size (e.g. “tiny” or “huge”) are heuristically mapped to ranges of block lengths.

6 Data

This section describes the datasets we are releasing with the framework.

6.1 The semantic parsing dataset

We are releasing a semantic parsing dataset of English-language instructions and their associated “action dictionaries”, used for human-bot interactions in Minecraft. This dataset was generated in different settings as described below:

  • Generations: Algorithmically generating action trees (logical forms over the grammar) with associated surface forms using templates. (The script for generating these is here:

  • Rephrases: We asked crowd workers to rephrase some of the produced instructions into commands in alternate, natural English that does not change the meaning of the sentence.

  • Prompts: We presented crowd workers with a description of an assistant bot and asked them for examples of commands they’d give the bot.

  • Interactive: We asked crowd workers to play creative mode Minecraft with our bot, and used the data from the in-game chat.

The dataset has four files, corresponding to the settings above:

  1. generated_dialogues.json : This file has 800000 dialogue - action dictionary pairs generated using our generation script. More can be generated using the script.

  2. rephrases.json: This file has 25402 dialogue - action dictionary pairs. These are paraphrases of dialogues generated by our grammar.

  3. prompts.json: This file contains 2513 dialogue - action dictionary pairs. These dialogues came from the prompts setting described above.

  4. humanbot.json: This file contains 708 dialogue - action dictionary pairs. These dialogues came from the interactive setting above.

The format of the data in each file is:

  • A dialogue is represented as a list of sentences, where each sentence is a sequence of words separated by spaces and tokenized using the spaCy tokenizer (Honnibal & Johnson, 2015).

  • Each json file is a list of dialogue - action dictionary pair, where “action dictionary” is a nested dictionary described in 5.2.1

For more details on the dataset see: (Jernite et al., 2019)

6.2 House dataset

We used crowd sourcing to collect examples of humans building houses in Minecraft. Each user is asked to build a house on a fixed time budget (30 minutes), without any additional guidance or instructions. Every action of the user is recorded using the Cuberite server.

The data collection was performed in Minecraft’s creative mode, where the user is given unlimited resources, has access to all material block types and can freely move in the game world. The action space of the environment is straight-forward: moving in x-y-z dimensions, choosing a block type, and placing or breaking a block.

There are hundreds of different block types someone could use to build a house, including different kinds of wood, stone, dirt, sand, glass, metal, ice, to list a few. An empty voxel is considered as a special block type “air” (block id=0).

We record sequences of atomic building actions for each user at each step using the following format:

[t, userid, [x, y, z], [block-id, meta-id], ”P”/”B”]

where the time-stamp is in monotonically increasing order; is the absolute coordinate with respect to the world origin in Minecraft; “P” and “B” refers to placing a new block and breaking (destroying) an existing block; each house is built by a single player in our data collection process with a unique user-id.

There are 2586 houses in total. Details of this work is under submission.

6.3 Instance segmentation data

For a subset of the houses collected in the house dataset described above, we asked crowd workers to add semantic segmentation labels for sub-components of the house. The format of the data is explained below. There are two files:

  • training_data.pkl : This file contains data we used for training our 3D semantic segmentation model.

  • validation_data.pkl: This file contains data used as validation set for the model.

Each pickle file has a list of :

[schematic, annotated_schematic, annotation_list, house_name]


  • schematic: The 3-d numpy array representing the house, where each element in the array is the block_id of the block at that coordinate.

  • annotated_schematic: The 3-d numpy array representing the house, where each element in the array is the id of the semantic annotation that the coordinate/block belongs to (1-indexed annotation_list).

  • annotation_list: List of semantic segmentation for the house.

  • house_name: Name of the house.

There are 2050 houses in total and 1038 distinct labels of subcomponents.

The datasets described above can be downloaded following the instructions here

7 Related Work

A number of projects have been initiated to study Minecraft agents or to build frameworks to make learning in Minecraft possible. The most well known framework is Microsoft’s MALMO project (Johnson et al., 2016)

. The majority of work using MALMO consider reinforcement learned agents to achieve certain goals e.g

(Shu et al., 2017; Udagawa et al., 2016; Alaniz, 2018; Oh et al., 2016; Tessler et al., 2017). Recently the MineRL project (Guss et al., 2019) builds on top of MALMO with playthrough data and specific challenges.

Our initial bot has a neural semantic parser (Dong & Lapata, 2016; Jia & Liang, 2016; Zhong et al., 2017) as its core NLU component. We also release the data used to train the semantic parser. There have been a number of datasets of natural language paired with logical forms to evaluate semantic parsing approaches, e.g. (Price, 1990; Tang & Mooney, 2001; Cai & Yates, 2013; Wang et al., 2015; Zhong et al., 2017). Recently (Chevalier-Boisvert et al., 2018) described a gridworld with navigation instructions generated via a grammar. Our bot also needs to update its understanding of an initial instruction during several turns of dialogue with the user, which is reminiscent of the setting of (Bordes et al., 2017).

In addition to mapping natural language to logical forms, our dataset connects both of these to a dynamic environment. In (Tellex et al., 2011; Matuszek et al., 2013) semantic parsing has been used for interpreting natural language commands for robots. In our setup, the “robot” is embodied in the Minecraft game instead of in the physical world. Semantic parsing in a voxel-world recalls (Wang et al., 2017), where the authors describe a method for building up a programming language from a small core via interactions with players. Our bot’s NLU pipeline is perhaps most similar to the one proposed in (Kollar et al., 2018), which builds a grammar for the Alexa virtual personal assistant.

A task relevant to interactive bots is that of Visual Question Answering (VQA)  (Antol et al., 2015; Krishna et al., 2017; Geman et al., 2015) in which a question is asked about an image and an answer is provided by the system. Most papers address this task using real images, but synthetic images have also been used (Johnson et al., 2017; Andreas et al., 2016). The VQA task has been extended to visual dialogues (Das et al., 2017) and videos (Tapaswi et al., 2016). Recently, the tasks of VQA and navigation have been combined using 3D environments (Gordon et al., 2018; Das et al., 2018; Kolve et al., 2017; Anderson et al., 2018) to explore bots that must navigate to certain locations before a question can be answered, e.g., “How many chairs are in the kitchen?” Similar to our framework, these papers use synthetic environments for exploration. However, these can be expanded to use those generated from real environments (Savva et al., 2019). Instead of the goal being the answering of a question, other tasks can be explored. For instance, the task of guiding navigation in New York City using dialogue (de Vries et al., 2018), or accomplishing tasks such as pushing or opening specific objects (Kolve et al., 2017).

8 Discussion

In this work we have described the design of a bot and associated data that we hope can be used as a starting point and baseline for research in learning from interaction in Minecraft. In this section, we discuss some major design decisions that were made for this bot, and contrast against other possible choices. We further discuss ways in which the bot can be improved.

8.1 Semantic Parsing

Rather than learning a mapping directly from (language, state) to an action or sequence of actions, the bot described in this paper first parses language into a program over high level tasks, called action dictionaries (see section 5.2.1). The execution of the program is scripted, rather than learned.

This arrangement has several advantages:

  1. Determining a sequence of actions, given a well-specified intent, is usually simple in Minecraft. For example, moving to a known but faraway object might require hundreds of steps, but it is simple to use a path-finding algorithm such as A* search to find the sequence of actions to actually execute the move.

  2. Training data for a semantic parsing model is easier to collect, compared to language-action pairs that would necessitate recording the actions of a human player.

  3. If it was desired to learn the low-level actions needed to complete a task, approaches such as reinforcement learning could be employed that use the completion of the task in the action dictionary as a reward without having to address the ambiguities of tasks specified through language.

  4. It may be possible to transfer the natural language understanding capabilities of the bot to another similar domain by re-implementing the interpretation and execution of action dictionaries, without needing to retrain the semantic parser.

On the other hand,

  1. The space of objectives that can be completed by the bot is limited by the specification of the action dictionaries. Adding a new capability to the bot usually requires adding a new structure to the action dictionary spec, adding code to the relevant Dialogue Object to handle it, and updating the semantic parsing dataset.

  2. A more end-to-end model with a simpler action space might only require the collection of more data and might generalize better.

  3. The use of a pipelined approach (as described in this paper) introduces the possibility for compounding errors.

There is a huge space of possible interfaces into the high-level actions we have proposed (and many other interesting constructions of high level actions). In particular, we plan to remove the strict separation between the parser and the world state in our bot.

8.2 Symbolic Memory

As described in section 5.4, the bot’s memory is implemented using a (discrete, symbolic) relational database. The major advantages of this (compared to an end-to-end machine-learned model that operates on raw sensory inputs) are:

  1. Easier to convert semantic parses into fully specified tasks that can query and write to the database.

  2. Debugging the bot’s current understanding of the world is easier.

  3. Integrating outside information, e.g. crowd-sourced building schematics, is more straightforward: doing so requires pre-loading rows into a database table, rather than re-training a generative model.

  4. Reliable symbolic manipulations, especially lookups by keyword.

On the other hand, such a memory can be brittle and limited. Even within the space of “discrete” memories, there are more flexible formats, e.g. raw text; and there have been recent successes using such memories, for example works using the Squad dataset (Rajpurkar et al., 2016). We hope our platform will be useful for studying other symbolic memory architectures as well as continuous, learned approaches, and things in between.

8.3 Modularity for ML research

The bot’s architecture is modular, and currently many of the modules are not learned. Many machine learning researchers consider the sort of tasks that pipelining makes simpler to be tools for evaluating more general learning methodologies. Such a researcher might advocate more end-to-end (or otherwise less “engineered”) approaches because the goal is not necessarily to build something that works well on the tasks that the engineered approach can succeed in, but rather to build something that can scale beyond those tasks.

We have chosen this approach in part because it allows us to more easily build an interesting initial assistant from which we can iterate; and in particular allows data collection and creation. We do believe that modular systems are more generally interesting, especially in the setting of competency across a large number of relatively easier tasks. Perhaps most interesting to us are approaches that allow modular components with clearly defined interfaces, and heterogeneous training based on what data is available. We hope to explore these with further iterations of our bot.

Despite our own pipelined approach, we consider research on more end-to-end approaches worthwhile and interesting. Even for researchers primarily interested in these, the pipelined approach still has value beyond serving as a baseline: as discussed above, it allows generating large amounts of training data for end-to-end methods.

Finally, we note that from an engineering standpoint, modularity has clear benefits. In particular, it allows many researchers to contribute components to the greater whole in parallel. As discussed above, the bot presented here is meant to be a jumping off point, not a final product. We hope that the community will find the framework useful and join us in building an assistant that can flexibly learn from interaction with people.

9 Conclusion

We have described a platform for studying situated natural language understanding in Minecraft. The platform consists of code that implements infrastructure for allowing bots and people to play together, tools for labeling data, and a baseline assistant. In addition to the code, we are releasing a diverse set of data we used for building the assistant. This includes 2586 houses built in game, and the actions used in building them, instance segmentations of those houses, and templates and rephrases of templates for training a semantic parser. In the future, we plan to continue to release data as it is collected. We hope that the community will find the framework useful and join us in building an assistant that can learn a broad range of tasks from interaction with people.