RTFM: Generalising to Novel Environment Dynamics via Reading

by   Victor Zhong, et al.

Obtaining policies that can generalise to new environments in reinforcement learning is challenging. In this work, we demonstrate that language understanding via a reading policy learner is a promising vehicle for generalisation to new environments. We propose a grounded policy learning problem, Read to Fight Monsters (RTFM), in which the agent must jointly reason over a language goal, relevant dynamics described in a document, and environment observations. We procedurally generate environment dynamics and corresponding language descriptions of the dynamics, such that agents must read to understand new environment dynamics instead of memorising any particular information. In addition, we propose txt2π, a model that captures three-way interactions between the goal, document, and observations. On RTFM, txt2π generalises to new environments with dynamics not seen during training via reading. Furthermore, our model outperforms baselines such as FiLM and language-conditioned CNNs on RTFM. Through curriculum learning, txt2π produces policies that excel on complex RTFM tasks requiring several reasoning and coreference steps.


page 3

page 11

page 15


Nested Policy Reinforcement Learning

Off-policy reinforcement learning (RL) has proven to be a powerful frame...

MANGA: Method Agnostic Neural-policy Generalization and Adaptation

In this paper we target the problem of transferring policies across mult...

Fast Adaptation via Policy-Dynamics Value Functions

Standard RL algorithms assume fixed environment dynamics and require a s...

BabyAI++: Towards Grounded-Language Learning beyond Memorization

Despite success in many real-world tasks (e.g., robotics), reinforcement...

Interpretable Reinforcement Learning with Multilevel Subgoal Discovery

We propose a novel Reinforcement Learning model for discrete environment...

Feudal Reinforcement Learning by Reading Manuals

Reading to act is a prevalent but challenging task which requires the ab...

Grounding Language to Entities and Dynamics for Generalization in Reinforcement Learning

In this paper, we consider the problem of leveraging textual description...

Code Repositories

1 Introduction

Reinforcement learning (RL) has been successful in a variety of areas such as continuous control (Lillicrap et al., 2015), dialogue systems (Li et al., 2016), and game-playing (Mnih et al., 2013). However, RL adoption in real-world problems is limited due to poor sample efficiency and failure to generalise to environments even slightly different from those seen during training. We explore language-conditioned policy learning, where agents use machine reading to discover strategies required to solve a task, thereby leveraging language as a means to generalise to new environments.

Prior work on language grounding and language-based RL (see Luketina et al. (2019) for a recent survey) are limited to scenarios in which language specifies the goal for some fixed environment dynamics (Branavan et al., 2011; Hermann et al., 2017; Bahdanau et al., 2019; Fried et al., 2018; Co-Reyes et al., 2019), or the dynamics of the environment vary and are presented in language for some fixed goal (Branavan et al., 2012). In practice, changes to goals and to environment dynamics tend to occur simultaneously—given some goal, we need to find and interpret relevant information to understand how to achieve the goal. That is, the agent should account for variations in both by selectively reading, thereby generalising to environments with dynamics not seen during training.

Our contributions are two-fold. First, we propose a grounded policy learning problem that we call Read to Fight Monsters (RTFM). In RTFM, the agent must jointly reason over a language goal, a document that specifies environment dynamics, and environment observations. In particular, it must identify relevant information in the document to shape its policy and accomplish the goal. To necessitate reading comprehension, we expose the agent to ever changing environment dynamics and corresponding language descriptions such that it cannot avoid reading by memorising any particular environment dynamics. We procedurally generate environment dynamics and natural language templated descriptions of dynamics and goals to produced a combinatorially large number of environment dynamics to train and evaluate RTFM.

Second, we propose txt2 to model the joint reasoning problem in RTFM. We show that txt2 generalises to goals and environment dynamics not seen during training, and outperforms previous language-conditioned models such as language-conditioned CNNs and FiLM (Perez et al., 2018; Bahdanau et al., 2019) both in terms of sample efficiency and final win-rate on RTFM. Through curriculum learning where we adapt txt2 trained on simpler tasks to more complex tasks, we obtain agents that generalise to tasks with natural language documents that require five hops of reasoning between the goal, document, and environment observations. Our qualitative analyses show that txt2 attends to parts of the document relevant to the goal and environment observations, and that the resulting agents exhibit complex behaviour such as retrieving correct items, engaging correct enemies after acquiring correct items, and avoiding incorrect enemies. Finally, we highlight the complexity of RTFM in scaling to longer documents, richer dynamics, and natural language variations. We show that significant improvement in language-grounded policy learning is needed to solve these problems in the future.

2 Related Work

Language-conditioned policy learning.

A growing body of research is learning policies that follow imperative instructions. The granularity of instructions vary from high-level instructions for application control (Branavan, 2012) and games (Hermann et al., 2017; Bahdanau et al., 2019) to step-by-step navigation (Fried et al., 2018). In contrast to learning policies for imperative instructions, Branavan et al. (2011, 2012)

 infer a policy for a fixed goal using features extracted from high level strategy descriptions and general information about domain dynamics. Unlike prior work, we study the combination of imperative instructions and descriptions of dynamics. Furthermore, we require that the agent learn to filter out irrelevant information to focus on dynamics relevant to accomplishing the goal.

Language grounding.

Language grounding refers to interpreting language in a non-linguistic context. Examples of such context include images (Barnard and Forsyth, 2001), games (Chen and Mooney, 2008; Wang et al., 2016), robot control (Kollar et al., 2010; Tellex et al., 2011), and navigation (Anderson et al., 2018). We study language grounding in interactive games similar to Branavan (2012); Hermann et al. (2017) or Co-Reyes et al. (2019), where executable semantics are not provided and the agent must learn through experience. Unlike prior work, we require grounding between an underspecified goal, a document of environment dynamics, and world observations. In addition, we focus on generalisation to not only new goal descriptions but new environments dynamics.

3 Read to Fight Monsters

Figure 1: RTFM requires jointly reasoning over the goal, a document describing environment dynamics, and environment observations. This figure shows key snapshots from a trained policy on one randomly sampled environment. Frame 1 shows the initial world. In 4, the agent approaches “fanatical sword”, which beats the target “fire goblin”. In 5, the agent acquires the sword. In 10, the agent evades the distractor “poison bat” while chasing the target. In 11, the agent engages the target and defeats it, thereby winning the episode. Sprites are used for visualisation — the agent observes cell content in text (shown in white). More examples are in appendix A.

We consider a scenario where the agent must jointly reason over a language goal, relevant environment dynamics specified in a text document, and environment observations. In reading the document, the agent should identify relevant information key to solving the goal in the environment. A successful agent needs to perform this language grounding to generalise to new environments with dynamics not seen during training.

To study generalisation via reading, the environment dynamics must differ every episode such that the agent cannot avoid reading by memorising a limited set of dynamics. Consequently, we procedurally generate a large number of unique environment dynamics (e.g. effective(blessed items, poison monsters)), along with language descriptions of environment dynamics (e.g. blessed items are effective against poison monsters) and goals (e.g. Defeat the order of the forest). We couple a large, customisable ontology inspired by rogue-like games such as NetHack or Diablo, with natural language templates to create a combinatorially rich set of environment dynamics to learn from and evaluate on.

In RTFM, the agent is given a document of environment dynamics, observations of the environment, and an underspecified goal instruction. Figure 1

illustrates an instance of the game. Concretely, we design a set of dynamics that consists of monsters (e.g. wolf, goblin), teams (e.g. Order of the Forest), element types (e.g. fire, poison), item modifiers (e.g. fanatical, arcane), and items (e.g. sword, hammer). When the player is in the same cell with a monster or weapon, the player picks up the item or engages in combat with the monster. The player can possess one item at a time, and drops existing weapons if they pick up a new weapon. A monster moves towards the player with 60% probability, and otherwise moves randomly. The dynamics, the agent’s inventory, and the underspecified goal are rendered as text. The game world is rendered as a matrix of text in which each cell describes the entity occupying the cell. We use human-written templates for stating which monsters belong to which team, which modifiers are effective against which element, and which team the agent should defeat (see appendix 

G for details). In order to achieve the goal, the agent must cross-reference relevant information in the document and as well as in the observations.

During every episode, we subsample a set of groups, monsters, modifiers, and elements to use. We randomly generate group assignments of which monsters belong to which team and which modifier is effective against which element. A document that consists of randomly ordered statements corresponding to this group assignment is presented to the agent. We sample one element, one team, and a monster from that team (e.g. “fire goblin” from “Order of the forest”) to be the target monster. Additionally, we sample one modifier that beats the element and an item to be the item that defeats the target monster (e.g. “fanatical sword”). Similarly, we sample an element, a team, and a monster from a different team to be the distractor monster (e.g. poison bat), as well as an item that defeats the distractor monster (e.g. arcane hammer).

In order to win the game (e.g. Figure 1), the agent must

  1. identify the target team from the goal (e.g. Order of the Forest)

  2. identify the monsters that belong to that team (e.g. goblin, jaguar, and lynx)

  3. identify which monster is in the world (e.g. goblin), and its element (e.g. fire)

  4. identify the modifiers that are effective against this element (e.g. fanatical, shimmering)

  5. find which modifier is present (e.g. fanatical), and the item with the modifier (e.g. sword)

  6. pick up the correct item (e.g. fanatical sword)

  7. engage the correct monster in combat (e.g. fire goblin).

If the agent deviates from this trajectory (e.g. does not have correct item before engaging in combat, engages with distractor monster), it cannot defeat the target monster and therefore will lose the game. The agent receives a reward of +1 if it wins the game and -1 otherwise.

RTFM presents challenges not found in prior work in that it requires a large number of grounding steps in order to solve a task. In order to perform this grounding, the agent must jointly reason over a language goal and document of dynamics, as well as environment observations. In addition to the environment, the positions of the target and distractor within the document are randomised—the agent cannot memorise ordering patterns in order to solve the grounding problems, and must instead identify information relevant to the goal and environment at hand.

We split environments into train and eval sets. No assignments of monster-team-modifier-element are shared between train and eval to test whether the agent is able to generalise to new environments with dynamics not seen during training via reading. There are more than 2 million train or eval environments without considering the natural language templates, and 200 million otherwise. With random ordering of templates, the number of unique documents exceeds 15 billion.

4 Model

We propose the txt2 model, which builds representations that capture three-way interactions between the goal, document describing environment dynamics, and environment observations. We begin with definition of the Bidirectional Feature-wise Linear Modulation (FiLM) layer, which forms the core of our model.

4.1 Bidirectional Feature-wise Linear Modulation (FiLM) layer

Figure 2: The FiLM layer.

Feature-wise linear modulation (FiLM), which modulates visual inputs using representations of textual instructions, is an effective method for image captioning 

(Perez et al., 2018) and instruction following (Bahdanau et al., 2019). In RTFM, the agent must not only filter concepts in the visual domain using language but filter concepts in the text domain using visual observations. To support this, FiLM builds codependent representations of text and visual inputs by further incorporating conditional representations of the text given visual observations. Figure 2 shows the FiLM layer.

We use upper-case bold letters to denote tensors, lower-case bold letters for vectors, and non-bold letters for scalars. Exact dimensions of these variables are shown in Table 

4 in appendix B. Let denote a fixed-length -dimensional representation of the text and the representation of visual inputs with height , width , and channels. Let denote a convolution layer. Let + and * symbols denote element-wise addition and multiplication operations that broadcast over spatial dimensions. We first modulate visual features using text features:


Unlike FiLM, we additionally modulate text features using visual features:


The output of the FiLM layer consists of the sum of the modulated features

, as well as a max-pooled summary

over this sum across spatial dimensions.

4.2 The txt2 model

Figure 3: txt2 models interactions between the goal, document, and observations.

We model interactions between observations from the environment, goal, and document using FiLM layers. We first encode text inputs using bidirectional LSTMs, then compute summaries using self-attention and conditional summaries using attention. We concatenate text summaries into text features, which, along with visual features, are processed through consecutive FiLM layers. In this case of a textual environment, we consider the grid of word embeddings as the visual features for FiLM. The final FiLM

 output is further processed by MLPs to compute a policy distribution over actions and a baseline for advantage estimation. Figure 

3 shows the txt2 model.

Let denote word embeddings corresponding to the observations from the environment, where represents the embeddings corresponding to the -word string that describes the objects in location in the grid-world. Let , , and respectively denote the embeddings corresponding to the -word document, the -word inventory, and the -word goal. We first compute a fixed-length summary of the the goal using a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) followed by self-attention (Lee et al., 2017; Zhong et al., 2018).


We abbreviate self-attention over the goal as . We similarly compute a summary of the inventory as . Next, we represent the document encoding conditioned on the goal using dot-product attention (Luong et al., 2015).

We abbreviate attention over the document encoding conditioned on the goal summary as . Next, we build the joint representation of the inputs using successive FiLM layers. At each layer, the visual input to the FiLM layer is the concatenation of the output of the previous layer with positional features. For each cell, the positional feature consists of the and distance from the cell to the agent’s position respectively, normalized by the width and height of the grid-world. The text input is the concatenation of the goal summary, the inventory summary, the attention over the document given the goal, and the attention over the document given the previous visual summary. Let denote the feature-wise concatenation of and . For the th layer, we have


is another encoding of the document similar to , produced using a separate LSTM, such that the document is encoded differently for attention with the visual features and with the goal. For , we concatenate the bag-of-words embeddings of the grid with positional features as the initial visual features

. We max pool a linear transform of the initial visual features to compute the initial visual summary

. Let denote visual summary of the last FiLM layer. We compute the policy and baseline as


where and

are 2-layer multi-layer perceptrons with

activation. We train using an implementation of IMPALA (Espeholt et al., 2018), which decouples actors from learners and uses V-trace for off-policy correction. Please refer to appendix D for details.

5 Experiments

We consider variants of RTFM by varying the size of the grid-world ( vs ), allowing many-to-one group assignments to make disambiguation more difficult (group), allowing dynamic, moving monsters that hunt down the player (dyna), and using natural language templated documents (nl). In the absence of many-to-one assignments, the agent does not need to perform steps 3 and 5 in section 3 as there is no need to disambiguate among many assignees, making it easier to identify relevant information.

Figure 4: Ablation training curves on simplest variant of RTFM. Individual runs are in light colours. Average win rates are in bold, dark lines.
Model Win rate Train Eval 66 Eval 1010 conv FiLM no_task_attn no_vis_attn no_text_mod txt2 Table 1: Final win rate on simplest variant of RTFM. The models are trained on one set of dynamics (e.g. training set) and evaluated on another set of dynamics (e.g. evaluation set). “Train” and “Eval” show final win rates on training and eval environments.

We compare txt2 to the FiLM model by Bahdanau et al. (2019) and a language-conditioned residual CNN model. We train on one set of dynamics (e.g. group assignments of monsters and modifiers) and evaluated on a held-out set of dynamics. We also study three variants of txt2. In no_task_attn, the document attention conditioned on the goal utterance (equation 16) is removed and the goal instead represented through self-attention and concatenated with the rest of the text features. In no_vis_attn, we do not attend over the document given the visual output of the previous layer (equation 18), and the document is instead represented through self-attention. In no_text_mod, text modulation using visual features (equation 6) is removed. Please see appendix C for model details on our model and baselines, and appendix D for training details.

5.1 Comparison to baselines and ablations

We compare txt2 to baselines and ablated variants on a simplified variant of RTFM in which there are one-to-one group assignments (no group), stationary monsters (no dyna), and no natural language templated descriptions (no nl). Figure 4 shows that compared to baselines and ablated variants, txt2 is more sample efficient and converges to higher performance. Moreover, no ablated variant is able to solve the tasks—it is the combination of ablated features that enables txt2 to win consistently. Qualitatively, the ablated variants converge to locally optimum policies in which the agent often picks up a random item and then attacks the correct monster, resulting in a % win rate. Table 1 shows that all models, with the exception of the CNN baseline, generalise to new evaluation environments with dynamics and world configurations not seen during training, with txt2 outperforming FiLM and the CNN model. We find similar results for txt2, its ablated variants, and baselines on other tasks (see appendix E for details).

5.2 Curriculum learning for complex environments

Transfer to
Table 2: Curriculum training results. We keep 5 randomly initialised models through the entire curriculum. A cell in row and column shows transfer from the best-performing setting in the previous stage (bolded in row ) to the new setting in column

. Each cell shows final mean and standard deviation of win rate on the training environments. Each experiment trains for 50 million frames, except for the initial stage (first row, 100 million instead). For the last stage (row 4), we also transfer to a

variant and obtain win rate.
Win rate
Train Eval
Table 3: Win rate when evaluating on new dynamics and world configurations for txt2 on the full RTFM problem.

Due to the long sequence of co-references the agent must perform in order to solve the full RTFM ( with moving monsters, many-to-one group assignments, and natural language templated documents) we design a curriculum to facilitate policy learning by starting with simpler variants of RTFM. We start with the simplest variant (no group, no dyna, no nl) and then add in an additional dimension of complexity. We repeatedly add more complexity until we obtain worlds with moving monsters, many-to-one group assignments and natural language templated descriptions. The performance across the curriculum is shown in Table 2 (see Figure 13 in appendix F for training curves of each stage). We see that curriculum learning is crucial to making progress on RTFM, and that initial policy training (first row of Table 2) with additional complexities in any of the dimensions result in significantly worse performance. We take each of the 5 runs after training through the whole curriculum and evaluate them on dynamics not seen during training. Table 3 shows variants of the last stage of the curriculum in which the model was trained on versions of the full RTFM and in which the model was trained on versions of the full RTFM. We see that models trained on smaller worlds generalise to bigger worlds. Despite curriculum learning, however, performance of the final model trail that of human players, who can consistently solve RTFM. This highlights the difficulties of the RTFM problem and suggests that there is significant room for improvement in developing better language grounded policy learners.

(a) The entities present are shimmering morning star, mysterious spear, fire spider, and lightning wolf.
(b) The entities present are soldier’s axe, shimmering axe, fire beetle, and poison wolf.
Figure 5: txt2 attention on the full RTFM. These include the document attention conditioned on the goal (top) as well as those conditioned on summaries produced by intermediate FiLM layers. Weights are normalised across words (e.g. horizontally). Darker means higher attention weight.

Attention maps.

Figure 5 shows attention conditioned on the goal and on observation summaries produced by intermediate FiLM layers. Goal-conditioned attention consistently locates the clause that contains the team the agent is supposed to attack. Intermediate layer attentions focus on regions near modifiers and monsters, particularly those that are present in the observations. These results suggests that attention mechanisms in txt2 help identify relevant information in the document.

Analysis of trajectories and failure modes.

We examine trajectories from well-performing policies (80% win rate) as well as poorly-performing policies (50% win rate) on the full RTFM. We find that well-performing policies exhibit a number of consistent behaviours such as identifying the correct item to pick up to fight the target monster, avoiding distractors, and engaging target monsters after acquiring the correct item. In contrast, the poorly-performing policies occasionally pick up the wrong item, causing the agent to lose when engaging with a monster. In addition, it occasionally gets stuck in evading monsters indefinitely, causing the agent to lose when the time runs out. Replays of both policies can be found in GIFs in the supplementary materials111Trajectories by txt2 on RTFM can be found at https://gofile.io/?c=9k7ZLk.

6 Conclusion

We proposed RTFM, a grounded policy learning problem in which the agent must jointly reason over a language goal, relevant dynamics specified in a document, and environment observations. In order to study RTFM, we procedurally generated a combinatorially large number of environment dynamics such that the model cannot memorise a set of environment dynamics and must instead generalise via reading. We proposed txt2, a model that captures three-way interactions between the goal, document, and observations, and that generalises to new environments with dynamics not seen during training. txt2 outperforms baselines such as FiLM and language-conditioned CNNs. Through curriculum learning, txt2 performs well on complex RTFM tasks that require several reasoning and coreference steps with natural language templated goals and descriptions of the dynamics. Our work suggests that language understanding via reading is a promising way to learn policies that generalise to new environments. Despite curriculum learning, our best models trail performance of human players, suggesting that there is ample room for improvement in grounded policy learning on complex RTFM problems. In addition to jointly learning policies based on external documentation and language goals, we are interested in exploring how to use supporting evidence in external documentation to reason about plans (Andreas et al., 2018) and induce hierarchical policies (Hu et al., 2019; Jiang et al., 2019).


  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. D. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In CVPR, Cited by: §2.
  • J. Andreas, D. Klein, and S. Levine (2018) Learning with latent language. In NAACL, Cited by: §6.
  • D. Bahdanau, F. Hill, J. Leike, E. Hughes, P. Kohli, and E. Grefenstette (2019) Learning to follow language instructions with adversarial reward induction. In ICLR, Cited by: §C.3, §1, §1, §2, §4.1, §5.
  • K. Barnard and D. Forsyth (2001) Learning the semantics of words and pictures. In ICCV, Cited by: §2.
  • S. R. K. Branavan, N. Kushman, T. Lei, and R. Barzilay (2012) Learning high-level planning from text. In ACL, Cited by: §1, §2.
  • S. R. K. Branavan, D. Silver, and R. Barzilay (2011) Learning to win by reading manuals in a monte-carlo framework. In ACL, Cited by: §1, §2.
  • S.R.K. Branavan (2012) Grounding linguistic analysis in control applications. Ph.D. Thesis, MIT. Cited by: §2, §2.
  • D. L. Chen and R. J. Mooney (2008) Learning to sportscast: a test of grounded language acquisition. In ICML, Cited by: §2.
  • J. D. Co-Reyes, A. Gupta, S. Sanjeev, N. Altieri, J. DeNero, P. Abbeel, and S. Levine (2019) Guiding policies with language via meta-learning. In ICLR, Cited by: §1, §2.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu (2018) IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML, Cited by: Appendix D, Appendix D, §4.2.
  • D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell (2018) Speaker-follower models for vision-and-language navigation. In NeurIPS, Cited by: §1, §2.
  • K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, M. Wainwright, C. Apps, D. Hassabis, and P. Blunsom (2017) Grounded language learning in a simulated 3d world. CoRR abs/1706.06551. Cited by: §1, §2, §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Compututation 9 (8). Cited by: §4.2.
  • H. Hu, D. Yarats, Q. Gong, Y. Tian, and M. Lewis (2019) Hierarchical decision making by generating and following natural language instructions. CoRR abs/1906.00744. Cited by: §6.
  • Y. Jiang, S. Gu, K. Murphy, and C. Finn (2019) Language as an abstraction for hierarchical deep reinforcement learning. CoRR abs/1906.07343. Cited by: §6.
  • T. Kollar, S. Tellex, D. Roy, and N. Roy (2010) Toward understanding natural language directions. In HRI, Cited by: §2.
  • K. Lee, L. He, M. Lewis, and L. Zettlemoyer (2017) End-to-end neural coreference resolution. In EMNLP, Cited by: §4.2.
  • J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016) Deep reinforcement learning for dialogue generation. In EMNLP, Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. M. O. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. CoRR abs/1509.02971. Cited by: §1.
  • J. Luketina, N. Nardelli, G. Farquhar, J. Foerster, J. Andreas, E. Grefenstette, S. Whiteson, and T. Rocktäschel (2019) A Survey of Reinforcement Learning Informed by Natural Language. In IJCAI, Cited by: §1.
  • M. Luong, H. Pham, and C. D. Manning (2015)

    Effective approaches to attention-based neural machine translation

    In ACL, Cited by: §4.2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller (2013) Playing atari with deep reinforcement learning. CoRR abs/1312.5602. Cited by: §1.
  • E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. C. Courville (2018) FiLM: visual reasoning with a general conditioning layer. In AAAI, Cited by: §1, §4.1.
  • S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. Teller, and N. Roy (2011) Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI, Cited by: §2.
  • T. Tieleman and G. Hinton (2012) Lecture 6.5—RmsPropG: Divide the gradient by a running average of its recent magnitude. Note:

    COURSERA: Neural Networks for Machine Learning

    Cited by: Appendix D.
  • S. I. Wang, P. Liang, and C. D. Manning (2016) Learning language games through interaction. In ACL, Cited by: §2.
  • V. Zhong, C. Xiong, and R. Socher (2018) Global-locally self-attentive dialogue state tracker. In ACL, Cited by: §4.2.

Appendix A Playthrough examples

These figures shows key snapshots from a trained policy on randomly sampled environments.

Figure 6: The initial world is shown in 1. In 4, the agent avoids the target “lightning shaman” because it does not yet have “arcane spear”, which beats the target. In 7 and 8, the agent is cornered by monsters. In 9, the agent is forced to engage in combat and loses.
Figure 7: The initial world is shown in 1. In 5 the agent evades the target “cold ghost” because it does not yet have “soldier’s knife”, which beats the target. In 11 and 13, the agent obtains “soldier’s knife” while evading monsters. In 14, the agent defeats the target and wins.

Appendix B Variable dimensions

Let denote a fixed-length -dimensional representation of the text and denote the representation of visual inputs with

Variable Symbol Dimension
-dim text representation
-dim visual representation with height , width , channels
Environment observations embeddings
-word string that describes the objects in location in the grid-world
-word document embeddings
-word inventory embeddings
-word goal embeddings
Table 4: Variable dimensions

Appendix C Model details

c.1 txt2


The txt2 used in our experiments consists of 5 consecutive FiLM

 layers, each with 3x3 convolutions and padding and stride sizes of 1. The 


 layers have channels of 16, 32, 64, 64, and 64, with residual connections from the 3rd layer to the 5th layer. The Goal-doc LSTM (see Figure 

3) shares weight with the Goal LSTM. The Inventory and Goal LSTMs have a hidden dimension of size 10, whereas the Vis-doc LSTM has a dimension of 100. We use a word embedding dimension of 30.

c.2 CNN with residual connections

Figure 8: The convolutional network baseline. The FiLM baseline has the same structure, but with convolutional layers replaced by FiLM layers.

Like txt2, the CNN baseline consists of 5 layers of convolutions with channels of 16, 32, 64, 64, and 64. There are residual connections from the 3rd layer to the 5th layer. The input to each layer consists of the output of the previous layer, concatenated with positional features.

The input to the network is the concatenation of the observations and text representations. The text representations consist of self-attention over bidirectional LSTM-encoded goal, document, and inventory. These attention outputs are replicated over the dimensions of the grid and concatenated feature-wise with the observation embeddings in each cell. Figure 8 illustrates the CNN baseline.

c.3 FiLM baseline

The FiLM baseline encodes text in the same fashion as the CNN model. However, instead of using convolutional layers, each layer is a FiLM layer from Bahdanau et al. (2019). Note that in our case, the language representation is a self-attention over the LSTM states instead of a concatenation of terminal LSTM states.

Appendix D Training procedure

We train using an implementation of IMPALA (Espeholt et al., 2018)

. In particular, we use 20 actors and a batch size of 24. When unrolling actors, we use a maximum unroll length of 80 frames. Each episode lasts for a maximum of 1000 frames. We optimise using RMSProp 

(Tieleman and Hinton, 2012) with a learning rate of 0.005, which is annealed linearly for 100 million frames. We set and .

During training, we apply a small negative reward for each time step of and a discount factor of 0.99 to facilitate convergence. We additionally include a entropy cost to encourage exploration. Let denote the policy. The entropy loss is calculated as


In addition to policy gradient, we add in the entropy loss with a weight of 0.005 and the baseline loss with a weight of 0.5. The baseline loss is computed as the root mean square of the advantages (Espeholt et al., 2018).

When tuning models, we perform a grid search using the training environments to select hyperparameters for each model. We train 5 runs for each configuration in order to report the mean and standard deviation. When transferring, we transfer each of the 5 runs to the new task and once again report the mean and standard deviation.

Appendix E Rock-paper-scissors

Figure 9: The Rock-paper-scissors task requires jointly reasoning over the game observations and a document describing environment dynamics. The agent observes cell content in the form of text (shown in white).
Scenario # graphs # edges # nodes
train dev unseen train dev % new train dev % new
permutation 30 30 y 20 20 n 60 60 n
new edge 20 20 y 48 36 y 17 13 n
new edge+nodes 60 60 y 20 20 y 5 5 y
Table 5: Statistics of the three variations of the Rock-paper-scissors task

In addition to the main RTFM tasks, we also study a simpler formulation called Rock-paper-scissors that has a fixed goal. In Rock-paper-scissors, the agent must interpret a document that describes the environment dynamics in order to solve the task. Given an set of characters (e.g. a-z), we sample 3 characters and set up a rock-paper-scissors-like dependency graph between the characters (e.g. “a beats b, b beats c, c beats a”). We then spawn a monster in the world with a randomly assigned type (e.g. “b goblin”), as well as an item corresponding to each type (e.g. “a”, “b”, and “c”). The attributes of the agent, monster, and items are set up such that the player must obtain the correct item and then engage the monster in order to win. Any other sequence of actions (e.g. engaging the monster without the correct weapon) results in a loss. The winning policy should then be to first identify the type of monster present, then cross-reference the document to find which item defeats that type, then pick up the item, and finally engage the monster in combat. Figure 9 shows an instance of Rock-paper-scissors.

Figure 10: Performance on the Rock-paper-scissors task across models. Left shows final performance on environments whose goals and dynamics were seen during training. Right shows performance on the environments whose goals and dynamics were not seen during training.
Figure 11: Learning curve while transferring to the development environments. Win rates of individual runs are shown in light colours. Average win rates are shown in bold, dark lines.

Reading models generalise to new environments.

We split environment dynamics by permuting 3-character dependency graphs from an alphabet, which we randomly split into training and held-out sets. This corresponds to the “permutations” setting in Table 5.

Figure 12: Ablation training curves. Win rates of individual runs are shown in light colours. Average win rates are shown in bold, dark lines.

We train models on the worlds from the training set and evaluate them on both seen and not seen during training. The left of Figure 10 shows the performance of models on worlds of varying sizes with training environment dynamics. In this case, the dynamics (e.g. dependency graphs) were seen during training. For and worlds, the world configuration not seen during training. For worlds, there is a 5% chance that the initial frame was seen during training.222There are 24360 unique grid configurations given a particular dependency graph, 4060 unique dependency graphs in the training set, and 50 million frames seen during training. After training, the model finishes an episode in approximately 10 frames. Hence the probability of seeing a redundant initial frame is . Figure 10 shows the performance on held-out environments not seen during training. We see that all models generalise to environments not seen during training, both when the world configuration is not seen (left) and when the environment dynamics are not seen (right).

Reading models generalise to new concepts.

In addition to splitting via permutations, we devise two additional ways of splitting environment dynamics by introducing new edges and nodes into the held-out set. Table 5 shows the three different settings. For each, we study the transfer behaviour of models on new environments. Figure 11 shows the learning curve when training a model on the held-out environments directly and when transferring the model trained on train environments to held-out environments. We observe that all models are significantly more sample-efficient when transferring from training environments, despite the introduction of new edges and new nodes.

txt2 is more sample-efficient and learns better policies.

In Figure 10, we see that the FiLM model outperforms the CNN model on both training environment dynamics and held-out environment dynamics.  txt2

 further outperforms FiLM, and does so more consistently in that the final performance has less variance. This behaviour is also observed in the in Figure 

11. When training on the held-out set without transferring, txt2 is more sample efficient than FiLM and the CNN model, and achieves higher win-rate. When transferring to the held-out set, txt2 remains more sample efficient than the other models.

Appendix F Curriculum learning training curves

Figure 13: Curriculum learning results for txt2 on RTFM. Win rates of individual runs are shown in light colours. Average win rates are shown in bold, dark lines.

Appendix G Language templates

We collect human-written natural language templates for the goal and the dynamics. The goal statements in RTFM describe which team the agent should defeat. We collect 12 language templates for goal statements. The document of environment dynamics consists of two types of statements. The first type describes which monsters are assigned to with team. The second type describes which modifiers, which describe items, are effective against which element types, which are associated with monsters. We collection 10 language templates for each type of statements. The entire document is composed from statements, which are randomly shuffled. We randomly sample a template for each statement, which we fill with the monsters and team for the first type and modifiers and element for the second type.