|North of House|
|You are facing the north side of a white house. There is no door here, and all the windows are barred.|
|This is a dimly lit forest, with large trees all around. One particularly large tree with some low branches stands here.|
|>climb large tree|
|Up a Tree|
|You are about 10 feet above the ground nestled among some large branches. On the branch is a small birds nest. In the bird’s nest is a large egg encrusted with precious jewels, apparently scavenged somewhere by a childless songbird.|
|>climb down tree|
|This is a dimly lit forest, with large trees all around. To the east, there appears to be sunlight.|
|You are in a clearing, with a forest surrounding you on the west and south. There is a pile of leaves on the ground.|
|Done. A grating appears on the ground.|
|The grating is locked.|
Embodied agents offer an experimental paradigm to study the development and use of semantic representations for a variety of real-world tasks, from household tasks ALFRED20 to navigation guss2019minerl to chemical synthesis tamari-etal-2021-process. While robotic agents are a primary vehicle for studying embodiment (e.g. cangelosi2015developmental), robotic models are costly to construct, and experiments can be slow or difficult to scale. Virtual agents and embodied virtual environments help mitigate many of these issues, allowing large-scale simulations to be run in parallel orders of magnitude faster than real world environments (e.g. robothor), while controlled virtual environments can be constructed for exploring specific tasks – though this benefit in speed comes at the cost of modeling virtual 3D environments, which can be substantial.
Text Worlds – embodied environments rendered linguistically through textual descriptions instead of graphically through pixels (see Table 1) – have emerged as a recent methodological focus that allow studying many embodied research questions while reducing some of the development costs associated with modeling complex and photorealistic 3D environments (e.g. cote2018textworld). More than simply reducing development costs, Text Worlds also offer paradigms to study developmental knowledge representation, embodied task learning, and transfer learning at a higher level than perceptually-grounded studies, enabling different research questions that explore these topics in isolation of the open problems of perceptual input, object segmentation, and object classification regularly studied in the vision community (e.g. he2016deep; szegedy2017inception; zhai2021scaling).
1.1 Motivation for this survey
Text Worlds are rapidly gaining momentum as a research methodology in the natural language processing community. Research on agent modeling is regularly reported, while other works aim to standardize tooling, experimental methodologies, and evaluation mechanisms, to rapidly build infrastructure for a sustained research program and community. In spite of this interest, many barriers exist to applying these methodologies, with significant development efforts in the early stages of mitigating those barriers, at least in part.
In this review, citation graphs of recent articles were iteratively crawled, identifying 108 articles relevant to Text Worlds and other embodied environments that include text as part of the simulation or task. Frequent motivations for choosing Text Worlds are highlighted in Section 2. Tooling and modeling paradigms (in the form of simulators, intermediate languages, and libraries) are surveyed and compared to higher-fidelity 3D environments in Section 3, with text environments and common benchmarks implemented with this tooling described in Section 4
. Contemporary focuses in agent modeling, including coupling knowledge graphs, question answering, and common-sense reasoning with reinforcement learning, are identified in Section5. Recent contributions to focus areas in world generation, social reasoning, and hybrid text-3D environments are summarized in Section 6, while a distillation of near-term directions for reducing barriers to using Text Worlds more broadly as a research paradigm are presented in Section 7.
2 Why use Text Worlds?
For many tasks, Text Worlds can offer advantages over other embodied environment modelling paradigms – typically in reduced development costs, the ability to model large action spaces, and the ability to study embodied reasoning at a higher level than raw perceptual information.
Embodied agents have been proposed as a solution to the symbol grounding problem harnad1990symbol, or the problem of how concepts acquire real-world meaning. Humans likely resolve symbol grounding at least partially by assigning semantics to concepts through perceptually-grounded mental simulations barsalou1999perceptual. Using embodied agents that take in perceptual data and perform actions in real or virtual environments offers an avenue for studying semantics and symbol grounding empirically cangelosi2010integration; bisk2020experience; tamari2020language; tamari2020ecological. Text Worlds abstract some of the challenges in perceptual modeling, allowing agents to focus on higher-level semantics, while hybrid worlds that simultaneously render both text and 3D views (e.g. shridhar2020alfworld) help control what kind of knowledge is acquired, and better operationalize the study of symbol grounding.
Ease of Development:
Constructing embodied virtual environments typically has steep development costs, but Text Worlds are typically easier to construct for many tasks. Creating new objects does not require the expensive process of creating new 3D models, or performing visual-information-to-object-name segmentation or classification (since the scene is rendered linguistically). Similarly, a rich action semantics is possible, and comparatively easy to implement – while 3D environments typically have one or a small number of action commands (e.g. kolve2017ai2; ALFRED20), Text Worlds typically implement dozens of action verbs, and thousands of valid Verb-NounPhrase action combinations hausknecht2020interactive.
Complex reasoning tasks typically require multi-step (or compositional) reasoning that integrates several pieces of knowledge in an action procedure that arrives at a solution. In the context of natural language, compositional reasoning is frequently studied through question answering tasks (e.g. yang-etal-2018-hotpotqa; khot2020qasc; xie-etal-2020-worldtree; dalvi2021explaining) or procedural knowledge prediction (e.g. dalvi-etal-2018-tracking; tandon-etal-2018-reasoning; dalvi-etal-2019-everything). A contemporary challenge is that the number of valid compositional procedures is typically large compared to those that can be tractably annotated as gold, and as such automatically evaluating model performance becomes challenging. In an embodied environment, an agent’s actions have (generally) deterministic consequences for a given environment state, as actions are grounded in an underlying action language (e.g. mcdermott1998pddl) or linear logic (e.g. martens2015ceptre). Embodied environments can offer a more formal semantics to study these reasoning tasks, where correctness of novel procedures could be evaluated directly.
Training a text-only agent for embodied tasks allows the agent to learn those tasks in a distilled form, at a high-level. This performance can then be transferred to more realistic 3D environments, where agents pretrained on text versions of the same environment learn to ground their high-level knowledge in low-level perceptual information, and complete tasks faster than when trained jointly shridhar2020alfworld. This offers the possibility of creating simplified text worlds to pretrain agents for challenging 3D tasks that are currently out of reach of embodied agents.
3 What embodied simulators exist?
Here we explore what simulation engines exist for embodied agents, the trade-offs between high-fidelity (3D) and low-fidelity (text) simulators, and modeling paradigms for text-only environments and agents.
|Example Simulators||Typical Characteristics|
|3D Environment Simulators|
|AI2-Thor kolve2017ai2||High-resolution 3D visual environments|
|CHALET yan2018chalet||Physics engine for forces, matter, light|
|House3D wu2018building||Adding objects requires complex 3D modeling|
|RoboThor robothor||Limited set of simplified actions|
|ALFRED ALFRED20||Adding actions is typically expensive|
|ALFWorld shridhar2020alfworld||Possible transfer to real environments (robotics)|
|Malmo johnson2016malmo||Low-resolution 3D visual environments|
|MineRL guss2019minerl||Simplified physics engine for forces, matter, light|
|Adding objects requires simple 3D modeling|
|Limited set of simplified actions|
|Adding actions is somewhat expensive|
|Limited transfer to real environments|
|Rogue-in-a-box asperti2017rogueinabox||2D Grid-world rendered graphically/as characters|
|BABYAI chevalier2018babyai||Low-fidelity physics|
|Nethack LE kuttler2020nethack||Adding objects is comparatively easy|
|VisualHints carta2020visualhints||Small or large action spaces|
|Griddly bamford2021griddly||Adding object interactions is inexpensive|
|Limited transfer to real environments|
|Z-Machine learningZIL1989||Environment described to user using text only|
|Inform7 nelson2006natural||Low-fidelity, conceptual-level physics|
|Ceptre martens2015ceptre||Adding objects is comparatively easy|
|TextWorld cote2018textworld||Small or large action spaces|
|LIGHT urbanek2019learning||Adding actions is comparatively inexpensive|
|Jericho hausknecht2020interactive||Limited transfer to real environments|
Simulators provide the infrastructure to implement the environments, objects, characters, and interactions of a virtual world, typically through a combination of a scripting engine to define the behavior of objects and agents, with a rendering engine that provides a view of the world for a given agent or user. Simulators for embodied agents exist on a fidelity spectrum, from photorealistic 3D environments to worlds described exclusively with language, where a trade-off typically exists between richer rendering and richer action spaces. This fidelity spectrum (paired with example simulators) is shown in Table 2, and described briefly below. Note that many of these higher-fidelity simulators are largely out-of-scope when discussing Text Worlds, except as a means of contrast to text-only worlds, and in the limited context that these simulators make use of text.
3D Environment Simulators:
3D simulators provide the user with complex 3D environments, including near-photorealistic environments such as AI2-Thor kolve2017ai2, and include physics engines that model forces, liquids, illumination, containment, and other object interactions. Because of their rendering fidelity, they offer the possibility of inexpensively training robotic models in virtual environments that can then be transferred to the real world (e.g. RoboThor, robothor). Adding objects to 3D worlds can be expensive, as this requires 3D modelling expertise that teams may not have. Similarly, adding agent actions or object-object interactions through a scripting language can be expensive if those actions are outside what is easily implemented in the existing 3D engine or physics engine (like creating gasses, or using a pencil or saw to modify an object). Because of this, action spaces tend to be small, and limited to movement, and one (or a small number of) interaction commands. Some simulators and environments include text directives for an agent to perform, such as an agent being asked to “slice an apple then cool it” in the ALFRED environment ALFRED20. Other hybrid environments such as ALFWorld shridhar2020alfworld simultaneously render an environment both as 3D as well as in text, allowing agents to learn high-level task knowledge through text interactions, then ground these in environment-specific perceptual input.
Voxel-based simulators create worlds from (typically) large 3D blocks, lowering rendering fidelity while greatly reducing the time and skill required to add new objects. Similarly, creating new agent-object or object-object interactions can be easier because they can generally be implemented in a coarser manner – though some kinds of basic spatial actions (like rotating an object in increments smaller than 90 degrees) are generally not easily implemented. Malmo johnson2016malmo and MineRL guss2019minerl offer wrappers and training data to build agents in the popular Minecraft environment. While the agent’s action space is limited in Minecraft (see Table 4), the crafting nature of the game (that allows collecting, creating, destroying, or combining objects using one or more voxels) affords exploring a variety of compositional reasoning tasks with a low barrier to entry, while still using a 3D environment. Text directives, like those in CraftAssist gray2019craftassist, allow agents to learn to perform compositional crafting actions in this 3D environment from natural language dialog.
2D gridworlds are comparatively easier to construct than 3D environments, and as such more options are available. GridWorlds share the commonality that they exist on a discretized 2D plane, typically containing a maximum of a few dozen cells on either dimension (or, tens to hundreds of discrete cells, total). Cells are discrete locations that (in the simplest case) contain up to a single agent or object, while more complex simulators allow cells to contain more than one object, including containers. Agents move on the plane through simplified spatial dynamics, at a minimum rotate left, rotate right, and move forward, allowing the entire world to be explored through a small action space.
Where gridworlds tend to differ is in their rendering fidelity, and their non-movement action spaces. In terms of rendering, some (such as BABYAI, chevalier2018babyai) render a world graphically, using pixels, with simplified shapes for improving rendering throughput and reducing RL agent training time. Others such as NetHack kuttler2020nethack are rendered purely as textual characters, owing to their original nature as early terminal-only games. Some simulators (e.g. Griddly, bamford2021griddly) support a range of rendering fidelities, from sprites (slowest) to shapes to text characters (fastest), depending on how critical rendering fidelity is for experimentation. As with 3D simulators, hybrid environments (like VisualHints, carta2020visualhints) exist, where environments are simultaneously rendered as a Text World and accompanying GridWorld that provides an explicit spatial map.
Action spaces vary considerably in GridWorld simulators, owing to the different scripting environments that each affords. Some environments have a small set of hardcoded environment rules (e.g. BABYAI), while others (e.g. NetHack) offer nearly 100 agent actions, rich crafting, and complex agent-object interactions. Text can occur in the form of task directives (e.g. “put a ball next to the blue door” in BABYAI), partial natural language descriptions of changes in the environmental state (e.g. “You are being attacked by an orc” in NetHack), or as full Text World descriptions in hybrid environments.
|The Kitchen is a room. "A well-stocked Kitchen."|
|The Living Room is north of the Kitchen.|
|A stove is in the Kitchen. A table is in the Kitchen. A plate is on the table.|
|An Apple is on the plate. The Apple is edible.|
|The cook is a person in the Kitchen. The description is "A busy cook."|
|The ingredient list is carried by the cook. The description is "A list of|
|ingredients for the cook’s favorite recipe.".|
|Instead of listening to the cook:|
|say "The cook asks if you can help find some ingredients, and hands|
|you a shopping list from their pocket.";|
|move the ingredient list to the player.|
|A well-stocked Kitchen.|
|You can see a stove, a table (on which is a plate (on which is an Apple))|
|and a cook here.|
|(first taking the Apple)|
|You eat the Apple. Not bad.|
|>listen to cook|
|The cook asks if you can help find some ingredients, and hands you a|
|shopping list from their pocket.|
Text World simulators render an agent’s world view directly into textual descriptions of their environment, rather than into 2D or 3D graphical renderings. Similarly, actions the agent wishes to take are typically provided to the simulator as text (e.g. “read the letter” in Zork), requiring agent models to both parse input text from the environment, and generate output text to to interact with that environment.
In terms of simulators, the Z-machine learningZIL1989 is a low-level virtual machine originally designed by Infocom for creating portable interactive fiction novels (such as Zork). It was paired with a high-level LISP-like domain-specific language (ZIL) that included libraries for text parsing, and other tools for writing interactive fiction novels. The Z-machine standard was reverse-engineered by others (e.g. nelson2014zmachine) in an effort to build their own high-level interactive fiction domain-specific languages, and has since become a standard compilation target due to the proliferation of existing tooling and legacy environments.111A variety of text adventure tooling, including the Adventure Game Toolkit (AGT) and Text Adventure Development System (TADS), was developed starting in the late 1980s, but these simulators have generally not been adopted by the NLP community in favour of the more popular Inform series tools.
Inform7 nelson2006natural is a popular high-level language designed for interactive fiction novels that allows environment rules to be directly specified in a simplified natural language, substantially lowering the barrier to entry for creating text worlds (see Table 3
for an example). The text generation engine allows substantial variation in the way the environments are described, from dry formulaic text to more natural, varied, conversational descriptions. Inform7 is compiled to Inform6, an earlier object-oriented scripting language with C-like syntax, which itself is compiled to Z-machine code.
Ceptre martens2015ceptre is a linear-logic simulation engine developed with the goal of specifying more generic tooling for operational logics than Inform 7. TextWorld cote2018textworld adapt Ceptre’s linear logic state transitions for environment descriptions, and add tooling for generative environments, visualization, and RL agent coupling, all of which is compiled into Inform7 source code. Parallel to this, the Jericho environment hausknecht2020interactive allows inferring relevant vocabulary and template-based object interactions for Z-machine-based interactive fiction games, easing action selection for agents.
3.2 Text World Modeling Paradigms
3.2.1 Environment Modelling
Environments are typically modeled as an object tree that stores all the objects in an environment and their nested locations, as well as a set of action rules that implement changes to the objects in the environment based on actions.
Because of the body of existing interactive fiction environments for Z-machine environments, and nearly all popular tooling (Inform7, TextWorlds, etc.) ultimately compiling to Z-machine code, object models typically use the Z-machine model nelson2014zmachine. Z-machine objects have names (e.g. “mailbox”), descriptions (e.g. “a small wooden mailbox”), binary flags called attributes (e.g. “is_container_open”), and generic properties stored as key-value pairs. Objects are stored in the object tree, which represents the locations of all objects in the environment through parent-child relationships, as shown in Figure 1.
Action rules describe how objects in the environment change in response to a given world state, which is frequently a collection of preconditions followed by an action taken by an agent (e.g. “eat the carrot”), but can also be due to environment states (e.g. a plant dying because it hasn’t been watered for a time greater than some threshold).
Ceptre martens2015ceptre and TextWorld cote2018textworld use linear logic to represent possible valid state transitions. In linear logic, a set of preconditions in the state history of the world can be consumed by a rule to generate a set of postconditions. The set of states are delimited with the linear logic multiplicative conjunction operator (), while preconditions and postconditions are delimited using the linear implication operator (). For example, if a player is at location , and a container is at that location, and that container is closed, a linear logic rule can be defined to allow the state to be changed to open, as represented by:
Note that prepending $ to a predicate signifies that it will not be consumed by a rule, but carried over to the postconditions. Here, that means that after opening container , the and will still be location .
Côté et al. cote2018textworld note the limitations in existing implementations of state transition systems for text worlds (such as single-step forward or backward chaining), and suggest future systems may wish to use mature action languages such as STRIPS fikes1971strips or GDL genesereth2005general; thielscher2010general; thielscher2017gdl as the basis of a world model, though each of these languages have tradeoffs in features (such as object typing) and general expressivity (such as being primarily agent-action centered, rather than easily implementing environment-driven actions) that make certain kinds of modeling more challenging. As a proof-of-concept, ALFWorld shridhar2020alfworld uses the Planning Domain Definition Language (PDDL, mcdermott1998pddl) to define the semantics for the variety of pick-and-place tasks in its text world rendering of the ALFRED benchmark.
3.2.2 Agent Modelling
While environments can be modelled as a collection of states and allowable state transitions (or rules), agents typically have incomplete or inaccurate information about the environment, and must make observations of the environment state through (potentially noisy or inadequate) sensors, and take actions
based on those observations. Because of this, agents are typically modelled as partially-observable Markov decision processes (POMDP)kaelbling1998planning, defined by .
A Markov decision process (MDP) contains the state history , valid state transitions , available actions , and (for agent modeling) the expected immediate reward for taking each action . POMDPs extend this to account for partial observability by supplying a finite list of observations the agent can make , and an observation function that returns what the agent actually observes from an observation, given the current world state. For example, the observation function might return unknown if the agent tries to examine the contents of a locked container before unlocking it, because the contents cannot yet be observed. Similarly, when observing the temperature of a cup of tea, the observation function might return coarse measurements (e.g. hot, warm, cool) if the agent uses their hand for measurement, or fine-grained measurements (e.g. ) if the agent uses a thermometer. A final discount factor () influences whether the agent prefers immediate rewards, or eventual (distant) rewards. The POMDP then serves as a model for a learning framework, typically reinforcement learning (RL), to learn a policy that enables the agent to maximize the reward.
4 Text World Environments
|3D Environment Simulators|
|ALFRED||7 Command||pickup, put, heat, cool, slice, toggle-on, toggle-off|
|5 Movement||move forward, rotate left, rotate right, look up, look down|
|8 Movement||move forward, move back, strafe left, strafe right, look left, look right, etc.|
|9 Movement||pitch (+/-), yaw (+/-), forward, backward, left, right, jump|
|BABYAI||4 Command||pickup, drop, toggle, done|
|3 Movement||turn left, turn right, move forward|
|NETHACK||77 Command||eat, open, kick, read, tip over, wipe, jump, look, cast, pay, ride, sit, throw, wear, …|
|16 Movement||8 compass directions x 2 possibilities (move one step, move far)|
|SocialAI||4 Command||pickup, drop, toggle, done (from BABYAI)|
|3 Movement||turn left, turn right, move forward (from BABYAI)|
|4x16 Text||4 templates (what, where, open, close) x 16 objects (exit, bed, fridge)|
|ALFWorld||11 Command||goto, take, put, open, close, toggle, heat, cool, clean, inventory, examine|
|LIGHT||11 Command||get, drop, put, give, steal, wear, remove, eat, drink, hug, hit|
|22 Emotive||applaud, blush, cringe, cry, frown, sulk, wave, wink, …|
|PEG (Biomedical)||35 Command||add, transfer, incubate, store, mix, spin, discard, measure, wash, cover, wait, …|
|Zork||56 Command||take, open, read, drop, turn on, turn off, attack, eat, kill, cut, drink, smell, listen, …|
Note that while Text World parsers generally recognize a specified number of command verbs, the template extractor in the Jericho framework estimates that common interactive fiction games (includingZork) contain between 150 and 300 valid action templates, e.g. put OBJ in OBJ hausknecht2020interactive.
Environments are worlds implemented in simulators, that agents explore to perform tasks. Environments can be simple or complex, test specific or domain-general competencies, be static or generative, and have small or large action spaces compared to higher-fidelity simulators (see Table 4 for a comparison of action space sizes).
4.1 Single Environment Benchmarks
Single environment benchmarks typically consist of small environments designed to test specific agent competencies, or larger challenging interactive fiction environments that test broad agent competencies to navigate around a diverse world and interact with the environment toward achieving some distant goal. Toy environments frequently evaluate an agent’s ability to perform compositional reasoning tasks of increasing lengths, such as in the Kitchen Cleanup and related benchmarks murugesan2020enhancing. Other toy worlds explore searching environments to locate specific objects yuan2018counting, or combining source materials to form new materials jiang2020wordcraft. While collections of interactive fiction environments are used as benchmarks (see Section 4.3), individual environments frequently form single benchmarks. Zork lebling1979zork and its subquests are medium-difficulty environments frequently used in this capacity, while Anchorhead anchorhead is a hard-difficulty environment where state-of-the-art performance remains below 1%.
4.2 Domain-specific Environments
Domain-specific environments allow agents to learn highly specific competencies relevant to a single domain, like science or medicine, while typically involving more modeling depth than toy environments. Tamari et al. tamari-etal-2021-process create a TextWorld environment for wet lab chemistry protocols, that describe detailed step-by-step instructions for replicating chemistry experiments. These text-based simulations can then be represented as process execution graphs (PEG), which can then be run on real lab equipment. A similar environment exists for the materials science domain tamari2019playing.
4.3 Environment Collections as Benchmarks
To test the generality of agents, large collections of interactive fiction games (rather than single environments) are frequently used as benchmarks. While the Text-Based Adventure AI Shared Task initially evaluated on a single benchmark environment, later instances switched to evaluating on 20 varied environments to gauge generalization atkinson2019text. Fulda et al. fulda2017can created a list of 50 interactive fiction games to serve as a benchmark for agents to learn common-sense reasoning. Côté et al. cote2018textworld further curate this list, replacing 20 games without scores to those more useful for RL agents. The Jericho benchmark hausknecht2020interactive includes 32 interactive fiction games that support Jericho’s in-built methods for score and world-change detection, out of a total of 56 games known to support these features.
4.4 Generative Environments
A difficulty with statically-initialized environments is that because their structure is identical each time the simulation is run, rather than learning general skills, agents quickly overfit to learn solutions to an exact instantiation of a particular task and environment, and rarely generalize to unseen environments chaudhury-etal-2020-bootstrapped. Procedurally generated environments help address this need by generating variations of environments centered around specific goal conditions.
The TextWorld simulator cote2018textworld allows specifying high-level parameters such as the number of rooms, objects, and winning conditions, then uses a random walk to procedurally generate environment maps in the Inform7 language meeting those specifications, using either forward or backward chaining during generation to verify tasks can be successfully completed in the random environment. As an example, the First TextWorld Problems shared task222https://competitions.codalab.org/competitions/21557 used TextWorld to generate 5k variations of a cooking environment, divided into train, development, and test sets. Similarly, Murugesan et al. murugesan2020text introduce TextWorld CommonSense (TWC), a simple generative environment for household cleaning tasks, modelled as a pick-and-place task where agents must pick up common objects from the floor, and place them in their common household locations (such as placing shoes in a shoe cabinet). Other related environments include Coin Collector yuan2018counting, a generative environment for a navigation task, and Yin et al.’s yin2019learn procedurally generated environment for cooking tasks.
Adhikari et al. adhikari2020learning generate a large set of recipe-based cooking games, where an agent must precisely follow a cooking recipe that requires collecting tools (e.g. a knife) and ingredients (e.g. carrots), and processing those ingredients correctly (e.g. dice carrots, cook carrots) in the correct order. Jain et al. jain2020algorithmic propose a similar synthetic benchmark for multi-step compositional reasoning called SaladWorld. In the context of question answering, Yuan et al. yuan-etal-2019-interactive procedurally generate a simple environment that requires an agent to search and investigate attributes of objects, such as verifying their existence, locations, or specific attributes (like edibility). On the balance, while tooling exists to generate simple procedural environments, when compared to classic interactive fiction games (such as Zork), the current state-of-the-art allows for generating only relatively simple environments with comparatively simple tasks and near-term goals than human-authored interactive fiction games.
|CALM (GPT-2) yao-etal-2020-keep||0.80||0.09||0.07||0.14||0.05||0.01|
Agent performance on benchmark interactive fiction environments. All performance values are normalized to maximum achievable scores in a given environment. Due to the lack of standard reporting practice, performance reflects values reported for agents, but is unable to hold other elements (such as number of training epochs, number of testing epochs, reporting average vs maximum performance) constant. Parentheses denote environment difficulty (E:Easy, M:Medium, H:Hard) as determined by the Jericho benchmarkhausknecht2020interactive.
5 Text World Agents
Recently a large number of agents have been proposed for Text World environments. This section briefly surveys common modeling methods, paradigms, and trends, with the performance of recent agents on common easy, medium, and hard interactive fiction games (as categorized by the Jericho benchmark, hausknecht2020interactive) shown in Table 5.
While some agents rely on learning frameworks heavily coupled with heuristics(e.g., kostka2017text, Golovin), owing to the sampling benefits afforded by operating in a virtual environment, the predominant modeling paradigm for most contemporary text world agents is reinforcement learning. Narasimhan et al. narasimhan-etal-2015-language demonstrate that “Deep-Q Networks” (DQN) mnih2015human developed for Atari games can be augmented with LSTMs for representation learning in Text Worlds, which outperform simpler methods using n-gram bag-of-words representations. He et al. (he-etal-2016-deep, DRRN) extend this to build the Deep Reinforcement Relevance Network (DRRN), an architecture that uses separate embeddings for the state space and actions, to improve both training time and performance. Madotto et al. madotto2020exploration show that the Go-Explore algorithm ecoffet2019go, which periodically returns to promising but underexplored areas of a world, can achieve higher scores than the DRRN with fewer steps. Zahvey et al. (zahavy2018learn, AE-DQN) use an Action Elimination Network (AEN) to remove sub-optimal actions, showing improved performance over a DQN on Zork. Yao et al (yao-etal-2020-keep, CALM) use a GPT-2 language model trained on human gameplay to reduce the space of possible input command sequences, and produce a shortlist of candidate actions for an RL agent to select from. Yao et al. (yao2021reading, INV-DY) demonstrate that semantic modeling is important, showing that models that either encode semantics through an inverse dynamic decoder, or discard semantics by replacing words with unique hashes, have different performance distributions in different environments. Taking a different approach, Tessler et al. (tessler2019action, IK-OMP)
show that imitation learning combined with a compressed sensing framework can solve Zork when restricted to a vocabulary of 112 words extracted from walk-through examples.
Augmenting reinforcement learning models to produce knowledge graphs of their beliefs can reduce training time and improve overall agent performance ammanabrolu-riedl-2019-playing. Ammanabrolu et al. (ammanabrolu2020graph, KG-A2C) demonstrate a method for training an RL agent that uses a knowledge graph to model its state-space, and use a template-based action space to achieve strong performance across a variety of interactive fiction benchmarks. Adhikari et al. adhikari2020learning demonstrate that a Graph Aided Transformer Agent (GATA) is able to learn implicit belief networks about its environment, improving agent performance in a cooking environment. Xu et al. (xu2020deep, SHA-KG) extend KG-A2C to use use hierarchical RL to reason over subgraphs, showing substantially improved performance on a variety of benchmarks.
To support these modelling paradigms, Zelinka et al. zelinka2019building introduce TextWorld KG, a dataset for learning the subtask of updating knowledge graphs based on text world descriptions in a cooking domain, and show their best ensemble model is able to achieve 70 F1 at this subtask. Similarly, Annamabrolu et al. ammanabrolu2021modeling introduce JerichoWorld, a similar dataset for world modeling using knowledge graphs but on a broader set of interactive fiction games, and subsequently introduce WorldFormer ammanabrolu2021learning, a multi-task transformer model that performs well at both knowledge-graph prediction and next-action prediction tasks.
Agents can reframe Text World tasks as question answering tasks to gain relevant knowledge for action selection, with these agents providing current state-of-the-art performance across a variety of benchmarks. Guo et al. (guo-etal-2020-interactive, MPRC-DQN) use multi-paragraph reading comprehension (MPRC) techniques to ask questions that populate action templates for agents, substantially reducing the number of training examples required for RL agents while achieving strong performance on the Jericho benchmark. Similarly, Ammanabrolu et al. (ammanabrolu2020avoid, MC!Q*BERT) use contextually-relevant questions (such as “Where am I?”, “Why am I here?”, and “What do I have?”) to populate their knowledge base to support task completion.
Agents arguably require a large background of common-sense or world knowledge to perform embodied reasoning in general environments. Fulda et al. fulda2017can
extract common-sense affordances from word vectors trained on Wikipedia using word2vecmikolov2013distributed, and use this to increase performance on interactive fiction games, as well as (more generally) on robotic learning tasks fulda2017harvesting. Murugesan et al. murugesan2020enhancing combine the ConceptNet speer2017conceptnet common-sense knowledge graph with an RL agent that segments knowledge between general world knowledge, and specific beliefs about the current environment, demonstrating improved performance in a cooking environment. Similarly, Dambekodi et al. dambekodi2020playing demonstrate that RL agents augmented with either COMET bosselut2019comet, a transformer trained on common-sense knowledge bases, or BERT devlin-etal-2019-bert, which is hypothesized to contain common-sense knowledge, outperform agents without this knowledge on the interactive fiction game 9:05. In the context of social reasoning, Ammanabrolu et al. ammanabrolu-etal-2021-motivate create a fantasy-themed knowledge graph, ATOMIC-LIGHT, and show that an RL agent using this knowledge base performs well at the social reasoning tasks in the LIGHT environment.
5.1 Generalization across environments
Agents trained in one environment rarely transfer their performance to other environments. Yuan et al. yuan2018counting propose that dividing learning into segmented episodes can improve transfer learning to unseen environments, and demonstrate transfer to more difficult procedurally-generated versions of Coin Collector environments. Yin et al. yin2019learn demonstrate that separating knowledge into universal (environment-general) and instance (environment-specific) knowledge using curriculum learning improves model generalization.
Ansari et al. ansari2018language suggest policy distillation may reduce overfitting in Text Worlds, informed by experiments using an LSTM-DQN model on 5 toy environments designed to test generalization. Similarly, Chaudrey et al. chaudhury-etal-2020-bootstrapped demonstrate that policy distillation can be used on RL agents to reduce overfitting and improve generalization to unseen environments using 10x fewer training episodes when training on the Coin Collector environment, and evaluating in a cooking environment. Adolphs et al. adolphs2020ledeepchef combine an actor-critic approach with hierarchical RL to demonstrate agent generalization on cooking environments.
Yin et al. yin2020zero propose a method for factorizing Q values that allows agents to better learn in multi-task environments where tasks have different times to reaching rewards. They empirically demonstrate this novel Q-factorization method requires an order of magnitude less training data, while enabling zero-shot performance on unseen environments. A t-SNE plot shows that Q-factorization produces qualitatively different and well-clustered learning of game states compared to conventional Q learning.
6 Contemporary Focus Areas
6.1 World Generation
Generating detailed environments with complex tasks is labourious, while randomly generating environments currently provides limited task complexity and environment cohesiveness. World generation aims to support the generation of complex, coherent environments, either through better tooling for human authors (e.g. temprado2019online), or automated generation systems that may or may not have a human-in-the-loop. Fan et al. fan2020generating explore creating cohesive game worlds in the LIGHT environment using a variety of embedding models including Starspace wu2018starspace and BERT devlin-etal-2019-bert. Automatic evaluations show performance of between 36-47% in world building, defined as cohesively populating an environment with locations, objects, and characters. Similarly, human evaluation shows that users prefer Starspace-generated environments over those generated by a random baseline. In a more restricted domain, Ammanabrolu et al. ammanabrolu2019toward
show that two models, one Markov chain model, the other a generative language model (GPT-2), are capable of generating quests in a cooking environment, while there is a tradeoff between human ratings of quest creativity and coherence.
Ammanabrolu et al. ammanabrolu2020bringing propose a large-scale end-to-end solution to world generation that automatically constructs interactive fiction environments based on a story (such as Sherlock Holmes) provided as input. Their system first builds a knowledge graph of the story by framing KG construction as a question answering task, using their model (AskBERT) to populate this graph. The system then uses either a rule-based baseline or a generative model (GPT-2) to generate textual descriptions of the world from this knowledge graph. User studies show that humans generally prefer these neural-generated worlds to the rule-generated worlds (measured in terms of interest, coherence, and genre-resemblance), but that neural-generated performance still substantially lags behind that of human-generated worlds.
6.2 Social Modeling and Dialog
While embodied environments provide an agent the opportunity to explore and interact with a physical environment, they also provide an opportunity for exploring human-to-agent or agent-to-agent social interaction grounded in particular situations, consisting of collections of locations, and the objects and agents in those locations.
The LIGHT platform urbanek-etal-2019-learning is large text-only dataset designed specifically to study grounded social interaction. More than modeling environments and physical actions, LIGHT includes a large set of 11k crowdsourced situated agent-to-agent dialogs, and sets of emotive as well as physical actions. Specifically, LIGHT emulates a multi-user dungeon (MUD) dialog scenario, and includes 663 interconnected locations, which can be populated with 1755 characters and 3462 objects. In addition to 11 physical actions (get, eat, wear, etc.), it includes 22 emotive actions (e.g. applaud, cry, wave) that affect agent behavior. At each turn, an agent can say something (in natural language) to another agent, take a physical action, and take an emotive action. Urbanek et al. urbanek-etal-2019-learning
propose a next-action prediction task (modelled as picking the correct dialog, physical action, and emotive action from 20 possible choices for each), and evaluate both model (BERT) and human performance. Human vs model performance reaches 92 vs 71 for predicting dialog, 72 vs 52 for physical actions, and 34 vs 29 for emotives, demonstrating the difficulty of the task for models, and the large variance in predicting accompanying emotives for humans. Others have proposed alternate models, such as Qi et al.qiu2021towards, who develop a mental state parser that explicitly models an agent’s mental state (including both the agent’s physical observations and beliefs) in a graphical formalism to increase task performance.
As an alternate task, Prabhumoye et al. prabhumoye2020love use LIGHT to explore persuasion or task-cueing, requiring the player to say something that would cause an agent to perform a specific situated action (like picking up a weapon while in the armory) or emotive (like hugging the player). This is challenging, with the best RL models succeeding only about half the time.
Kovač et al. kovavc2021socialai propose using pragmatic frames vollmer2016pragmatic as a means of implementing structured dialog sessions between agents (with specific roles) in embodied environments. To explore this formalism, they extend BABYAI to include multiple non-player character (NPC) agents that the player must seek information from to solve their SocialAI navigation task. Inter-agent communication uses 4 frames and 16 referents, for a total of 64 possible utterances, and requires modeling belief states as NPCs may be deceptive. An RL agent performs poorly at this task, highlighting the difficulty of modeling even modest social interactions that involve communicating with multiple agents.
As environments move from toy problems to large spaces that contain multiple instances of the same category of object (like more than one cup or house), agents that communicate (e.g. “pick up the cup”, “go to the house”) have to resolve which cup or house is referenced. Kelleher et al. kelleher2019referring propose resolving ambiguous references in embodied agent dialog by using a decaying working memory, similar to models of human short-term memory (e.g. atkinson1968human), that resolve to the most recently observed or recently thought-about object.
6.3 Hybrid 3D-Text Environments
Hybrid simulators that can simultaneously render worlds both graphically (2D or 3D) as well as textually offer a mechanism to quickly learn high-level tasks without having to first solve grounding or perceptual learning challenges. The ALFWorld simulator shridhar2020alfworld combines the ALFRED 3D home environment ALFRED20 with a simultaneous TextWorld interface to that same environment, and introduce the BUTLER agent, which shows increased task generalization on the 3D environment when first trained on the text world. Prior to ALFWorld, Jansen jansen2020visually showed that a language model (GPT-2) was able to successfully generate detailed step-by-step textual descriptions of ALFRED task trajectories for up to 58% of unseen cases using task descriptions alone, without visual input. Building on this, Micheli micheli2021language confirmed GPT-2 also performs well on the text world rendering of ALFWorld, and is able to successfully complete goals in 95% of unseen cases. Taken together, these results show the promise of quickly learning complex tasks at a high-level in a text-only environment, then transferring this performance to agents grounded in more complex environments.
7 Contemporary Limitations and Challenges
Environment complexity is limited, and it’s currently difficult to author complex worlds.
Two competing needs are currently at odds: the desire for complex environments to learn complex skills, and the desire for environment variation to encourage robustness in models. Current tooling emphasizes creating varied procedural environments, but those environments have limited complexity, and require agents to complete straightforward tasks. Economically creating complex, interactive environments that simulate a significant fraction of real world interactions are still well beyond current simulators or libraries – but required for higher-fidelity interactive worlds that have multiple meaningful paths toward achieving task goals. Generating these environments semi-automatically(e.g. ammanabrolu2020bringing) may offer a partial solution. Independent of tooling, libraries and other middleware offer near-term solutions to more complex environment modeling, much in the same way 3D game engines are regularly coupled with physics engine middleware to dramatically reduce the time required to implement forces, collisions, lighting, and other physics-based modeling. Currently, few analogs exist for text worlds. The addition of a chemistry engine that knows that ice warmed above the freezing point will change to liquid water, or a generator engine that knows the sun is a source of sunlight during sunny days, or an observation engine that knows tools (like microscopes or thermometers) can change the observation model of a POMPD – may offer tractability in the form of modularization. Efforts using large-scale crowdsourcing to construct knowledge bases of common-sense knowledge (e.g., ATOMIC, sap2019atomic) may be required to support these efforts.
Current planning languages offer a partial solution for environment modelling.
While simulators partially implement facilities for world modeling, some (e.g. cote2018textworld; shridhar2020alfworld) suggest using mature planning languages like STRIPS fikes1971strips or PDDL mcdermott1998pddl for more full-featured modeling. This would not be without significant development effort – existing implementations of planning languages typically assume full-world observability (in conflict with POMPD modelling), and primarily agent-directed state-space changes, making complex world modeling with partial observability, and complex environment processes (such as plants that require water and light to survive, or a sun that rises and sets causing different items to be observable in day versus night) outside the space of being easily implemented with off-the-shelf solutions. In the near-term, it is likely that a domain-specific language specific to complex text world modeling would be required to address these needs while simultaneously reducing the time investment and barrier-to-entry for end users.
Analyses of environment complexity can inform agent design and evaluation.
Text world articles frequently emphasize agent modeling contributions over environment, methodological, or analysis contributions – but these contributions are critical, especially in the early stages of this subfield. Agent performance in easy environments has increased incrementally, while medium-to-hard environments have seen comparatively modest improvements. Agent performance is typically reported as a distribution over a large number of environments, and the methodological groundwork required to understand when different models exceed others in time or performance over these environment distributions is critical to making forward progress. Transfer learning in the form of training on one set of environments and testing on others has become a standard feature of benchmarks (e.g. hausknecht2020interactive), but focused contributions that work to precisely characterize the limits of what can be learned from (for example) OmniQuest and transferred to Zork, and what capacities must be learned elsewhere, will help inform research programs in agent modeling and environment design.
Transfer learning between text world and 3D environments.
Tasks learned at a high-level in text worlds help speed learning when those same models are transferred to more complex 3D environments shridhar2020alfworld. This framing of transfer learning may in some ways resemble how humans can converse about plans for future actions in locations remote from those eventual actions (like classrooms). As such, text-plus-3D environment rendering shows promise as a manner of controlling for different sources of complexity in input to multi-modal task learning (from high-level task-specific knowledge to low-level perceptual knowledge), and appears a promising research methodology for future work.