Log In Sign Up

Eye of the Beholder: Improved Relation Generalization for Text-based Reinforcement Learning Agents

by   Keerthiram Murugesan, et al.

Text-based games (TBGs) have become a popular proving ground for the demonstration of learning-based agents that make decisions in quasi real-world settings. The crux of the problem for a reinforcement learning agent in such TBGs is identifying the objects in the world, and those objects' relations with that world. While the recent use of text-based resources for increasing an agent's knowledge and improving its generalization have shown promise, we posit in this paper that there is much yet to be learned from visual representations of these same worlds. Specifically, we propose to retrieve images that represent specific instances of text observations from the world and train our agents on such images. This improves the agent's overall understanding of the game 'scene' and objects' relationships to the world around them, and the variety of visual representations on offer allow the agent to generate a better generalization of a relationship. We show that incorporating such images improves the performance of agents in various TBG settings.


page 3

page 8

page 9

page 17

page 19

page 20

page 21

page 22


Reinforcement Learning Agents in Colonel Blotto

Models and games are simplified representations of the world. There are ...

Learning to Deceive in Multi-Agent Hidden Role Games

Deception is prevalent in human social settings. However, studies into t...

Generalization in Text-based Games via Hierarchical Reinforcement Learning

Deep reinforcement learning provides a promising approach for text-based...

Baselines for Reinforcement Learning in Text Games

The ability to learn optimal control policies in systems where action sp...

Relational Reinforcement Learning in Infinite Mario

Relational representations in reinforcement learning allow for the use o...

Playing by the Book: Towards Agent-based Narrative Understanding through Role-playing and Simulation

Understanding procedural text requires tracking entities, actions and ef...

Graph-based Cluttered Scene Generation and Interactive Exploration using Deep Reinforcement Learning

We introduce a novel method to teach a robotic agent to interactively ex...

1 Introduction

Reinforcement Learning (RL) has seen a resurgence in recent years thanks to advances in representation, inference, and learning techniques – led by a massive scale-up and investment in deep neural network-based methods. Successful applications of RL have included domains such as Chess 

silver2018general , Go silver2017mastering , and Atari games mnih2016asynchronous

. However, with the emergence of natural language processing (NLP) as a key AI application area, research attention has turned towards text-based applications and domains. These domains offer their complexity challenges for RL algorithms, including large and intractable action spaces – the space of all possible words and combinations; partial observability of the world state; and under-specified goals and rewards.

Text-based games (TBGs) have emerged as prime exemplars of the above challenges. Inspired by games such as Dungeons & Dragons and Zork, researchers have worked on putting together challenging environments that offer the complexities of real-world interactions but in sandbox settings suitable for the training of RL agents. The foremost such example is TextWorld cote2018textworld

, an open-source text-based game engine that allows for the generation of text-based game instances and the evaluation of agents on those games. Much of the recent work on text-based RL 

ammanabrolu2019playing ; dambekodi2020playing ; murugesan2021text has focused on the TextWorld environment, and on imbuing agents with additional information to make them learn, scale, and act more efficiently.

However, much of the information that has been used in the prior art to improve the performance of AI agents in TBGs is still restricted to the medium of text. In contrast, when humans encounter games such as Zork and TextWorld, they do not restrict themselves to only textual information. Indeed, they are able to generalize to environments and the actions within them by considering not just the form of information provided by the environment; but also by imagining or visualizing various forms of that information. This imagination is key to generalizing beyond merely the information present in the instance currently under consideration. In this work, we posit that using images – either retrieved or imagined (generated) – that represent information from the game instance can help improve the performance of RL agents in TBGs.

Specifically, we consider RL agents in the TextWorld and Jericho TBG environments; and additional information that can be provided to such agents to improve their performance. Past work has focused on trying to use external knowledge to either limit chaudhury2020bootstrapped or enhance murugesan2021text the space of actions: however, this has also been restricted to the text modality. At their crux, these efforts are all trying fundamentally to solve the problem of relationships within the environment – how are different things in the world related to each other? And how can the agent manipulate these relations to convert the initial state of the world – via a sequence of observations – into the desired goal state (or to maximize reward)? Purely text-based information is extremely sparse and is unable to sufficiently abstract the notion of relationships.

Consider for example the relationship at - a patio chair is at the backyard. What does this relation mean - what is the at-ness? Text cannot convey this information effectively on its own: as the size of the underlying vocabulary increases, the natural language space gets sparser and it becomes harder to extract signals to understand relationships between objects (in this case, ‘patio chair’ and ‘backyard’). Images, on the other hand, go a bit further in conveying the meanings of relationships as understood by humans chen2015mind . Images also help generalize better: in text, a patio chair is always represented as patio chair; yet in a visual medium, there can exist different kinds of patio chairs, with different properties such as shape, size, color, texture, surroundings, etc.

In this paper, we introduce the Scene Images for Text-based Games (SceneIT) model (pronounced “seen-it”) that integrates an external repository of images as additional knowledge for an RL agent in text-based game environments; and measure the performance of this model against the state-of-the-art text-only method. Our images come from two sources: pre-retrieved from prior existing images; and generated anew based on textual descriptions. We show that an agent with access to this additional visual information does significantly better, and examine some specific instances that show the reason for this improved performance.

2 Methodology

Text-based reinforcement learning agents for TBGs interact with the environment only using the modality of text narasimhan2015language ; he2016deep . TBGs convey the state of the game at every step as observations in natural language text, and the text-based RL agent learns to map the current state to one of the admissible actions (also in the text modality) available to it. Most current text-based RL agents (e.g. murugesan2021text ) focus on integrating additional textual knowledge to learn and act in a complex environment. Such agents thus lack the ability for human-like imagination involved in solving TBGs efficiently.

In this section, we outline the methodology that we use to integrate the visual (image) representation of a game scene using our SceneIT approach for TBGs. In order to obtain the visual representation of the scene that the agent is currently situated in, as the first step, we extract noun phrases that represent objects and relational phrases between the objects in the scene from the text observation – for example, kitchen of the white house, bottle on the table, desk chair at bedroom, etc. These phrases portray the scene in terms of which object is located at what location, which we intend to use to create a “visual mind-map” of the scene for the agent.

Since the key component and novelty of our system is the usage of images for the TBGs under consideration, we first outline the collection process for such images. Our technique relies on two main sources of images: retrieval from the internet, and generation from pre-existing models for imagining and generating visual scenes. We describe each of these methods in detail below.

Figure 1: Examples of images obtained from (a) the web-based image retriever, and (b) imagination via AttnGAN xu2018attngan . The phrase used to retrieve or generate the picture is indicated above the respective picture.

2.1 Collecting Images

Retrieving Images from the Internet: In order to obtain images from the internet, we design an image retriever that obtains the best matching image from the list of query strings (noun phrases) that are used to represent the scene. This process also ties into one of the central motivations of our work, which is that images offer more signals to agents as they try to abstract, represent, and use the relationships between different objects in a scene.

To provide good generalization behavior, we design an image retriever that automatically searches the internet for a given query string without any human supervision 111Based on Google Image Retriever: In addition, we use image caching to improve the speed of retrieval such that the images corresponding to encountered queries are saved to disk and need not be downloaded from the web while training the agent. It is to be noted that the caching process is completely generic and does not involve saving specific situation-relevant images. Figure 1(a) provides some examples of images that are retrieved from the internet for specific phrases.

Imagining Images from Generative Models:

The previous method of “visual mind-map” extraction uses pre-existing images from the internet for scene representation. Such a scene representation is useful for a human to visually parse the scene. However, we also explore the potential for representing visual scenes using images that are

imagined by generative models. Our hypothesis is that such images can also provide useful visual features to improve generalization in tasks from TextWorld (and other text-based games).

We use the Attentional Generative Adversarial Network (AttnGAN) 

xu2018attngan for attention-driven text-to-image generation. This generative model uses a multi-stage refinement for fine-grained generation of images from a given text snippet. AttnGAN gives attention to the relevant tokens in the natural language query in order to generate details at different sub-regions of the image.

For our approach, we pre-train the AttnGAN model on the MS-COCO dataset 

lin2014microsoft . The queries used for image generation are the same as the ones used for the previous internet retrieval-based scene representation. We hypothesize that although such images may not always be interpretable by humans – see Figure 1(b) for a few examples – such images can provide some latent image features for neural models that might contribute to better generalization in TextWorld games.

2.2 Model Description

We now detail the models that we used to use and encode the images retrieved or generated in the previous step. Figure 2 shows the architecture overview of our proposed approach for scene representation using the AttnGAN xu2018attngan based text-to-image generation. In order to capture the textual features from the text observation from the game, we use Stacked GRU as our text encoder: this keeps tracks of the state of the game across time steps. Once we have the images retrieved/generated from the text snippets, we extract the image features using image encoders which are combined with the features from textual inputs to obtain the action scores.

Specifically, we use Resnet-50 for encoding the retrieved images and for the images generated from the pre-trained AttnGAN. The text and image encoding features are then concatenated and passed to the action selector (as shown in Figure 2

), which maps the encoding features to action scores using a multi-layer perceptron (MLP) to select the next action. Based on the reward from the game environment, we update text and image encoders and the action selector. Since the reward from the game can guide the text-to-image generator (AttnGAN) to generate meaningful images for the current context of the game, we finetune the pre-trained AttnGAN along with the encoders and the action selector to yield the best results. In this case, we use the inbuilt CNN-based image encoder (Inception v3 

szegedy2016rethinking ) to map the generated images to the image features. We call this model SceneIT and use it by default for all our experiments in this paper.

Figure 2: Overview of our methodology of scene representation for a sample text observation taken from Zork1 using text-to-image generative model. Highlighted text snippets show some of the phrases used by the agent to generate relevant images for scene representation.

3 Experimental Results

In this section, we present experimental results that demonstrate the advantage of our proposed Scene Images for Text-based Games approach – which makes use of images in addition to text – over existing state-of-the-art techniques that are text-only. We conduct our performance evaluation on three datasets: TextWorld Commonsense (TWC) 222, the First TextWorld Problems (FTWP) 333 and Jericho444 The TWC and FTWP datasets build on the Microsoft TextWorld Environment cote2018textworld , and offer complementary tests: while TWC tasks require the retrieval and use of commonsense knowledge for more efficient solution, the FTWP problems test the agent’s exploration capabilities. Jericho is a suite of interactive fiction games that measures human performance on text-based games by offering stories from different domains – in our case, it helps evaluate the breadth and coverage of the image generation.

Distribution: In these datasets, a set of text-adventure games are provided for training reinforcement learning (RL) agents. In addition to these training games, the datasets contain two test sets of games: 1) Test games (IN) that are generated from the same distribution as the training games – these games contain similar sets of entities and relations as the train games; and 2) Test games (OUT), which contain games generated from a set of entities that have no overlap with the training games. This is a way of testing whether the RL agent can generalize its behavior to new and unseen games by leveraging the state observation from the TextWorld environment – and additionally in our case, the visual relationships between entities.

Agents: We compare three RL agents in our experiments: 1) Random, where the actions are selected randomly at each step; 2) Text-Only, where the actions are selected solely based on the textual observation available at the current step. We use three baseline text-only methods - DRRN he2016deep , Template DQN hausknecht19 and KG-A2C ammanabrolu2020graph ; and 3) Our method – SceneIT – explained in the previous section, where the RL agent is allowed to imagine visual scenes and images using Attention GAN Tao18attngan , a Text-to-Image generator based on Generative Adversarial Networks (GAN) NIPS2014_5ca3e9b1 .

Metrics: In our experiments, we measure the performance of various agents using two metrics: (1) Average Normalized Score – calculated as the total score achieved by an agent normalized by the maximum possible score for the game); and (2) Average Steps Taken – calculated as the total number of steps taken by the agent to complete the goals. A higher score is better, while a lower number of steps taken is better.

Figure 3:

Training performance (showing mean and standard deviation averaged over

runs) for the three difficulty levels: Easy (left), Medium (middle), Hard (right). Higher normalized score is better, while lower number of steps is better. Our Method refers to our SceneIT technique.

3.1 Quantitative Results

We first present the results of a quantitative evaluation of our proposed technique. In order to provide a well-rounded evaluation, we consider different text-based games: the TWC and FTWP problems, both based on the TextWorld cote2018textworld domain; and the Jericho hausknecht19 domain, based on interactive fiction (IF) games. Detailed experimental setting are provided in the supplementary material.

3.1.1 Experiments on TextWorld Commonsense

The first domain that we conduct our evaluation on is the TextWorld Commonsense murugesan2021text domain. This domain is an extension of the TextWorld domain that adds scenarios where commonsense knowledge is required in order to arrive at efficient solutions.

Difficulty Levels: The TWC domain comes with difficulty levels for the problem instances associated with it, defined in terms of how hard it is for an agent (human or AI) to solve that specific instance. The difficulty of a level is set as a combination of the number of goals to be achieved, the number of actions (steps) required to achieve them, and the number of objects and rooms in the instance (which may be related to goal achievement, or may simply be distractors). In our evaluation for this work, we consider three distinct difficulty settings. In increasing order of hardness, these are: easy, medium, and hard. We follow Murugesan et al. murugesan2021text – who introduce the TWC domain, and are the current state-of-the-art on this domain – in choosing these difficulty levels.

Training Performance: Figure 3 shows the training performance of three different agents/models on the TWC problems for the three difficulty levels discussed above. For each level, the performance is reported via the normalized score (higher is better) as well as the average number of steps (lower is better). It is clear that SceneIT – with access to both the textual representation of the observations from the game, as well as the image/visual representation – does much better in all three settings. Furthermore, beyond the episode mark, there is a clear divergence of our technique from the random and text-only baselines.

Norm. Score Test Games (IN) Test Games (OUT)
ModelLevel Easy Medium Hard Easy Medium Hard
Random 0.52 0.49 0.49 0.51 0.54 0.31
Text 0.82 0.74 0.62 0.75 0.69 0.41
SceneIT 0.96 0.70 0.77 0.88 0.78 0.59

Num. Steps Test Games (IN) Test Games (OUT)
ModelLevel Easy Medium Hard Easy Medium Hard
Random 38.52 49.66 46.21 38.92 48.94 48.95
Text 22.73 46.36 39.54 30.18 46.29 46.90
SceneIT 13.38 46.15 34.65 19.58 38.18 44.08
Table 1: Test performance (averaged over runs) on the normalized score (higher is better) and number of steps (lower is better) metrics for the three difficulty levels.

Test Performance: Table 1 shows the test results for models - one random baseline, one text-only baseline, and SceneIT – which combines the text features with image features from the finetuned AttnGAN. We split our reporting across two conditions: Test games (IN) reports on test games that come from the same distribution as the training games; while Test games (OUT) reports on test games from outside the distribution of training games. It is clear that for both conditions, SceneIT is the state-of-the-art in out of instances – handily beating the existing text-only state-of-the-art (Text). In the one case where it is not the best (medium for in distribution), it is very close to the performance of the best performing model. This shows the added advantage of using visual features in addition to textual features when solving TWC games, thus validating the central hypothesis of our work.

3.1.2 Experiments on First TextWorld Problems

Figure 4: Test-set performance on FTWP Cooking Task (averaged over runs) on the normalized score (higher is better) and the number of steps (lower is better) metrics.

In this section, we present the results of running the various agents/models on the First TextWorld Problems (FTWP) dataset. Figure 4 shows the results across the in and out distributions, as introduced previously. Since the cooking task in FTWP focuses more on exploration rather than the meaningful relationship between the objects (as in TWC ) to improve the performance, we can see that SceneIT shows results that are comparable to and even worse than the text-only model: this shows that merely adding images to a game does not always necessarily improve the metrics.

3.1.3 Experiments on Jericho

Human Baselines Ours
Game Max Walkthrough-100 TDQN DRRN KG-A2C SceneIT
detective 360 350 169 197.8 207.9 317.7
enchanter 400 125 8.6 20 12.1 21.6
inhumane 90 70 0.7 0 3 15.83
karn 170 40 0.7 2.1 0 0.0
snacktime 50 50 9.7 0 0 20
spellbrkr 600 160 18.7 37.8 21.3 40
zork1 350 102 9.9 32.6 34 43.58
zork3 7 3 0 0.5 0.1 2.67
Table 2: Maximum scores on a subset of Jericho games (selected randomly based on the difficulty level) achieved by the agents (proposed and baseline) averaged over runs. Difficulty levels: easy marked in green color, difficult in tan, and extreme in red.

Next, we consider Jericho hausknecht19 , a benchmark dataset in TBGs that consists of popular interactive fiction (IF) games developed for humans a decade ago. We randomly select a subset of games from different difficulty levels for our experiments. From Table 2, we can see that SceneIT outperforms the other state-of-the-art text-only baselines (Template DQN hausknecht19 , DRRN narasimhan2015language ; he2016deep , and KG-A2C ammanabrolu2019graph ) by a significant margin. Our approach is currently able to achieve the best score (averaged over runs) on games from across the difficulty levels.

3.1.4 Images: Retrieval vs. Generation

Figure 5: Results showing an improvement across both normalized score (higher is better) and number of steps (lower is better) by using images on the TWC dataset with different difficulty levels.

After establishing that the addition of the visual features from images that represent the scene described by the textual observations from the game does indeed help the performance of agents, we now explore further into comparison between these different agents. Specifically, we compare the three models described in Section 2.2: SceneIT with retrieved images from the internet, SceneIT with generated/imagined images from the pretrained AttnGAN and SceneIT with finetuned AttnGAN. This comparison is presented as a bar chart in Figure 5. As in the previous experiments, we plot the three difficulty levels across two conditions: in and out of distribution. We use a lighter shade of the corresponding color for the former, and a darker shade for the latter. It is clear that SceneIT – which combines text features with features from AttnGAN – outperforms the other two image baselines across different difficulty levels and conditions.

3.2 Qualitative Results

Figure 6: Activation maps showing the region of interest when producing the action command in each case, using the internet-based retrieval model for TWC.
Figure 7: Activation maps showing the region of interest when producing the action command in each case, using the imagination based model for TWC. We include both the generated images and its attention plot for clarity.

In addition to the quantitative results described previously, we also present some qualitative examples of what the SceneIT agent focuses on as it uses images (retrieved or imagined) in order to solve specific problem instances. To illustrate this effectively, we use the notion of attention activation maps zhou2016learning ; selvaraju2017grad ; lu2012learning ; gupta2021recognition , which can be used to demonstrate parts of an image that an agent/technique is attending to. We split our analysis into the two main ways in which we currently produce images for use by SceneIT: retrieval, and imagination (see Section 2).

Figure 6 shows examples of this for images retrieved from the Internet. We present three examples of the various images that are produced for a given text phrase from the observation as input (e.g. clothesline at backyard), as well as the final action that is taken by the agent (e.g. examine clothesline). The other two examples follow a similar pattern; and together, these three examples illustrate that SceneIT can focus the agent’s attention on relevant parts of the retrieved image to facilitate the final decision making.

A similar pattern is seen in the case of imagined images: Figure 7 presents both the imagined images as well as the activation maps overlaid over those respective images for a given set of text phrases from the game observation. For example, the agent can focus on the right part of the image that is imagined for the phrase wet brown dress on patio chair; and can then choose the action examine patio chair. The other examples also illustrate a similar pattern.

This analysis also presents an interesting contrast between images that are retrieved from the Internet, versus ones that are generated from scratch by an AI model. For example, consider the two text phrases patio chair at backyard and clothesline at backyard. Both these phrases and the images retrieved and generated respectively for them appear in both Figure 6 and Figure 7 – however, the visual representation of the pairs of pictures is strikingly different. The SceneIT

 agent is also led to choose different actions in the two cases – while in the case of the retrieval it chooses to examine the clothesline, in the case of imagination it instead chooses to examine the patio chair. These examples illustrate qualitatively how the two image retrieval techniques can work in different and often complementary ways.

4 Related Work

The field of text-based and interactive games has seen a lot of recent interest and work, thanks in large part to the creation and availability of pioneering environments such as TextWorld cote2018textworld and the Jericho hausknecht19 collection. Based on these domains, several interesting approaches have been proposed that seek to improve the efficiency of agents in these environments ammanabrolu2019playing ; dambekodi2020playing ; chaudhury2020bootstrapped ; murugesan2021text . We mention and discuss this prior work in context in the earlier parts of this paper.

Separate from this progress on TBGs, there has also been work on Inductive Logic Programming (ILP) methods – these methods have shown good relation generalization in symbolic domains using differentiable model learning on symbolic inputs 

evans2018learning ; richardson2006markov , even in noisy settings. Neural Logic Machines dong2019neural have shown good generalization to out-of-sample games using dedicated MLP units for first-order rule learning by interacting with the environment. The work on Logical Neural Networks riegel2020logical is a recent addition to the family of ILP methods that can learn differentiable logical connectives using constrained optimization over the differentiable neural network framework. Concurrently, there has been work in the (symbolic) automated planning community that has looked at learning and inferring the relations (predicates) that make up an underlying domain – like the eight-tile puzzle – by using variational auto-encoders asai2019unsupervised ; asai2018classical ; asai2020learning ; asai2020discrete .

5 Conclusion

In this paper, we introduced Scene Images for Text-based Games (SceneIT), a model for RL agents executing in text-based games. SceneIT uses the text from observations provided by the game to either retrieve or generate images that correspond to the scene represented by the text; and then combines the features from the images along with features from the text in order to select the next best action for the RL agent. We show via an extensive experimental evaluation that SceneIT shows better performance – in terms of the normalized reward score achieved by agents, as well as the number of steps to complete a task – than existing state-of-the-art models that rely only on the observation text. We also presented qualitative results that showed that an agent guided by SceneIT focuses its attention on those parts of an image that we may expect a human to attend to as well.


  • [1] Leonard Adolphs and Thomas Hofmann. Ledeepchef deep reinforcement learning agent for families of text-based games. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 34, pages 7342–7349, 2020.
  • [2] Prithviraj Ammanabrolu and Matthew Hausknecht. Graph constrained reinforcement learning for natural language action spaces. In International Conference on Learning Representations, 2019.
  • [3] Prithviraj Ammanabrolu and Matthew Hausknecht. Graph constrained reinforcement learning for natural language action spaces. arXiv preprint arXiv:2001.08837, 2020.
  • [4] Prithviraj Ammanabrolu and Mark Riedl. Playing text-adventure games with graph-based deep reinforcement learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3557–3565, 2019.
  • [5] Masataro Asai. Unsupervised grounding of plannable first-order logic representation from images. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 583–591, 2019.
  • [6] Masataro Asai and Alex Fukunaga. Classical planning in deep latent space: Bridging the subsymbolic-symbolic boundary. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [7] Masataro Asai and Christian Muise. Learning neural-symbolic descriptive planning models via cube-space priors: The voyage home (to strips). In International Joint Conference on AI (IJCAI), 2020.
  • [8] Masataro Asai and Zilu Tang. Discrete word embedding for logical natural language understanding. arXiv preprint arXiv:2008.11649, 2020.
  • [9] Subhajit Chaudhury, Daiki Kimura, Kartik Talamadupula, Michiaki Tatsubori, Asim Munawar, and Ryuki Tachibana. Bootstrapped Q-learning with Context Relevant Observation Pruning to Generalize in Text-based Games. In The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  • [10] Xinlei Chen and C Lawrence Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 2422–2431, 2015.
  • [11] Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for text-based games. In Workshop on Computer Games, pages 41–75. Springer, 2018.
  • [12] Sahith Dambekodi, Spencer Frazier, Prithviraj Ammanabrolu, and Mark O. Riedl. Playing text-based games with common sense, 2020.
  • [13] Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. Neural logic machines. arXiv preprint arXiv:1904.11694, 2019.
  • [14] Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61:1–64, 2018.
  • [15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27, pages 2672–2680. Curran Associates, Inc., 2014.
  • [16] Shikha Gupta, AD Dileep, and Veena Thenkanidiyoor. Recognition of varying size scene images using semantic analysis of deep activation maps. Machine Vision and Applications, 32(2):1–19, 2021.
  • [17] Matthew Hausknecht, Prithviraj Ammanabrolu, Côté Marc-Alexandre, and Yuan Xingdi. Interactive fiction games: A colossal adventure. CoRR, abs/1909.05398, 2019.
  • [18] Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep reinforcement learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1621–1630, 2016.
  • [19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [20] Yao Lu, Wei Zhang, Cheng Jin, and Xiangyang Xue. Learning attention map from images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1067–1074. IEEE, 2012.
  • [21] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In

    International conference on machine learning

    , pages 1928–1937. PMLR, 2016.
  • [22] Keerthiram Murugesan, Mattia Atzeni, Pavan Kapanipathi, Pushkar Shukla, Sadhana Kumaravel, Gerald Tesauro, Kartik Talamadupula, Mrinmaya Sachan, and Murray Campbell. Text-based RL Agents with Commonsense Knowledge: New Challenges, Environments and Baselines. In The 35th AAAI Conference on Artificial Intelligence, 2021.
  • [23] Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Language understanding for text-based games using deep reinforcement learning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1–11, 2015.
  • [24] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-2):107–136, 2006.
  • [25] Ryan Riegel, Alexander Gray, Francois Luus, Naweed Khan, Ndivhuwo Makondo, Ismail Yunus Akhalwaya, Haifeng Qian, Ronald Fagin, Francisco Barahona, Udit Sharma, et al. Logical neural networks. arXiv preprint arXiv:2006.13155, 2020.
  • [26] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • [27] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  • [28] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • [29] Matthijs TJ Spaan.

    Partially observable markov decision processes.

    In Reinforcement Learning, pages 387–414. Springer, 2012.
  • [30] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • [31] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  • [32] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. CVPR, 2018.
  • [33] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.

    Learning deep features for discriminative localization.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.

Appendix A Experimental Details

In this section, we report the experimental setup and settings used in our paper.

TBGs as a POMDP: TBGs can be framed as partially observable Markov decision processes (POMDPs) [29] denoted , where: denotes the set of states, denotes the action space, denotes the observation space,

denotes the state transition probabilities,

denotes the conditional observation emission probabilities, and is the reward function. The observation at time step depends on the current state. Both observations and actions are rendered in text. The agent receives a reward at every time step : , and the agent’s goal is to maximize the expected discounted sum of rewards: , where is a discount factor. In our experiments, we set . All policies are learned using Actor-Critic [21]. We use GloVe word embeddings to represent our observation text and project it to a

dimensional vector. We use

dimensions as our hidden size in the (bidirectional) text encoder.

TextWorld Commonsense consists of 30 games per difficulty level (listed below) that have been generated and separated into 3 directories: train, test (OUT), and valid (IN). The goal of these games is house cleanup, where each object is misplaced in a house (with one or more rooms) and the agent needs to return the misplaced objects to a commonsensically appropriate location (e.g., apple to the refrigerator and dirty sock to the laundry basket, etc). The objects in these games follow commonsensical relations and previous work [22] has shown that leveraging external commonsense knowledge (such as ConceptNet) improves the performance.

  • Easy level games have only 1 room and up to 3 objects that need to be placed in their appropriate location.

  • Medium level games have 1 room with up to 5 objects.

  • Hard level games have either 4 or 5 objects shuffled across 3 or 4 rooms.

In our experiments with TWC, we set the maximum number of episodes to with a step maximum per episode.

First TextWorld Problems consist of different training games, validation games (IN), and test games (OUT). These belong to the cooking domain, where the cooking ingredients are placed throughout the house and the agent has to collect these objects/items (listed in the cookbook) to prepare a delicious meal. Unlike in TWC, the objects in these games are not related and the agent needs to utilize its exploration capabilities to collect these items.

In our experiments with FTWP, we set the maximum number of episodes to with steps per episode, as suggested in prior work [1].

In Jericho  the agent can take up to steps per episode with a limit of maximum total steps per run. The Jericho environment allows handicap configurations for the agents when interacting with the environment. Depending on the handicap level, the agent may request additional information from the environment for better exploration. We follow prior work (baselines reported in Table 2 in the main paper) for our handicap configuration for fair comparison among the baselines. Specifically, we generate a set of valid actions per step using the action templates and evaluate it using the Jericho environment to check whether the action is valid. In addition to the observation text, we use the additional information: inventory and look from the Jericho environment. Please see Section C for a sample gameplay readout for Zork1 – a benchmark text-based interactive fiction game developed in the 1980s which is a part of the Jericho game suite.

Resources: The agents were trained in parallel on two machines with the following specifications:

Resource Setting
CPU Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Memory 128GB
GPUs 2 x NVIDIA Tesla V100 16 GB
Disk1 100GB
Disk2 600GB
OS Ubuntu 18.04-64 Minimal for VSI.
Table 3: Resources used by the agents.

Each agent was trained on a single GPU for approximately 12 hours for the Text agent and 16 hours for the SceneIT agent for each run. We use SpaCy with additional hand-written rules to extract the noun phrases for the image retrieval/generation.

Appendix B Additional Experimental Results

b.1 Quantitative Results: Training Curves for Different SceneIT Strategies

Figure 12 shows the training curves for different strategies for utilizing images in our proposed SceneIT approach. We compare the agent’s performance based on both retrieved and generated images, to complement the test set (IN and OUT distributions) performance results reported in Section 3.1.4 of the main paper. We use Text + Google (SceneIT retrieved) for the internet retrieved images and Text + AttnGAN (SceneIT imagined/generated) for AttnGAN generated images. Text + ModelGAN (SceneIT) shows the proposed SceneIT method that fine-tunes the pretrained AttnGAN while training our agent – this agent is used for all our experiments, as mentioned in Section 3 of the main paper. The training curves show that both Text + Google and Text + AttnGAN feature improvements in the early episodes of the training; whereas Text + ModelGAN slowly catches up to the other approaches. This is due to the fine-tuning of the pre-trained AttnGAN along with the training of the agent. The test performance of SceneIT with the fine-tuning approach validates that this approach outperforms other strategies and the baselines.

b.2 Qualitative Results: Activation Maps for Ftwp and Jericho 

Figures 13 and 14 shows the activation maps for the First TextWorld Problems. Figures 15 and 16 show the activation maps for Jericho. The title on each sub-figure shows the next action taken based on the observation text, and the images generated/retrieved using the objects and the relational phrases between the objects in the scene from the observation.

From each observation at time , we extract various keywords that represent the relationships between the objects in the scene. As shown in Figures 14 and 13

, for each keyword, we either retrieve the image from the web or generate using AttnGAN. Our model uses the various images for each keyword to generate the action string using a combination of textual and visual features. We use the ResNet-18 model for the visual feature extraction, which we fine-tune during the training of the agent. During inference, we use GradCAM 

[26] on the visual model to extract regions of interest in the image that is used for feature extraction; this is then ultimately used for generating the final action (shown as the heading for each image) from the list of admissible actions.

Note that in the case of the Jericho environment, the AttnGAN-based generative model encounters out-of-sample query strings (i.e., phrases extracted from observation text) which are unique to these fantasy games such as “precious jewel”, “broken clockwork canary” etc, that are not seen during training.

Appendix C Sample Gameplay by the Proposed SceneIT Agent for Zork1 from Jericho 

In this section, we show a sample of gameplay by our agent in Zork1, a benchmark text-based interactive fiction game developed in the 1980s (now a part of the Jericho game suite). We show the generated images from our SceneIT model with fine-tuned AttnGAN for some of the steps for clarity.

[Action Taken]:look
**Step 95078**
[Obs]:Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726
West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.
Figure 8: SceneIT generated images for this observation text.
[Action Taken]:open mailbox
**Step 95079**
[Obs]:opening the small mailbox reveals a leaflet.
[Action Taken]:west
**Step 95080**
this is a forest, with trees in all directions. to the east, there appears to be sunlight.
[Action Taken]:go around trees
**Step 95081**
this is a dimly lit forest, with large trees all around.
[Action Taken]:go around trees
**Step 95082**
this is a dimly lit forest, with large trees all around.
[Action Taken]:west
**Step 95083**
you hear in the distance the chirping of a song bird.
[Action Taken]:south
**Step 95084**
[Action Taken]:north
**Step 95085**
you are in a small clearing in a well marked forest path that extends to the east and west.
[Action Taken]:go around forest
**Step 95086**
[Action Taken]:north
**Step 95087**
you are in a clearing, with a forest surrounding you on all sides.
a path leads south.
on the ground is a pile of leaves.
[Action Taken]:take pile
**Step 95088**
[Obs]:in disturbing the pile of leaves, a grating is revealed.
[Action Taken]:put down pile
**Step 95089**
[Action Taken]:south
**Step 95090**
[Obs]:forest path
this is a path winding through a dimly lit forest. the path heads
north-south here. one particularly large tree with some low branches
stands at the edge of the path.
you hear in the distance the chirping of a song bird.
[Action Taken]:up
**Step 95091**
[Obs]:up a tree
you are about 10 feet above the ground nestled
among some large branches. the nearest branch above you is above your reach.
beside you on the branch is a small bird’s nest.
in the bird’s nest is a large egg encrusted with
precious jewels, apparently scavenged by a childless songbird.
the egg is covered with fine gold inlay, and ornamented in
lapis lazuli and mother-of-pearl. unlike most eggs,
this one is hinged and closed with a delicate looking clasp.
the egg appears extremely fragile.
Figure 9: SceneIT generated images for this observation text.
[Action Taken]:down
**Step 95092**
[Obs]:forest path
[Action Taken]:south
**Step 95093**
[Obs]:north of house
you are facing the north side of a white house.
there is no door here, and all the windows are boarded up.
to the north a narrow path winds through the trees.
[Action Taken]:west
**Step 95094**
[Obs]:west of house
there is a small mailbox here.
the small mailbox contains:
  a leaflet
[Action Taken]:north
**Step 95095**
[Obs]:north of house
[Action Taken]:north
**Step 95096**
[Obs]:forest path
[Action Taken]:go around forest
**Step 95097**
[Action Taken]:west
**Step 95098**
[Obs]:behind house
you are behind the white house. a path leads into the
forest to the east. in one corner of the house there is
a small window which is slightly ajar.
[Action Taken]:south
**Step 95099**
[Obs]:south of house
you are facing the south side of a white house.
there is no door here, and all the windows are boarded.
Figure 10: SceneIT generated images for this observation text.
[Action Taken]:east
**Step 95100**
[Obs]:behind house
[Action Taken]:open window
**Step 95101**
[Obs]:with great effort, you open the window far enough to allow entry.
[Action Taken]:west
**Step 95102**
you are in the kitchen of the white house.
a table seems to have been used recently for the preparation of food.
a passage leads to the west and a dark staircase can be seen leading upward.
a dark chimney leads down and to the east is a small window
which is open.on the table is an elongated brown sack,
smelling of hot peppers.
a bottle is sitting on the table.
the glass bottle contains:
  a quantity of water
Figure 11: SceneIT generated images for this observation text.
[Action Taken]:west
Figure 12: Performance evaluation on TextWorld Commonsense (TWC) across the three visual methods measured during training (showing mean and standard deviation averaged over runs) for the three difficulty levels: Easy (left), Medium (middle), Hard (right)using normalized score and the number of steps taken.
Figure 13: Activation maps for First TextWorld Problems Cooking Task when using images retrieved from internet for selecting the next action. We show the retrieved images along with its attention plot for clarity.
Figure 14: Activation maps for First TextWorld Problems Cooking Task showing the region of interest when using the imagination based model (AttnGAN) for selecting the next action. We include both the generated images and its attention plot for clarity.
Figure 15: Activation maps for Zork1 from Jericho environment, when using images retrieved from the internet for selecting the next action. We show the retrieved images along with its attention plot for clarity.
Figure 16: Activation maps for Zork1 from Jericho environment, on the imagined images from Zork1 games using AttnGAN by the visual model.