Learn How to Cook a New Recipe in a New House: Using Map Familiarization, Curriculum Learning, and Common Sense to Learn Families of Text-Based Adventure Games

08/13/2019 ∙ by Xusen Yin, et al. ∙ USC Information Sciences Institute 5

We consider the task of learning to play families of text-based computer adventure games, i.e., fully textual environments with a common theme (e.g. cooking) and goal (e.g. prepare a meal from a recipe) but with different specifics; new instances of such games are relatively straightforward for humans to master after a brief exposure to the genre but have been curiously difficult for computer agents to learn. We find that the deep Q-learning strategies that have been successfully leveraged for superhuman performance in single-instance action video games can be applied to learn families of text video games when adopting simple strategies that correlate with human-like learning behavior. Specifically, we build agents that learn to tackle simple scenarios before more complex ones (curriculum learning), that are equipped with the contextualized semantics of BERT (and we demonstrate that this provides a measure of common sense), and that familiarize themselves in an unfamiliar environment by navigating before acting. We demonstrate faster training convergence and improved task completion rates over reasonable baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Building agents able to play text-based adventure games is a useful proxy task for learning open-world goal-oriented problem-solving dialogue agents. Via an alternating sequence of natural language descriptions given by the game and natural language commands given by the player, a player-agent navigates an environment, discovers and interacts with entities, and accomplishes a goal, receiving explicit rewards for doing so. Human players are skilled at text games when they understand the situation they are placed in and can make rational decisions based on their life and game playing experience. For example, in the classic text game Zork Infocom (2001), the adventurer discovers an air pump and an uninflated plastic boat; common sense leads human players to inflate the boat with the pump.

Games such as Zork are very complicated and are designed to be played repeatedly until all the puzzles contained within have been solved; in this way, they are not very similar to real human experiences. Another kind of text game, as exemplified by the TextWorld learning environment Côté et al. (2018) and competition, expects agents to learn a particular task theme (such as rescuing victims from a burning building or preparing a meal) but evaluates on never-before-seen instances of that theme. This is a much more realistic scenario. A person who has never cooked a meal before would no doubt flounder when asked to prepare one. In order to learn to cook, one does not begin by learning to make Coq au Vin, but rather starts simply and works up to more complicated tasks. However, once the cooking skill is learned, one would reasonably expect to be able to prepare a new recipe the first time it is seen. Furthermore, even if the recipe was prepared in a somewhat unfamiliar location (say, the kitchen of a vacation home), a reasonable person would explore the new space, recognize the familiar rooms and elements, and then begin cooking.

In this work, we approach this more-realistic scenario and consider how we might train models to learn to play familiar but unseen text games by adopting a training regimen and knowledge set that mirror human skill acquisition. Specifically, we make the following contributions in our text game agent learning models:

Master Player
You find yourself in a kitchen. You make out a fridge. The fridge is empty. You see a cookbook on the table. you see a counter. the counter is vast. on the counter you can make out a knife.
examine cookbook
You open the cookbook and start reading. ‘Recipe 1: ingredients: red potato. directions: slice the red potato. roast the red potato. prepare meal’
take knife from counter
You take the knife from the counter.
slice red potato with knife
You slice the potato. Your score has gone up by 1 point.
Figure 1: Truncated example of dialogue from the First TextWorld Problems Challenge shows a portion of a ‘tier-1’ game, as described in Section 3. The concatenation of all master-player sequences constitutes a trajectory as described in Section 2.
  • We build agents that can play unseen text-based games, by transferring learned knowledge instead of by simply overfitting on a single trained game.

  • We show how the proper use of domain-aware curriculum learning strategies can lead to a better learned agent.

  • We draw a distinction between knowledge into the universal (e.g., that cooking can be done in the kitchen) and instance (e.g. that the kitchen is east of the bedroom); the former can be usefully learned with training data, but the latter cannot. We show how environment familiarization

    through construction of a knowledge graph improves learning.

  • We show that the incorporation of a pre-trained contextualized large language model speeds up training convergence. We also demonstrate that this is because it provides external common sense knowledge that otherwise must be learned through trial and error, or not at all.

2 Reinforcement learning for text game models

The influential Deep Q-Network (DQN) approach of learning simple action video games pioneered by google-atari has motivated research into the limits of this technique when applied to other kinds of games. We follow recent work that ports this approach to text-based games Narasimhan et al. (2015); He et al. (2016); Fulda et al. (2017); Zahavy et al. (2018); Ansari et al. (2018); Kostka et al. (2017); Yuan et al. (2018); Ammanabrolu and Riedl (2018); Yin and May (2019). The core approach of DQN as described by google-atari is to build a replay memory of partial games with associated scores, and use this to learn a function , where is the expected reward (a.k.a. Q-value) obtained by choosing action when in state ; from , choosing affords the optimal action policy and this is used at inference time. As in the original work, a key innovation is using the appropriate input to determine the game state; for video games, it is using a sequence of images from the game display; while for text games we use a history of system description-player action sequences, which we call a trajectory; an abbreviated example is given in Figure 1. A means of efficiently representing infinite is necessary; most related work uses LSTMs Narasimhan et al. (2015); Ammanabrolu and Riedl (2018); Yuan et al. (2018); Kostka et al. (2017); Ansari et al. (2018), though we follow Zahavy et al. (2018), which uses CNNs, to achieve greater speed in training. The DQN is trained in an exploration-exploitation method (

-search): with the probability

, the agent chooses a random action (explores), and otherwise the agent chooses the action that maximizes the DQN function. The hyperparameter

usually decays from 1 to 0 during the training process.

Much game-learning research is concerned with the optimization of a single game, e.g. applying DQN repeatedly on Pac-Man with the goal of learning to be very good at playing Pac-Man. While this is a realistic goal when strictly limited to the domain of video game play111occasional stochasticity notwithstanding, single-game optimization is rather unsatisfying. It is difficult to tell if a single game-trained model has managed to simply overfit on its target or if it has learned something general about the task it is trying to complete. More concretely, if we consider game playing as a proxy for real-world navigation (in the action game genre) or task-oriented dialogue (in the text genre), it is clear that a properly trained agent should be able to succeed in a new, yet familiar environment. We thus depart from the single-game approach taken by others Narasimhan et al. (2015); He et al. (2016); Ammanabrolu and Riedl (2018); Zahavy et al. (2018) and evaluate principally on games that are in the same genre as those seen in training, but that have not previously been played during training.

Figure 2:

The architecture of the DRRN model. Trajectories and actions are encoded by a CNN and an LSTM into hidden states and hidden actions, followed by a dense layer to compute the Q-vector. The pre-trained BERT layer may be replaced with randomly initialized word embeddings for both trajectory and actions. We also construct a knowledge graph from trajectories to add information in contradicted actions.

2.1 Handling unbounded action representations

A consequence of learning to play a game that has not been seen before is that actions not seen in training may be necessary at test time. Vanilla DQNs as introduced by google-atari are incompatible with this modification; they presume a predefined finite action space and were designed for a space of up to 18 (each of nine joystick directions and a potential button push). Additionally, vanilla DQNs presume no semantic relatedness among action spaces, while in text games it would make sense for, e.g., open the door to be semantically closer to shut the door than dice the carrot. In our experiments we assume a game’s action set is fully known at inference time but not beforehand, and that actions have some relatedness.222

This is itself still a simplification, as many text games allow open text generation and thus infinite action space. Our approach does not preclude abandoning this simplification, but the difficulty of the problem is sufficient to leave this for future work.

We thus represent actions using Deep Reinforcement Relevance Networks (DRRN) (Figure 2) He et al. (2016), a modification of the standard DQN. Actions are encoded via an LSTM Hochreiter and Schmidhuber (1997) and scored against state representations according to this equation:

where is a learned weight matrix. In preliminary experiments we found that LSTMs worked better than CNNs on the small and similar actions in our space such as take yellow potato from fridge and dice purple potato.

3 Games

We use the games released by Microsoft for the ‘First TextWorld Problems’333https://www.microsoft.com/en-us/research/project/textworld competition. The competition provides 4,440 cooking games generated by the TextWorld framework Côté et al. (2018). The goal of each game is to prepare a recipe. The action space is simple, yet expressive, and has a fairly large, though domain-limited, vocabulary. A portion of a simple example is shown in Figure 1.

The games are divided into 222 different types, with 20 games per type. A type is a set of attributes that increase the complexity of a game. These attributes include the number of ingredients, the set of necessary actions, and the number of rooms in the environment. One example of such a type is recipe3 + take3 + open + drop + go9 that implies the game contains three ingredients in the recipe, and players need to find and take the three items. In the process of finding these items, there could be doors to open, e.g. a door of a fridge, or a door of a room. The agent may also need to drop something in hand before taking another. Finally, the go9 means there are nine different rooms in the game. A constant reward (i.e. one point) is given for each acquisition or proper preparation of a necessary ingredient as well as for accomplishing the goal (preparing the correct recipe). Each game has a different maximum score, so we report aggregate scores as a percentage of achievable points.

3.1 Levels of difficulty

Game types naturally cluster into tiers of increasing difficulty. The easiest games take place inside a single room and require only one (tier-1), two (tier-2), or three (tier-3) ingredients. More complicated are the multi-room games; these may have six (tier-4), nine (tier-5), or twelve (tier-6) rooms. Intuitively, it should be very easy to learn a tier-1 game. Adding additional ingredients requires knowing how to prepare each ingredient correctly, and adding additional rooms requires finding the kitchen and other locations. Table 1 contains per-tier details.

tier #ingredients #rooms #games
1 1 1 420
2 2 1 420
3 3 1 420
4 3 6 1040
5 3 9 1040
6 3 12 1040
Table 1: Tiers of games. The tiers are selected by the difficulty level of games. Tier-1 is the simplest, containing only one ingredient in a recipe and one room to explore per game. Tier-6 is the most difficult, including up to three ingredients in a recipe, and twelve rooms to explore per game. The first three tiers only contain one room, which means there need be no go actions involved in these games.

4 Methods

4.1 Curriculum learning

Correctly training a DQN-like model to play even a single game can take millions of training steps Mnih et al. (2015) due to the need for heavy exploration. If our models are able to learn critical general skills in the early parts of training, they can focus on more fine-grained skills later on. For example, recognizing that the action cook potato with stove matches the cookbook instruction fry potato allows generalization to, e.g., fry eggplant. This skill is needed across all games. More specific skills, like knowing to drop items before picking up other items are less commonly used.

Curriculum learning Bengio et al. (2009) is a good way of structuring our learning to capture core skills first and gradually build in more complicated knowledge. We initially only train with tier-1 training data. After convergence we then use the best model to initialize the model of tier-2, and so on. Because tiers 1–3 differ significantly from tiers 4–6 (the latter have movements and more games per tier), we alter our approach slightly as training proceeds. We start training tier-1 with the games of tier-1 only. When we train tier-2, we mix the games of tier-1 and tier-2 in order to make the agent perform well on both tiers. We then mix tier-3 data in. But for tier-4 to tier-6, we only use the data for the specific stage of training, and do not mix in data from previous tiers. For each stage of curriculum learning we initialize to 1 and decay evenly to 0.0001 across a maximum of 2,000,000 steps. In ablation experiments without curriculum learning we instead decay over 10,000,000 steps.

Experiment Score %
Test 1 Test 2
random action 14 14
curric go-cardinal 50 52
curric go-random 55 57
curric go-room 55 58
mixed go-room 50 54
Table 2: Core overall results on unseen games of various difficulty levels. The random action baseline gives predictably poor results. Casting directions in terms of the room destination (go-room) generalizes better than learning specific cardinal directions (go-cardinal), but the alternative of picking a direction at random (go-random) appears surprisingly competitive. Using curriculum learning (curric) is preferred to training with all games simultaneously (mixed).

4.2 Learning universally from local information

Since knowledge like the connection between the behavior of fry and using a stove can be learned from past experience and applied to future scenarios, we call this universal knowledge. Other knowledge that is specific to a particular scenario and not reusable we term instance knowledge. In a specific game from our data set, for example, the player may have to go north to reach the kitchen. However, this will not be the case in general. Thus, naively learning a policy for the action go east given a particular state is likely to be suboptimal. We’d like to ensure that training does not overfit by turning instance knowledge into universal knowledge

As it turns out, in the domain we are studying, learning that we must go from the room we are in (generally to reach the kitchen or a room containing missing ingredients) is universal knowledge. A simple way to remove instance knowledge, which we call random-go, is to conflate all actions of the form go direction into a single go action, but then randomly choose a cardinal direction.

Since the room we are trying to reach is more universally important than the direction chosen in a particular game, another approach to converting instance to universal knowledge is to augment directions with the name of the room that will be reached before encoding actions. If, in a particular game, the bedroom is east of the hallway, the action go east is modified during training to be go east to hallway, enabling the action representation to incorporate the more globally useful room type of context into its representation. At inference time we build a simple knowledge graph with this information by a series of initial random walks.

4.3 Learning with common sense knowledge

Humans play games by both learning from failures and by using common sense. Common sense knowledge, such as that a closed door should be open, that it is helpful to light lamp in a dark dungeon, or that one can fry on a stove, is helpful a priori knowledge that allows agents to learn to play faster. An agent that does not have this knowledge could conceivably, through reward signal and enough random exploration, learn these associations, but humans playing these games will be extremely unlikely to attempt to fry using, e.g., a fridge. We incorporate BERT Devlin et al. (2018), a large pre-trained contextualized language model, in our system, as a source of common sense.

While it is rather controversial to claim that a model trained only to predict missing words in context has common sense akin to that of a human, the fact remains that an adequately fine-tuned BERT has been shown to answer multiple choice questions from the Situations With Adversarial Generations (SWAG) dataset Zellers et al. (2018), among others, at near-human levels. Such ‘weak common sense’ knowledge may be enough for our use case, which also may be expressed as a multiple-choice test given textual context. To save time during training, we use the first layer of BERT as an embedding-level feature extractor, and fine-tune this layer during the learning procedure. In ablation studies we compare this to a randomly initialized simple (non-contextualized) embedding baseline.

Figure 3: The training process of ‘mixed go-room’ (Table 2

); all 3,596 training games without curriculum learning and with room destination. We evaluate on the dev set at every epoch (10,000 steps). The total score converges around 54% after 500 epochs of training.

5 Experiments and discussion

We hold out a selection of 10% of the games and divide this portion into two separate test sets, each consisting of 222 games, one from each type. We randomly select an additional 400 games as a dev set and keep the remaining games for training. We consider an episode to be a play-through of a game; there are multiple episodes of each game run during training and scores are taken over a 10-episode run of each game when evaluating test. An episode is run until a loss (an ingredient is damaged or the maximum of 100 steps is reached) or a win, by completing the recipe successfully. Apart from the inherent game reward, we add reward (i.e. punishment) to every step, to encourage more direct gameplay. Also, if the game stops early because of a loss, we set the instant reward to to penalize the last action.

During training, we use 50,000 observation steps, 500,000 replay memory entries, and decay from 1 to 0.0001 in 10,000,000 steps for training with all games in training data.

From a training run, we select the model with the highest score on the dev set for test inference. We run 10 episodes for each game during the test phase with , allowing for some stochasticity. The maximum total steps of evaluating on one test set is thus . The maximum total score is not unique since different games could have different scores. We use the percentage of scores and steps as the evaluation criteria in the following sections. The higher the score, the better the agent. A lower percentage of steps means better policy when scores tie; we show the percentage of wins alongside steps; if steps decrease and wins do not, this indicates an improving policy.

We use a CNN with 32 of each size-3, 4, 5 convolutional filters, followed by a max-pooling layer. The LSTM action encoder contains 32 units in a single-layer. We use the last LSTM hidden state as the encoded action state. We initialize our models with random word embeddings and position embeddings. We use a fixed embedding size of 64. At every training step, we draw a minibatch of 32 samples and use a learning rate of

with the Adam optimizer. We trim trajectories to contain no more than 21 sentences to avoid unnecessarily long concatenated strings.

5.1 Core results

We primarily report results as a percentage of total achievable points on the test sets. Core findings are shown in Table 2. For a simple, training-free baseline, we choose a random action from the set of admissible actions at each state. Our main comparisons are that of curriculum learning (curric) as described in Section 4.1 to the default (mixed), and between the three different approaches to handling instance knowledge as described in Section 4.2. We next take a more in-depth look at the differences in learning behavior.

5.2 Curriculum analysis

Table 3 breaks down the test results ‘mixed go-room’ and ‘curric go-room’ by tier, evaluating after all training is complete. Here we can see that a) curriculum training is generally helpful at every tier, and that b) the ability to reach 100% of score generally decreases by tier. The training behavior of ‘mixed go-room’ is shown in Figure 3. As training proceeds, the total score percentage on dev should go up, and as long as the percentage of wins is not decreasing, the total steps percentage should go down, indicating fewer unnecessary steps. Indeed, this is what we see; the total score gradually increases during training and finally is stable at 54%.

Training graphs for ‘curric go-room’ broken down by tier are shown in Figure 4. For tier-1 (Figure 3(a)) we converge to almost 100% of total score after 140 epochs, which means our agent grasps basic cooking abilities. However, the results of tier-2 (Figure 3(b)) and tier-3 (Figure 3(c)) are flat, indicating there is minor ingredient confusion but it is never resolved. For tiers 4 through 6 (Figure 3(d) to 3(f)), scores generally improve from 40% to roughly 60%, indicating progressive ability to learn to navigate rooms.

Tier Test 1 Test 2
mixed curric mixed curric
1 88 95 85 94
2 53 58 53 55
3 57 55 54 55
4 55 56 57 58
5 40 49 55 60
6 36 47 41 45
All 50 55 54 58
Table 3: Comparing the evaluation results of training all tiers together (mixed) and training with curriculum learning (curric) on the two separate test sets. Rows 1-6 show the breakdown of total scores and steps on each tier. The curriculum learning method generally shows better results on both test sets.
(a) tier-1
(b) tier-2
(c) tier-3
(d) tier-4
(e) tier-5
(f) tier-6
Figure 4: The training process of ‘curric go-room’ broken down by tier. Results on tier-specific dev sets are shown. The learning is generally rational (scores and wins go up, steps go down) but is less effective in tiers 2 and 3.

5.3 Analysis of universal information conversion

Table 4 breaks down performance of each strategy for dealing with instance information in each tier that requires resolution of this information. It is clear that ‘go-cardinal,’ which does not convert any instance information, is less able to learn than the other methods at any tier. As the number of rooms to navigate grows from tier-4 to tier-6, the random navigation strategy becomes less effective, such that the ‘go-room’ transferring from instance-level cardinal information into universal-level room transition information is the most effective at navigating the large twelve-room games of tier-6.444An even more pertinent strategy would be to label directions by their ability to get to key destination rooms, i.e. the kitchen and supermarket, but these strategies would not necessarily transfer well to a new domain.

Tier go-cardinal go-random go-room
4 49 58 56
5 40 48 49
6 36 44 47
All 50 55 55
Table 4: Breakdown of information conversion strategies by tier on Test 1; the ‘go-random’ approach is less effective as map size increases.

Table 5 shows that there is a correlation between the most recently trained tier and performance on test data from that tier; we run ‘curric go-room’ but stop after the tier indicated, then subdivide test data per-tier. We see strongest performance on the main diagonal. This is reasonable because the six-room games of tier-4 use the same six rooms each time and so on; the extra rooms of tier-6 aren’t known during tier-4 training, and some decay of tier-4 rooms is observed as learning is rededicated to new rooms. Nevertheless, by training on all tiers we get best overall performance on Test 1.

TestTrain Tier 4 Tier 5 Tier 6
Tier 4 62 59 56
Tier 5 41 50 49
Tier 6 26 35 47
All 51 53 55
Table 5: Recency effect of curriculum learning (using go-room) on Test 1; performance on tier-specific subsets is best on the last tier used for training, though training on the entire set gives the overall best result.

5.4 Common sense analysis

We replace the uninitialized simple type-based embedding used heretofore with the lowest level of pre-trained BERT-uncase base Devlin et al. (2018) to investigate the a priori knowledge that this large language model brings to aid in game playing. Figure 5 shows the training result on tier-1 with BERT. Comparing the training process with the baseline agent, the BERT agent can converge to the optimal scores and steps on tier-1 in far fewer training steps. After 260,000 steps of training on tier-1, the baseline model scores 24% on Test 1 and 29% on Test 2, while the BERT-enhanced model scores 34% on Test 1 and 35% on Test 2.

Figure 6 shows the comparison of grasping key knowledge in games of the BERT and the CNN agents; for each of six key cooking verbs, we show the percentage of games in the dev set requiring these verbs where they are used correctly. While both BERT and CNN learn to use take fairly quickly, BERT masters the remaining five much more quickly than the baseline. As both models have the same exploration ability and the contextualized representations afforded by BERT are not particularly useful in the fairly simple language of this genre, we can conclude that equipping our model with BERT gives it sufficient common sense to associate these key verbs with clues that indicate their use placed in the game text.

A major downside of replacing simple 64-dimension type-based embeddings with the contextualized BERT model is that, even though we restrict our usage to a single layer, fine-tuning this model, especially given the long trajectories used as input, becomes cumbersome. An epoch of training using BERT takes 1093 seconds on a Tesla P100, vs. 118 seconds for the baseline. As full curriculum learning through tier-6 on the baseline model on our P100 takes about 2 days per tier, it is not currently practical to fully train using BERT.

Figure 5: Comparing the training process of the BERT and baseline agents on tier-1. BERT can converge to the optimal score in five times fewer epochs than the baseline.
Figure 6: Comparison between BERT and baseline at learning six different cooking abilities. The BERT agent grasps most abilities much faster than the baseline.
Master Player
You find yourself in a kitchen. You make out a fridge. The fridge contains a yellow bell pepper. You see a cookbook and a knife on the table.
examine cookbook
You open the cookbook and start reading. ‘Recipe 1: ingredients: red apple. directions: slice the red apple. prepare meal’
You are carrying: a red apple, a red onion.


drop red onion.
you drop the red onion on the ground.
take knife from table.
you take the knife from the table.
slice red apple with knife.
you slice the red apple. your score has just gone up by one point.
prepare meal.
adding the meal to your inventory. your score has just gone up by one point.
eat meal.
you eat the meal. your score has just gone up by one point. you win.


cook red apple with stove
You fried the red apple. You lost!
Figure 7: Comparison of BERT and baseline models on an example game from the dev set after 270,000 steps of training on tier 1 data; common steps are shown at the top.

6 Related Work

Many recent works Narasimhan et al. (2015); He et al. (2016); Li et al. (2016); Ansari et al. (2018); Fulda et al. (2017); Côté et al. (2018); Kostka et al. (2017) on building agents of text-based games apply the DQN Mnih et al. (2015) from playing video games or its variants. Different aspects of DQN have been presented, such as action reduction with language correlation Fulda et al. (2017), a bounding method Zahavy et al. (2018), the introduction of a knowledge graph Ammanabrolu and Riedl (2018), text understanding with dependency parsing Yin and May (2019) and an entity relation graph Ammanabrolu and Riedl (2018).

However, previous work is chiefly focused on learning to self-train on games and then do well on the same games, instead of playing unseen games. A rare exception, DBLP:journals/corr/abs-1806-11525 work on generalization of agents on variants of a very simple coin-collecting game. The simplicity of their games enables them to use an LSTM-DQN method with a counting-based reward. DBLP:journals/corr/abs-1812-01628 use a knowledge graph as a persistent memory to encode states, while we use a knowledge graph to make actions more informative. Our work is closely related to task-oriented dialogue studies He et al. (2017); Rajendran et al. (2018); Bordes et al. (2017) though these are generally not directly transferrable to our scenario, because they use customized models and rely on training data.

7 Conclusion

In this paper, we train agents to play a family of text-based games. Instead of repeatedly optimizing on a single game, we train agents to play familiar but unseen games. We show that curriculum learning helps the agent learn better. We convert instance knowledge into universal knowledge via map familiarization. We also show how the incorporation of an external knowledge source (BERT) leads the agent to learn in far fewer epochs.