Text-based games offer a unique framework to train decision-making models insofar as these models have to understand complex text instructions and interact via natural language. At each time step of a text-based game, the current environment of the player (or the ‘context’) is described in words. To move the game forward, a text command (or an ‘action’) must be issued. Based on the new action and the current game state, the game transitions to a new state and the new context resulting from the action is described to the player. This iterative process can be naturally divided into two tasks. The first task is to recognize the commands that are possible in a given context (e.g., open the door if the context contains an unlocked door), and the second task is the reinforcement learning task of learning to act optimally in order to solve the game (narasimhan2015lstmdqn; zelinka2018; he2015drrn; haroush2018actionelimination). Most work on reinforcement learning has focused on training an agent that picks the best command from a given set of valid commands, i.e., pick the command that would lead to completing the game.
Humans who play a text-based game typically do not have access to a list of commands and a large part of playing the game consists of learning how to formulate valid commands. In this paper, we propose models that try to accomplish this task. We frame it as a supervised learning problem and train a model by giving it (input, label) pairs where the input is the current context as well as the objects that the player possesses, and the output is the list ofadmissible commands given this input. Similarly to cote2018textworld, we define an admissible command as a command that changes the game’s state. We generate these (input, label) pairs with TextWorld (cote2018textworld), a sandbox environment for generating text-based games of varying difficulty.
In this work, we explore and present three neural encoder-decoder approaches:
a pointer-softmax model that uses beam search to generate multiple commands;
a hierarchical recurrent model with pointer-softmax generating multiple commands at once;
a pointer-softmax model generating multiple commands at once.
The first model has the disadvantage of imposing a fixed number of actions for any given context. The two others alleviate this constraint but suffer from conditioning on the previous action generated. We compare empirical and qualitative results from those models, and pinpoint their weaknesses.
2 Related Work
Sequence to sequence generation
Sentence generation has been studied extensively with the inception of sequence to sequence models (sutskever2014seq2seq), and attentive decoding (bahadanau2014attention). Pointer-based sequence to sequence networks (vinyals2015pointer; gulcehre2016pointersoftmax; wang2017qa) help dealing with out-of-vocabulary words by introducing a mechanism for choosing between outputting a word from the vocabulary or referencing an input word during decoding. vinals2015order studied the problem of matching input sequences to output sets, i.e., where there is no natural order between the elements. This task is challenging because there is no natural order between the sentences but there is an order between the tokens within each sentence. One of the models we try for this task is the hierarchical encoder-decoder (sordoni2015hred) originally proposed to model the dialogue between two speakers. Another model is inspired by yuan2018keyphrase who generates concatenated target sentences with orthogonally regularized separators.
Reinforcement learning for text-based games
Many recent attempts at solving text-based games have assumed that the agent has a predefined set of commands to choose from. For instance, the Action-Eliminating Network (haroush2018actionelimination) assumes that the agent has access to all possible permutations of commands in the entire game, and prunes that list in each state to allow the agent to better select correct commands. One attempt at command generation for a text-based game is the LSTM-DQN (narasimhan2015lstmdqn). This approach generates commands by leveraging off-policy deep Q-value approximations (mnih2013dqn), and learns two separate Q-functions for verbs and nouns. This limits the structure of generated commands to verb-action pairs, and does not allow for more robust multi-entity commands. yuan2018twcount extends the LSTM-DQN approach with an exploration bonus to try and generalize, and beat games consisting of collecting coins in a maze.
Separating planning from generation in dialogue systems
The task of choosing the best next utterance to generate for a given context has been extensively studied in the literature on dialogue systems (rieser_natural_2016; Pietquin:11; Fatemi:16). Historically, dialogue systems have considered separately the tasks of understanding the context, producing the available next utterances and of generating the next utterance (Lemon:07). Recent attempts at learning to perform all these tasks through one end-to-end model have produced encouraging results (li_adversarial_2017; Bordes:17) but so far, the best-performing models still separate these two tasks (Wen:16; Asadi:16). Inspired by these results, we decide to frame the task of solving a text-based game into an action generation and an action selection modules and we propose models for action generation in the following section.
3.1 Dataset and environment
In this section, we introduce a dataset called TextWorld Action Command Generation (TextWorld ACG). It is a collection of game walkthroughs gathered from random games generated with TextWorld. Statistics of TextWorld ACG are shown in Table 1. Each data point in TextWorld ACG consists of:
Context: concatenation of the room’s and inventory’s description for a game state;
Entities: a list of interactable object names or exits appearing within the context;
Commands: a list of strings that contains all the admissible commands recognized by TextWorld.
We define two tasks using TextWorld ACG to learn the action space of these TextWorld games. First, without conditioning on entities, the model needs to generate all the admissible commands. Second, conditioning on one entity, the model is required to generate all valid commands that are related to that entity. In the following sections, we denote the task without conditioning on entity with ACG, and the task conditioning on entities with ACGE. The data used for the ACGE task is created by splitting each data point in TextWorld ACG by its entities, so that each data point in ACGE has a single entity. There exist commands with multiple entities (i.e., put apple on table) - in these cases we group this action with one of the entities, and expect the models to produce the other entity. We also ignore the two commands (look and inventory) that don’t affect the game state. This is because the context already consists of the exact descriptions returned by look and inventory. Adding the two commands would only serve to inflate metrics.
3.2 Command generation
In the following sections, we denote tokenized input words from the context sequence as , to denote embedded tokens, a subscript ( or etc.) to denote where the representations are from (encoder, decoder etc.), to represent hidden states, to represent session states and to denote output tokens. We use superscripts to represent time steps. An absence of a superscript represents multiple time-steps. We represent concatenation with angled brackets
. We also represent linear transformations with
, as well as linear transformations followed by an non-linear activation functionas . A subscript on these transformations (ie. ) represent transformations with different parameters.
3.2.1 Context encoding
Given a sequence of length in the context, we have the input sequence which we embed using GloVe (pennington2014glove)vectors to produce . We feed into a bidirectional RNN (cho2014nmt; schuster1997brnn) to retrieve forwards () and backwards () encodings of the source sequence:
We concatenate the two to get the resulting encoded sequence . Then, we take a step depending on whether we condition on entity or not. Given a sequence of word tokens from the entity (which is also a sequence of word tokens), we find the indices where the entity words appear in context, i.e., . Now we take context encodings and use them as input to a GRU, where :
We use the final hidden state of this entity RNN as an entity encoding, which we will label as .
3.2.2 Attentive decoding and Pointer Softmax
The decoder is also a recurrent model that takes in the context encodings and the generated entity encoding , at every timestep
it produces a probability distribution of generating the next token. This next token can come from one of two sources - either a word in the context or a word in our shortlist vocabulary. Our shortlist vocabulary in this case is just our entire vocabulary (consisting of all possible 887 unique words in the dataset). The first part of the decoder model is an RNN that takes in the embedding of the previous output and previous decoder hidden state to produce the first hidden state:
Next, we concatenate this output hidden state with the entity representation to produce . We use this as the query to an attention mechanism (bahadanau2014attention) which generates annotations from this query and a value (in this case context encodings ). We generate these annotations with a two layer Feed Foward Network (FFN), and define a distribution over the context encodings. The context vector is then computed by taking the weighted sum of the context encodings :
We now use the annotations as the distribution over the context sequence, . We take the context vector and use this and the first RNN hidden state as input to a second RNN:
We use this hidden state as the previous hidden state for the next time step (). We also apply dropout on the output of this RNN for regularization purposes. We now use the concatenation of as input to both the shortlist FFN and switch FFN to generate the shortlist distribution and switch distributions respectively:
We generate output tokens from a combined distribution over the context words (
), shortlist words and a switch that interpolate the probability of each distribution as per the Pointer Softmax(gulcehre2016pointersoftmax) decoder framework
3.2.3 Hierarchical session encoding
We adopt the framework of the hierarchical recurrent encoder-decoder (sordoni2015hred) as one solution to alleviate the problem of multiple phrase generation per context (Figure 2). We place the session-level RNNs in between the encoder and decoder in order to condition on and summarize the previously decoded phrases. The session-level RNN takes as input a sequence of query representations . We let , and all subsequent ’s will be the final decoder hidden state as per Figure 2. The session-level state becomes , which we use as initial hidden states of the decoder, .
3.2.4 Learning with command generation
We employ a cross-entropy loss for all the learning objectives. The first model architecture uses the context encoder connected with a pointer-softmax decoder on single target commands (we label this as PS + BS(, ). During inference, we use the top out of beams to produce commands. With as the phrase produced at time step , we try to maximize the following log-likelihood:
The second model applies hierarchical decoding, where we encode our context sequence as above and have a session state run through all the pointer-softmax decoder steps (we label this as HRED + PS). We use the same objective function as sordoni2015hred over the parameters for all the RNNs in the model. We let be all generated phrases given a context, represents the phrase generated at session time-step , so objective is to maximize the following log-likelihood:
The final model uses the same architecture as yuan2018keyphrase, we train on the concatenated target commands delineated by separator tokens (we label this as PS + Cat). In this case, the objective is:
4 Results and Discussion
|Dataset||Model||Precision||Recall||score||Unseen recall||Seen recall|
|ACG||PS + BS(11, 30)||26.6||53.3||35.5||12.0||54.8|
|HRED + PS||94.1||84.7||89.2||48.4||86.0|
|PS + Cat||98.4||94.7||96.5||83.0||95.1|
|ACGE||PS + BS(3, 10)||20.1||93.0||33.0||80.9||93.5|
|HRED + PS||96.8||91.7||94.2||59.7||92.8|
|PS + Cat||98.9||96.3||97.6||76.7||96.9|
The empirical results (Table 2) and qualitative results (Appendix A.1) show the ability for our best model to generate valid unseen commands and achieve scores of 96.5 and 97.6 on ACG and ACGE respectively. The hierarchical and concatenation models outperform the Pointer-Softmax with Beam Search by a wide margin - largely due to the over-generation of PS + BS and the mismatch in number of targets between and actual number of target target commands (as seen in Appendix B.1). We hypothesize the PS + Cat outperforms the HRED model due to the gating mechanism between each session state. Conditioning on different queries gives HRED the ability to prevent gradients to flow through to the next session. We can see the detriment of this gating by comparing their scores. We hypothesize that as we only have a single query from our encoded context (and hence no "noisy" queries (sordoni2015hred)) the gating mechanism hinders the model by "filtering" certain queries. We also observe a noticeable gap between the performances in the ACG and ACGE as expected. In the ACGE case, the models are more constrained by conditioning information. This means the scope of its generation narrows - our models generate smaller sequences on average (as shown in Table 1), which decreases the likelihood of generating missing or extra commands as shown in Appendix A.3.
Experiments for models initialized without pre-trained GloVe embeddings were also conducted on both ACG and ACGE datasets, but resulted in an almost negligible () decline in F1-score of the model. We postulate this is due to the mismatch in objectives between how GloVe is trained and the required entity relations in our environment.
Interestingly, the generative models are able to generate a large portion of the valid commands that are unseen during training. Added diversity from beam search seems to help in producing unseen examples, but only in the case where the number of targets for a training instance is close to the number of targets we generate during inference as seen in Table 2. A large beam width is able to generate more unseen actions because of how beam search over generates actions.
In this work, we explored three different approaches at generating sets of text commands that are context dependent. We tested them on TextWorld ACG and ACGE, two new datasets built using TextWorld. Seeing those encouraging results, our next step would be to combine the command generation with a control policy in order to play (and solve) text-based games. While the performance of the command generation is good (on TextWorld games), using it as a fixed generator would set an upper bound on the performance of the control policy (i.e., commands, mandatory for the game progression, might never be generated in the first place). Instead, our next goal is to develop a control policy that can use the generator and fine tune it to produce more relevant commands.
Special thanks to Kaheer Suleman for his help and guidance in model architectures.
Appendix A Full Results
a.1 Qualitative results from generation
|Context||-= attic = - you ’ve entered an attic . you see a closed type p box . oh wow ! is that what i|
|think it is ? it is ! it ’s a workbench . you see a type p keycard and a bug on the workbench .|
|hmmm … what else , what else ? there is an unblocked exit to the east . you do n’t like|
|doors ? why not try going south , that entranceway is unblocked . you are carrying nothing .|
|PS + BS||go bug; go east; go south; go type; open bug; open east; open type; open type p;|
|open type p box; open type p keycard’; take bug; take bug p keycard from; take east;|
|take south; take type; take type p; take type p box; take type p keycard;|
|take type p keycard from|
|HRED + PS||go east; go south; open type p box; take type p keycard from workbench;|
|PS + Cat||go east; go south; open type p box; take bug from workbench;|
|take type p keycard from workbench;|
|Ground Truth||go east; go south; open type p box; take bug from workbench;|
|take type p keycard from workbench|
a.2 Full empirical results
|PS + BS||26.5||26.6||54.1||53.3||35.6||35.5|
|HRED + PS||94.6||94.1||85.6||84.7||89.9||89.2|
|PS + Cat||-||98.4||-||94.7||-||96.5|
|PS + BS||19.9||20.1||93.0||93.0||32.7||33.0|
|HRED + PS||96.8||96.8||91.9||91.7||94.3||94.2|
|PS + Cat||-||98.9||-||96.3||-||97.6|