In a text-based game, also called interactive fiction (IF), an agent interacts with its environment through a natural language interface. Actions consist of short textual commands, while observations are paragraphs describing the outcome of these actions (Figure 1). Recently, interactive fiction has emerged as an important challenge for AI techniques (atkinson2018textbased atkinson2018textbased), in great part because the genre combines natural language with sequential decision-making.
From a reinforcement learning perspective, IF domains pose a number of challenges. First, the state space is typically combinatorial in nature, due to the presence of objects and characters with which the player can interact. Since any natural language sentence may be given as a valid command, the action space is similarly combinatorial. The player observes its environment through feedback in natural language, making this a partially observable problem. The reward structure is usually sparse, with non-zero rewards only received when the agent accomplishes something meaningful, such as retrieving an important object or unlocking a new part of the domain.
We are particularly interested in bringing deep reinforcement learning techniques to bear on this problem. In this paper, we consider how to design an agent architecture that can learn to play text adventure games from feedback alone. Despite the inherent challenges of the domain, we identify three structural aspects that make progress possible:
Rewards from subtasks. The optimal behaviour completes a series of subtasks towards the eventual game end;
Transition structure. Most actions have no effect in a given state;
Memory as state. Remembering key past events is often sufficient to deal with partial observability.
While these properties have been remarked on in previous work [4, 8], here we relax some of the assumptions previously made and provide fresh tools to more tractably solve IF domains. More generally, we believe these tools to be useful in partially observable domains with similar structure.
Our first contribution takes advantage of the special reward structure of IF domains. In IF, the accumulated reward within an episode correlates with the number of completed subtasks and provides a good proxy for an agent’s progress. Our score contextualisation architecture makes use of this fact by defining a piecewise value function composed of different deep network heads, where each piece corresponds to a particular level of cumulative reward. This separation allows the network to learn separate value functions for different portions of the complete task; in particular, when the problem is linear (i.e., there is a fixed ordering in which subtasks must be completed), our method can be used to learn a separate value function for each subtask.
Our second contribution extends the work of Zahavy:2018:LLA:3327144.3327274 Zahavy:2018:LLA:3327144.3327274 on action elimination. We make exploration and action selection more tractable by determining which actions are admissible in the current state. Formally, we say that an action is admissible if it leads to a change in the underlying game state. While the set of available actions is typically large in IF domains, there are usually few commands that are actually admissible in any particular context. Since the state is not directly observable, we first learn an LSTM-based auxiliary classifier
that predicts which actions are admissible given the agent’s history of recent feedback. We use the predicted probability of an action being admissible to modulate orgate which actions are available to the agent at each time step. We propose and compare three simple modulation methods: masking, drop out, and finally consistent Q-learning (Bellemare:2016:IAG:3016100.3016105 Bellemare:2016:IAG:3016100.3016105). Compared to Zahavy:2018:LLA:3327144.3327274’s algorithm, our techniques are simpler in spirit and can be learned from feedback alone.
We show the effectiveness of our methods on a suite of seven IF problems of increasing difficulty generated using the TextWorld platform (cote18textworld cote18textworld). We find that combining the score contextualisation approach to an otherwise standard recurrent deep RL architecture leads to faster learning than when using a single value function. Furthermore, our action gating mechanism enables the learning agent to progress on the harder levels of our suite of problems.
2 Problem Setting
We represent an interactive fiction environment as a partially observable Markov decision process (POMDP) with deterministic observations. This POMDP is summarized by the tuple, where is the state space, the action space, is the transition function, is the reward function, and is the discount factor. The function describes the observation provided to the agent when action is taken in state and leads to state .
Throughout we will make use of standard notions from reinforcement learning (sutton98reinforcement sutton98reinforcement) as adapted to the POMDP literature [3, 6]. At time step , the agent selects an action according to a policy which maps a history to a distribution over actions, denoted . This history is a sequence of observations and actions which, from the agent’s perspective, replaces the unobserved environment state . We denote by the probability or belief of being in state after observing . Finally, we will find it convenient to rely on time indices to indicate the relationship between a history and its successor, and denote by the history resulting from taking action in and observing as emitted by the hidden state .
The action-value function describes the expected discounted sum of rewards when choosing action after observing history , and subsequently following policy :
where we assume that the action at time is drawn from ; note that the reward depends on the sequence of hidden states implied by the belief state . The action-value function satisfies the Bellman equation over histories
When the state is observed at each step (), this simplifies to the usual Bellman equation for Markov decision processes:
In the fully observable case we will conflate and .
The Q-learning algorithm (watkins89learning watkins89learning) over histories maintains an approximate action-value function which is updated from samples using a step-size parameter :
Q-learning is used to estimate theoptimal action-value function attained by a policy which maximizes
for all histories. In the context of our work, we will assume that this policy exists. Storing this action-value function in a lookup table is impractical, as there are in general an exponential number of histories to consider. Instead, we use recurrent neural networks approximate the Q-learning process.
2.1 Consistent Q-Learning
Consistent Q-learning (Bellemare:2016:IAG:3016100.3016105 Bellemare:2016:IAG:3016100.3016105) learns a value function which is consistent with respect to a local form of policy stationarity. Defined for a Markov decision process, it replaces the term in (2) by
Consistent Q-learning can be shown to decrease the action-value of suboptimal actions while maintaining the action-value of the optimal action, leading to larger action gaps and a potentially easier value estimation problem.
Observe that consistent Q-learning is not immediately adaptable to the history-based formulation, since and are sequences of different lengths (and therefore not comparable). One of our contributions in this paper is to derive a related algorithm suited to the history-based setting.
2.2 Admissible Actions
We will make use of the notion of an admissible action, following terminology by Zahavy:2018:LLA:3327144.3327274 Zahavy:2018:LLA:3327144.3327274.111Note that our definition technically differs from Zahavy:2018:LLA:3327144.3327274 Zahavy:2018:LLA:3327144.3327274’s, who define an admissible action as one that is not ruled out by the learning algorithm.
An action is admissible in state if
That is, is admissible in if its application may result in a change in the environment state. When , we say that an action is inadmissible.
We extend the notion of admissibility to histories as follows. We say that an action is admissible given a history if it is admissible in some state that is possible given , or equivalently:
We denote by the set of admissible actions in state . Abusing notation, we define the admissibility function
We write for the set of admissible actions given history , i.e. the actions whose admissibility in is strictly greater than zero. In IF domains, inadmissible actions are usually dominated, and we will deprioritize or altogether rule them out based on our estimate of .
3 More Efficient Learning for IF Domains
We are interested in learning an action-value function which is close to optimal and from which can be derived a near-optimal policy. We would also like learning to proceed in a sample-efficient manner. In the context of IF domains, this is hindered by both the partially observable nature of the environment and the size of the action space. In this paper we propose two complementary ideas that alleviate some of the issues caused by partial observability and large action sets. The first idea contextualizes the action-value function on a surrogate notion of progress based on total reward so far, while the second seeks to eliminate inadmissible actions from the exploration and learning process.
Although our ideas are broadly applicable, for concreteness we describe their implementation in a deep reinforcement learning framework. Our agent architecture (Figure 2) is derived from the LSTM-DRQN agent (yuan2018counting yuan2018counting) and the work of Narasimhan_2015 Narasimhan_2015.
3.1 Score Contextualisation
In applying reinforcement learning to games, it is by now customary to translate the player’s score differential into rewards [1, 5]. Our setting is similar to Arcade Learning Environment in the sense that the environment provides the score. In IF, the player is awarded points for acquiring an important object, or completing some task relevant to progressing through the game. These awards occur in a linear, or almost linear structure, reflecting the agent’s progression through the story, and are relatively sparse. We emphasize that this is in contrast to the more general reinforcement learning setting, which may provide reward for surviving, or achieving something at a certain rate. In the video game Space Invaders, for example, the notion of “finishing the game” is ill-defined: the player’s objective is to keep increasing their score until they run out of lives.
We make use of the IF reward structure as follows. We call score the agent’s total (undiscounted) reward since the beginning of an episode, remarking that the term extends beyond game-like domains. At time step , the score is
In IF domains, where the score reflects the agent’s progress, it is reasonable to treat it as a state variable. We propose maintaining a separate action-value function for each possible score. This action-value function is denoted . We call this approach score contextualisation. The use of additional context variables has by now been demonstrated in a number of settings (DBLP:journals/corr/abs-1903-08254 DBLP:journals/corr/abs-1903-08254; icarte18using icarte18using; DBLP:journals/corr/abs-1711-09874 DBLP:journals/corr/abs-1711-09874). First, credit assignment becomes easier since the score provides clues as to the hidden state. Second, in settings with function approximation we expect optimization to be simpler since for each , the function needs only be trained on a subset of the data, and hence can focus on features relevant to this part of the environment.
In a deep network, we implement score contextualisation using network heads and a map such that the head is used when the agent has received a score of at time . This provides the flexibility to either map each score to a separate network head, or multiple scores to one head. Taking uses one monolothic network for all subtasks, and fully relies on this network to identify state from feedback. In our experiments, we assign scores to networks heads using a round-robin scheme with a fixed . Using Narasimhan_2015 Narasimhan_2015’s terminology, our architecture consists of a shared representation generator with independent LSTM heads, followed by a feed-forward action scorer which outputs the action-values (Figure 2).
3.2 Action Gating Based on Admissibility
In this section we revisit the idea of using the admissibility function to eliminate or more generally gate actions. Consider an action which is inadmissible in state . By definition, taking this action does not affect the state. We further assume that inadmissible actions produce a constant level of reward, which we take to be 0 without loss of generality:
This assumption is reasonable in IF domains, and more generally holds true in domains that exhibit subtask structure, such as the video game Montezuma’s Revenge (NIPS2016_6383 NIPS2016_6383). We can combine knowledge of and for inadmissible actions with Bellman’s equation (1) to deduce that for any policy ,
If we know that is inadmissible, then we do not need to learn its action-value.
We propose learning a classifier whose purpose is to predict the admissibility function. Given a history , this classifier outputs, for each action , the probability that this action is admissible. Because of state aliasing, this probability is in general strictly between 0 and 1; furthermore, it may be inaccurate due to approximation error. We therefore consider action gating schemes that are sensitive to intermediate values of . The first two schemes produce an approximately admissible set which varies from time step to time step; the third directly uses the definition of admissibility in a history-based implementation of the consistent Bellman operator.
Dropout. The dropout method randomly adds each action to with probability .
Masking. The masking method uses an elimination threshold . The set contains all actions whose estimated admissibility is at least :
The masking method is a simplified version of Zahavy:2018:LLA:3327144.3327274 Zahavy:2018:LLA:3327144.3327274’s action elimination algorithm, whose threshold is adaptively determined from a confidence interval, itself derived from assuming a value function and admissibility functions that can be expressed linearly in terms of some feature vector.
In both the dropout and masking methods, we use the action set in lieu of the the full action set when selecting exploratory actions.
Consistent Q-learning for histories (CQLH). The third method leaves the action set unchanged, but instead drives the action-values of purportedly inadmissible actions to 0. This is done by adapting the consistent Bellman operator (3) to the history-based setting. First, we replace the indicator by the probability . Second, we drive to in the case when we believe the state is unchanged, following the argumentation of (4). This yields a version of consistent Q-learning which is adapted to histories, and makes use of the predicted admissibility:
One may ask whether this method is equivalent to a belief-state average of consistent Q-learning when is accurate, i.e. equals . In general, this is not the case: the admissibility of an action depends on the hidden state, which in turns influences the action-value at the next step. As a result, the above method may underestimate action-values when there is state aliasing (e.g., ), and yields smaller action gaps than the state-based version when . However, when is known to be inadmissible (), the methods do coincide, justifying its use as an action gating scheme.
We implement these ideas using an auxiliary classifier . For each action , this classifier outputs the estimated probability
, parametrized as a sigmoid function. These probabilities are learned from bandit feedback: after choosingfrom history , the agent receives a binary signal as to whether was admissible or not. In our setting, learning this classifier is particularly challenging because the agent must predict admissibility solely based on the history . As a point of comparison, using the information-gathering commands look and inventory to establish the state, as proposed by Zahavy:2018:LLA:3327144.3327274 Zahavy:2018:LLA:3327144.3327274, leads to a simpler learning problem, but one which does not consider the full history. The need to learn from bandit feedback also encourages methods that generalize across histories and textual descriptions.
4 A Synthetic IF Benchmark
Both score contextualisation and action gating are tailored to domains that exhibit the structure typical of interactive fiction. To assess how useful these methods are, we will make use of a synthetic benchmark based on the TextWorld framework (cote18textworld cote18textworld). TextWorld provides a reinforcement learning interface to text-based games along with an environment specification language for designing new environments. Environments provide a set of locations, or rooms, objects that can picked up and carried between locations, and a reward function based on interacting with these objects. Following the genre, special key objects are used to access parts of the environment.
Our benchmark provides seven environments of increasing complexity, which we call levels. We control complexity by adding new rooms and/or objects to each successive level. Each level also requires the agent to complete a number of subtasks (Table 1), most of which involve carrying one or more items to a particular location. Reward is provided only when the agent completes one of these subtasks. Thematically, each level involves collecting food items to make a salad, inspired by the first TextWorld competition. Example objects include an apple and a head of lettuce, while example actions include get apple and slice lettuce with knife. Accordingly we call our benchmark SaladWorld.
SaladWorld provides a graded measure of an agent architecture’s ability to deal with both partial observability and large action spaces. Indeed, completing each subtasks requires memory of what has previously been accomplished, along with where different objects are. Together with this, each level in the SaladWorld involves some amount of history-dependent admissibility i.e the admissibility of the action depends on the history rather than the state. For example, put lettuce on counter can only be accomplished once take lettuce (in a different room) has happened. Keys pose an additional difficulty as they do not themselves provide reward. As shown in Table 1, the number of possible actions rapidly increases with the number of objects in a given level. Even the small number of rooms and objects considered here preclude the use of tabular representations, as the state space for a given level is the exponentially-sized cross-product of possible object and agent locations. In fact, we have purposefully designed SaladWorld as a small challenge for IF agents, and even our best method falls short of solving the harder levels within the allotted training time. Full details are given in Table 2 in the appendix.
|Level||# Rooms||# Objects||# Sub-tasks|
|1||Following subtasks with reward and fulfilling condition:
|2||All subtasks from previous level plus this subtask:
||5, 10, 15, 20|
|3||All subtasks from level 1 plus this subtask:
||5, 10, 15, 20|
|4||All subtasks from previous level plus this subtask:
||5, 10, 15, 20, 25|
|5||All subtasks from previous level plus this subtask:
||5, 10, 15, 20, 25, 30|
|6||All subtasks from previous level plus this subtask:
||5, 10, 15, 20, 25, 30, 35|
|7||All subtasks from previous level plus this subtask:
||5, 10, 15, 20, 25, 30, 35, 40|
5 Empirical Analysis
In the first set of experiments, we use SaladWorld to establish that both score contextualisation and action gating provide positive benefits in the context of IF domain. We then validate these findings on the celebrated text-based game Zork used in prior work [2, 8].
Our baseline agent is the LSTM-DRQN agent (yuan2018counting yuan2018counting) but with a different action representation. We augment this baseline with either or both score contextualisation and action gating, and observe the resulting effect on agent performance in SaladWorld. We measure this performance as the fraction of subtasks completed during an episode, averaged over time. In all cases, our results are generated from 5 independent trials of each condition. To smooth the results, we use moving average with a window of 20,000 training steps. The graphs and the histograms report average std. deviation across the trials.
Score contextualisation uses network heads; the baseline corresponds to . Each head is trained using the Adam optimizer (kingma:adam kingma:adam) with a learning rate to minimize a Q-learning loss (mnih2015human mnih2015human) with a discount factor of . The auxiliary classifier is trained with the binary cross-entropy loss over the selected action’s admissibility (recall that our agent only observes the admissibility function for the selected action). Training is done using a balanced form of prioritized replay which we found improves baseline performance appreciably. Specifically, we use the sampling mechanism described in SDMIA15-Hausknecht SDMIA15-Hausknecht with prioritization i.e we sample fraction of episodes that had atleast one positive reward, fraction with atleast one negative reward and from whole episodic memory . Section C.1 in the Appendix compares the baseline agent with and without prioritization. For prioritization, .
Actions are chosen from the estimated admissible set according to an -greedy rule, with annealed linearly from 1.0 to 0.1 over the first million training steps. To simplify exploration, our agent further takes a forced look action every 20 steps. Each episode lasts for a maximum steps. For Level game, , whereas for rest of the levels .To simplify exploration, our agent further takes a forced look action every 20 steps (Section C.1).Full details are given in the Appendix (Section A).
5.1 Score Contextualisation
We first consider the effect of score contextualisation on our agents’ ability to complete tasks in SaladWorld. We ask,
Does score contextualisation mitigate the negative effects of partial observability?
We begin in a simplified setting where the agent knows the admissible set . We call this setting oracle gating. This setting lets us focus on the impact of contextualisation alone. We compare our score contextualisation (SC) to the baseline and also to two “tabular” agents. The first tabular agent treats the most recent feedback as state, and hashes each unique description-action pair to a Q-value. This results in a memoryless scheme that ignores partial observability. The second tabular agent performs the information-gathering actions look and inventory
to construct its state description, and also hashes these to unique Q-values. Accordingly, we call this the “LI-tabular” agent. This latter scheme has proved to be a successful heuristic in the design of IF agents (Fulda_2017 Fulda_2017), but can be problematic in domains where taking information-gathering actions can have negative consequences (as is the case inZork).
Figure 3 shows the performance of the four methods across SaladWorld levels, after 1.3 million training steps. We observe that the tabular agents’ performance suffers as soon as there are multiple subtasks, as expected. The baseline agent performs well up to the third level, but then shows significantly reduced performance. We hypothesize that this occurs because the baseline agent must estimate the hidden state from longer history sequences and effectively learn an implicit contextualisation. Beyond the fourth level, the performance of all agents suffers, suggesting the need for a better exploration strategy, for example using expert data .
We find that score contextualisation performs better than the baseline when the admissible set is unknown. Figure 4 compares learning curves of the SC and baseline agents with oracle gating and using the full action set, respectively, in the simplest of levels (Level 1 and 2). We find that score contextualisation can learn to solve these levels even without access to , whereas the baseline cannot. Our results also show that oracle gating simplifies the problem, and illustrate the value in handling inadmissible actions differently.
We hypothesize that score contextualisation results in a simpler learning problem in which the agent can more easily learn to distinguish which actions are relevant to the task, and hence facilitate credit assignment. Our result indicates that it might be unreasonable to expect contextualisation to arise naturally (or easily) in partially observable domains with large actions sets. We conclude that score contextualisation mitigates the negative effects of partial observability.
5.2 Score Contextualisation with Learned Action Gating
The previous experiment (in particular, Figure 4) shows the value of restricting action selection to admissible actions. With the goal in mind of designing an agent that can operate from feedback alone, we now ask:
Can an agent learn more efficiently when given bandit feedback about the admissibility of its chosen actions?
We address this question by comparing our three action gating mechanisms. As discussed in Section 3.2, the output of the auxiliary classifier describes our estimate of an action’s admissibility for a given history.
As an initial point of comparison, we tested the performance of the baseline agent when using the auxiliary classifier’s output to gate actions. For the masking method, we selected from a larger initial parameter sweep. The results are summarized in Figure 5. While action gating alone provides some benefits in the first level, performance is equivalent for the rest of the levels.
However, when combined with score contextualisation (see Fig 6, 7), we observe some performance gains. In Level 3 in particular, we almost recover the performance of the SC agent with oracle gating. From our results we conclude that masking with the right threshold works best, but leave as an open question whether the other action gating schemes can be improved.
Figure 8 shows the final comparison between the baseline LSTM-DRQN and our new agent architecture which incorporates action gating and score contextualisation (full learning curves are provided in the appendix, Figure 14). Our results show that the augmented method significantly outperforms the baseline, and is able to handle more complex IF domains. From level 4 onwards, the learning curves in the appendix show that combining score contextualisation with masking results in faster learning, even though final performance is unchanged. We posit that better exploration schemes are required for further progress in SaladWorld.
As a final experiment, we evaluate our agent architecture on the interactive fiction Zork I, the first installment of the popular trilogy. Zork provides an interesting point of comparison for our methods, as it is designed by and for humans – following the ontology of Bellemare_2013 Bellemare_2013, it is a domain which is both interesting and independent. Our main objective is to compare the different methods studied with Zahavy:2018:LLA:3327144.3327274 Zahavy:2018:LLA:3327144.3327274’s AE-DQN agent. Following their experimental setup, we take and train for 2 million steps. All agents use the smaller action set (131 actions). Unlike AE-DQN, however, our agent does not use information-gathering actions (look and inventory) to establish the state.
Figure 9 shows the corresponding learning curves. Despite operating in a harder regime than AE-DQN, the score contextualizing agent reaches a score comparable to AE-DQN, in about half of the training steps. All agents eventually fail to pass the 35-point benchmark, which corresponds to a particularly difficult in-game task (the “troll quest”) which involves a timing element, and we hypothesize requires a more intelligent exploration strategy.
6 Related Work
RL applied to Text Adventure games: LSTM-DQN by Narasimhan_2015 Narasimhan_2015 deals with parser-based text adventure games and uses an LSTM to generate feedback representation. The representation is then used by an action scorer to generate scores for the action verb and objects. The two scores are then averaged to determine Q-value for the state-action pair. In the realm of choice-based games, he-etal-2016-deep he-etal-2016-deep uses two separate deep neural nets to generate representation for feedback and action respectively. Q-values are calculated by dot-product of these representations. None of the above approaches deals with partial observability in text adventure games.
Admissible action set learning:
tao2018solving tao2018solving approach the issue of learning admissible set given context as a supervised learning one. They train their model on (input, label) pairs where input is context (concatenation of feedbacks bylook and inventory) and label is the list of admissible commands given this input. AE-DQN (Zahavy:2018:LLA:3327144.3327274 Zahavy:2018:LLA:3327144.3327274) employs an additional neural network to prune in-admissible actions from action set given a state. Although the paper doesn’t deal with partial observability in text adventure games, authors show that having a tractable admissible action set led to faster convergence. Fulda_2017 Fulda_2017 work on bounding the action set through affordances. Their agent is trained through tabular Q-Learning.
Partial Observability: yuan2018counting yuan2018counting replace the shared MLP in Narasimhan_2015 Narasimhan_2015 with an LSTM cell to calculate context representation. However, they use concatenation of feedbacks by look and inventory
as the given state to make the game more observable. Their work also doesn’t focus on pruning in-admissible actions given a context. Finally, ammanabrolu-riedl-2019-playing ammanabrolu-riedl-2019-playing deal with partial observability by representing state as a knowledge graph and continuously updating it after every game step. However, the graph update rules are hand-coded; it would be interesting to see they can be learned during gameplay.
7 Conclusions and Future work
We introduced two algorithmic improvements for deep reinforcement learning applied to interactive fiction (IF). While naturally rooted in IF, we believe our ideas extend more generally to partially observable domains and large discrete action spaces. Our results on SaladWorld and Zork show the usefulness of these improvements. Going forward, we believe better contextualisation mechanisms should yield further gains. In Zork, in particular, we hypothesize that going beyond the 35-point limit will require more tightly coupling exploration with representation learning.
This work was funded by the CIFAR Learning in Machines and Brains program. Authors thank Compute Canada for providing the computational resources.
The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research47, pp. 253–279. External Links: Cited by: §3.1.
-  (2017-08) What can you do with a rock? affordance extraction via word embeddings. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. External Links: Cited by: §5.
-  (1995) Reinforcement learning with selective perception and hidden state. Ph.D. Thesis, University of Rochester. Cited by: §2.
Language understanding for text-based games using deep reinforcement learning.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. External Links: Cited by: §1.
-  (2018) Gym retro. GitHub. Note: https://github.com/openai/retro Cited by: §3.1.
-  (2010) Monte-carlo planning in large pomdps. In Advances in Neural Information Processing Systems, Cited by: §2.
Sparse imitation learning for text based games with combinatorial action spaces. The Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM) 2019. External Links: Cited by: §5.1.
-  (2018) Learn what not to learn: action elimination with deep reinforcement learning. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 3566–3577. External Links: Cited by: §1, §5.
Appendix A Training Details
Training hyper-parameters: For all the experiments unless specified, . Weights for the learning agents are updated every steps. Agents with score contextualisation architecture have network heads. Parameters of score contextualisation architecture are learned end to end with Adam optimiser (kingma:adam kingma:adam) with learning rate . To prevent imprecise updates for the initial states in the transition sequence due to in-sufficient history, we use updating mechanism proposed by lample2017playing lample2017playing. In this mechanism, considering the transition sequence of length , errors from aren’t back-propagated through the network. In our case, the sequence length and minimum history size for a state to be updated for all experiments. Score contextualisation heads are trained to minimise the Q-learning loss over the whole transition sequence. On the other hand, minimises the BCE (binary cross-entropy) loss over the predicted admissibility probability and the actual admissibility signal for every transition in the transition sequence. The behavior policy during training is greedy over the admissible set . Each episode lasts for a maximum steps. For Level game, we anneal to over steps and . For rest of the games in the suite, we anneal to over steps and .
Architectural hyper-parameters: In , word embedding size is and the number of hidden units in encoder LSTM is . For a network head , the number of hidden units in context LSTM is ; is a two layer MLP: sizes of first and second layer are 128 and respectively. has the same configuration as .
a.2 Action Gating Implementation
For dropout and masking when selecting actions, we set for . Since is basically an estimate for admissibility for action given history , we use (5) to implement consistent Q value backups: where
We notice by using the above equation, that for an action inadmissible in , it’s value indeed reduces to over time.
a.3 Baseline Modifications
We modify LSTM-DRQN (yuan2018counting yuan2018counting) in two ways. First, we concatenate the representations and before sending it to the history LSTM, in contrast yuan2018counting yuan2018counting concatenates the inputs and first and then generates . Second, we modify the action scorer as action scorer in the LSTM-DRQN could only handle commands with two words.
Appendix B Notations and Algorithm
Following are the notations important to understand the algorithm:
observation (i.e. feedback), reward and admissibility signal received at time .
command executed in game-play at time .
cumulative rewared/score at time .
: number of network heads in score contextualisation architecture.
: dictionary mapping cumulative rewards to network heads.
LSTM corresponding to network head .
Action scorer corresponding to network head .
agent’s context/history state at time .
: maximum steps for an episode.
boolean that determines whether +ve reward was received in episode .
boolean that determines whether -ve reward was received in episode .
fraction of episodes where
fraction of episodes where
minimum history size for a state to be updated.
admissible set generated at time .
update interval for target network
parameter for greedy exploration strategy.
softness parameter i.e. fraction of times .
threshold parameter for action elimination strategy Masking.
maximum steps till which training is performed.
Full training procedure is listed in Algorithm 1.
Appendix C More Empirical Analysis
c.1 Prioritised Sampling & Infrequent look
Our algorithm uses prioritised sampling and executes a look action every steps. The baseline agent LSTM-DRQN follows this algorithm. We now ask,
Does prioritised sampling and an infrequent look play a significant role in the baseline’s performance?
For this experiment, we compare the Baseline to two agents. The first agent is the Baseline without prioritised sampling and the second is the one without an infrequent look. Accordingly, we call them “No-priority (NP)” and “No-look (NL)” respectively. We use Zork as the testing domain.
From Fig 12, we observe that the Baseline performs better than the NP agent. This is because prioritised sampling helps the baseline agent to choose the episodes in which rewards are received in, thus assigning credit to the relevant states faster and overall better learning. In the same figure, the Baseline performs slightly better than the NL agent. We hypothesise that even though look command is executed infrequently, it helps the agent in exploration and do credit assignment better.
Our algorithm uses CQLH implementation as described in Section A.2. An important case that CQLH considers is . This manifests in term in equation (5). We now ask whether ignoring the case worsen the agent’s performance?
For this experiment, we compare CQLH agent with the agent which uses this error for update:
Accordingly, we call this new agent as “alternate CQLH (ACQLH)” agent. We use Zork as testing domain. From Fig 13, we observe that although ACQLH has a simpler update rule, its performance seems more unstable compared to the CQLH agent.