Action Assembly: Sparse Imitation Learning for Text Based Games with Combinatorial Action Spaces

by   Chen Tessler, et al.

We propose a computationally efficient algorithm that combines compressed sensing with imitation learning to solve sequential decision making text-based games with combinatorial action spaces. We propose a variation of the compressed sensing algorithm Orthogonal Matching Pursuit (OMP), that we call IK-OMP, and show that it can recover a bag-of-words from a sum of the individual word embeddings, even in the presence of noise. We incorporate IK-OMP into a supervised imitation learning setting and show that this algorithm, called Sparse Imitation Learning (Sparse-IL), solves the entire text-based game of Zork1 with an action space of approximately 10 million actions using imperfect, noisy demonstrations.


page 1

page 2

page 3

page 4


Compressed imitation learning

In analogy to compressed sensing, which allows sample-efficient signal r...

The Past and Present of Imitation Learning: A Citation Chain Study

Imitation Learning is a promising area of active research. Over the last...

Sequential Causal Imitation Learning with Unobserved Confounders

"Monkey see monkey do" is an age-old adage, referring to naïve imitation...

Robust Imitation Learning from Noisy Demonstrations

Learning from noisy demonstrations is a practical but highly challenging...

Seeing Differently, Acting Similarly: Imitation Learning with Heterogeneous Observations

In many real-world imitation learning tasks, the demonstrator and the le...

Maximum Causal Tsallis Entropy Imitation Learning

In this paper, we propose a novel maximum causal Tsallis entropy (MCTE) ...

Fully General Online Imitation Learning

In imitation learning, imitators and demonstrators are policies for pick...

1 Introduction

“A priest, a rabbi, and a minister walk into a bar. The bartender opens his mouth to speak, but can’t find the words. Slowly he cracks a smile, looks around and says… ‘Ugh, what was I going to say?’ Such a large action space…”

While we might chuckle at a story like this, there is more here than just a joke. If we look carefully at what happened, we notice that the bartender of our story stumbled on his words (Pruim, 2014), while thinking about how to start the joke. Such cognitive-behavior breakdowns often occur when there are many actions (in the above joke, actions are possible responses) to choose from, as described in the Action-Assembly Theory (O. Greene, 1984, AAT). According to Greene, behavior is described by two essential processes: representation and processing. Representation refers to the way information is coded and stored in the mind, whereas processing refers to the mental operations performed to retrieve this information (Greene, 2008). Having good representations of information and an efficient processing procedure allows us to quickly exploit highly rewarding nuances of an environment upon first discovery.

However, learning good representations and developing efficient action selection procedures is a major computational challenge for artificially intelligent agents. This is especially problematic when dealing with

combinatorial action spaces

, since the agent needs to explore all possible action interactions. Combinatorial action spaces are prevalent in many domains including natural language generation

(Ranzato et al., 2015; Bahdanau et al., 2016), industry 4.0 (e.g., Internet of Things) (Preuveneers and Ilie-Zudor, 2017), electric grids (Wen et al., 2015) and more. For example, in text-based games (Côté et al., 2018), given a dictionary with entries (words), the size of number of possible sentences of length , namely the size of the action space, is .

In this work we propose the first computationally efficient algorithm (see Figure 1), called Sparse Imitation Learning (Sparse-IL), which is inspired by AAT and combines imitation learning with a Compressed Sensing (CS) retrieval mechanism to solve text-based games with combinatorial action spaces. Our approach is composed of:

(1) Encoder - the encoder receives a state as input (Figure 1). The state is composed of individual words using word embeddings that were previously trained on a large corpus of text. We train the encoder, using imitation learning, to generate a continuous action (a dense representation of the action). The action

corresponds to a sum of word embeddings that indicates the action that the agent intends to take, e.g., the embedding of the action ‘take egg’ is the sum of the word embedding vectors of ‘take’ and ‘egg’. As the embeddings capture a prior, i.e., similarity, over language, it enables improved generalization and robustness to noise when compared to an end-to-end approach.

(2) Retrieval Mechanism - given a continuous vector , we reconstruct the best Bag-of-Words (BoW) actions , composed of up to words, from the continuous output of the encoder. We do this using an algorithm that we term Integer K-Orthogonal Matching Pursuit (IK-OMP). We then use a fitness function to score the actions, after which, the best action is fed into a language model to yield an action sentence that can be parsed by the game.

Figure 1: The Sparse-IL algorithm. Figure 2: Zork1 example screen.

Main contributions: We propose a computationally efficient algorithm called Sparse-IL that combines CS with imitation learning to solve natural language tasks with combinatorial action spaces. We show that IK-OMP, which we adapted from White et al. (2016) and Lin et al. (2013), can be used to recover a BoW vector from a sum of the individual word embeddings in a computationally efficient manner, even in the presence of significant noise. We demonstrate that Sparse-IL can solve the entire game of Zork, for the first time, while considering a combinatorial action space of approximately 10 million actions, using noisy, imperfect demonstrations.

This paper is structured as follows: Section 2 details relevant related work. Section 3 provides an overview of the problem setting; that is, the text-based game of Zork and the challenges it poses. Section 4 provides an overview of CS algorithms and, in particular, our variant called IK-OMP. This section also includes experiments in the text-based game Zork highlighting the robustness of IK-OMP to noise and its computational efficiency. Section 5 introduces our Sparse-IL algorithm and showcases the experiments of the algorithm solving the ‘Troll Quest’ and the entire game of Zork.

2 Related work

Combinatorial action spaces in text-based games: Previous works have suggested approaches for solving text-based games (Narasimhan et al., 2015; He et al., 2016; Yuan et al., 2018; Zahavy et al., 2018; Zelinka, 2018; Tao et al., 2018). However, these techniques do not scale to combinatorial action spaces. For example, He et al. (2016) presented the DRRN architecture, which requires each action to be evaluated by the network. This results in a total of forward passes. Zahavy et al. (2018) proposed the Action-Elimination DQN, resulting in a smaller action set . However, this set may still be of exponential size.

CS and embeddings representation:

CS was originally introduced in the Machine Learning (ML) world by

Calderbank et al. (2009)

, who proposed the concept of compressed learning. That is, learning directly in the compressed domain, e.g. the embeddings domain in the Natural Language Processing (NLP) setting. The task of generating BoW from the sums of their word embeddings was first formulated by

White et al. (2016). A greedy approach, very similar to orthogonal matching pursuit (OMP), was proposed to iteratively find the words. However, this recovery task was only explicitly linked to the field of CS two years later in Arora et al. (2018).

3 Problem setting

Zork - A text-based game:

Text-based games (Côté et al., 2018) are complex interactive games usually played through a command line terminal. An example of Zork1, a text-based game, is shown in Figure 2. In each turn, the player is presented with several lines of text which describe the state of the game, and the player acts by entering a text command. In order to cope with complex commands, the game is equipped with an interpreter which deciphers the input and maps it to in-game actions. For instance, in Figure 2, a command “climb the large tree” is issued, after which the player receives a response. In this example, the response explains that up in the tree is a collectible item - a jewel encrusted egg. The large, combinatorial action space is one of the main reasons Zork poses an interesting research problem. The actions are issued as free-text and thus the complexity of the problem grows exponentially with the size of the dictionary in use.

Our setup: In this work, we consider two tasks: the ‘Troll Quest’ (Zahavy et al., 2018) and ‘Open Zork’, i.e., solving the entire game. The ‘Troll Quest’ is a sub-task within ‘Open Zork’, in which the agent must enter the house, collect a lantern and sword, move a rug which reveals a trapdoor, open the trapdoor and enter the basement. Finally, in the basement, the agent encounters a troll which it must kill using the sword. A bad action at any stage will fail to reach the troll, whereas a wrong action when encountering the troll may result in termination, i.e., death.

In our setting, we consider a dictionary of unique words, extracted from a walk-through of actions which solve the game, a demonstrated sequence of actions (sentences) used to solve the game. We limit the maximal sentence length to words. Thus, the number of possible, unordered, word combinations are , i.e., the dictionary size to the power of the maximal sentence length, divided by the number of possible permutations. This results in approximately 10 million possible actions.

Markov Decision Process (MDP):

Text-based games can be modeled as Markov Decision Processes. An MDP

is defined by the tuple (Sutton and Barto, 1998). In the context of text-based games, is the set of states, a paragraph representing the current observation. are the available discrete actions, e.g., all combinations of words from the dictionary up to a maximal given sentence length . is the bounded reward function, for instance collecting items provides a positive reward. is the transition matrix, where

is the probability of transitioning from state

to assuming action was taken.

Action Space: While the common approach may be to consider a discrete action space, such an approach may be infeasible to solve, as the complexity of solving the MDP is related to the effective action space size. Hence, in this work, we consider an alternative, continuous representation. As each action is a sentence composed of words, we represent each action using the sum of the embeddings of its tokens, or constitutive words, denoted by (Sum of Embeddings). A simple form of embedding is the Bag of Words (BoW), it represents the word using a one-hot vector the size of the dictionary in which the dictionary index of the word is set to . Aside from the Bag of Words (BoW) embedding, there exist additional forms of embedding vectors. For instance, Word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014)

, which encode the similarity between words (in terms of cosine distance). These embeddings are pre-trained using unsupervised learning techniques and similarly to how convolutional neural networks enable generalization across similar states, word embeddings enable generalization across similar sentences, i.e., actions.

In this work, we utilize GloVe embeddings, pre-trained on the Wikipedia corpus. We chose GloVe over Word2vec, as there exist pre-trained embeddings in low dimensional space. The embedding space dimensionality is , significantly smaller in dimension than the size of the dictionary , in our experiments. Given the continuous representation of an action, namely the sum of embeddings of the sentence tokens , the goal is to recover the corresponding discrete action , that is the tokens composing the sentence. These may be represented as a BoW vector . Recovering the sentence from requires prior information on the language model.

Provided a set of words, the goal of a language model, the last element in Figure 1, a central piece in many important NLP tasks, is to output the most likely ordering which yields a grammatically correct sentence. In this paper, we use a rule based approach. Our rules are relatively simple. For example, given a verb and an object, the verb comes before the object - e.g., [‘sword’, ‘take’] ‘take sword’.

To conclude, we train a neural network to predict the sum of embeddings . Using CS (Section 4), we recover the BoW vector , i.e., the set of words which compose the sentence. Finally, a language model M converts into a valid discrete-action, namely . The combined approach is as follows:

4 Compressed sensing

This section provides some background on CS and sparse recovery, including practical recovery algorithms and theoretical recovery guarantees. In particular, we describe our variant of one popular reconstruction algorithm, OMP, that we refer to as Integer K-OMP (IK-OMP). The first modification allows exploitation of the integer prior on the sparse vector and is inspired by White et al. (2016) and Sparrer and Fischer (2015b). The second mitigates the greedy nature of OMP using beam search Lin et al. (2013). The experiments presented at the end of this section compare different sparse recovery approaches and demonstrate the superiority of introducing the integer prior and the beam search strategy.

4.1 Sparse Recovery

CS is concerned with recovering a high-dimensional -sparse signal (the BoW vector in our setting) from a low dimensional measurement vector (the sum of embeddings vector ). That is, given a dictionary :


To ensure uniqueness of the solution of Eq. 1, the sensing matrix, or dictionary, must fulfill certain properties. These are key to provide practical recovery guarantees as well. Well known such properties are the spark, or Kruskal rank Donoho and Elad (2003), and the Restricted Isometry Property (Candes and Tao, 2005, RIP). Unfortunately, these are typically as hard to compute as solving the original problem Eq. 1. While the mutual-coherence (see Definition 1) provides looser bounds, it is easily computable. Thus, we focus on mutual-coherence based results and note that Spark and RIP based guarantees may be found in Elad (2010).

Definition 1 (Elad (2010) Definition 2.3)

The mutual coherence of a given matrix is the largest absolute normalized inner product between different columns from . Denoting the -th column in by , it is given by

The mutual-coherence characterizes the dependence between columns of the matrix

. For a unitary matrix, columns are pairwise orthogonal, and as a result, the mutual-coherence is zero. For general matrices with more columns than rows (

), as in our case, is necessarily strictly positive, and we desire the smallest possible value so as to get as close as possible to the behavior exhibited by unitary matrices Elad (2010). This is illustrated in the following uniqueness theorem.

Theorem 1 (Elad (2010) Theorem 2.5)

If a system of linear equations has a solution obeying where , this solution is the sparsest possible.

We now turn to discuss practical methods for solving Eq. 1.

4.2 Recovery Algorithms

The sparse recovery problem Eq. 1 is non-convex due to the -norm. Although it may be solved via combinatorial search, the complexity of exhaustive search is exponential in the dictionary dimension , and it has been proven that Eq. 1 is, in general, NP-Hard Elad (2010).
One approach to solve Eq. 1, known as basis pursuit, relaxes the -minimization to its -norm convex surrogate,


In the presence of noise, the condition is replaced by . The Lagrangian relaxation of this quadratic program is written, for some as and is known as basis pursuit denoising (BPDN).

The above noiseless and noisy problems can be respectively cast as linear programming and second order cone programming problems

Chen et al. (2001). They thus may be solved using techniques such as interior-point methods Ben-Tal and Nemirovski (2001); Boyd and Vandenberghe (2004). Large scale problems involving dense sensing matrices often precludes the use of such methods. This motivated the search for simpler gradient-based algorithms for solving Eq. 2, such as fast iterative shrinkage-thresholding algorithm (Beck and Teboulle, 2009, FISTA).

Alternatively, one may use greedy methods, broadly divided into matching pursuit based algorithms, such as OMP Blumensath and Davies (2008), and thresholding based methods, including iterative hard thresholding Blumensath and Davies (2009)

. The popular OMP algorithm, reported in the Appendix, proceeds by iteratively finding the dictionary column with the highest correlation to the signal residual, computed by subtracting the contribution of a partial estimate of

from . The coefficients over the selected support set are then chosen so as to minimize the residual error. A typical halting criterion compares the residual to a predefined threshold.

4.3 Recovery Guarantees

Performance guarantees for both -relaxation and greedy methods have been provided in the CS literature. In noiseless settings, under the conditions of Theorem 1, the unique solution of Eq. 1 is also the unique solution of Eq. 2 (Elad, 2010, Theorem 4.5). Under the same conditions, OMP with halting criterion threshold is guaranteed to find the exact solution of Eq. 1 (Elad, 2010, Theorem 4.3). More practical results are given for the case where the measurements are contaminated by noise (Donoho et al., 2006; Elad, 2010; Eldar and Kutyniok, 2012).

4.4 Integer K-OMP (IK-OMP)

Input: Measurement vector , dictionary , maximal number of characters and beam width
Initial solutions
for  do
     for  do
         Extend: Append to      
Algorithm 1 IK-OMP

An Integer Prior: While CS is typically concerned with the reconstruction of a sparse real-valued signal, in our BoW linear representation, the signal fulfills a secondary structure constraint besides sparsity. Its nonzero entries stem from a finite, or discrete, alphabet. Such prior information on the original signal appears in many communication scenarios Candes et al. (2005); Axell et al. (2012); Rossi et al. (2014), where the transmitted data originates from a finite set.

A few works have adapted CS algorithms to the recovery of integer-valued vectors. It has been shown that discreteness constraints on the possible values of the reconstructed signal may significantly reduce the number of required measurements Keiper et al. (2017), or alternatively increase recovery performance with a given number of measurements. BP has been adapted to the recovery of sparse integer vectors in Keiper et al. (2017). However, it is restricted to recovery of binary and ternary signals. Besides, BP’s running time becomes prohibitively high with large dictionaries and the authors are not aware of any FISTA variant for integer-valued signal recovery.

In Sparrer and Fischer (2015a), the discrete prior is exploited to initialize the support set for OMP. Later, Sparrer and Fischer (2015b)

consider OMP in connection with quantization and soft feedback. Unfortunately, the above approaches require knowledge of the measurement noise variance. Here, we adopt a similar approach as the greedy addition phase in the selection algorithm of

White et al. (2016). In each iteration, we find the dictionary column closest to the residual in terms of mean square error and increment the corresponding entry of the recovered by one. This method enjoys the same computational complexity as OMP and we empirically found that it outperforms simply adding a quantization step in the original OMP iteration.

Beam Search OMP: As OMP iteratively adds atoms to the recovered support, the choice of a new element in an iteration is blind to its effect on future iterations. Therefore, any mistakes, particularly in early iterations, may lead to large recovery errors. To mitigate this phenomenon, several methods have been proposed to amend the OMP algorithm.

To decrease the greediness of the greedy addition algorithm (which acts similarly to OMP), White et al. (2016) use a substitution based method, also referred as swapping Andrle and Rebollo-Neira (2006) in the CS literature. Unfortunately, the computational complexity of this substitution strategy makes it impractical. Elad and Yavneh (2009) combine several recovered sparse representations, to improving denoising, by randomizing the OMP algorithm. However, in our case, the sum of embeddings represents a true sparse BoW vector , so that combining several recovered vectors should not lead to the correct solution.

The look ahead OMP (LAOMP) Chatterjee et al. (2011) uses a multi-path OMP procedure that, in each iteration, considers several potential elements and evaluates the effect of selecting each one on the final residual. However, this algorithm still may early discard the correct path and result in serious recovery error. In addition, this approach is computationally demanding. While the complexity of OMP is , that of LAOMP is , where is the number of potential candidates in each iteration. We adopt the beam search approach of Lin et al. (2013), K-best OMP, that enjoys the lower computational complexity of , with the number of candidates per beam. This method extends and preserves multiple search paths simultaneously so that the probability of finding the correct locations of nonzero elements can be much improved.

IK-OMP: We combine the integer-prior with the beam search strategy, and propose the IK-OMP (Algorithm 1). In the algorithm description, is the vector with a single nonzero element at index and denotes the elements with smallest value for the following expression. In this work, the selected BoW is the candidate which minimizes the reconstruction score.

Figure 3: Compressed Sensing: Comparison of the accuracy, and accumulated reward, of the various reconstruction algorithms on the ‘Troll Quest’ and in ‘Open Zork’. The SnR denotes the ratio between the norm of the original signal and that of the added noise.
Table 1: Runtime comparison of various CS approaches.
Figure 4: Compressed Sensing - DeepCS: Comparison of the accuracy, and accumulated reward, of the DeepCS baselines, compared to the IK-OMP approach.

4.5 Compressed Sensing Experiments

In this section, we focus on comparing several CS approaches. To do so, we follow the set of commands, extracted from a walk-through of the game, required to solve Zork1, both in the ‘Troll Quest’ and ‘Open Zork’ domains. In each state , we take the ground-truth action , calculate the sum of word embeddings , add noise and test the ability of various CS methods to reconstruct . We compare the run-time (Table 1), and the reconstruction accuracy (number of actions reconstructed correctly) and reward gained in the presence of noise (Figure 3). Specifically, the measured action is , where is normalized based on the signal to noise ratio (SnR).

We compare 4 CS methods: the FISTA implementation of BP, OMP, IK-OMP (Algorithm 1

) and a Deep Learning variant we deem DeepCS described below. The dictionary is composed of

possible words which can be used in the game. The dimension of the embedding is and the sentence length is limited to at most words. This yields a total number of 10 million actions, from which the agent must choose one at each step. It is important to note that while accuracy and reward might seem similar, an inaccurate reconstruction at an early stage results in an immediate failure, even when the accuracy seems high.

Clearly, as seen from Figure 3, OMP fails to reconstruct the true BoW vectors , even in the noiseless scenario. Indeed, the mutual-coherence (Definition 1) is and from Theorem 1, there is no guarantee that OMP can reconstruct a sparse vector for any sparsity . However, our suggested approach, IK-OMP, is capable of correctly reconstructing the original action , even in the presence of relatively large noise. This gives evidence that the integer prior, in particular, and the beam search strategy significantly improve the sparse recovery performance.

Deep Compressed Sensing: Besides traditional CS methods, it is natural to test the ability of deep learning methods to perform such a task. In this approach, we train a neural network to predict the BoW vector

which composes the continuous embedding vector. Our network is a multi layer perceptron (MLP), composed of two hidden layers, 100 neurons each. We use a sigmoid activation function to bound the outputs to

and train the network using a binary bross entropy loss.

Our results, presented in Figure 4, show that the DeepCS approach works when no noise is present, however, once noise is added to the setup, it is clear that DeepCS performs poorly compared to classic CS methods such as IK-OMP111In the Appendix, we provide an additional experiment, which shows that in the Troll quest, DeepCS remains competitive as the domain is relatively small. We also provide experiments with a much larger dictionary, , and show that IK-OMP is capable of scaling.. Besides, as DeepCS requires training a new model for each domain, it is data-specific and does not transfer easily, which is not the case with traditional CS methods.

5 Imitation Learning

In this section, we present our Sparse-ILalgorithm and provide in-depth details regarding the design and implementation of each of its underlying components. We also detail the experiments of executing Sparse-IL on the entire game of Zork.

Sparse Imitation Learning: Our Sparse-IL architecture is composed of two major components - Encoder and Retrieval Mechanism (as seen in Figure 1). Each component has a distinct role and combining them together enables for a computationally efficient approach. (1) , the Encoder, given a state , predicts a sum of word embeddings . (2) , the Retrieval Mechanism converts the continuous action embedding into the most probable BoW representation , such an example can be seen in Section 4, e.g., IK-OMP; this BoW representation , is provided to the Language Model M, which returns the most probable ordering of words (BoW sentence). We will now discuss each of these components in more detail.

The Encoder (E) is a neural network trained to output the optimal action representation at each state. As we consider the task of imitation learning, this is performed by minimizing the loss between the Encoder’s output and the embedding of the action provided by the expert .

Figure 6: Sparse Imitation Learning: Comparison of the accuracy of each reconstruction algorithm on an agent trained using imitation learning to solve the entire game. In the graph on the left, IK-OMP with K=20 and K=112 result in identical performance.
Figure 5: Difference in reconstruction accuracy, between Sparse-IL and DeepCS-2. Higher value represents a higher reconstruction accuracy for Sparse-IL. DeepCS-2 fails when presented with several variants of the correct actions (synonyms).

In all of the learning experiments, the architecture we use is a convolutional neural network (CNN) that is suited to NLP tasks (Kim, 2014). Due to the structure of the game, there exist long term-dependencies. Frame-stacking, a common approach in games (Mnih et al., 2015), tackles this issue by providing the network with the N previous states. For the “Open Zork” task, we stack the previous 12 states, and provide only the current state in the “Troll Quest”.

Retrieval Mechanism (R): The output of the Encoder, , is fed into a CS algorithm, such as IK-OMP. IK-OMP produces K candidate actions, . These actions are fed into a fitness function which ranks them, based on the reconstruction score (see Section 4), and returns the optimal candidate. Other CS approaches, e.g., OMP and FISTA, return a single candidate action.

5.1 Experiments

In an imitation learning setup, we are given a data set of state-action pairs , provided by an expert; the goal is to learn a policy that achieves the best performance possible. We achieve this by training the embedding network to imitate the demonstrated actions in the embedding space, namely , at each state , using the MSE between the predicted actions and those demonstrated. We consider three setups: (1) Perfect demonstrations, where we test errors due to architecture capacity and function approximation, (2) Gaussian noise, (See Section 4.5), and (3) discrete-action noise, in which w.p. a random incorrect action is demonstrated. This experiment can be seen as learning from demonstrations provided by an ensemble of sub-optimal experts.

Our results are shown in Figure 6. By combining CS with imitation learning techniques, our approach is capable of solving the entire game of Zork1, even in the presence of discrete-action noise. In all our experiments, IK-OMP outperforms the various baselines222We provide similar results on the ‘Troll Quest’ in the appendix., including the end-to-end approach, i.e., DeepCS-2 which is trained to predict the BoW embedding directly from the state .

Training: Analyzing the training graph presents an interesting picture. It shows that during the training process, the output of the Encoder can be seen as a noisy estimation of . As training progresses, the effective SnR of the noise decreases which is seen by the increase in the reconstruction performance.

Generalization: In Figure 6, we present the generalization capabilities which our method Sparse-IL enjoys, due to the use of pre-trained unsupervised word embeddings. The heatmap shows two forms of noise. The first, as before, is the probability of receiving a bad demonstration, an incorrect action. The second, synonym probability, is the probability of being presented with a correct action, yet composed of different words, e.g., drop, throw and discard result in an identical action in the environment and have a similar meaning. These results clearly show that Sparse-IL outperforms DeepCS-2 in nearly all scenarios, highlighting the generalization improvement inherent in the embeddings.

The benefit of meaningful embeddings: In our approach, the Encoder is trained to predict the sum-of-embeddings . However, it can also be trained to directly predict the BoW vector . While this approach may work, it lacks the generalization ability which is apparent in embeddings such as GloVe, in which similar words receive similar embedding vectors.

Consider a scenario in which there are 4 optimal actions (e.g., ‘go north’, ‘walk north’, ’run north’ and ‘move north’) and 1 sub-optimal action (e.g., ‘climb tree’). With probability we are presented with one of the optimal actions and with probability the sub-optimal action. In this example, the expected BoW representation would include ‘north’ w.p. , ‘climb’ and ‘tree’ w.p. , and the rest w.p. . On the other hand, since ‘go’, ‘walk’, ‘run’ and ‘move’ have similar meanings and in turn similar embeddings, the expected is much closer to the optimal actions than to the sub-optimal one and thus an imitation agent is less likely to make a mistake.

Remark 1

While in this work we used a naive fitness function, namely the reconstruction loss, as it performs a selection over K possible actions, it can be replaced by an evaluation network, e.g., critic (Dulac-Arnold et al., 2015), and used for policy improvement. Results using the Top-K metric are provided in the appendix. These results show that the optimal action is contained within the K candidates even under large noise which prevents the Top-1 from successfully solving the task, and suggests that IK-OMP can be used to produce a minimal action set, similar to the elimination network in Zahavy et al. (2018).

6 Conclusion

We have presented a computationally efficient algorithm called Sparse Imitation Learning (Sparse-IL) that combines CS with imitation learning to solve text-based games with combinatorial action spaces. We proposed a CS algorithm variant of OMP which we have called Integer K-OMP (IK-OMP) and demonstrated that it can deconstruct a sum of word embeddings into the individual BoW that make up the embedding, even in the presence of significant noise. In addition, IK-OMP is significantly more computationally efficient than the baseline CS techniques. When combining IK-OMP with imitation learning, our agent is able to solve Troll quest as well as the entire game of Zork1 for the first time. Zork1 contains a combinatorial action space of 10 million actions. Future work includes replacing the fitness function with a critic in order to further improve the learned policy as well as testing the capabilities of the critic agent in cross-domain tasks.


Appendix A Compressed Sensing Algorithms

In Section 4, we presented Orthogonal Matching Persuit (OMP), a common algorithm in CS, and Integer OMP, an integer constrained variant. Both algorithms are presented below. In the algorithm description, is the -th column of and denotes the columns of the matrix indexed by the set . Similarly, is the vector reduced to the support , implying that the remaining entries of are zeros.

Input: Measurement vector , dictionary and stopping criteria
Initial solution
Initial residual
Initial support
while  do
     Update support:
     Compute solution:
     Compute residual:
Algorithm 2 OMP
Input: Measurement vector , dictionary and maximal number of words
Initial solution
Initial residue
for l = 1, L do
Algorithm 3 Integer OMP

Appendix B Experiments

Throughout the paper, all the experiments were evaluated across 5 random seeds. The compute used for the simulations consisted of a single i7 system with a GTX 1080-TI GPU.

b.1 DeepCS

For clarity, we provide both the ‘Troll Quest’ and ‘Open Zork’ results in the DeepCS setting. These results show that while DeepCS is comparable to IK-OMP in a small problem, such as the ‘Troll Quest’, when scaled up to larger, more complex problems, IK-OMP performs significantly better than DeepCS both in accuracy and total reward.

Figure 7: Zork1: Comparison of the accuracy, and accumulated reward, of the DeepCS baselines, compared to the IK-OMP approach.

b.2 CS with

In addition to the results provided in the paper, we test our approach when the dictionary size is increased x10. This results in an effective action space of .

Figure 8: Zork1: Comparison of the accuracy and reward of the various CS approaches when considering a dictionary of 1000 words.

The results show that indeed our approach is capable of coping with larger action spaces, however there is a degradation in performance as the noise grows.

b.3 Sparse Imitation Learning

Similarly to the results on ‘Open Zork’, we see that IK-OMP outperforms both the standard OMP and FISTA. In addition, these results show that during the training procedure; the output of the Encoder, namely , may be viewed as a noisy representation of the original sum of embeddings. As the training proceeds, the network converges, hence SnR of the approximation noise reduces and the various reconstruction schemes perform better.

Figure 9: Zork1: Comparison of the accuracy of each reconstruction algorithm on an agent trained using imitation learning to solve the Troll Quest and the entire game.

b.4 Top-K experiment

While our analysis focused on the pure imitation setting, and thus the Top-1 accuracy, it is also important to consider the Top-K accuracy. In the Top-K accuracy, a prediction is deemed correct if the accurate action is one of the K candidates offered by the CS algorithm. As opposed to OMP and FISTA, as IK-OMP produces multiple candidate actions we expect a significant improvement when considering this measure.

Troll Quest

Figure 10: Zork1: Comparison of the various CS approaches using the Top-K accuracy with imperfect demonstrations.

The empirical results show that indeed the TOP-K results in a dramatic increase in performance. We can see that even in the presence of very large noise, the real action is one of the K-candidates reconstructed by IK-OMP.

This result is important as it highlights the ability of our approach to incorporate self-play. Such an approach can be viewed as similar to the action elimitation approach taken in [Zahavy et al., 2018]. While [Zahavy et al., 2018] reduce the effective action set using feedback from the environment, in our approach, the imperfect demonstrations reduce the effective action set and thus the MDP to a smaller and thus solvable problem.