1 Introduction
“A priest, a rabbi, and a minister walk into a bar. The bartender opens his mouth to speak, but can’t find the words. Slowly he cracks a smile, looks around and says… ‘Ugh, what was I going to say?’ Such a large action space…”
While we might chuckle at a story like this, there is more here than just a joke. If we look carefully at what happened, we notice that the bartender of our story stumbled on his words (Pruim, 2014), while thinking about how to start the joke. Such cognitivebehavior breakdowns often occur when there are many actions (in the above joke, actions are possible responses) to choose from, as described in the ActionAssembly Theory (O. Greene, 1984, AAT). According to Greene, behavior is described by two essential processes: representation and processing. Representation refers to the way information is coded and stored in the mind, whereas processing refers to the mental operations performed to retrieve this information (Greene, 2008). Having good representations of information and an efficient processing procedure allows us to quickly exploit highly rewarding nuances of an environment upon first discovery.
However, learning good representations and developing efficient action selection procedures is a major computational challenge for artificially intelligent agents. This is especially problematic when dealing with
combinatorial action spaces, since the agent needs to explore all possible action interactions. Combinatorial action spaces are prevalent in many domains including natural language generation
(Ranzato et al., 2015; Bahdanau et al., 2016), industry 4.0 (e.g., Internet of Things) (Preuveneers and IlieZudor, 2017), electric grids (Wen et al., 2015) and more. For example, in textbased games (Côté et al., 2018), given a dictionary with entries (words), the size of number of possible sentences of length , namely the size of the action space, is .In this work we propose the first computationally efficient algorithm (see Figure 1), called Sparse Imitation Learning (SparseIL), which is inspired by AAT and combines imitation learning with a Compressed Sensing (CS) retrieval mechanism to solve textbased games with combinatorial action spaces. Our approach is composed of:
(1) Encoder  the encoder receives a state as input (Figure 1). The state is composed of individual words using word embeddings that were previously trained on a large corpus of text. We train the encoder, using imitation learning, to generate a continuous action (a dense representation of the action). The action
corresponds to a sum of word embeddings that indicates the action that the agent intends to take, e.g., the embedding of the action ‘take egg’ is the sum of the word embedding vectors of ‘take’ and ‘egg’. As the embeddings capture a prior, i.e., similarity, over language, it enables improved generalization and robustness to noise when compared to an endtoend approach.
(2) Retrieval Mechanism  given a continuous vector , we reconstruct the best BagofWords (BoW) actions , composed of up to words, from the continuous output of the encoder. We do this using an algorithm that we term Integer KOrthogonal Matching Pursuit (IKOMP). We then use a fitness function to score the actions, after which, the best action is fed into a language model to yield an action sentence that can be parsed by the game.
Main contributions: We propose a computationally efficient algorithm called SparseIL that combines CS with imitation learning to solve natural language tasks with combinatorial action spaces. We show that IKOMP, which we adapted from White et al. (2016) and Lin et al. (2013), can be used to recover a BoW vector from a sum of the individual word embeddings in a computationally efficient manner, even in the presence of significant noise. We demonstrate that SparseIL can solve the entire game of Zork, for the first time, while considering a combinatorial action space of approximately 10 million actions, using noisy, imperfect demonstrations.
This paper is structured as follows: Section 2 details relevant related work. Section 3 provides an overview of the problem setting; that is, the textbased game of Zork and the challenges it poses. Section 4 provides an overview of CS algorithms and, in particular, our variant called IKOMP. This section also includes experiments in the textbased game Zork highlighting the robustness of IKOMP to noise and its computational efficiency. Section 5 introduces our SparseIL algorithm and showcases the experiments of the algorithm solving the ‘Troll Quest’ and the entire game of Zork.
2 Related work
Combinatorial action spaces in textbased games: Previous works have suggested approaches for solving textbased games (Narasimhan et al., 2015; He et al., 2016; Yuan et al., 2018; Zahavy et al., 2018; Zelinka, 2018; Tao et al., 2018). However, these techniques do not scale to combinatorial action spaces. For example, He et al. (2016) presented the DRRN architecture, which requires each action to be evaluated by the network. This results in a total of forward passes. Zahavy et al. (2018) proposed the ActionElimination DQN, resulting in a smaller action set . However, this set may still be of exponential size.
CS and embeddings representation:
CS was originally introduced in the Machine Learning (ML) world by
Calderbank et al. (2009), who proposed the concept of compressed learning. That is, learning directly in the compressed domain, e.g. the embeddings domain in the Natural Language Processing (NLP) setting. The task of generating BoW from the sums of their word embeddings was first formulated by
White et al. (2016). A greedy approach, very similar to orthogonal matching pursuit (OMP), was proposed to iteratively find the words. However, this recovery task was only explicitly linked to the field of CS two years later in Arora et al. (2018).3 Problem setting
Zork  A textbased game:
Textbased games (Côté et al., 2018) are complex interactive games usually played through a command line terminal. An example of Zork1, a textbased game, is shown in Figure 2. In each turn, the player is presented with several lines of text which describe the state of the game, and the player acts by entering a text command. In order to cope with complex commands, the game is equipped with an interpreter which deciphers the input and maps it to ingame actions. For instance, in Figure 2, a command “climb the large tree” is issued, after which the player receives a response. In this example, the response explains that up in the tree is a collectible item  a jewel encrusted egg. The large, combinatorial action space is one of the main reasons Zork poses an interesting research problem. The actions are issued as freetext and thus the complexity of the problem grows exponentially with the size of the dictionary in use.
Our setup: In this work, we consider two tasks: the ‘Troll Quest’ (Zahavy et al., 2018) and ‘Open Zork’, i.e., solving the entire game. The ‘Troll Quest’ is a subtask within ‘Open Zork’, in which the agent must enter the house, collect a lantern and sword, move a rug which reveals a trapdoor, open the trapdoor and enter the basement. Finally, in the basement, the agent encounters a troll which it must kill using the sword. A bad action at any stage will fail to reach the troll, whereas a wrong action when encountering the troll may result in termination, i.e., death.
In our setting, we consider a dictionary of unique words, extracted from a walkthrough of actions which solve the game, a demonstrated sequence of actions (sentences) used to solve the game. We limit the maximal sentence length to words. Thus, the number of possible, unordered, word combinations are , i.e., the dictionary size to the power of the maximal sentence length, divided by the number of possible permutations. This results in approximately 10 million possible actions.
Markov Decision Process (MDP):
Textbased games can be modeled as Markov Decision Processes. An MDP
is defined by the tuple (Sutton and Barto, 1998). In the context of textbased games, is the set of states, a paragraph representing the current observation. are the available discrete actions, e.g., all combinations of words from the dictionary up to a maximal given sentence length . is the bounded reward function, for instance collecting items provides a positive reward. is the transition matrix, whereis the probability of transitioning from state
to assuming action was taken.Action Space: While the common approach may be to consider a discrete action space, such an approach may be infeasible to solve, as the complexity of solving the MDP is related to the effective action space size. Hence, in this work, we consider an alternative, continuous representation. As each action is a sentence composed of words, we represent each action using the sum of the embeddings of its tokens, or constitutive words, denoted by (Sum of Embeddings). A simple form of embedding is the Bag of Words (BoW), it represents the word using a onehot vector the size of the dictionary in which the dictionary index of the word is set to . Aside from the Bag of Words (BoW) embedding, there exist additional forms of embedding vectors. For instance, Word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014)
, which encode the similarity between words (in terms of cosine distance). These embeddings are pretrained using unsupervised learning techniques and similarly to how convolutional neural networks enable generalization across similar states, word embeddings enable generalization across similar sentences, i.e., actions.
In this work, we utilize GloVe embeddings, pretrained on the Wikipedia corpus. We chose GloVe over Word2vec, as there exist pretrained embeddings in low dimensional space. The embedding space dimensionality is , significantly smaller in dimension than the size of the dictionary , in our experiments. Given the continuous representation of an action, namely the sum of embeddings of the sentence tokens , the goal is to recover the corresponding discrete action , that is the tokens composing the sentence. These may be represented as a BoW vector . Recovering the sentence from requires prior information on the language model.
Provided a set of words, the goal of a language model, the last element in Figure 1, a central piece in many important NLP tasks, is to output the most likely ordering which yields a grammatically correct sentence. In this paper, we use a rule based approach. Our rules are relatively simple. For example, given a verb and an object, the verb comes before the object  e.g., [‘sword’, ‘take’] ‘take sword’.
To conclude, we train a neural network to predict the sum of embeddings . Using CS (Section 4), we recover the BoW vector , i.e., the set of words which compose the sentence. Finally, a language model M converts into a valid discreteaction, namely . The combined approach is as follows:
4 Compressed sensing
This section provides some background on CS and sparse recovery, including practical recovery algorithms and theoretical recovery guarantees. In particular, we describe our variant of one popular reconstruction algorithm, OMP, that we refer to as Integer KOMP (IKOMP). The first modification allows exploitation of the integer prior on the sparse vector and is inspired by White et al. (2016) and Sparrer and Fischer (2015b). The second mitigates the greedy nature of OMP using beam search Lin et al. (2013). The experiments presented at the end of this section compare different sparse recovery approaches and demonstrate the superiority of introducing the integer prior and the beam search strategy.
4.1 Sparse Recovery
CS is concerned with recovering a highdimensional sparse signal (the BoW vector in our setting) from a low dimensional measurement vector (the sum of embeddings vector ). That is, given a dictionary :
(1) 
To ensure uniqueness of the solution of Eq. 1, the sensing matrix, or dictionary, must fulfill certain properties. These are key to provide practical recovery guarantees as well. Well known such properties are the spark, or Kruskal rank Donoho and Elad (2003), and the Restricted Isometry Property (Candes and Tao, 2005, RIP). Unfortunately, these are typically as hard to compute as solving the original problem Eq. 1. While the mutualcoherence (see Definition 1) provides looser bounds, it is easily computable. Thus, we focus on mutualcoherence based results and note that Spark and RIP based guarantees may be found in Elad (2010).
Definition 1 (Elad (2010) Definition 2.3)
The mutual coherence of a given matrix is the largest absolute normalized inner product between different columns from . Denoting the th column in by , it is given by
The mutualcoherence characterizes the dependence between columns of the matrix
. For a unitary matrix, columns are pairwise orthogonal, and as a result, the mutualcoherence is zero. For general matrices with more columns than rows (
), as in our case, is necessarily strictly positive, and we desire the smallest possible value so as to get as close as possible to the behavior exhibited by unitary matrices Elad (2010). This is illustrated in the following uniqueness theorem.Theorem 1 (Elad (2010) Theorem 2.5)
If a system of linear equations has a solution obeying where , this solution is the sparsest possible.
We now turn to discuss practical methods for solving Eq. 1.
4.2 Recovery Algorithms
The sparse recovery problem Eq. 1 is nonconvex due to the norm. Although it may be solved via combinatorial search, the complexity of exhaustive search is exponential in the dictionary dimension , and it has been proven that Eq. 1 is, in general, NPHard Elad (2010).
One approach to solve Eq. 1, known as basis pursuit, relaxes the minimization to its norm convex surrogate,
(2) 
In the presence of noise, the condition is replaced by . The Lagrangian relaxation of this quadratic program is written, for some as and is known as basis pursuit denoising (BPDN).
The above noiseless and noisy problems can be respectively cast as linear programming and second order cone programming problems
Chen et al. (2001). They thus may be solved using techniques such as interiorpoint methods BenTal and Nemirovski (2001); Boyd and Vandenberghe (2004). Large scale problems involving dense sensing matrices often precludes the use of such methods. This motivated the search for simpler gradientbased algorithms for solving Eq. 2, such as fast iterative shrinkagethresholding algorithm (Beck and Teboulle, 2009, FISTA).Alternatively, one may use greedy methods, broadly divided into matching pursuit based algorithms, such as OMP Blumensath and Davies (2008), and thresholding based methods, including iterative hard thresholding Blumensath and Davies (2009)
. The popular OMP algorithm, reported in the Appendix, proceeds by iteratively finding the dictionary column with the highest correlation to the signal residual, computed by subtracting the contribution of a partial estimate of
from . The coefficients over the selected support set are then chosen so as to minimize the residual error. A typical halting criterion compares the residual to a predefined threshold.4.3 Recovery Guarantees
Performance guarantees for both relaxation and greedy methods have been provided in the CS literature. In noiseless settings, under the conditions of Theorem 1, the unique solution of Eq. 1 is also the unique solution of Eq. 2 (Elad, 2010, Theorem 4.5). Under the same conditions, OMP with halting criterion threshold is guaranteed to find the exact solution of Eq. 1 (Elad, 2010, Theorem 4.3). More practical results are given for the case where the measurements are contaminated by noise (Donoho et al., 2006; Elad, 2010; Eldar and Kutyniok, 2012).
4.4 Integer KOMP (IKOMP)
An Integer Prior: While CS is typically concerned with the reconstruction of a sparse realvalued signal, in our BoW linear representation, the signal fulfills a secondary structure constraint besides sparsity. Its nonzero entries stem from a finite, or discrete, alphabet. Such prior information on the original signal appears in many communication scenarios Candes et al. (2005); Axell et al. (2012); Rossi et al. (2014), where the transmitted data originates from a finite set.
A few works have adapted CS algorithms to the recovery of integervalued vectors. It has been shown that discreteness constraints on the possible values of the reconstructed signal may significantly reduce the number of required measurements Keiper et al. (2017), or alternatively increase recovery performance with a given number of measurements. BP has been adapted to the recovery of sparse integer vectors in Keiper et al. (2017). However, it is restricted to recovery of binary and ternary signals. Besides, BP’s running time becomes prohibitively high with large dictionaries and the authors are not aware of any FISTA variant for integervalued signal recovery.
In Sparrer and Fischer (2015a), the discrete prior is exploited to initialize the support set for OMP. Later, Sparrer and Fischer (2015b)
consider OMP in connection with quantization and soft feedback. Unfortunately, the above approaches require knowledge of the measurement noise variance. Here, we adopt a similar approach as the greedy addition phase in the selection algorithm of
White et al. (2016). In each iteration, we find the dictionary column closest to the residual in terms of mean square error and increment the corresponding entry of the recovered by one. This method enjoys the same computational complexity as OMP and we empirically found that it outperforms simply adding a quantization step in the original OMP iteration.Beam Search OMP: As OMP iteratively adds atoms to the recovered support, the choice of a new element in an iteration is blind to its effect on future iterations. Therefore, any mistakes, particularly in early iterations, may lead to large recovery errors. To mitigate this phenomenon, several methods have been proposed to amend the OMP algorithm.
To decrease the greediness of the greedy addition algorithm (which acts similarly to OMP), White et al. (2016) use a substitution based method, also referred as swapping Andrle and RebolloNeira (2006) in the CS literature. Unfortunately, the computational complexity of this substitution strategy makes it impractical. Elad and Yavneh (2009) combine several recovered sparse representations, to improving denoising, by randomizing the OMP algorithm. However, in our case, the sum of embeddings represents a true sparse BoW vector , so that combining several recovered vectors should not lead to the correct solution.
The look ahead OMP (LAOMP) Chatterjee et al. (2011) uses a multipath OMP procedure that, in each iteration, considers several potential elements and evaluates the effect of selecting each one on the final residual. However, this algorithm still may early discard the correct path and result in serious recovery error. In addition, this approach is computationally demanding. While the complexity of OMP is , that of LAOMP is , where is the number of potential candidates in each iteration. We adopt the beam search approach of Lin et al. (2013), Kbest OMP, that enjoys the lower computational complexity of , with the number of candidates per beam. This method extends and preserves multiple search paths simultaneously so that the probability of finding the correct locations of nonzero elements can be much improved.
IKOMP: We combine the integerprior with the beam search strategy, and propose the IKOMP (Algorithm 1). In the algorithm description, is the vector with a single nonzero element at index and denotes the elements with smallest value for the following expression. In this work, the selected BoW is the candidate which minimizes the reconstruction score.











4.5 Compressed Sensing Experiments
In this section, we focus on comparing several CS approaches. To do so, we follow the set of commands, extracted from a walkthrough of the game, required to solve Zork1, both in the ‘Troll Quest’ and ‘Open Zork’ domains. In each state , we take the groundtruth action , calculate the sum of word embeddings , add noise and test the ability of various CS methods to reconstruct . We compare the runtime (Table 1), and the reconstruction accuracy (number of actions reconstructed correctly) and reward gained in the presence of noise (Figure 3). Specifically, the measured action is , where is normalized based on the signal to noise ratio (SnR).
We compare 4 CS methods: the FISTA implementation of BP, OMP, IKOMP (Algorithm 1
) and a Deep Learning variant we deem DeepCS described below. The dictionary is composed of
possible words which can be used in the game. The dimension of the embedding is and the sentence length is limited to at most words. This yields a total number of 10 million actions, from which the agent must choose one at each step. It is important to note that while accuracy and reward might seem similar, an inaccurate reconstruction at an early stage results in an immediate failure, even when the accuracy seems high.Clearly, as seen from Figure 3, OMP fails to reconstruct the true BoW vectors , even in the noiseless scenario. Indeed, the mutualcoherence (Definition 1) is and from Theorem 1, there is no guarantee that OMP can reconstruct a sparse vector for any sparsity . However, our suggested approach, IKOMP, is capable of correctly reconstructing the original action , even in the presence of relatively large noise. This gives evidence that the integer prior, in particular, and the beam search strategy significantly improve the sparse recovery performance.
Deep Compressed Sensing: Besides traditional CS methods, it is natural to test the ability of deep learning methods to perform such a task. In this approach, we train a neural network to predict the BoW vector
which composes the continuous embedding vector. Our network is a multi layer perceptron (MLP), composed of two hidden layers, 100 neurons each. We use a sigmoid activation function to bound the outputs to
and train the network using a binary bross entropy loss.Our results, presented in Figure 4, show that the DeepCS approach works when no noise is present, however, once noise is added to the setup, it is clear that DeepCS performs poorly compared to classic CS methods such as IKOMP^{1}^{1}1In the Appendix, we provide an additional experiment, which shows that in the Troll quest, DeepCS remains competitive as the domain is relatively small. We also provide experiments with a much larger dictionary, , and show that IKOMP is capable of scaling.. Besides, as DeepCS requires training a new model for each domain, it is dataspecific and does not transfer easily, which is not the case with traditional CS methods.
5 Imitation Learning
In this section, we present our SparseILalgorithm and provide indepth details regarding the design and implementation of each of its underlying components. We also detail the experiments of executing SparseIL on the entire game of Zork.
Sparse Imitation Learning: Our SparseIL architecture is composed of two major components  Encoder and Retrieval Mechanism (as seen in Figure 1). Each component has a distinct role and combining them together enables for a computationally efficient approach. (1) , the Encoder, given a state , predicts a sum of word embeddings . (2) , the Retrieval Mechanism converts the continuous action embedding into the most probable BoW representation , such an example can be seen in Section 4, e.g., IKOMP; this BoW representation , is provided to the Language Model M, which returns the most probable ordering of words (BoW sentence). We will now discuss each of these components in more detail.
The Encoder (E) is a neural network trained to output the optimal action representation at each state. As we consider the task of imitation learning, this is performed by minimizing the loss between the Encoder’s output and the embedding of the action provided by the expert .




In all of the learning experiments, the architecture we use is a convolutional neural network (CNN) that is suited to NLP tasks (Kim, 2014). Due to the structure of the game, there exist long termdependencies. Framestacking, a common approach in games (Mnih et al., 2015), tackles this issue by providing the network with the N previous states. For the “Open Zork” task, we stack the previous 12 states, and provide only the current state in the “Troll Quest”.
Retrieval Mechanism (R): The output of the Encoder, , is fed into a CS algorithm, such as IKOMP. IKOMP produces K candidate actions, . These actions are fed into a fitness function which ranks them, based on the reconstruction score (see Section 4), and returns the optimal candidate. Other CS approaches, e.g., OMP and FISTA, return a single candidate action.
5.1 Experiments
In an imitation learning setup, we are given a data set of stateaction pairs , provided by an expert; the goal is to learn a policy that achieves the best performance possible. We achieve this by training the embedding network to imitate the demonstrated actions in the embedding space, namely , at each state , using the MSE between the predicted actions and those demonstrated. We consider three setups: (1) Perfect demonstrations, where we test errors due to architecture capacity and function approximation, (2) Gaussian noise, (See Section 4.5), and (3) discreteaction noise, in which w.p. a random incorrect action is demonstrated. This experiment can be seen as learning from demonstrations provided by an ensemble of suboptimal experts.
Our results are shown in Figure 6. By combining CS with imitation learning techniques, our approach is capable of solving the entire game of Zork1, even in the presence of discreteaction noise. In all our experiments, IKOMP outperforms the various baselines^{2}^{2}2We provide similar results on the ‘Troll Quest’ in the appendix., including the endtoend approach, i.e., DeepCS2 which is trained to predict the BoW embedding directly from the state .
Training: Analyzing the training graph presents an interesting picture. It shows that during the training process, the output of the Encoder can be seen as a noisy estimation of . As training progresses, the effective SnR of the noise decreases which is seen by the increase in the reconstruction performance.
Generalization: In Figure 6, we present the generalization capabilities which our method SparseIL enjoys, due to the use of pretrained unsupervised word embeddings. The heatmap shows two forms of noise. The first, as before, is the probability of receiving a bad demonstration, an incorrect action. The second, synonym probability, is the probability of being presented with a correct action, yet composed of different words, e.g., drop, throw and discard result in an identical action in the environment and have a similar meaning. These results clearly show that SparseIL outperforms DeepCS2 in nearly all scenarios, highlighting the generalization improvement inherent in the embeddings.
The benefit of meaningful embeddings: In our approach, the Encoder is trained to predict the sumofembeddings . However, it can also be trained to directly predict the BoW vector . While this approach may work, it lacks the generalization ability which is apparent in embeddings such as GloVe, in which similar words receive similar embedding vectors.
Consider a scenario in which there are 4 optimal actions (e.g., ‘go north’, ‘walk north’, ’run north’ and ‘move north’) and 1 suboptimal action (e.g., ‘climb tree’). With probability we are presented with one of the optimal actions and with probability the suboptimal action. In this example, the expected BoW representation would include ‘north’ w.p. , ‘climb’ and ‘tree’ w.p. , and the rest w.p. . On the other hand, since ‘go’, ‘walk’, ‘run’ and ‘move’ have similar meanings and in turn similar embeddings, the expected is much closer to the optimal actions than to the suboptimal one and thus an imitation agent is less likely to make a mistake.
Remark 1
While in this work we used a naive fitness function, namely the reconstruction loss, as it performs a selection over K possible actions, it can be replaced by an evaluation network, e.g., critic (DulacArnold et al., 2015), and used for policy improvement. Results using the TopK metric are provided in the appendix. These results show that the optimal action is contained within the K candidates even under large noise which prevents the Top1 from successfully solving the task, and suggests that IKOMP can be used to produce a minimal action set, similar to the elimination network in Zahavy et al. (2018).
6 Conclusion
We have presented a computationally efficient algorithm called Sparse Imitation Learning (SparseIL) that combines CS with imitation learning to solve textbased games with combinatorial action spaces. We proposed a CS algorithm variant of OMP which we have called Integer KOMP (IKOMP) and demonstrated that it can deconstruct a sum of word embeddings into the individual BoW that make up the embedding, even in the presence of significant noise. In addition, IKOMP is significantly more computationally efficient than the baseline CS techniques. When combining IKOMP with imitation learning, our agent is able to solve Troll quest as well as the entire game of Zork1 for the first time. Zork1 contains a combinatorial action space of 10 million actions. Future work includes replacing the fitness function with a critic in order to further improve the learned policy as well as testing the capabilities of the critic agent in crossdomain tasks.
References
 Andrle and RebolloNeira [2006] Miroslav Andrle and Laura RebolloNeira. A swappingbased refinement of orthogonal matching pursuit strategies. Signal Processing, 86(3):480–495, 2006.

Arora et al. [2018]
Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli.
A compressed sensing view of unsupervised text embeddings, bagofngrams, and lstms.
2018.  Axell et al. [2012] Erik Axell, Geert Leus, Erik G Larsson, and H Vincent Poor. Spectrum sensing for cognitive radio: Stateoftheart and recent advances. IEEE Signal Processing Magazine, 29(3):101–116, 2012.
 Bahdanau et al. [2016] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actorcritic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.
 Beck and Teboulle [2009] Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
 BenTal and Nemirovski [2001] Ahron BenTal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications, volume 2. Siam, 2001.
 Blumensath and Davies [2008] Thomas Blumensath and Mike E Davies. Gradient pursuits. IEEE Transactions on Signal Processing, 56(6):2370–2382, 2008.
 Blumensath and Davies [2009] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
 Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 Calderbank et al. [2009] Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. preprint, 2009.
 Candes et al. [2005] Emmanuel Candes, Mark Rudelson, Terence Tao, and Roman Vershynin. Error correction via linear programming. In Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on, pages 668–681. IEEE, 2005.
 Candes and Tao [2005] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005.
 Chatterjee et al. [2011] S. Chatterjee, D. Sundman, and M. Skoglund. Look ahead orthogonal matching pursuit. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4024–4027, 2011.
 Chen et al. [2001] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
 Côté et al. [2018] MarcAlexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for textbased games. arXiv preprint arXiv:1806.11532, 2018.
 Donoho and Elad [2003] David L Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003.
 Donoho et al. [2006] David L Donoho, Michael Elad, and Vladimir N Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on information theory, 52(1):6–18, 2006.
 DulacArnold et al. [2015] Gabriel DulacArnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
 Elad [2010] Michael Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer Publishing Company, Incorporated, 1st edition, 2010. ISBN 144197010X, 9781441970107.
 Elad and Yavneh [2009] Michael Elad and Irad Yavneh. A plurality of sparse representations is better than the sparsest one alone. IEEE Transactions on Information Theory, 55(10):4701–4714, 2009.
 Eldar and Kutyniok [2012] Yonina C Eldar and Gitta Kutyniok. Compressed sensing: theory and applications. Cambridge University Press, 2012.
 Greene [2008] John O Greene. Action assembly theory. The International Encyclopedia of Communication, 2008.

He et al. [2016]
Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari
Ostendorf.
Deep reinforcement learning with a natural language action space.
In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1621–1630, 2016.  Keiper et al. [2017] Sandra Keiper, Gitta Kutyniok, Dae Gwan Lee, and Götz E Pfander. Compressed sensing for finitevalued signals. Linear Algebra and its Applications, 532:570–613, 2017.
 Kim [2014] Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
 Lin et al. [2013] P. Lin, S. Tsai, and G. C. . Chuang. A kbest orthogonal matching pursuit for compressive sensing. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5706–5709, 2013.
 Mikolov et al. [2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Narasimhan et al. [2015] Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Language understanding for textbased games using deep reinforcement learning. arXiv preprint arXiv:1506.08941, 2015.
 O. Greene [1984] John O. Greene. A cognitive approach to human communication: An action assembly theory. Communication Monographs  COMMUN MONOGR, 51:289–306, 12 1984. doi: 10.1080/03637758409390203.
 Pennington et al. [2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
 Preuveneers and IlieZudor [2017] Davy Preuveneers and Elisabeth IlieZudor. The intelligent industry of the future: A survey on emerging trends, research challenges and opportunities in industry 4.0. Journal of Ambient Intelligence and Smart Environments, 9(3):287–298, 2017.
 Pruim [2014] Doug Pruim. Action assembly theory. 11 2014. doi: 10.13140/2.1.2282.6566.
 Ranzato et al. [2015] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
 Rossi et al. [2014] Marco Rossi, Alexander M Haimovich, and Yonina C Eldar. Spatial compressive sensing for mimo radar. IEEE Transactions on Signal Processing, 62(2):419–430, 2014.
 Sparrer and Fischer [2015a] S Sparrer and RFH Fischer. Mmsebased version of omp for recovery of discretevalued sparse signals. Electronics Letters, 52(1):75–77, 2015a.
 Sparrer and Fischer [2015b] Susanne Sparrer and Robert FH Fischer. Softfeedback omp for the recovery of discretevalued sparse signals. In Signal Processing Conference (EUSIPCO), 2015 23rd European, pages 1461–1465. IEEE, 2015b.
 Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Tao et al. [2018] Ruo Yu Tao, MarcAlexandre Côté, Xingdi Yuan, and Layla El Asri. Towards solving textbased games by producing adaptive action spaces. arXiv preprint arXiv:1812.00855, 2018.
 Wen et al. [2015] Zheng Wen, Daniel O’Neill, and Hamid Maei. Optimal demand response using devicebased reinforcement learning. IEEE Transactions on Smart Grid, 6(5):2312–2324, 2015.
 White et al. [2016] Lyndon White, Roberto Togneri, Wei Liu, and Mohammed Bennamoun. Generating bags of words from the sums of their word embeddings. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 91–102. Springer, 2016.
 Yuan et al. [2018] Xingdi Yuan, MarcAlexandre Côté, Alessandro Sordoni, Romain Laroche, Remi Tachet des Combes, Matthew Hausknecht, and Adam Trischler. Counting to explore and generalize in textbased games. arXiv preprint arXiv:1806.11525, 2018.
 Zahavy et al. [2018] Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J Mankowitz, and Shie Mannor. Learn what not to learn: Action elimination with deep reinforcement learning. Advances in neural information processing systems (NeurIPS), 2018.
 Zelinka [2018] Mikuláš Zelinka. Using reinforcement learning to learn how to play textbased games. arXiv preprint arXiv:1801.01999, 2018.
Appendix A Compressed Sensing Algorithms
In Section 4, we presented Orthogonal Matching Persuit (OMP), a common algorithm in CS, and Integer OMP, an integer constrained variant. Both algorithms are presented below. In the algorithm description, is the th column of and denotes the columns of the matrix indexed by the set . Similarly, is the vector reduced to the support , implying that the remaining entries of are zeros.
Appendix B Experiments
Throughout the paper, all the experiments were evaluated across 5 random seeds. The compute used for the simulations consisted of a single i7 system with a GTX 1080TI GPU.
b.1 DeepCS
For clarity, we provide both the ‘Troll Quest’ and ‘Open Zork’ results in the DeepCS setting. These results show that while DeepCS is comparable to IKOMP in a small problem, such as the ‘Troll Quest’, when scaled up to larger, more complex problems, IKOMP performs significantly better than DeepCS both in accuracy and total reward.






b.2 CS with
In addition to the results provided in the paper, we test our approach when the dictionary size is increased x10. This results in an effective action space of .



The results show that indeed our approach is capable of coping with larger action spaces, however there is a degradation in performance as the noise grows.
b.3 Sparse Imitation Learning
Similarly to the results on ‘Open Zork’, we see that IKOMP outperforms both the standard OMP and FISTA. In addition, these results show that during the training procedure; the output of the Encoder, namely , may be viewed as a noisy representation of the original sum of embeddings. As the training proceeds, the network converges, hence SnR of the approximation noise reduces and the various reconstruction schemes perform better.








b.4 TopK experiment
While our analysis focused on the pure imitation setting, and thus the Top1 accuracy, it is also important to consider the TopK accuracy. In the TopK accuracy, a prediction is deemed correct if the accurate action is one of the K candidates offered by the CS algorithm. As opposed to OMP and FISTA, as IKOMP produces multiple candidate actions we expect a significant improvement when considering this measure.
Troll Quest
The empirical results show that indeed the TOPK results in a dramatic increase in performance. We can see that even in the presence of very large noise, the real action is one of the Kcandidates reconstructed by IKOMP.
This result is important as it highlights the ability of our approach to incorporate selfplay. Such an approach can be viewed as similar to the action elimitation approach taken in [Zahavy et al., 2018]. While [Zahavy et al., 2018] reduce the effective action set using feedback from the environment, in our approach, the imperfect demonstrations reduce the effective action set and thus the MDP to a smaller and thus solvable problem.