“A priest, a rabbi, and a minister walk into a bar. The bartender opens his mouth to speak, but can’t find the words. Slowly he cracks a smile, looks around and says… ‘Ugh, what was I going to say?’ Such a large action space…”
While we might chuckle at a story like this, there is more here than just a joke. If we look carefully at what happened, we notice that the bartender of our story stumbled on his words (Pruim, 2014), while thinking about how to start the joke. Such cognitive-behavior breakdowns often occur when there are many actions (in the above joke, actions are possible responses) to choose from, as described in the Action-Assembly Theory (O. Greene, 1984, AAT). According to Greene, behavior is described by two essential processes: representation and processing. Representation refers to the way information is coded and stored in the mind, whereas processing refers to the mental operations performed to retrieve this information (Greene, 2008). Having good representations of information and an efficient processing procedure allows us to quickly exploit highly rewarding nuances of an environment upon first discovery.
However, learning good representations and developing efficient action selection procedures is a major computational challenge for artificially intelligent agents. This is especially problematic when dealing withcombinatorial action spaces
, since the agent needs to explore all possible action interactions. Combinatorial action spaces are prevalent in many domains including natural language generation(Ranzato et al., 2015; Bahdanau et al., 2016), industry 4.0 (e.g., Internet of Things) (Preuveneers and Ilie-Zudor, 2017), electric grids (Wen et al., 2015) and more. For example, in text-based games (Côté et al., 2018), given a dictionary with entries (words), the size of number of possible sentences of length , namely the size of the action space, is .
In this work we propose the first computationally efficient algorithm (see Figure 1), called Sparse Imitation Learning (Sparse-IL), which is inspired by AAT and combines imitation learning with a Compressed Sensing (CS) retrieval mechanism to solve text-based games with combinatorial action spaces. Our approach is composed of:
(1) Encoder - the encoder receives a state as input (Figure 1). The state is composed of individual words using word embeddings that were previously trained on a large corpus of text. We train the encoder, using imitation learning, to generate a continuous action (a dense representation of the action). The action
corresponds to a sum of word embeddings that indicates the action that the agent intends to take, e.g., the embedding of the action ‘take egg’ is the sum of the word embedding vectors of ‘take’ and ‘egg’. As the embeddings capture a prior, i.e., similarity, over language, it enables improved generalization and robustness to noise when compared to an end-to-end approach.
(2) Retrieval Mechanism - given a continuous vector , we reconstruct the best Bag-of-Words (BoW) actions , composed of up to words, from the continuous output of the encoder. We do this using an algorithm that we term Integer K-Orthogonal Matching Pursuit (IK-OMP). We then use a fitness function to score the actions, after which, the best action is fed into a language model to yield an action sentence that can be parsed by the game.
Main contributions: We propose a computationally efficient algorithm called Sparse-IL that combines CS with imitation learning to solve natural language tasks with combinatorial action spaces. We show that IK-OMP, which we adapted from White et al. (2016) and Lin et al. (2013), can be used to recover a BoW vector from a sum of the individual word embeddings in a computationally efficient manner, even in the presence of significant noise. We demonstrate that Sparse-IL can solve the entire game of Zork, for the first time, while considering a combinatorial action space of approximately 10 million actions, using noisy, imperfect demonstrations.
This paper is structured as follows: Section 2 details relevant related work. Section 3 provides an overview of the problem setting; that is, the text-based game of Zork and the challenges it poses. Section 4 provides an overview of CS algorithms and, in particular, our variant called IK-OMP. This section also includes experiments in the text-based game Zork highlighting the robustness of IK-OMP to noise and its computational efficiency. Section 5 introduces our Sparse-IL algorithm and showcases the experiments of the algorithm solving the ‘Troll Quest’ and the entire game of Zork.
2 Related work
Combinatorial action spaces in text-based games: Previous works have suggested approaches for solving text-based games (Narasimhan et al., 2015; He et al., 2016; Yuan et al., 2018; Zahavy et al., 2018; Zelinka, 2018; Tao et al., 2018). However, these techniques do not scale to combinatorial action spaces. For example, He et al. (2016) presented the DRRN architecture, which requires each action to be evaluated by the network. This results in a total of forward passes. Zahavy et al. (2018) proposed the Action-Elimination DQN, resulting in a smaller action set . However, this set may still be of exponential size.
CS and embeddings representation:
CS was originally introduced in the Machine Learning (ML) world byCalderbank et al. (2009)
, who proposed the concept of compressed learning. That is, learning directly in the compressed domain, e.g. the embeddings domain in the Natural Language Processing (NLP) setting. The task of generating BoW from the sums of their word embeddings was first formulated byWhite et al. (2016). A greedy approach, very similar to orthogonal matching pursuit (OMP), was proposed to iteratively find the words. However, this recovery task was only explicitly linked to the field of CS two years later in Arora et al. (2018).
3 Problem setting
Zork - A text-based game:
Text-based games (Côté et al., 2018) are complex interactive games usually played through a command line terminal. An example of Zork1, a text-based game, is shown in Figure 2. In each turn, the player is presented with several lines of text which describe the state of the game, and the player acts by entering a text command. In order to cope with complex commands, the game is equipped with an interpreter which deciphers the input and maps it to in-game actions. For instance, in Figure 2, a command “climb the large tree” is issued, after which the player receives a response. In this example, the response explains that up in the tree is a collectible item - a jewel encrusted egg. The large, combinatorial action space is one of the main reasons Zork poses an interesting research problem. The actions are issued as free-text and thus the complexity of the problem grows exponentially with the size of the dictionary in use.
Our setup: In this work, we consider two tasks: the ‘Troll Quest’ (Zahavy et al., 2018) and ‘Open Zork’, i.e., solving the entire game. The ‘Troll Quest’ is a sub-task within ‘Open Zork’, in which the agent must enter the house, collect a lantern and sword, move a rug which reveals a trapdoor, open the trapdoor and enter the basement. Finally, in the basement, the agent encounters a troll which it must kill using the sword. A bad action at any stage will fail to reach the troll, whereas a wrong action when encountering the troll may result in termination, i.e., death.
In our setting, we consider a dictionary of unique words, extracted from a walk-through of actions which solve the game, a demonstrated sequence of actions (sentences) used to solve the game. We limit the maximal sentence length to words. Thus, the number of possible, unordered, word combinations are , i.e., the dictionary size to the power of the maximal sentence length, divided by the number of possible permutations. This results in approximately 10 million possible actions.
Markov Decision Process (MDP):
Text-based games can be modeled as Markov Decision Processes. An MDPis defined by the tuple (Sutton and Barto, 1998). In the context of text-based games, is the set of states, a paragraph representing the current observation. are the available discrete actions, e.g., all combinations of words from the dictionary up to a maximal given sentence length . is the bounded reward function, for instance collecting items provides a positive reward. is the transition matrix, where
is the probability of transitioning from stateto assuming action was taken.
Action Space: While the common approach may be to consider a discrete action space, such an approach may be infeasible to solve, as the complexity of solving the MDP is related to the effective action space size. Hence, in this work, we consider an alternative, continuous representation. As each action is a sentence composed of words, we represent each action using the sum of the embeddings of its tokens, or constitutive words, denoted by (Sum of Embeddings). A simple form of embedding is the Bag of Words (BoW), it represents the word using a one-hot vector the size of the dictionary in which the dictionary index of the word is set to . Aside from the Bag of Words (BoW) embedding, there exist additional forms of embedding vectors. For instance, Word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014)
, which encode the similarity between words (in terms of cosine distance). These embeddings are pre-trained using unsupervised learning techniques and similarly to how convolutional neural networks enable generalization across similar states, word embeddings enable generalization across similar sentences, i.e., actions.
In this work, we utilize GloVe embeddings, pre-trained on the Wikipedia corpus. We chose GloVe over Word2vec, as there exist pre-trained embeddings in low dimensional space. The embedding space dimensionality is , significantly smaller in dimension than the size of the dictionary , in our experiments. Given the continuous representation of an action, namely the sum of embeddings of the sentence tokens , the goal is to recover the corresponding discrete action , that is the tokens composing the sentence. These may be represented as a BoW vector . Recovering the sentence from requires prior information on the language model.
Provided a set of words, the goal of a language model, the last element in Figure 1, a central piece in many important NLP tasks, is to output the most likely ordering which yields a grammatically correct sentence. In this paper, we use a rule based approach. Our rules are relatively simple. For example, given a verb and an object, the verb comes before the object - e.g., [‘sword’, ‘take’] ‘take sword’.
To conclude, we train a neural network to predict the sum of embeddings . Using CS (Section 4), we recover the BoW vector , i.e., the set of words which compose the sentence. Finally, a language model M converts into a valid discrete-action, namely . The combined approach is as follows:
4 Compressed sensing
This section provides some background on CS and sparse recovery, including practical recovery algorithms and theoretical recovery guarantees. In particular, we describe our variant of one popular reconstruction algorithm, OMP, that we refer to as Integer K-OMP (IK-OMP). The first modification allows exploitation of the integer prior on the sparse vector and is inspired by White et al. (2016) and Sparrer and Fischer (2015b). The second mitigates the greedy nature of OMP using beam search Lin et al. (2013). The experiments presented at the end of this section compare different sparse recovery approaches and demonstrate the superiority of introducing the integer prior and the beam search strategy.
4.1 Sparse Recovery
CS is concerned with recovering a high-dimensional -sparse signal (the BoW vector in our setting) from a low dimensional measurement vector (the sum of embeddings vector ). That is, given a dictionary :
To ensure uniqueness of the solution of Eq. 1, the sensing matrix, or dictionary, must fulfill certain properties. These are key to provide practical recovery guarantees as well. Well known such properties are the spark, or Kruskal rank Donoho and Elad (2003), and the Restricted Isometry Property (Candes and Tao, 2005, RIP). Unfortunately, these are typically as hard to compute as solving the original problem Eq. 1. While the mutual-coherence (see Definition 1) provides looser bounds, it is easily computable. Thus, we focus on mutual-coherence based results and note that Spark and RIP based guarantees may be found in Elad (2010).
Definition 1 (Elad (2010) Definition 2.3)
The mutual coherence of a given matrix is the largest absolute normalized inner product between different columns from . Denoting the -th column in by , it is given by
The mutual-coherence characterizes the dependence between columns of the matrix
. For a unitary matrix, columns are pairwise orthogonal, and as a result, the mutual-coherence is zero. For general matrices with more columns than rows (), as in our case, is necessarily strictly positive, and we desire the smallest possible value so as to get as close as possible to the behavior exhibited by unitary matrices Elad (2010). This is illustrated in the following uniqueness theorem.
Theorem 1 (Elad (2010) Theorem 2.5)
If a system of linear equations has a solution obeying where , this solution is the sparsest possible.
We now turn to discuss practical methods for solving Eq. 1.
4.2 Recovery Algorithms
The sparse recovery problem Eq. 1 is non-convex due to the -norm. Although it may be solved via combinatorial search, the complexity of exhaustive search is exponential in the dictionary dimension , and it has been proven that Eq. 1 is, in general, NP-Hard Elad (2010).
One approach to solve Eq. 1, known as basis pursuit, relaxes the -minimization to its -norm convex surrogate,
In the presence of noise, the condition is replaced by . The Lagrangian relaxation of this quadratic program is written, for some as and is known as basis pursuit denoising (BPDN).
The above noiseless and noisy problems can be respectively cast as linear programming and second order cone programming problemsChen et al. (2001). They thus may be solved using techniques such as interior-point methods Ben-Tal and Nemirovski (2001); Boyd and Vandenberghe (2004). Large scale problems involving dense sensing matrices often precludes the use of such methods. This motivated the search for simpler gradient-based algorithms for solving Eq. 2, such as fast iterative shrinkage-thresholding algorithm (Beck and Teboulle, 2009, FISTA).
Alternatively, one may use greedy methods, broadly divided into matching pursuit based algorithms, such as OMP Blumensath and Davies (2008), and thresholding based methods, including iterative hard thresholding Blumensath and Davies (2009)
. The popular OMP algorithm, reported in the Appendix, proceeds by iteratively finding the dictionary column with the highest correlation to the signal residual, computed by subtracting the contribution of a partial estimate offrom . The coefficients over the selected support set are then chosen so as to minimize the residual error. A typical halting criterion compares the residual to a predefined threshold.
4.3 Recovery Guarantees
Performance guarantees for both -relaxation and greedy methods have been provided in the CS literature. In noiseless settings, under the conditions of Theorem 1, the unique solution of Eq. 1 is also the unique solution of Eq. 2 (Elad, 2010, Theorem 4.5). Under the same conditions, OMP with halting criterion threshold is guaranteed to find the exact solution of Eq. 1 (Elad, 2010, Theorem 4.3). More practical results are given for the case where the measurements are contaminated by noise (Donoho et al., 2006; Elad, 2010; Eldar and Kutyniok, 2012).
4.4 Integer K-OMP (IK-OMP)
An Integer Prior: While CS is typically concerned with the reconstruction of a sparse real-valued signal, in our BoW linear representation, the signal fulfills a secondary structure constraint besides sparsity. Its nonzero entries stem from a finite, or discrete, alphabet. Such prior information on the original signal appears in many communication scenarios Candes et al. (2005); Axell et al. (2012); Rossi et al. (2014), where the transmitted data originates from a finite set.
A few works have adapted CS algorithms to the recovery of integer-valued vectors. It has been shown that discreteness constraints on the possible values of the reconstructed signal may significantly reduce the number of required measurements Keiper et al. (2017), or alternatively increase recovery performance with a given number of measurements. BP has been adapted to the recovery of sparse integer vectors in Keiper et al. (2017). However, it is restricted to recovery of binary and ternary signals. Besides, BP’s running time becomes prohibitively high with large dictionaries and the authors are not aware of any FISTA variant for integer-valued signal recovery.
consider OMP in connection with quantization and soft feedback. Unfortunately, the above approaches require knowledge of the measurement noise variance. Here, we adopt a similar approach as the greedy addition phase in the selection algorithm ofWhite et al. (2016). In each iteration, we find the dictionary column closest to the residual in terms of mean square error and increment the corresponding entry of the recovered by one. This method enjoys the same computational complexity as OMP and we empirically found that it outperforms simply adding a quantization step in the original OMP iteration.
Beam Search OMP: As OMP iteratively adds atoms to the recovered support, the choice of a new element in an iteration is blind to its effect on future iterations. Therefore, any mistakes, particularly in early iterations, may lead to large recovery errors. To mitigate this phenomenon, several methods have been proposed to amend the OMP algorithm.
To decrease the greediness of the greedy addition algorithm (which acts similarly to OMP), White et al. (2016) use a substitution based method, also referred as swapping Andrle and Rebollo-Neira (2006) in the CS literature. Unfortunately, the computational complexity of this substitution strategy makes it impractical. Elad and Yavneh (2009) combine several recovered sparse representations, to improving denoising, by randomizing the OMP algorithm. However, in our case, the sum of embeddings represents a true sparse BoW vector , so that combining several recovered vectors should not lead to the correct solution.
The look ahead OMP (LAOMP) Chatterjee et al. (2011) uses a multi-path OMP procedure that, in each iteration, considers several potential elements and evaluates the effect of selecting each one on the final residual. However, this algorithm still may early discard the correct path and result in serious recovery error. In addition, this approach is computationally demanding. While the complexity of OMP is , that of LAOMP is , where is the number of potential candidates in each iteration. We adopt the beam search approach of Lin et al. (2013), K-best OMP, that enjoys the lower computational complexity of , with the number of candidates per beam. This method extends and preserves multiple search paths simultaneously so that the probability of finding the correct locations of nonzero elements can be much improved.
IK-OMP: We combine the integer-prior with the beam search strategy, and propose the IK-OMP (Algorithm 1). In the algorithm description, is the vector with a single nonzero element at index and denotes the elements with smallest value for the following expression. In this work, the selected BoW is the candidate which minimizes the reconstruction score.
4.5 Compressed Sensing Experiments
In this section, we focus on comparing several CS approaches. To do so, we follow the set of commands, extracted from a walk-through of the game, required to solve Zork1, both in the ‘Troll Quest’ and ‘Open Zork’ domains. In each state , we take the ground-truth action , calculate the sum of word embeddings , add noise and test the ability of various CS methods to reconstruct . We compare the run-time (Table 1), and the reconstruction accuracy (number of actions reconstructed correctly) and reward gained in the presence of noise (Figure 3). Specifically, the measured action is , where is normalized based on the signal to noise ratio (SnR).
We compare 4 CS methods: the FISTA implementation of BP, OMP, IK-OMP (Algorithm 1
) and a Deep Learning variant we deem DeepCS described below. The dictionary is composed ofpossible words which can be used in the game. The dimension of the embedding is and the sentence length is limited to at most words. This yields a total number of 10 million actions, from which the agent must choose one at each step. It is important to note that while accuracy and reward might seem similar, an inaccurate reconstruction at an early stage results in an immediate failure, even when the accuracy seems high.
Clearly, as seen from Figure 3, OMP fails to reconstruct the true BoW vectors , even in the noiseless scenario. Indeed, the mutual-coherence (Definition 1) is and from Theorem 1, there is no guarantee that OMP can reconstruct a sparse vector for any sparsity . However, our suggested approach, IK-OMP, is capable of correctly reconstructing the original action , even in the presence of relatively large noise. This gives evidence that the integer prior, in particular, and the beam search strategy significantly improve the sparse recovery performance.
Deep Compressed Sensing: Besides traditional CS methods, it is natural to test the ability of deep learning methods to perform such a task. In this approach, we train a neural network to predict the BoW vector
which composes the continuous embedding vector. Our network is a multi layer perceptron (MLP), composed of two hidden layers, 100 neurons each. We use a sigmoid activation function to bound the outputs toand train the network using a binary bross entropy loss.
Our results, presented in Figure 4, show that the DeepCS approach works when no noise is present, however, once noise is added to the setup, it is clear that DeepCS performs poorly compared to classic CS methods such as IK-OMP111In the Appendix, we provide an additional experiment, which shows that in the Troll quest, DeepCS remains competitive as the domain is relatively small. We also provide experiments with a much larger dictionary, , and show that IK-OMP is capable of scaling.. Besides, as DeepCS requires training a new model for each domain, it is data-specific and does not transfer easily, which is not the case with traditional CS methods.
5 Imitation Learning
In this section, we present our Sparse-ILalgorithm and provide in-depth details regarding the design and implementation of each of its underlying components. We also detail the experiments of executing Sparse-IL on the entire game of Zork.
Sparse Imitation Learning: Our Sparse-IL architecture is composed of two major components - Encoder and Retrieval Mechanism (as seen in Figure 1). Each component has a distinct role and combining them together enables for a computationally efficient approach. (1) , the Encoder, given a state , predicts a sum of word embeddings . (2) , the Retrieval Mechanism converts the continuous action embedding into the most probable BoW representation , such an example can be seen in Section 4, e.g., IK-OMP; this BoW representation , is provided to the Language Model M, which returns the most probable ordering of words (BoW sentence). We will now discuss each of these components in more detail.
The Encoder (E) is a neural network trained to output the optimal action representation at each state. As we consider the task of imitation learning, this is performed by minimizing the loss between the Encoder’s output and the embedding of the action provided by the expert .
In all of the learning experiments, the architecture we use is a convolutional neural network (CNN) that is suited to NLP tasks (Kim, 2014). Due to the structure of the game, there exist long term-dependencies. Frame-stacking, a common approach in games (Mnih et al., 2015), tackles this issue by providing the network with the N previous states. For the “Open Zork” task, we stack the previous 12 states, and provide only the current state in the “Troll Quest”.
Retrieval Mechanism (R): The output of the Encoder, , is fed into a CS algorithm, such as IK-OMP. IK-OMP produces K candidate actions, . These actions are fed into a fitness function which ranks them, based on the reconstruction score (see Section 4), and returns the optimal candidate. Other CS approaches, e.g., OMP and FISTA, return a single candidate action.
In an imitation learning setup, we are given a data set of state-action pairs , provided by an expert; the goal is to learn a policy that achieves the best performance possible. We achieve this by training the embedding network to imitate the demonstrated actions in the embedding space, namely , at each state , using the MSE between the predicted actions and those demonstrated. We consider three setups: (1) Perfect demonstrations, where we test errors due to architecture capacity and function approximation, (2) Gaussian noise, (See Section 4.5), and (3) discrete-action noise, in which w.p. a random incorrect action is demonstrated. This experiment can be seen as learning from demonstrations provided by an ensemble of sub-optimal experts.
Our results are shown in Figure 6. By combining CS with imitation learning techniques, our approach is capable of solving the entire game of Zork1, even in the presence of discrete-action noise. In all our experiments, IK-OMP outperforms the various baselines222We provide similar results on the ‘Troll Quest’ in the appendix., including the end-to-end approach, i.e., DeepCS-2 which is trained to predict the BoW embedding directly from the state .
Training: Analyzing the training graph presents an interesting picture. It shows that during the training process, the output of the Encoder can be seen as a noisy estimation of . As training progresses, the effective SnR of the noise decreases which is seen by the increase in the reconstruction performance.
Generalization: In Figure 6, we present the generalization capabilities which our method Sparse-IL enjoys, due to the use of pre-trained unsupervised word embeddings. The heatmap shows two forms of noise. The first, as before, is the probability of receiving a bad demonstration, an incorrect action. The second, synonym probability, is the probability of being presented with a correct action, yet composed of different words, e.g., drop, throw and discard result in an identical action in the environment and have a similar meaning. These results clearly show that Sparse-IL outperforms DeepCS-2 in nearly all scenarios, highlighting the generalization improvement inherent in the embeddings.
The benefit of meaningful embeddings: In our approach, the Encoder is trained to predict the sum-of-embeddings . However, it can also be trained to directly predict the BoW vector . While this approach may work, it lacks the generalization ability which is apparent in embeddings such as GloVe, in which similar words receive similar embedding vectors.
Consider a scenario in which there are 4 optimal actions (e.g., ‘go north’, ‘walk north’, ’run north’ and ‘move north’) and 1 sub-optimal action (e.g., ‘climb tree’). With probability we are presented with one of the optimal actions and with probability the sub-optimal action. In this example, the expected BoW representation would include ‘north’ w.p. , ‘climb’ and ‘tree’ w.p. , and the rest w.p. . On the other hand, since ‘go’, ‘walk’, ‘run’ and ‘move’ have similar meanings and in turn similar embeddings, the expected is much closer to the optimal actions than to the sub-optimal one and thus an imitation agent is less likely to make a mistake.
While in this work we used a naive fitness function, namely the reconstruction loss, as it performs a selection over K possible actions, it can be replaced by an evaluation network, e.g., critic (Dulac-Arnold et al., 2015), and used for policy improvement. Results using the Top-K metric are provided in the appendix. These results show that the optimal action is contained within the K candidates even under large noise which prevents the Top-1 from successfully solving the task, and suggests that IK-OMP can be used to produce a minimal action set, similar to the elimination network in Zahavy et al. (2018).
We have presented a computationally efficient algorithm called Sparse Imitation Learning (Sparse-IL) that combines CS with imitation learning to solve text-based games with combinatorial action spaces. We proposed a CS algorithm variant of OMP which we have called Integer K-OMP (IK-OMP) and demonstrated that it can deconstruct a sum of word embeddings into the individual BoW that make up the embedding, even in the presence of significant noise. In addition, IK-OMP is significantly more computationally efficient than the baseline CS techniques. When combining IK-OMP with imitation learning, our agent is able to solve Troll quest as well as the entire game of Zork1 for the first time. Zork1 contains a combinatorial action space of 10 million actions. Future work includes replacing the fitness function with a critic in order to further improve the learned policy as well as testing the capabilities of the critic agent in cross-domain tasks.
- Andrle and Rebollo-Neira  Miroslav Andrle and Laura Rebollo-Neira. A swapping-based refinement of orthogonal matching pursuit strategies. Signal Processing, 86(3):480–495, 2006.
Arora et al. 
Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli.
A compressed sensing view of unsupervised text embeddings, bag-of-n-grams, and lstms.2018.
- Axell et al.  Erik Axell, Geert Leus, Erik G Larsson, and H Vincent Poor. Spectrum sensing for cognitive radio: State-of-the-art and recent advances. IEEE Signal Processing Magazine, 29(3):101–116, 2012.
- Bahdanau et al.  Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.
- Beck and Teboulle  Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
- Ben-Tal and Nemirovski  Ahron Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications, volume 2. Siam, 2001.
- Blumensath and Davies  Thomas Blumensath and Mike E Davies. Gradient pursuits. IEEE Transactions on Signal Processing, 56(6):2370–2382, 2008.
- Blumensath and Davies  Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
- Boyd and Vandenberghe  Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
- Calderbank et al.  Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. preprint, 2009.
- Candes et al.  Emmanuel Candes, Mark Rudelson, Terence Tao, and Roman Vershynin. Error correction via linear programming. In Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on, pages 668–681. IEEE, 2005.
- Candes and Tao  Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005.
- Chatterjee et al.  S. Chatterjee, D. Sundman, and M. Skoglund. Look ahead orthogonal matching pursuit. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4024–4027, 2011.
- Chen et al.  Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
- Côté et al.  Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, et al. Textworld: A learning environment for text-based games. arXiv preprint arXiv:1806.11532, 2018.
- Donoho and Elad  David L Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003.
- Donoho et al.  David L Donoho, Michael Elad, and Vladimir N Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on information theory, 52(1):6–18, 2006.
- Dulac-Arnold et al.  Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
- Elad  Michael Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer Publishing Company, Incorporated, 1st edition, 2010. ISBN 144197010X, 9781441970107.
- Elad and Yavneh  Michael Elad and Irad Yavneh. A plurality of sparse representations is better than the sparsest one alone. IEEE Transactions on Information Theory, 55(10):4701–4714, 2009.
- Eldar and Kutyniok  Yonina C Eldar and Gitta Kutyniok. Compressed sensing: theory and applications. Cambridge University Press, 2012.
- Greene  John O Greene. Action assembly theory. The International Encyclopedia of Communication, 2008.
He et al. 
Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari
Deep reinforcement learning with a natural language action space.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1621–1630, 2016.
- Keiper et al.  Sandra Keiper, Gitta Kutyniok, Dae Gwan Lee, and Götz E Pfander. Compressed sensing for finite-valued signals. Linear Algebra and its Applications, 532:570–613, 2017.
- Kim  Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
- Lin et al.  P. Lin, S. Tsai, and G. C. . Chuang. A k-best orthogonal matching pursuit for compressive sensing. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5706–5709, 2013.
- Mikolov et al.  Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Narasimhan et al.  Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Language understanding for text-based games using deep reinforcement learning. arXiv preprint arXiv:1506.08941, 2015.
- O. Greene  John O. Greene. A cognitive approach to human communication: An action assembly theory. Communication Monographs - COMMUN MONOGR, 51:289–306, 12 1984. doi: 10.1080/03637758409390203.
- Pennington et al.  Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- Preuveneers and Ilie-Zudor  Davy Preuveneers and Elisabeth Ilie-Zudor. The intelligent industry of the future: A survey on emerging trends, research challenges and opportunities in industry 4.0. Journal of Ambient Intelligence and Smart Environments, 9(3):287–298, 2017.
- Pruim  Doug Pruim. Action assembly theory. 11 2014. doi: 10.13140/2.1.2282.6566.
- Ranzato et al.  Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
- Rossi et al.  Marco Rossi, Alexander M Haimovich, and Yonina C Eldar. Spatial compressive sensing for mimo radar. IEEE Transactions on Signal Processing, 62(2):419–430, 2014.
- Sparrer and Fischer [2015a] S Sparrer and RFH Fischer. Mmse-based version of omp for recovery of discrete-valued sparse signals. Electronics Letters, 52(1):75–77, 2015a.
- Sparrer and Fischer [2015b] Susanne Sparrer and Robert FH Fischer. Soft-feedback omp for the recovery of discrete-valued sparse signals. In Signal Processing Conference (EUSIPCO), 2015 23rd European, pages 1461–1465. IEEE, 2015b.
- Sutton and Barto  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- Tao et al.  Ruo Yu Tao, Marc-Alexandre Côté, Xingdi Yuan, and Layla El Asri. Towards solving text-based games by producing adaptive action spaces. arXiv preprint arXiv:1812.00855, 2018.
- Wen et al.  Zheng Wen, Daniel O’Neill, and Hamid Maei. Optimal demand response using device-based reinforcement learning. IEEE Transactions on Smart Grid, 6(5):2312–2324, 2015.
- White et al.  Lyndon White, Roberto Togneri, Wei Liu, and Mohammed Bennamoun. Generating bags of words from the sums of their word embeddings. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 91–102. Springer, 2016.
- Yuan et al.  Xingdi Yuan, Marc-Alexandre Côté, Alessandro Sordoni, Romain Laroche, Remi Tachet des Combes, Matthew Hausknecht, and Adam Trischler. Counting to explore and generalize in text-based games. arXiv preprint arXiv:1806.11525, 2018.
- Zahavy et al.  Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J Mankowitz, and Shie Mannor. Learn what not to learn: Action elimination with deep reinforcement learning. Advances in neural information processing systems (NeurIPS), 2018.
- Zelinka  Mikuláš Zelinka. Using reinforcement learning to learn how to play text-based games. arXiv preprint arXiv:1801.01999, 2018.
Appendix A Compressed Sensing Algorithms
In Section 4, we presented Orthogonal Matching Persuit (OMP), a common algorithm in CS, and Integer OMP, an integer constrained variant. Both algorithms are presented below. In the algorithm description, is the -th column of and denotes the columns of the matrix indexed by the set . Similarly, is the vector reduced to the support , implying that the remaining entries of are zeros.
Appendix B Experiments
Throughout the paper, all the experiments were evaluated across 5 random seeds. The compute used for the simulations consisted of a single i7 system with a GTX 1080-TI GPU.
For clarity, we provide both the ‘Troll Quest’ and ‘Open Zork’ results in the DeepCS setting. These results show that while DeepCS is comparable to IK-OMP in a small problem, such as the ‘Troll Quest’, when scaled up to larger, more complex problems, IK-OMP performs significantly better than DeepCS both in accuracy and total reward.
b.2 CS with
In addition to the results provided in the paper, we test our approach when the dictionary size is increased x10. This results in an effective action space of .
The results show that indeed our approach is capable of coping with larger action spaces, however there is a degradation in performance as the noise grows.
b.3 Sparse Imitation Learning
Similarly to the results on ‘Open Zork’, we see that IK-OMP outperforms both the standard OMP and FISTA. In addition, these results show that during the training procedure; the output of the Encoder, namely , may be viewed as a noisy representation of the original sum of embeddings. As the training proceeds, the network converges, hence SnR of the approximation noise reduces and the various reconstruction schemes perform better.
b.4 Top-K experiment
While our analysis focused on the pure imitation setting, and thus the Top-1 accuracy, it is also important to consider the Top-K accuracy. In the Top-K accuracy, a prediction is deemed correct if the accurate action is one of the K candidates offered by the CS algorithm. As opposed to OMP and FISTA, as IK-OMP produces multiple candidate actions we expect a significant improvement when considering this measure.
The empirical results show that indeed the TOP-K results in a dramatic increase in performance. We can see that even in the presence of very large noise, the real action is one of the K-candidates reconstructed by IK-OMP.
This result is important as it highlights the ability of our approach to incorporate self-play. Such an approach can be viewed as similar to the action elimitation approach taken in [Zahavy et al., 2018]. While [Zahavy et al., 2018] reduce the effective action set using feedback from the environment, in our approach, the imperfect demonstrations reduce the effective action set and thus the MDP to a smaller and thus solvable problem.