A common assumption in reinforcement learning (RL) is that the agent has the knowledge of the entire dynamics of the environment, including the state space, transition probabilities, and a reward model. However, in many real world applications, this assumption may not always be valid. Instead, the environment is oftenpartially observable, meaning that the true state of the system is not completely visible to the agent. This partial observability can result in numerous difficulties in terms of learning the dynamics of the environment and planning to maximize returns.
Partially observable Markov decision Processes (POMDPs)[29, 8] provide a formal framework for single-agent planning under a partially observable environment. In contrast with MDPs, agents in POMDPs do not have direct access to the state space. Instead of observing the states, agents only have access to observations and need to operate on the so-called belief states, which describe the distribution over the state space given some past trajectory. Therefore, POMDPs model the dynamics of an RL environment in a latent variable fashion and explicitly reason about uncertainty in both action effects and state observability . Planning under a POMDP has long been considered a difficult problem . To perform exact planning under a POMDP, one common approach is to optimize the value function over all possible belief states. Value iteration for POMDPs 
is one particular example of this approach. However, due to the curse of dimensionality and curse of history, this method is often computationally intractable for most realistic POMDP planning problems .
As an alternative to exact planning, the family of predictive state representations (PSRs) has attracted many interests. In fact, PSRs are no weaker than POMDPs in terms of their representation power 
, and there are many efficient algorithms to estimate PSRs and their variants relying on likelihood based algorithms[28, 27] or spectral learning techniques [6, 13]. However, to plan with PSRs is not straight-forward. Typically, a two-stage process is applied to discover the optimal policy with PSRs: first, a PSR model is learned in an unsupervised fashion, then a planning method is used to discover the optimal policy based on the learned dynamics. Several planning algorithms can be used for the second stage of this process. For example in [6, 14], a reward function is estimated with the learned PSRs, and then combined with point based value iteration (PBVI)  to obtain an approximation of the optimal policy; in , the authors use the fitted-Q method  to iteratively regress Bellman updates on the learned state representations, thus approximating the action value function. ††footnotetext: 11footnotemark: 1 Equal Contribution 33footnotemark: 3 Quebec AI Institute (MILA) 22footnotemark: 2 McGill University 44footnotemark: 4 Université de Montréal
However, despite numerous successes, this two-stage process still suffers from significant drawbacks. To begin with, the PSRs parameters are learned independently from the reward information, resulting in a less efficient representation for planning. Secondly, planning with PSRs often involves multiple stages of regression, and these extra steps of approximation can be detrimental for obtaining the optimal policy. Last but not least, the planning methods for PSRs are often iterative methods that can be very time consuming.
In this work, we propose an alternative to the traditional paradigm of planning in partially observable environments. Inspired by PSRs, our method leverages the spectral learning algorithm for subspace identification, treating the environment as a latent variable model. However, instead of explicitly learning the dynamics of the environment, we learn a function that is proportional to the action value function, which we call unnormalized Q function
(UQF). In doing so, we incorporate the reward information into the dynamics in a supervised learning fashion., which unifies the two stages of the classical learning-planning paradigm for POMDPs. In some sense, our approach effectively learns a goal-oriented representation of the environment. Therefore, in terms of planning, our method is more sample efficient compared to the two-stage learning paradigm (for example, PSRs). Our algorithm relies on the spectral learning algorithm forweighted finite automata
(WFAs), which are an extension of PSRs that can model not only probability distributions, but arbitrary real-valued functions. Our method inherits the benefits of spectral learning: it provides a consistent estimation of the UQF and is computationally more efficient than EM based methods. Furthermore, planning with PSR usually requires multiple steps and often uses iterative planning method, which can be time consuming. In contrast, our algorithm directly learns a policy in one step, offering a more time efficient method. In addition, we also adoptmatrix compressed sensing techniques to extend this approach to complex domains. This technique has also been used in PSRs based methods to overcome similar problems .
We conduct experiments on partially observable grid world and S-PocMan environment  where we compare our approach with classical PSR based methods. In both domains, our approach is significantly more data-efficient than PSR based methods with considerably smaller running time.
In this section, we will introduce some basic RL concepts, including partially observable Markov decision processes (POMDPs), predictive state representations (PSRs) and their variants as well as the notion of WFAs. We will also introduce the spectral learning algorithms for WFAs.
2.1 Partially Observable Markov Decision Processes (POMDPs)
Markov decision processes have been widely applied in the field of reinforcement learning. A Markov decision process (MDP) of size is characterized by a 6-tuple where is the transition probability; is the initial state distribution;
is the reward vector over states;is the discount factor; is the set of states and and is the set of actions. The goal of an RL task is often to learn a policy that governs the actions of the agent to maximize the accumulated discounted rewards (return) in the future. A policy in an MDP environment is defined as . operates at the state level. At each timestep, the optimal action is selected probabilistically with respect to given the state of the current step. The agent then move to the next state depending on the corresponding transition matrix indexed by and collect potential rewards from the state.
However, in practice, it is rarely the case that we can observe the exact state of the agent. For example, in the game of poker, the player only knows the cards at hands and this information alone does not determine the exact state of the game. Partially observable Markov decision processes (POMDPs) were introduced to model this type of problems. Under the POMDP setting, the true state space of the model is hidden from the agent through partial observability. That is, an observation is obtained probabilistically based on the agent’s current state and the observation emission probability. A partially observable Markov decision process (POMDP) is characterized by an 8-tuple , where, is a set of observations and is the observation emission probability and the rest parameters follow the definitions in MDPs.
As the agent cannot directly observe which state it is at, one classic problem in POMDP is to compute the belief state knowing the past trajectory . Formally, given , we want to compute . This can be solved with a forward method similar to HMM . Let , and denote . It can be shown that and , where denotes the empty string.
Similarly to the MDP setting, the state-level policy for POMDPs is defined by , where . However, due to the partial observability, the agent’s true state cannot be directly observed. Nonetheless, any state-level policy implicitly induces a probabilistic policy over past trajectories, defined by for each . Similarly, every state-level policy induces a probabilistic distribution over trajectories. With a slight abuse of notation, denote the probability of a trajectory under the policy by . Here, we assume is induced by a state-level policy and define , where is an all-one vector. To make clear of the notations, we will use for deterministic policies in the later sections.
2.2 Predictive state representations
One common approach for modelling the dynamics of a POMDP is the so-called predictive state representations (PSRs). A PSR is a model of a dynamical system in which the current state is represented as a set of predictions about the future behavior of the system [20, 27]. This is done by maintaining a set of action-observation pairs, called tests, and the representation of the current state is given by the conditional probabilities of these tests given the past trajectory, which is referred to as history. Although there are multiple methods to select a set of tests [28, 16], it has been shown that with a large action-observation set, finding these tests can be exponentially difficult .
offer an alternative. TPSRs implicitly estimate a linear transformation of the PSR via subspace methods. This approach drastically reduces the complexity of estimating a PSR model and has shown many benefits in different RL domains[6, 27].
Although this approach is able to obtain a small transformed space of the original PSRs, it still faces scalability problems. Typically, one can obtain an estimate of TPSR by performing truncated SVD on the estimated system-dynamics matrix , which is indexed by histories and tests. The scalability issue arises in complex domains, which require a large number of histories and tests to form the system-dynamics matrix. As the time complexity of SVD is cubic in the number of histories and tests, the computation time explodes in these types of environments.
Compressed predictive state representations (CPSRs)  were introduced to circumvent this issue. The main idea of this approach is to project the high dimensional system-dynamics matrix onto a much smaller subspace spanned by randomly generated bases that satisfy the Johnson-Lindenstrauss (JL) lemma . The projection matrices corresponding to these bases are referred to as JL matrices. Intuitively, JL matrices define a low-dimensional embeddings which approximately preserves Euclidean distance between the projected points. More formally, given a matrix and JL random projection matrices and , the compressed matrix is computed by:
where is the compressed matrix. The choice of random projection matrix is rather empirical and often depends on the task. Gaussian matrices  and Rademacher matrices  are common choices for the random projection matrices that satisfy JL lemma. Although does not satisfy JL lemma, hashed random projection have also been shown to preserve certain kernel-functions and perform extremely well in practice [33, 25].
2.3 Weighted finite automata (WFAs)
In fact, TPSRs (CPSRs) are a subclass of a wider family called weighted finite automata (WFAs). More precisely, TPSRs belong to stochastic weighted finite automata (SWFAs)  in the formal language community or observable operator models (OOMs)  in control theory. Further connections between SWFAs, OOMs and TPSRs are shown in . WFAs are an extension to TPSRs in the sense that, instead of only computing the probabilities of trajectories, WFAs can compute functions with arbitrary scalar outputs over the given trajectories. Formally, a weighted finite automaton (WFA) with states is a tuple , where are the initial and terminal weights; is the transition matrix associated with symbol from alphabet . Given a trajectory , a WFA computes a function defined by:
We will denote by in the following sections for simplicity. For a function , the rank of is defined as the minimal number of states of a WFA computing . If cannot be computed by a WFA, we let . In the context of TPSRs, we often let and computes the probability of the trajectory.
2.3.1 Hankel matrix
The learning algorithm of WFAs, relies on the spectral decomposition of the so-called Hankel matrix. The Hankel matrix associated with a function is a bi-infinite matrix with entries for all words . The spectral learning algorithm for WFA relies on the following fundamental relation between the rank of and the rank of the Hankel matrix [7, 11]:
For any , .
In practice, one deals with finite sub-blocks of the Hankel matrix. Given a basis , where is a set of prefixes and is a set of suffixes, denote the corresponding sub-block of the Hankel matrix by . For an arbitrary basis , define its p-closure by where . It turns out that a Hankel matrix over a p-closed basis can be partitioned into blocks of the same size :
, where denotes the empty string and for each the matrix is defined by . We say that a basis is complete for the function if the sub-block has full rank: and we call a complete sub-block of and a prefix-closure of . It turns out that one can recover the WFA that realizes via the prefix-closure of a complete sub-block of  using the spectral learning algorithm of WFAs.
2.3.2 Spectral learning of WFAs
It can be shown that the rank of the Hankel matrix is upper bounded by the rank of . Moreover, given a rank factorization of the Hankel matrix , it is also true that for each . The spectral learning algorithm relies on the non-trivial observation that this construction can be reversed: given any rank factorization , the WFA defined by
is a minimal WFA computing [2, Lemma 4.1], where for denote the finite matrices defined above for a prefix closed complete basis . In practice, we compute empirical estimates of the Hankel matrices such that , where is a dataset of trajectories, is a vector of the outputs of , if and zero otherwise.
In fact, the above algorithm can be readily used for learning TPSRs. In PSRs terminology, the prefixes are the histories while the suffixes are the tests and the alphabet is the set of all possible action observation pairs, i.e. . By simply replacing Hankel matrix by the system-dynamics matrix proposed in , one will exactly fall back to TPSRs learning algorithm [27, 6].
3 Planning with Unnormalized Q Function
In this section, we will introduce our POMDP planning method. The main idea of our algorithm is to directly compute the optimal policy based on the estimation of unnormalized Q function that is proportional to the action value function. Moreover, the value of this function, given a past trajectory, can be computed via a WFA and it is then straight-forward to use the classical spectral learning algorithm to recover this WFA. Unlike traditional PSR methods, our approach takes advantage of the reward information by integrating the reward into the learned representations. In contrast, classical PSRs based methods construct the representations solely with the environment dynamics, completely ignoring the reward information. Consequently, our method offers a more sample efficient representation of the environment for planning under POMDPs. In addition, our algorithm only needs to construct a WFA and there is no other iterative planning method involved. Therefore, compared to traditional methods to plan with PSRs, our algorithm is more time efficient. Finally, with the help of compressed sensing techniques, we are able to scale our algorithm to complex domains.
3.1 Unnormalized Q function
The estimation of the action value function is of great importance for planning under POMDP. Typically, given a probabilistic sampling policy , where , the action value function (Q function) of a given trajectory is defined by:
where and is the immediate rewards collected at time step .
Given a POMDP , denote the expected immediate reward collected after by , which is defined as:
The action value function can then be expanded to:
where we will refer to the function as the unnormalized Q function(UQF). It is trivial to show that given the same trajectory :
Therefore, we have and we can then plan according to the UQF instead of .
3.2 A spectral learning algorithm for UQF
In this section, we will present our spectral learning algorithm for UQF. First, we will show that the value of a UQF given a past trajectory can be computed via a WFA. Let us denote by , we have:
Assume the probabilistic sampling policy is given, then we only need to compute the value of the function . As a special case, if is a random policy that uniformly select the actions, we can replace the term by without affecting the learned policy.
It turns out that the function can be computed by a WFA. To show that this is true, we first introduce the following lemma stating that the function can be computed by a WFA :
Given a POMDP of size k and a sampling policy induced by , there exists a WFA with states that realizes the function , where and .
Let denote the state and let
We can construct a WFA such that: , , . Then by construction, one can check that the WFA computes the function , which also shows that the rank of the function is at most . ∎
In fact, we can show that the function can be computed by another WFA , and one can easily convert to .
Given a POMDP of size k, a sampling policy and a WFA realizing the function such that the spectral radius , the WFA of size realizes the function .
By definition of the function , we have:
Here we applied Neumann identity: , which holds when . Therefore, the WFA realizes the function . ∎
Therefore, in order to compute the function , we only need to learn a WFA that computes the function . Following the classical spectral learning algorithm, we present our learning algorithm of POMDP planning in Algorithm 1. In fact, it has been shown that the spectral learning algorithm of WFAs is statistically consistent . Therefore our approximation of the function is consistent with respect to sample sizes.
3.3 Scalable learning of UQF
Now we have established the spectral learning algorithm for UQF. However, similar to the spectral learning algorithm for TPSRs, one can immediately observe that both time and storage complexity are the bottleneck of this algorithm. For complex domains, in order to obtain a complete sub-block of the Hankel matrix, one will need large amount of prefixes and suffixes to form a basis and the classical spectral learning will become intractable.
By projecting matrices down to low-dimensional spaces via randomly generated bases, matrix compressed sensing has been widely applied in matrix compression field. In fact, previous work have successfully applied matrix sensing techniques to TPSRs  and developed an efficient online algorithm for learning TPSRs . Here, we adopt a similar approach.
Assume that we are given a set of prefixes and suffixes and two independent random full-rank Johnson-Lindenstrauss (JL) projection matrices , and , where and are the projection dimension for the prefixes and suffixes. In this work, we use Gaussian projection matrices for and , which contain i.i.d. entries from the distribution and , respectively.
Let us now define two injective functions over prefixes and suffixes: and , where for all and , we have and . The core step of our algorithm is to obtain the compressed estimation of the Hankel matrix, denoted by associated with the function for all . Formally, we can obtain by:
where is the training dataset, containing all sampled trajectories, is the vector of immediate rewards. Then, after performing the truncated SVD of of rank , we can compute the transition matrix for the WFA by:
We present the complete method in Algorithm 2. Instead of iterative sweeping through dataset like most planning methods do, one can build an UQF in just two passes of data: one for building the compressed Hankel, one for recovering the parameters. More precisely, let denote the maximum length of a trajectory in the dataset , then the time complexity of our algorithm is , and there is no extra planning time needed. In contrast, fitted-Q algorithm alone requires only for the planning stage, where is the expected number of the fitted-Q iterations. Therefore, in terms of time complexity, our algorithm is linear to the number of trajectories, leading to a very efficient algorithm.
Compute the compressed estimation of Hankel matrices:
Perform truncated SVD on the estimated Hankel matrix with rank :
Recover the WFA realizing the function :
where is a vector s.t.
Following Theorem 3.2, convert the WFA to .
Return A new deterministic policy function defined by
3.4 Policy iteration
Policy iteration has been widely applied in both MDP and POMDP settings [5, 30], and have shown benefits from both empirical and theoretical perspectives . It is very natural to apply policy iteration to our algorithm, since we directly learn a policy from data. The policy iteration algorithm is listed in Algorithm 3. Note that for re-sampling, we convert our learned deterministic policy to a probabilistic one in an -greedy fashion.
To assess the performance of our method, we conducted experiments on two domains: a toy grid world environment and the S-Pocman game . We use TPSR/CPSR + fitted-Q as our baseline method. Our experiments have shown that indeed, in terms of sample complexity and time complexity, we outperm the classical two-stage algorithms.
4.1 Grid world experiment
The first benchmark for our method is the simple grid world environment shown in Fig. 1. The agent starts in the tile labeled and must reach the green goal state. At each time step, the agent can only perceive the number of surrounding walls and proceeds to execute one of the four actions: go up, down, left or right. To make the environment stochastic, with probability 0.2, the execution of the action will fail, resulting instead in a random action at the current time step. The reward function in this navigation task is sparse: the agent receives no reward until it reaches the termination state. We ran three variants of the aforementioned grid world, each corresponding to a different starting state. As one can imagine, the further away the goal state is from the starting state, the harder the task becomes.
We used a random policy to generate training data, which consisted of trajectories of length up to 100. To evaluate the policy learned by the different algorithms, we let the agent execute the learned policy for 1,000 episodes and computed the average accumulated discounted rewards, with discount factor being 0.99. The maximum length for test episodes was also set to 100. Hyperparameters were selected using cross-validation (i.e. the number of rows and columns in the Hankel matrices, the rank for SVD and).
As a baseline, we use the classical TPSRs and CPSRs as the learning method for the environment, and fitted-Q algorithm as the planning algorithm. We also performed a hyperparameter search for the baseline methods using cross validation. In addition, we report the rewards collected by a random policy as well as the optimal policy for comparison.
Results on this toy domain (see Figure 1) highlight the sample and time efficiency achieved by our method. Indeed, our algorithm outperforms the classical CPSR+fitted-Q method in all three domains, notably achieving better performance in small data regime, showing significant sample efficiency. Furthermore, it is clear that our algorithm reaches consistently to the optimal policy as sample size increases. In addition, our methods are much faster than other compared methods. For example, for the experiment with 800 samples, to achieve similar results, our method is approximately times faster compared to CPSR+fitted-Q.
4.2 S-PocMan domain
For the second experiment, we show the results on the S-PocMan environment . The partially observable version of the classical game Pacman was first introduced by Silver and Veness  and is referred to as PocMan. In this domain, the agent needs to navigate through a grid world to collect food and avoid being captured by the ghosts. It is an extremely large partially observable domain with states . However, Hamilton et al. showed that if one were to treat the partially observable environment as if it was fully observable, a simple memoryless controller can perform extremely well under this set-up, due to extensive reward information . Hence, they proposed a harder version of PocMan, called S-PocMan. In this new domain, they drop the parts of the observation vector that allow the agent to sense the direction of the food and greatly sparsify the amount of food in the grid world, therefore making the environment more partially observable.
|Method||Fitted-Q Iterations||Time (s)||Returns|
In this experiment, we only used the combination CPSR+fitted Q for our baseline algorithm, as TPSR can not scale to the large size of this environment. Similarly to the grid world experiment, we select the best hyperparameters through cross validation. The discount factor for computing returns was set to be 0.99999 in all runs. Table 1 shows the run-time and average return for both our algorithm and the baseline method. One can see that UQF achieves better performance compared to CPSR+fitted-Q. Moreover, UQF exhibits significant reduction in running time: about 200 times faster than CPSR+fitted-Q. Note that building CPSR takes similar amount of time to our method, however, the extra iterative fitted-Q planning algorithm takes considerably more time to converge, as our analysis showed in section 3.3.
In this paper, we propose a novel learning and planning algorithm for partially observable environments. The main idea of our algorithm relies on the estimation of the unnormalized Q function with the spectral learning algorithm. Theoretically, we show that in POMDP, UQF can be computed via a WFA and consequently can be provably learned from data using the spectral learning algorithm for WFAs. Moreover, UQF combines the learning and planning phases of reinforcement learning together, and learns the corresponding policy in one step. Therefore, our method is more sample efficient and time efficient compared to traditional POMDP planning algorithms. This is further shown in the experiments on the grid world and S-PocMan environments.
Future work include exploring some theoretic properties of this planning approach. For example, a first step would be to obtain convergence guarantees for policy iteration based on the UQF spectral learning algorithm. In addition, our approach could be extended to the multitask setting by leveraging the multi-task learning framework for WFAs proposed in . Readily, since we combine the environment dynamics and reward information together, our approach should be able to deal with partially shared environment and reward structure, leading to a potentially flexible multi-task RL framework.
-  Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003.
-  Borja Balle, Xavier Carreras, Franco M Luque, and Ariadna Quattoni. Spectral learning of weighted automata. Machine learning, 96(1-2):33–63, 2014.
Borja Balle, William Hamilton, and Joelle Pineau.
Methods of moments for learning stochastic languages: Unified presentation and empirical comparison.In International Conference on Machine Learning, pages 1386–1394, 2014.
-  Richard G Baraniuk and Michael B Wakin. Random projections of smooth manifolds. Foundations of computational mathematics, 9(1):51–77, 2009.
-  Richard Bellman. Dynamic programming. Princeton University Press, 1957.
-  Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954–966, 2011.
-  Jack W. Carlyle and Azaria Paz. Realizations by stochastic finite automata. Journal of Computer and System Sciences, 5(1):26–40, 1971.
-  Anthony R Cassandra, Leslie Pack Kaelbling, and Michael L Littman. Acting optimally in partially observable stochastic domains. 1994.
-  Manfred Droste, Werner Kuich, and Heiko Vogler. Handbook of weighted automata. Springer Science & Business Media, 2009.
-  Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
-  Michel Fliess. Matrices de hankel. Journal de Mathématiques Pures et Appliquées, 53(9):197–222, 1974.
-  William Hamilton, Mahdi Milani Fard, and Joelle Pineau. Efficient learning and planning with compressed predictive states. The Journal of Machine Learning Research, 15(1):3395–3439, 2014.
-  William L Hamilton, Mahdi Milani Fard, and Joelle Pineau. Modelling sparse dynamical systems with compressed predictive state representations. In International Conference on Machine Learning, pages 178–186, 2013.
-  Masoumeh T Izadi and Doina Precup. Point-based planning for predictive state representations. In Conference of the Canadian Society for Computational Studies of Intelligence, pages 126–137. Springer, 2008.
-  Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371–1398, 2000.
-  Michael R James and Satinder Singh. Learning and discovery of predictive state representations in dynamical systems with reset. In Proceedings of the twenty-first international conference on Machine learning, page 53. ACM, 2004.
-  William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
-  Biing Hwang Juang and Laurence R Rabiner. Hidden markov models for speech recognition. Technometrics, 33(3):251–272, 1991.
-  Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
-  Michael L Littman and Richard S Sutton. Predictive representations of state. In Advances in neural information processing systems, pages 1555–1561, 2002.
-  Joelle Pineau, Geoff Gordon, Sebastian Thrun, et al. Point-based value iteration: An anytime algorithm for pomdps. In IJCAI, volume 3, pages 1025–1032, 2003.
-  Joelle Pineau, Geoffrey Gordon, and Sebastian Thrun. Anytime point-based approximations for large pomdps. Journal of Artificial Intelligence Research, 27:335–380, 2006.
-  Guillaume Rabusseau, Borja Balle, and Joelle Pineau. Multitask spectral learning of weighted automata. In Advances in Neural Information Processing Systems, pages 2588–2597, 2017.
-  Matthew Rosencrantz, Geoff Gordon, and Sebastian Thrun. Learning low dimensional predictive representations. In Proceedings of the twenty-first international conference on Machine learning, page 88. ACM, 2004.
-  Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan. Hash kernels for structured data. Journal of Machine Learning Research, 10(Nov):2615–2637, 2009.
-  David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Advances in neural information processing systems, pages 2164–2172, 2010.
-  Satinder Singh, Michael R James, and Matthew R Rudary. Predictive state representations: A new theory for modeling dynamical systems. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 512–519. AUAI Press, 2004.
-  Satinder P Singh, Michael L Littman, Nicholas K Jong, David Pardoe, and Peter Stone. Learning predictive state representations. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 712–719, 2003.
-  Edward J Sondik. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Operations research, 26(2):282–304, 1978.
-  Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
-  Michael Thon and Herbert Jaeger. Links between multiplicity automata, observable operator models and predictive state representations: a unified learning framework. The Journal of Machine Learning Research, 16(1):103–147, 2015.
-  Joel Veness, Kee Siong Ng, Marcus Hutter, William Uther, and David Silver. A monte-carlo aixi approximation. Journal of Artificial Intelligence Research, 40:95–142, 2011.
-  Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex Smola. Feature hashing for large scale multitask learning. arXiv preprint arXiv:0902.2206, 2009.