1 Introduction
Decisionmaking and control often require that an agent interact with partiallyobserved environments whose causal mechanisms are unknown. To enable efficient planning, one might hope to construct latent representations of histories of (action, observation) tuples. At present, this practice is dominated by two points of view: (i) a standard approach to partially observable Markov decision processes (POMDPs), where one starts from a generative model of the latent transitions and emission dynamics, using observations to infer beliefs over the unobserved states [1, 5]; and (ii) predictive state representations (PSRs), where one starts from the history of the process and constructs states through modeling of the trajectories of observations [27, 33]. Both directions have drawbacks: the belief state approach requires access to a model that is equivalent to the real generator, while PSRs are constrained by their highdimensional nature that often makes planning unfeasible.
We propose a principled approach for learning state representations, that generalizes PSRs to nonlinear predictive models and allows for a formal comparison between generator and historybased state abstractions. We exploit the idea of causal states [10, 31, 32, 8], i.e., the coarsest partition of histories into classes that are maximally predictive of the future. By learning this mapping from histories to clusters, causal states constitute a discrete causal graph of the observation process.
Our method exploits the existence of minimal discrete representations to derive approximately optimal representations from recurrent neural networks (RNN) trained to model the environment. At the core lies the idea of discretizing the highdimensional continuous states generated by RNNs into a finite set of clusters. When the clusters are used as input for predictive models, they achieve the same predictive power as the original RNN. Even if this representation is not minimal, each cluster is guaranteed to each map to a single causal state
[31].We use this approach to extend, theoretically contextualize, and integrate two recent algorithms [25, 7] that learn tabular RL policies from discrete representations extracted with neural networks. We evaluate our algorithm for approximate causal states reconstruction on a modification of the original VizDoom environment used in [25] as well as multiple GridWorld navigation tasks with partially observable states. Our DQN models and their tabularized counterparts systematically outperform (by both return and stability) their recurrentDQN baselines.
2 State Representation for Decision Processes
Consider the nonlinear stochastic process emerging from the interaction between an agent’s policy which chooses discrete actions taking values from the alphabet —and the environmental response taking values from the alphabet . Let be the joint observationaction variable with realizations with ,^{1}^{1}1
For a fixed random variable
, we indicate its noninclusive past with and its inclusive future with , dropping the subscript from the notation for convenience when the context is clear. The set of all biinfinite sequences with alphabet is indicated as . By and , we indicate the finite sequences and respectively. The future dynamics depend jointly on the stationary policy that maps the joint histories into future actions and the environment channel, that maps the joint histories and future actions into future observations . An agent’s preferences over the future dynamics are defined via its reward function . The optimal policy of the agent maximizes the expected reward . We restrict our attention to environment channels and agent policies that generate a stochastic process that is ergodic stationary, i.e. processes for which the probability of every bisequence
of finite lengthis timeinvariant, which can be reliably estimated from empirical data.
The formalism of POMDPs supposes a hidden Markov process , with realizations where is discrete, and observations emitted through the actionconditional probability [23]. This causal relationship between the observed process and the hidden states implies that the mutual information between the generator state and current action (jointly) and the next observation is at least as great as that achieved by any competing representation of the history . This nextstep sufficiency is extended to the infinite due to the recursive nature of generator and belief state computations .
2.1 Belief and Predictive State Representations
A typical approach to planning in POMDPs assumes the agent has access to and and uses it to construct the belief states from the finite realizations . Belief states are computed recursively using Bayes formula from an initial belief and give rise to the belief process . The belief process is a sufficient statistic of the generator state when , and is said to be asymptotically synchronized when , where is the conditionalentropy function. When , the generator states contain more information about the future observable than the complete history of observations , implying absence of asymptotical synchronization [9] and that belief states are only sufficient statistics of the history such that .
The PSR approach relaxes the assumption of having any knowledge about the underlying generator and constructs the representation using the outputs of the predictive model of the next observations, conditioned on the next actions (the test) sampled from the set of feasible length action sequences . By , we indicate the collections of predictive models for all . Each model is a sufficient statistic of the length future observations , and the complete collection is a sufficient statistic of the infinite future observations , i.e. . Typically, PSRs are constructed for decision processes using a linear model that enables approximate solutions by assuming that the infinitedimensional system dynamics matrix has finite rank [33].
2.2 Causal States Representations
We propose to use the causal states representation that expands PSRs to the general case of nonlinear predictive models and allows the definition of a formal equivalence between the eventual generator states and the causal states reconstructed from history. As in the PSR framework, causal states depend on a predictive model of the observation process.
Definition 1
where is the variable denoting causal state at time , overwriting the definition in Sec. 2 of the unknown ground truth state. Since all histories belonging to the same equivalence class predict the same (conditional) future, the corresponding causal state can be used to fully summarize the information content of those histories. It can be demonstrated [31] that the partition induced by
is the coarsest possible and generates the minimal sufficient representation across the model class. Sampling of new symbols in the sequence induces the creation of new histories and consequently new causal states. Because of this mapping from histories to states, the resulting hidden Markov model is
unifilar.Definition 2
[31] A unifilar hidden Markov model is a HMM whose state transition probability is deterministic if conditioned on the output symbol, i.e .
With explicit reference to the joint inputoutput history, the state transition dynamics are governed by inputconditional transition matrices with elements:
(3) 
Since the causal states are defined over histories of joint symbols, the causal state model is unifilar with respect to the joint variable , i.e. the transitions between states are deterministic once the next action and observable have been sampled or . The unifilar property implies that the recurrent dynamics of the causal states are fully specified by the stateactionconditional symbol emission probability and the actionsymbolconditional causal state emission probability . As a consequence, knowledge of the current causal states and of the future actionobservation sequence induces a deterministic sequence of future causal states , .
2.3 Stochastic Processes with Finite Causal States
When the joint process admits a finite causal state representation it is called a finitary stochastic process which have multiple theoretical implications. In discrete stochastic processes with finite actions, finitesymbol alphabets, and finite memory of length the causal states are always finite, with a worst case scenario in which each subsequence of length belongs to a distinct causal state forming a length Markov model [31]. When the causal states are finite, they are also unique up to isomorphisms [31] and always generate a stationary stochastic process. If the underlying generator is nonunifilar, the causal states have the same information content of the potentially nonsynchronizing belief states of the generator, and the belief states defined over the causal states always asymptotically synchronize to the actual causal states [9].
We focus on partially observable environments with discrete causal states and either continuous or discrete observations. For continuous observations it is not possible to derive generic conditions that imply discrete or finite causal states. Therefore, the existence of discrete latent states has to be directly assumed or derived from alternative assumptions like the existence of finite latent discrete variables underlying each continuous observation. When a memoryless map from continuous observation to latent discrete variables exists, the causal states of the revealed continuous variable process coincide with those defined over the underlying discrete variables.
3 Methods
In the previous sections, we introduced a class of stochastic processes with discrete or continuous outputs that are optimally compressed by a finitestate hiddenMarkov representation, called the causal state model of the process.
We now propose a new approach to approximately reconstruct these causal states from empirical data.
3.1 Empirical Estimation of Causal States
Existing methods either directly partition past sequences of length into a finite number of causal states via conditional hypothesis tests [32]
or use Bayesian inference over a set of existing candidate states
[36]. Either method can be adapted to model a jointprocess and consequently obtain the nextstep conditional output by marginalizing out the action , but do not extend to the realvalued measurement case described without strong assumptions on the shape of the conditional density function.In this work, we exploit the definition that minimalsufficient statistics can be computed from any other nonminimal sufficient statistic. We obtain the (approximately) minimal representations of the underlying process by discretizing a sufficient model of the measurement process . This approach exploits learning a hierarchy of optimal predictive models with progressively stricter bounds on their representations’ dimensionality. We start with infinite dimensional continuous representations learned with a deep neural network, and then partition the continuous representations into a finite set of clusters.
3.2 Learning Sufficient Statistics of History with Recurrent Networks
Recurrent neural networks (RNNs) are unifilar hidden Markov models with continuous states, where the transitions and state output probabilities are parameterized by differentiable functions. We use them to obtain recursivelycomputable highdimensional sufficient statistics of the actionmeasurement joint process. This representation is learned via a recurrent encoder , and a next step prediction network . The overall neuralnetwork architecture resembles world models [14], except we autoencode the observation only with image inputs, and only use it for next step prediction in the other cases. Furthermore, we use an explicit embedding layers for that is concatenated with the output of the recurrent encoder before predicting .
We note that when is maximally predictive of the subsequent observations (the future ), constitutes a sufficient statistic of the latent states . In practice, we estimate the continuous representations using the empirical realizations ^{2}^{2}2With a small abuse of notation, we use the same convention of , where we shift measurements by one time step such that the joint process has elements (). to learn a neural network that approximates end to end the maps and by minimizing the temporal loss through the following optimization problems:
(4) 
After solving Eqs. 4 we can use the neural networks parameterized by the optimal parameters to derive sufficient continuous representations to create discrete states that are refinement of the causal states.
3.3 Discretization of the RNN Hidden States
Together with the unifilar and Markovian nature of transitions in RNNs, the sufficiency of implies that there exists a function that allows us to describe the causal states as a partition of the learned latent state [31, 9].
We set up a second optimization problem using the trained neural network and the empirical realizations of the process to estimate the discretizer with and the new prediction network that maps the estimated discrete states into the next observable. We match the predictive behavior between the old network and the new networks that use discretized states and corresponding prediction function by minimizing the knowledge distillation [19] loss:
(5) 
Minimizing Eq. 5 guarantees a sufficient discrete representation. To summarize, we first minimize to obtain a neural model able to generate continuous sufficient statistics of the future observables of the process and subsequently minimize to obtain a sufficient representation of the dynamical system that is a refinement of the original causal states. Fig. 1 shows the stochastic process representing the environment and our learned states and and their interactions.
3.4 Implementation Details
For the GridWorlds and ToyProcesses experiments, the base world model architecture is composed of a threelayer perceptron (MLP) encoding the observation
and a single layer linear embedding for the action. The outputs of the respective layers are concatenated and fed to a Gated Recurrent Unit (GRU)
[6], and the output of the recurrent network is concatenated with the embedding of and fed to a second MLP that outputs predictions for. All the embeddings have 64 neurons except in layout 4 where we use 256 dimensional embeddings. The discretization network is composed of a QuantizedBottleneckNetwork
[25] with ternary tangent neurons that autoencodes the continuous representation generating the discrete variables , and a MLP decoder that uses the estimated discrete states for predicting the next observation .Both networks are trained with the RMSprop algorithm using crossentropy loss for discrete observations and reconstruction loss for the continuous setting. The worldmodels is trained through supervised learning of the temporal loss, while the discretization network is trained via knowledge distillation using the soft outputs of the GRU decoder as targets. We ran downstream evaluation of our learned representation with value iteration and compare with baselines. To approximate the value function we use a
layer fullyconnected DQN architecture separated by ReLU with a 64dimensional hidden layer. The RDQN baselines uses the same architectures but they are trained endend via reward maximization. We also present earlier results where the discretization step is implemented with Kmean clustering and for policies learned with traditional tabular Qlearning.
For the VizDoom and ALE experiments we built upon the base VaST architecture [7] that uses binary Bernoulli variables and Gumbel Softmax [22] for their discrete bottleneck. The original architecture for MDPs is composed by a variational encoder that embeds the observation into binary variables which are combined with action to predict the next latent discrete state. We extend it with a recurrent encoder that gives access to the agent to the history of observations . Since we build over the original implementation we use prioritized sweeping with tabular Qlearning for the downstream policy. The world models is initialized from 10k random rollouts and is trained using the reconstruction loss.
4 Experiments
We employ three partially observable environments to learn approximate causal states through selfsupervised learning and use these representations as input for reinforcement learning tasks defined over the domains.
4.1 Gridworlds
Layout 1  Layout 1  Layout 1  Layout 2  Layout 2  Layout 2  
Method  lowdisc  lowcont  egocont  lowdisc  lowcont  egocont 
Tab.,  0.43, 0.  0.42, 0.  0.437, 0.  0.01, 0.  0.09, 0.  0.036, 0. 
DQN,  0.50, 0.015  0.42, 0.10  0.49, 0.032  0.17, 0.76  0.026, 0.26  0.12, 0.064 
DQN,  0.5, 0.  0.5, 0.  0.49, 0.029  0.30, 0.  0.30, 0.  0.30, 0.01 
DQN,  9.46, 0.20  8.55, 0.79  8.64, 1.01  9.48, 0.14  8.59, 0.78  9.83, 0.25 
DQN,  0.91, 3.0  0.49, 0.03  0.34, 2.31  0.23, 0.16  0.084, 0.18  0.07, 0.24 
DRQN,  9.75, 0.22  6.95, 3.66  9.27, 0.68  5.63, 3.74  3.72, 4.31  9.97, 0.06 
Tab.,  9.40, 0.      9.11, 0.     
Tab.,  0.45, 0.  0.42, 0.  0.43, 0.  0.23, 0.  0.23, 0.  0.23, 0. 
DQN,  0.44, 0.019  0.44, 0.027  0.44, 0.023  0.30, 0.01  0.76, 3.07  0.23, 0.10 
. Models trained on 1000 episodes and evaluated on 100. Numbers are mean, standard deviation across 10 random seeds. First section is our method using kmean clustering for discretization, second is baselines on current observation, history of observations, and
. Final section is using ground truth states.We create partially observable gridworld environments where the task is for the agent to first obtain the key, then pass through the door to obtain the final reward. The final state is unseen (i.e. the agent cannot pass through the door to reach it) if the agent does not have the key. The agent only knows it has the key if it remembers entering the state with the key, so without infinite memory this task is partially observable. At each time step the agent receives 0.1 reward, 0.5 reward for picking up the key, and a final reward of 1 for passing through the door.
We conduct experiments with three layouts with an increasing number of states (see figures in the supplementary materials). The first layout is a 1dimensional corridor while the other three are twodimensional mazes. The minimal memory requirement for solving the environment is given by the shortest path from the key to the final destination.
We use three types of observation processes for the agent that give different priors to the agent’s behavior but share the same underlying causal states. lowdiscrete is the discrete observation of the agent’s absolute position in the grid. lowcont is the continuous observation of the agent’s absolute position in the grid. egocont is egocentric continuous observation (up, down, left, right) of the distance from walls in the 4 cardinal directions.
dqnDQN on is able to achieve optimal policy across all 10 random seeds with very low or zero standard deviation, showing the stability of our learned . We expect to perform as well as or better than , as is distilled from and therefore contains the same information. Using only current observation learns using a recurrent DQN (DRQN) [16].
4.2 Doom
We modify the tmaze VizDoom [24] environment of [7] to make it partially observable. We randomize the goal location between the two ending corners and signal its location with a stochastic signal in the observation space. The agent must remember where the goal is in order to navigate to it. We convey the signal to the agent through a fourth channel (after RGB) that intermittently contains information about where the goal is. The frequency at which the information is displayed is a tunable factor for these experiments. Example trajectories to different goals that the learned agent takes are shown in Fig. 3 on the left.
Results Fig. 3 (right) shows the speedup in learning from explicitly learning to cluster sequences of observations into causal states. Additional results with varying frequencies in Supplementary Material.
4.3 Atari Pong
We evaluate our model on the Atari Learning Environment’s Pong [3], an environment where causal states are less intuitive. Our algorithm learns a set of discrete causal states suitable for learning successful policies (Fig. 4) We use the same preprocessing of observations in [29] except that the agent receives a single frame as observation at each time step. The environment is now partially observable as the agent must learn to retain information about velocity of the paddles and ball from the hidden state of the RNN. We see a decrease in performance with DRQN, due to instability in training.
4.4 Toy StochasticProcesses
We apply our method to inputoutput stochastic processes that can be modeled as HMMs with memory length and alphabet size . We complicate the continuous observation process by mapping the environment’s outputs into a) multivariate gaussians and b) images sampled from the MNIST dataset. The goal is to maximize the occurrence of , . The environment returns a reward of 1 at time step if , and 0 otherwise. Each episode lasts 100 time steps. Results in the appendix.
5 Related Literature
The relationship between PSR and causal states has been previously suggested in the computation mechanics literature [31, 32, 2]. [20] also derive parallels between PSR, POMDP, and automata through the construction of equivalence classes that groups states with common action conditional future observations. Causal states, and the related informationtheoretic notion of complexity, minimality, and sufficiency are used to derive taskagnostic policies with an intrinsic explorationexploitation trade off in [34, 35].
In [4] the authors learn PSRs in Reproducing Kernel Hilbert Space extending the approach to continuous, potentially infinite, actionobservation processes. More similar to our latent discrete states with continuous observable, [12] model spatiotemporal processes as being generated by a finite set of discrete causal states in the form of lightcones.
Other methods for learning deep representations for reinforcement learning POMDPs have been recently proposed, starting with adding recurrency to DQN [16] to integrate the history in the estimation of the Qvalue as opposed to using only the current observation. However, this method stops short of ensuring sufficiency for next step prediction as it learns a task specific representation. [21, 38]
use deep variational methods to learn a probability distribution over states, i.e. belief states, and use the belief states for policy optimization with actorcritic
[21] and planning [38]. [13] also use neural methods to learn belief states with nextstep prediction [17, 18] learn PSRs with RNNs and spectral methods and use policy gradient with an alternating optimization method for the policy and the state representation to handle continuous action spaces. However, none of these explore the connection to causal states and compression via a discrete representation. [7] do learn discrete representations but not in partially observable environments and with no link to PSRs. Instead, they propose the discrete representation solely for using tabular Qlearning with prioritized sweeping.The idea of extracting the implicit knowledge of a neural network [37, 11, 28, 15]
is not novel and is rooted in early attempts to merge traditional rulebased methods with machine learning. The most recent examples
[41, 39, 40] are focused on the ability of character level RNNs to implicitly learn grammars. The only application of these ideas to RL in partially observable environments that we are aware of is in [25] where deep recurrent policies [16] are quantized into Moore machines. The main difference between our approaches is that we reduce models of the environment, not of the policy, to machines which are edgeemitting Mealy machines. The connection between machines and causal states is further discussed in Supplementary Materials. Because the optimal policy will in general use a subset of the possible input sequences, the minimal sufficient representation of a policy is typically smaller than the causal states of the complete environment.6 Conclusions
In this paper, we proposed a selfsupervised method of learning a minimal sufficient statistic for next step prediction and articulated its connection to the causal states of a controlled process. Our experiments demonstrate the practical utility of this representation, both with value iteration for control and exhaustive planning, in solving stochastic and infinitememory POMDP environments as well as order MDP environments with high dimensional observations, matching the performance achieved with ground truth states.
We encountered multiple problems while training recurrent DQN models end to end, even when we know from our causal states results that the same architecture is able to learn sufficient representations for the task. We plan on extending the Distributed Experience Replay algorithm [26] with extra asynchronous workers dedicated respectively to the construction of continuous and discrete world models.
Acknowledgements
Part of this work was supported by the National Science Foundation (grant number CCF1317433), CBRIC (one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA), and the Intel Corporation. A. Anandkumar is supported in part by Bren endowed chair, Darpa PAI, Raytheon, and Microsoft, Google and Adobe faculty fellowships. K. Azizzadenesheli is supported in part by NSF Career Award CCF1254106 and AFOSR YIP FA95501510221, work done while he was visiting Caltech. The authors affirm that the views expressed herein are solely their own, and do not represent the views of the United States government or any agency thereof.
References
 [1] Karl J Aastrom. Optimal control of markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1):174–205, 1965.
 [2] Nix Barnett and James P Crutchfield. Computational mechanics of input–output processes: Structured transformations and the epstransducer. Journal of Statistical Physics, 161(2):404–451, 2015.

[3]
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, jun 2013.  [4] Byron Boots, Geoffrey Gordon, and Arthur Gretton. Hilbert space embeddings of predictive state representations. arXiv preprint arXiv:1309.6819, 2013.
 [5] Anthony R Cassandra, Leslie Pack Kaelbling, and Michael L Littman. Acting optimally in partially observable stochastic domains. In AAAI, volume 94, pages 1023–1028, 1994.
 [6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 [7] Dane S. Corneil, Wulfram Gerstner, and Johanni Brea. Efficient modelbased deep reinforcement learning with variational state tabulation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 1057–1066, 2018.
 [8] James P Crutchfield. The origins of computational mechanics: A brief intellectual history and several clarifications. arXiv preprint arXiv:1710.06832, 2017.
 [9] James P Crutchfield, Christopher J Ellison, Ryan G James, and John R Mahoney. Synchronization and control in intrinsic and designed computation: An informationtheoretic analysis of competing models of stochastic computation. Chaos: An Interdisciplinary Journal of Nonlinear Science, 20(3):037105, 2010.
 [10] James P Crutchfield and Karl Young. Inferring statistical complexity. Physical Review Letters, 63(2):105, 1989.
 [11] LiMin Fu. Rule generation from neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 24(8):1114–1124, 1994.
 [12] Georg Goerg and Cosma Shalizi. Mixed licors: A nonparametric algorithm for predictive state reconstruction. In Artificial Intelligence and Statistics, pages 289–297, 2013.
 [13] Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo A. Pires, Toby Pohlen, and Rémi Munos. Neural predictive belief representations. CoRR, abs/1811.06407, 2018.
 [14] David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
 [15] Tameru Hailesilassie. Rule extraction algorithm for deep neural networks: A review. arXiv preprint arXiv:1610.05267, 2016.
 [16] Matthew Hausknecht and Peter Stone. Deep recurrent qlearning for partially observable mdps. 2015.
 [17] Ahmed Hefny, Carlton Downey, and Geoffrey J Gordon. Supervised learning for dynamical system learning. In Advances in neural information processing systems, pages 1963–1971, 2015.
 [18] Ahmed Hefny, Zita Marinho, Wen Sun, Siddhartha Srinivasa, and Geoffrey Gordon. Recurrent predictive state policy networks. arXiv preprint arXiv:1803.01489, 2018.
 [19] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [20] Christopher Hundt, Joelle Pineau, and Doina Precup. Representing systems with hidden state. 2006.
 [21] Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for pomdps. arXiv preprint arXiv:1806.02426, 2018.
 [22] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. CoRR, abs/1611.01144, 2016.
 [23] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artif. Intell., 101(12):99–134, May 1998.
 [24] Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski. ViZDoom: A Doombased AI research platform for visual reinforcement learning. In IEEE Conference on Computational Intelligence and Games, pages 341–348, Santorini, Greece, Sep 2016. IEEE. The best paper award.
 [25] Anurag Koul, Sam Greydanus, and Alan Fern. Learning finite state representations of recurrent policy networks. arXiv preprint arXiv:1811.12530, 2018.
 [26] Thanard Kurutach, Aviv Tamar, Ge Yang, Stuart Russell, and Pieter Abbeel. Learning plannable representations with causal infogan. arXiv preprint arXiv:1807.09341, 2018.
 [27] Michael L Littman and Richard S Sutton. Predictive representations of state. In Advances in neural information processing systems, pages 1555–1561, 2002.
 [28] Hongjun Lu, Rudy Setiono, and Huan Liu. Effective data mining using neural networks. IEEE transactions on knowledge and data engineering, 8(6):957–961, 1996.

[29]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
Antonoglou, Daan Wierstra, and Martin Riedmiller.
Playing atari with deep reinforcement learning.
In
NIPS Deep Learning Workshop
. 2013.  [30] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013.
 [31] Cosma Rohilla Shalizi and James P Crutchfield. Computational mechanics: Pattern and prediction, structure and simplicity. Journal of statistical physics, 104(34):817–879, 2001.
 [32] Cosma Rohilla Shalizi and Kristina Lisa Shalizi. Blind construction of optimal nonlinear recursive predictors for discrete sequences. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 504–511. AUAI Press, 2004.
 [33] Satinder Singh, Michael R James, and Matthew R Rudary. Predictive state representations: A new theory for modeling dynamical systems. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 512–519. AUAI Press, 2004.
 [34] Susanne Still. Informationtheoretic approach to interactive learning. EPL (Europhysics Letters), 85(2):28005, 2009.
 [35] Susanne Still and Doina Precup. An informationtheoretic approach to curiositydriven reinforcement learning. Theory in Biosciences, 131(3):139–148, 2012.
 [36] Christopher C Strelioff and James P Crutchfield. Bayesian structural inference for hidden processes. Physical Review E, 89(4):042119, 2014.
 [37] Geoffrey G Towell and Jude W Shavlik. Extracting refined rules from knowledgebased neural networks. Machine learning, 13(1):71–101, 1993.
 [38] Sebastian Tschiatschek, Kai Arulkumaran, Jan Stühmer, and Katja Hofmann. Variational inference for dataefficient model learning in pomdps. CoRR, abs/1805.09281, 2018.
 [39] Qinglong Wang, Kaixuan Zhang, II Ororbia, G Alexander, Xinyu Xing, Xue Liu, and C Lee Giles. An empirical evaluation of recurrent neural network rule extraction. arXiv preprint arXiv:1709.10380, 2017.
 [40] Qinglong Wang, Kaixuan Zhang, II Ororbia, G Alexander, Xinyu Xing, Xue Liu, and C Lee Giles. A comparison of rule extraction for different recurrent neural network models and grammatical complexity. arXiv preprint arXiv:1801.05420, 2018.
 [41] Gail Weiss, Yoav Goldberg, and Eran Yahav. Extracting automata from recurrent neural networks using queries and counterexamples. arXiv preprint arXiv:1711.09576, 2017.
 [42] Amy Zhang, Adam Lerer, Sainbayar Sukhbaatar, Rob Fergus, and Arthur Szlam. Composable planning with attributes. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, volume 80, pages 5837–5846. JMLR.org, 2018.
Appendix A Causal States and machines
In the computational mechanics literature, causal state models are usually called machines and are formally defined as:
Definition 3
To summarize, the machine of a stochastic process is the minimal unifilar hidden Markov model able to generate its empirical realizations. The hidden states of an machine are called the causal states of the process, and correspond to partitions of the process history.
Appendix B Additional Visualizations for GridWorlds experiments
In Figures 5 and 6, we present additional DQN results for Layout 1, 2 and 3 using continuous inputs. The models in Figure 5 use continuous x,y coordinates as input while those in Figure 6 use the egocentric distance from walls in each cardinal direction. Figures 7 and 8 depict the configuration of the four layouts. Figure 9 and 10 contain the learning curve for layout 1 and 2 for all input modalities using kmean clustering for discretization.
Appendix C Additional Experiments
c.1 Toy Processes
We apply our method to inputoutput stochastic processes that can be modeled as HMMs with memory length and alphabet size . We construct the process such that the probability of the next output depends only on the value of by sampling from the multinomial distribution: and for all .
We introduce a binary action space where
The goal is to maximize the occurrence of , . The environment returns a reward of 1 at time step if , and 0 otherwise. Each episode lasts 100 time steps. We again train on sequence data generated with a random policy and pass the actions into the RNN with a linear layer and concatenate with the observation embedding.
The multivariate Gaussians measurements are constructed by taking a vector composed by
blocks of size , the block has mean if or if . For MNIST we use a process alphabet of up to size , and associate each symbol to an image category, the measurement are generated by first sampling the realization from the process sampling a random image from the corresponding MNIST category.For the case of rendering highdimensional measurements with images, we use the full image dataset for sampling and use train and val sets for training and test for evaluation, showing generalization ability in image classification through this method, with no explicit training on class label.
Discrete  Gaussian  MNIST  

Method  
DQN on  50.1, 1.01  25.1, 1.12  50.6, 1.26  25.0, 1.35  50.1, 1.80  25.0, 1.27 
DQN on  73.7, 0.73  55.5, 1.62  73.3, 1.20  54.9, 1.71  72.3, 1.33  54.2, 1.39 
DQN on  72.7, 1.04  54.6, 1.61  73.6, 0.82  55.3, 1.91  72.8, 1.23  50.8, 1.80 
DQN on  72.6, 4.10  49.2, 3.29  73.7, 2.18  52.7, 3.07  72.6, 2.50  43.2, 3.02 
Results: We compare the learned representations with endtoend DQN [30] trained on length sequences of actionsoutputs observations , single observation , the continuous sufficient statistics , and the causal states refinements . The difficulty level of these environments can be extended by increasing the class size and the memory . Table 2 reports the results for . As can be appreciated from the first row, it is impossible to obtain more than random reward of without using memory, and the task can be successfully solved by stacking frames as input as indicated by the second row, making the task fullyobservable. The third and fourth row show the matching performance of our taskagnostic representation for the continuous and the discrete , respectively. We found for the Gaussian and MNIST results that vector quantization worked better than kmeans, so the results are reported with VQ.
c.2 Doom at Additional Frequencies
We tune frequency for the Doom environment and show results in Fig. 11. DRQN learns slightly on and noticeably drops with longer frequencies, whereas our method (Causal States) exhibits consistently good performance at all frequencies.
c.3 Planning
With a minimal sufficient unifilar model we can perform efficient planning by representing it as a labeled directed graph , with causal states as nodes and actionobservation conditional transitions as edges. Because of the unifilar property, once an agent has perfect knowledge of the current causal states, the information in the future actionobservation process are sufficient to uniquely determine the future causal states. This property can be exploited by multistep planning algorithms which need not keeping track of potential stochastic transitions between the underlying states, enabling a variety of methods that are otherwise amenable only to MDPs like Dijkstra’s algorithm. We plan over our learned discrete representation by building a graph where , , and for . We obtain the optimal policy for Layout 1 and 2 obtaining respectively 0.5 and 0.3 reward we derive and save high reward states seen during the rollouts as goal states. Then one can map the initial and final observations to a node in the graph and run Dijkstra’s algorithm to find the shortest path as proposed in [42]. Unlike value iteration, this requires no learning and no resampling of the environment by making use of the graph edges, where Qlearning only uses the nodes.