Learning Causal State Representations of Partially Observable Environments

by   Amy Zhang, et al.

Intelligent agents can cope with sensory-rich environments by learning task-agnostic state abstractions. In this paper, we propose mechanisms to approximate causal states, which optimally compress the joint history of actions and observations in partially-observable Markov decision processes. Our proposed algorithm extracts causal state representations from RNNs that are trained to predict subsequent observations given the history. We demonstrate that these learned task-agnostic state abstractions can be used to efficiently learn policies for reinforcement learning problems with rich observation spaces. We evaluate agents using multiple partially observable navigation tasks with both discrete (GridWorld) and continuous (VizDoom, ALE) observation processes that cannot be solved by traditional memory-limited methods. Our experiments demonstrate systematic improvement of the DQN and tabular models using approximate causal state representations with respect to recurrent-DQN baselines trained with raw inputs.



There are no comments yet.


page 8

page 14


Counterfactual equivalence for POMDPs, and underlying deterministic environments

Partially Observable Markov Decision Processes (POMDPs) are rich environ...

Intrinsically Motivated Acquisition of Modular Slow Features for Humanoids in Continuous and Non-Stationary Environments

A compact information-rich representation of the environment, also calle...

Universal Decision Models

Humans are universal decision makers: we reason causally to understand t...

Hierarchical Reinforcement Learning under Mixed Observability

The framework of mixed observable Markov decision processes (MOMDP) mode...

Learning classifier systems with memory condition to solve non-Markov problems

In the family of Learning Classifier Systems, the classifier system XCS ...

LEO: Learning Energy-based Models in Graph Optimization

We address the problem of learning observation models end-to-end for est...

A2: Extracting Cyclic Switchings from DOB-nets for Rejecting Excessive Disturbances

Reinforcement Learning (RL) is limited in practice by its gray-box natur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Decision-making and control often require that an agent interact with partially-observed environments whose causal mechanisms are unknown. To enable efficient planning, one might hope to construct latent representations of histories of (action, observation) tuples. At present, this practice is dominated by two points of view: (i) a standard approach to partially observable Markov decision processes (POMDPs), where one starts from a generative model of the latent transitions and emission dynamics, using observations to infer beliefs over the unobserved states [1, 5]; and (ii) predictive state representations (PSRs), where one starts from the history of the process and constructs states through modeling of the trajectories of observations [27, 33]. Both directions have drawbacks: the belief state approach requires access to a model that is equivalent to the real generator, while PSRs are constrained by their high-dimensional nature that often makes planning unfeasible.

We propose a principled approach for learning state representations, that generalizes PSRs to non-linear predictive models and allows for a formal comparison between generator- and history-based state abstractions. We exploit the idea of causal states [10, 31, 32, 8], i.e., the coarsest partition of histories into classes that are maximally predictive of the future. By learning this mapping from histories to clusters, causal states constitute a discrete causal graph of the observation process.

Our method exploits the existence of minimal discrete representations to derive approximately optimal representations from recurrent neural networks (RNN) trained to model the environment. At the core lies the idea of discretizing the high-dimensional continuous states generated by RNNs into a finite set of clusters. When the clusters are used as input for predictive models, they achieve the same predictive power as the original RNN. Even if this representation is not minimal, each cluster is guaranteed to each map to a single causal state


We use this approach to extend, theoretically contextualize, and integrate two recent algorithms [25, 7] that learn tabular RL policies from discrete representations extracted with neural networks. We evaluate our algorithm for approximate causal states reconstruction on a modification of the original VizDoom environment used in [25] as well as multiple GridWorld navigation tasks with partially observable states. Our DQN models and their tabularized counterparts systematically outperform (by both return and stability) their recurrent-DQN baselines.

2 State Representation for Decision Processes

Consider the nonlinear stochastic process emerging from the interaction between an agent’s policy which chooses discrete actions taking values from the alphabet —and the environmental response taking values from the alphabet . Let be the joint observation-action variable with realizations with ,111

For a fixed random variable

, we indicate its non-inclusive past with and its inclusive future with , dropping the subscript from the notation for convenience when the context is clear. The set of all bi-infinite sequences with alphabet is indicated as . By and , we indicate the finite sequences and respectively. The future dynamics depend jointly on the stationary policy that maps the joint histories into future actions and the environment channel, that maps the joint histories and future actions into future observations . An agent’s preferences over the future dynamics are defined via its reward function . The optimal policy of the agent maximizes the expected reward . We restrict our attention to environment channels and agent policies that generate a stochastic process that is ergodic stationary

, i.e. processes for which the probability of every bi-sequence

of finite length

is time-invariant, which can be reliably estimated from empirical data.

The formalism of POMDPs supposes a hidden Markov process , with realizations where is discrete, and observations emitted through the action-conditional probability  [23]. This causal relationship between the observed process and the hidden states implies that the mutual information between the generator state and current action (jointly) and the next observation is at least as great as that achieved by any competing representation of the history . This next-step sufficiency is extended to the infinite due to the recursive nature of generator and belief state computations .

2.1 Belief and Predictive State Representations

A typical approach to planning in POMDPs assumes the agent has access to and and uses it to construct the belief states from the finite realizations . Belief states are computed recursively using Bayes formula from an initial belief and give rise to the belief process . The belief process is a sufficient statistic of the generator state when , and is said to be asymptotically synchronized when , where is the conditional-entropy function. When , the generator states contain more information about the future observable than the complete history of observations , implying absence of asymptotical synchronization [9] and that belief states are only sufficient statistics of the history such that .

The PSR approach relaxes the assumption of having any knowledge about the underlying generator and constructs the representation using the outputs of the predictive model of the next observations, conditioned on the next actions (the test) sampled from the set of feasible -length action sequences . By , we indicate the collections of predictive models for all . Each model is a sufficient statistic of the -length future observations , and the complete collection is a sufficient statistic of the infinite future observations , i.e. . Typically, PSRs are constructed for decision processes using a linear model that enables approximate solutions by assuming that the infinite-dimensional system dynamics matrix has finite rank [33].

2.2 Causal States Representations

We propose to use the causal states representation that expands PSRs to the general case of non-linear predictive models and allows the definition of a formal equivalence between the eventual generator states and the causal states reconstructed from history. As in the PSR framework, causal states depend on a predictive model of the observation process.

Definition 1

[10, 31] The causal states of a stochastic process are partitions of the space of feasible pasts induced by the causal equivalence :


Which implies:


where is the variable denoting causal state at time , overwriting the definition in Sec. 2 of the unknown ground truth state. Since all histories belonging to the same equivalence class predict the same (conditional) future, the corresponding causal state can be used to fully summarize the information content of those histories. It can be demonstrated [31] that the partition induced by

is the coarsest possible and generates the minimal sufficient representation across the model class. Sampling of new symbols in the sequence induces the creation of new histories and consequently new causal states. Because of this mapping from histories to states, the resulting hidden Markov model is


Definition 2

[31] A unifilar hidden Markov model is a HMM whose state transition probability is deterministic if conditioned on the output symbol, i.e .

With explicit reference to the joint input-output history, the state transition dynamics are governed by input-conditional transition matrices with elements:


Since the causal states are defined over histories of joint symbols, the causal state model is unifilar with respect to the joint variable , i.e. the transitions between states are deterministic once the next action and observable have been sampled or . The unifilar property implies that the recurrent dynamics of the causal states are fully specified by the state-action-conditional symbol emission probability and the action-symbol-conditional causal state emission probability . As a consequence, knowledge of the current causal states and of the future action-observation sequence induces a deterministic sequence of future causal states , .

2.3 Stochastic Processes with Finite Causal States

When the joint process admits a finite causal state representation it is called a finitary stochastic process which have multiple theoretical implications. In discrete stochastic processes with finite actions, finite-symbol alphabets, and finite memory of length the causal states are always finite, with a worst case scenario in which each sub-sequence of length belongs to a distinct causal state forming a -length Markov model [31]. When the causal states are finite, they are also unique up to isomorphisms [31] and always generate a stationary stochastic process. If the underlying generator is non-unifilar, the causal states have the same information content of the potentially non-synchronizing belief states of the generator, and the belief states defined over the causal states always asymptotically synchronize to the actual causal states [9].

We focus on partially observable environments with discrete causal states and either continuous or discrete observations. For continuous observations it is not possible to derive generic conditions that imply discrete or finite causal states. Therefore, the existence of discrete latent states has to be directly assumed or derived from alternative assumptions like the existence of finite latent discrete variables underlying each continuous observation. When a memory-less map from continuous observation to latent discrete variables exists, the causal states of the revealed continuous variable process coincide with those defined over the underlying discrete variables.

3 Methods

In the previous sections, we introduced a class of stochastic processes with discrete or continuous outputs that are optimally compressed by a finite-state hidden-Markov representation, called the causal state model of the process.

We now propose a new approach to approximately reconstruct these causal states from empirical data.

3.1 Empirical Estimation of Causal States

Existing methods either directly partition past sequences of length into a finite number of causal states via conditional hypothesis tests [32]

or use Bayesian inference over a set of existing candidate states 

[36]. Either method can be adapted to model a joint-process and consequently obtain the next-step conditional output by marginalizing out the action , but do not extend to the real-valued measurement case described without strong assumptions on the shape of the conditional density function.

In this work, we exploit the definition that minimal-sufficient statistics can be computed from any other non-minimal sufficient statistic. We obtain the (approximately) minimal representations of the underlying process by discretizing a sufficient model of the measurement process . This approach exploits learning a hierarchy of optimal predictive models with progressively stricter bounds on their representations’ dimensionality. We start with infinite dimensional continuous representations learned with a deep neural network, and then partition the continuous representations into a finite set of clusters.

3.2 Learning Sufficient Statistics of History with Recurrent Networks

Recurrent neural networks (RNNs) are unifilar hidden Markov models with continuous states, where the transitions and state output probabilities are parameterized by differentiable functions. We use them to obtain recursively-computable high-dimensional sufficient statistics of the action-measurement joint process. This representation is learned via a recurrent encoder , and a next step prediction network . The overall neural-network architecture resembles world models [14], except we auto-encode the observation only with image inputs, and only use it for next step prediction in the other cases. Furthermore, we use an explicit embedding layers for that is concatenated with the output of the recurrent encoder before predicting .

We note that when is maximally predictive of the subsequent observations (the future ), constitutes a sufficient statistic of the latent states . In practice, we estimate the continuous representations using the empirical realizations 222With a small abuse of notation, we use the same convention of , where we shift measurements by one time step such that the joint process has elements (). to learn a neural network that approximates end to end the maps and by minimizing the temporal loss through the following optimization problems:


After solving Eqs. 4 we can use the neural networks parameterized by the optimal parameters to derive sufficient continuous representations to create discrete states that are refinement of the causal states.

Figure 1: Graphical model generating the joint action-measurement stochastic process. Black-arrows indicate causal relationships between random variables and red-arrows indicate the predictive relationship between the combinations of action , internal state () and the next measurement . Circular boxes indicate continuous variables.

3.3 Discretization of the RNN Hidden States

Together with the unifilar and Markovian nature of transitions in RNNs, the sufficiency of implies that there exists a function that allows us to describe the causal states as a partition of the learned latent state [31, 9].

We set up a second optimization problem using the trained neural network and the empirical realizations of the process to estimate the discretizer with and the new prediction network that maps the estimated discrete states into the next observable. We match the predictive behavior between the old network and the new networks that use discretized states and corresponding prediction function by minimizing the knowledge distillation [19] loss:


Minimizing Eq. 5 guarantees a sufficient discrete representation. To summarize, we first minimize to obtain a neural model able to generate continuous sufficient statistics of the future observables of the process and subsequently minimize to obtain a sufficient representation of the dynamical system that is a refinement of the original causal states. Fig. 1 shows the stochastic process representing the environment and our learned states and and their interactions.

3.4 Implementation Details

For the GridWorlds and Toy-Processes experiments, the base world model architecture is composed of a three-layer perceptron (MLP) encoding the observation

and a single layer linear embedding for the action

. The outputs of the respective layers are concatenated and fed to a Gated Recurrent Unit (GRU)

[6], and the output of the recurrent network is concatenated with the embedding of and fed to a second MLP that outputs predictions for

. All the embeddings have 64 neurons except in layout 4 where we use 256 dimensional embeddings. The discretization network is composed of a Quantized-Bottleneck-Network

[25] with ternary tangent neurons that auto-encodes the continuous representation generating the discrete variables , and a MLP decoder that uses the estimated discrete states for predicting the next observation .

Both networks are trained with the RMSprop algorithm using cross-entropy loss for discrete observations and reconstruction loss for the continuous setting. The world-models is trained through supervised learning of the temporal loss, while the discretization network is trained via knowledge distillation using the soft outputs of the GRU decoder as targets. We ran downstream evaluation of our learned representation with value iteration and compare with baselines. To approximate the value function we use a

-layer fully-connected DQN architecture separated by ReLU with a 64-dimensional hidden layer. The R-DQN baselines uses the same architectures but they are trained end-end via reward maximization. We also present earlier results where the discretization step is implemented with K-mean clustering and for policies learned with traditional tabular Q-learning.

For the VizDoom and ALE experiments we built upon the base VaST architecture [7] that uses binary Bernoulli variables and Gumbel Softmax [22] for their discrete bottle-neck. The original architecture for MDPs is composed by a variational encoder that embeds the observation into binary variables which are combined with action to predict the next latent discrete state. We extend it with a recurrent encoder that gives access to the agent to the history of observations . Since we build over the original implementation we use prioritized sweeping with tabular Q-learning for the downstream policy. The world models is initialized from 10k random rollouts and is trained using the reconstruction loss.

4 Experiments

We employ three partially observable environments to learn approximate causal states through self-supervised learning and use these representations as input for reinforcement learning tasks defined over the domains.

4.1 Gridworlds

Layout 1 Layout 1 Layout 1 Layout 2 Layout 2 Layout 2
Method low-disc low-cont ego-cont low-disc low-cont ego-cont
Tab., 0.43, 0. 0.42, 0. 0.437, 0. 0.01, 0. 0.09, 0. 0.036, 0.
DQN, 0.50, 0.015 0.42, 0.10 0.49, 0.032 -0.17, 0.76 0.026, 0.26 0.12, 0.064
DQN, 0.5, 0. 0.5, 0. 0.49, 0.029 0.30, 0. 0.30, 0. 0.30, 0.01
DQN, -9.46, 0.20 -8.55, 0.79 -8.64, 1.01 -9.48, 0.14 -8.59, 0.78 -9.83, 0.25
DQN, -0.91, 3.0 0.49, 0.03 -0.34, 2.31 0.23, 0.16 0.084, 0.18 -0.07, 0.24
DRQN, -9.75, 0.22 -6.95, 3.66 -9.27, 0.68 -5.63, 3.74 -3.72, 4.31 -9.97, 0.06
Tab., -9.40, 0. - - -9.11, 0. - -
Tab., 0.45, 0. 0.42, 0. 0.43, 0. 0.23, 0. 0.23, 0. 0.23, 0.
DQN, 0.44, 0.019 0.44, 0.027 0.44, 0.023 0.30, 0.01 -0.76, 3.07 0.23, 0.10
Table 1: Results for Gridworlds. Reward obtained with tabular Q-learning, DQN, and DRQN with

. Models trained on 1000 episodes and evaluated on 100. Numbers are mean, standard deviation across 10 random seeds. First section is our method using k-mean clustering for discretization, second is baselines on current observation, history of observations, and

. Final section is using ground truth states.

We create partially observable gridworld environments where the task is for the agent to first obtain the key, then pass through the door to obtain the final reward. The final state is unseen (i.e. the agent cannot pass through the door to reach it) if the agent does not have the key. The agent only knows it has the key if it remembers entering the state with the key, so without infinite memory this task is partially observable. At each time step the agent receives -0.1 reward, 0.5 reward for picking up the key, and a final reward of 1 for passing through the door.

We conduct experiments with three layouts with an increasing number of states (see figures in the supplementary materials). The first layout is a 1-dimensional corridor while the other three are two-dimensional mazes. The minimal memory requirement for solving the environment is given by the shortest path from the key to the final destination.

We use three types of observation processes for the agent that give different priors to the agent’s behavior but share the same underlying causal states. low-discrete is the discrete observation of the agent’s absolute position in the grid. low-cont is the continuous observation of the agent’s absolute position in the grid. ego-cont is ego-centric continuous observation (up, down, left, right) of the distance from walls in the 4 cardinal directions.

dqnDQN on is able to achieve optimal policy across all 10 random seeds with very low or zero standard deviation, showing the stability of our learned . We expect to perform as well as or better than , as is distilled from and therefore contains the same information. Using only current observation learns using a recurrent DQN (DRQN) [16].

Figure 2: Training curves for DQN policies—discretization with gradient descent and bottleneck networks (left) Layout 1 using discrete inputs.(center) Layout 2 using discrete inputs. (right) Layout 3 using discrete inputs. Averages over 10 runs with different random seeds with two standard deviations shaded. Y-axis is mean reward per step. Green is the World model, Blue is causal states, red is DRQN.

4.2 Doom

We modify the t-maze VizDoom  [24] environment of [7] to make it partially observable. We randomize the goal location between the two ending corners and signal its location with a stochastic signal in the observation space. The agent must remember where the goal is in order to navigate to it. We convey the signal to the agent through a fourth channel (after RGB) that intermittently contains information about where the goal is. The frequency at which the information is displayed is a tunable factor for these experiments. Example trajectories to different goals that the learned agent takes are shown in Fig. 3 on the left.

Results Fig. 3 (right) shows the speedup in learning from explicitly learning to cluster sequences of observations into causal states. Additional results with varying frequencies in Supplementary Material.

Figure 3: (left) Example trajectories in VizDoom. First goal (above), second goal (below). (right) Doom T-Maze POMDP: Averaged over 10 runs with different random seeds with one standard deviation shaded. Y-axis is mean reward per step. Blue is causal states, red is DRQN.

4.3 Atari Pong

Figure 4: Averaged over 10 runs with different random seeds with one standard deviation shaded.

We evaluate our model on the Atari Learning Environment’s Pong [3], an environment where causal states are less intuitive. Our algorithm learns a set of discrete causal states suitable for learning successful policies (Fig. 4) We use the same preprocessing of observations in  [29] except that the agent receives a single frame as observation at each time step. The environment is now partially observable as the agent must learn to retain information about velocity of the paddles and ball from the hidden state of the RNN. We see a decrease in performance with DRQN, due to instability in training.

4.4 Toy Stochastic-Processes

We apply our method to input-output stochastic processes that can be modeled as HMMs with memory length and alphabet size . We complicate the continuous observation process by mapping the environment’s outputs into a) multivariate gaussians and b) images sampled from the MNIST dataset. The goal is to maximize the occurrence of , . The environment returns a reward of 1 at time step if , and 0 otherwise. Each episode lasts 100 time steps. Results in the appendix.

5 Related Literature

The relationship between PSR and causal states has been previously suggested in the computation mechanics literature [31, 32, 2]. [20] also derive parallels between PSR, POMDP, and automata through the construction of equivalence classes that groups states with common action conditional future observations. Causal states, and the related information-theoretic notion of complexity, minimality, and sufficiency are used to derive task-agnostic policies with an intrinsic exploration-exploitation trade off in [34, 35].

In [4] the authors learn PSRs in Reproducing Kernel Hilbert Space extending the approach to continuous, potentially infinite, action-observation processes. More similar to our latent discrete states with continuous observable, [12] model spatio-temporal processes as being generated by a finite set of discrete causal states in the form of light-cones.

Other methods for learning deep representations for reinforcement learning POMDPs have been recently proposed, starting with adding recurrency to DQN [16] to integrate the history in the estimation of the Q-value as opposed to using only the current observation. However, this method stops short of ensuring sufficiency for next step prediction as it learns a task specific representation. [21, 38]

use deep variational methods to learn a probability distribution over states, i.e. belief states, and use the belief states for policy optimization with actor-critic 

[21] and planning [38]. [13] also use neural methods to learn belief states with next-step prediction [17, 18] learn PSRs with RNNs and spectral methods and use policy gradient with an alternating optimization method for the policy and the state representation to handle continuous action spaces. However, none of these explore the connection to causal states and compression via a discrete representation. [7] do learn discrete representations but not in partially observable environments and with no link to PSRs. Instead, they propose the discrete representation solely for using tabular Q-learning with prioritized sweeping.

The idea of extracting the implicit knowledge of a neural network [37, 11, 28, 15]

is not novel and is rooted in early attempts to merge traditional rule-based methods with machine learning. The most recent examples

[41, 39, 40] are focused on the ability of character level RNNs to implicitly learn grammars. The only application of these ideas to RL in partially observable environments that we are aware of is in [25] where deep recurrent policies [16] are quantized into Moore machines. The main difference between our approaches is that we reduce models of the environment, not of the policy, to -machines which are edge-emitting Mealy machines. The connection between -machines and causal states is further discussed in Supplementary Materials. Because the optimal policy will in general use a subset of the possible input sequences, the minimal sufficient representation of a policy is typically smaller than the causal states of the complete environment.

6 Conclusions

In this paper, we proposed a self-supervised method of learning a minimal sufficient statistic for next step prediction and articulated its connection to the causal states of a controlled process. Our experiments demonstrate the practical utility of this representation, both with value iteration for control and exhaustive planning, in solving stochastic and infinite-memory POMDP environments as well as -order MDP environments with high dimensional observations, matching the performance achieved with ground truth states.

We encountered multiple problems while training recurrent DQN models end to end, even when we know from our causal states results that the same architecture is able to learn sufficient representations for the task. We plan on extending the Distributed Experience Replay algorithm [26] with extra asynchronous workers dedicated respectively to the construction of continuous and discrete world models.


Part of this work was supported by the National Science Foundation (grant number CCF-1317433), C-BRIC (one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA), and the Intel Corporation. A. Anandkumar is supported in part by Bren endowed chair, Darpa PAI, Raytheon, and Microsoft, Google and Adobe faculty fellowships. K. Azizzadenesheli is supported in part by NSF Career Award CCF-1254106 and AFOSR YIP FA9550-15-1-0221, work done while he was visiting Caltech. The authors affirm that the views expressed herein are solely their own, and do not represent the views of the United States government or any agency thereof.


Appendix A Causal States and -machines

In the computational mechanics literature, causal state models are usually called -machines and are formally defined as:

Definition 3

The -machine of a stochastic process is given by the tuple where is the discrete alphabet of causal states, the discrete alphabet of observation and is a set of observation conditional state-to-state transition matrices. [10, 31]

To summarize, the -machine of a stochastic process is the minimal unifilar hidden Markov model able to generate its empirical realizations. The hidden states of an -machine are called the causal states of the process, and correspond to partitions of the process history.

Appendix B Additional Visualizations for GridWorlds experiments

In Figures 5 and 6, we present additional DQN results for Layout 1, 2 and 3 using continuous inputs. The models in Figure 5 use continuous x,y coordinates as input while those in Figure 6 use the egocentric distance from walls in each cardinal direction. Figures 7 and 8 depict the configuration of the four layouts. Figure 9 and 10 contain the learning curve for layout 1 and 2 for all input modalities using k-mean clustering for discretization.

Figure 5: LOW-CONT: Training curves for DQN policies - discretization with gradient descent and bottleneck networks. (left) Layout 1 using x-y coordinates. (center) Layout 2 using x-y coordinates. (right) Layout 3 using x-y coordinates. Averaged over 10 runs with different random seeds with two standard deviations shaded. Y-axis is mean reward per step. Green is the World model, Blue is causal states, red is DRQN.
Figure 6: EGO-CONT: Training curves for DQN policies - discretization with gradient descent and bottleneck networks. (left) Layout 1 using egocentric distance from walls in each cardinal direction. (center) Layout 2 using egocentric distance from walls in each cardinal direction. (right) Layout 3 using egocentric distance from walls in each cardinal direction. Averaged over 10 runs with different random seeds with two standard deviations shaded. Y-axis is mean reward per step. Green is the World model, Blue is causal states, red is DRQN.
Figure 7: Visual representation of the layouts 1 and 2 used in the gridworld experiments. In Blue the starting location, in Red the key location in Yellow the final goal. Brown represents wall and in Grey are walkable cells
Figure 8: Visual representation of the layouts 3 and 4 used in the gridworld experiments. In Blue the starting location, in Red the key location in Yellow the final goal. Brown represents wall and in Grey are walkable cells
Figure 9: DQN training curves across 10 runs with different random seeds for all input types in Tab. 1 for Layout 1 - Causal States estimated using K-mean clustering. Two levels of shading represent 1 and 2 standard deviations from the mean.
Figure 10: DQN training curves across 10 runs with different random seeds for all input types in Tab. 1 for Layout 2 - Causal States estimated using K-mean clustering. Two levels of shading represent 1 and 2 standard deviations from the mean.

Appendix C Additional Experiments

c.1 Toy Processes

We apply our method to input-output stochastic processes that can be modeled as HMMs with memory length and alphabet size . We construct the process such that the probability of the next output depends only on the value of by sampling from the multinomial distribution: and for all .

We introduce a binary action space where

The goal is to maximize the occurrence of , . The environment returns a reward of 1 at time step if , and 0 otherwise. Each episode lasts 100 time steps. We again train on sequence data generated with a random policy and pass the actions into the RNN with a linear layer and concatenate with the observation embedding.

The multivariate Gaussians measurements are constructed by taking a vector composed by

blocks of size , the block has mean if or if . For MNIST we use a process alphabet of up to size , and associate each symbol to an image category, the measurement are generated by first sampling the realization from the process sampling a random image from the corresponding MNIST category.

For the case of rendering high-dimensional measurements with images, we use the full image dataset for sampling and use train and val sets for training and test for evaluation, showing generalization ability in image classification through this method, with no explicit training on class label.

Discrete Gaussian MNIST
DQN on 50.1, 1.01 25.1, 1.12 50.6, 1.26 25.0, 1.35 50.1, 1.80 25.0, 1.27
DQN on 73.7, 0.73 55.5, 1.62 73.3, 1.20 54.9, 1.71 72.3, 1.33 54.2, 1.39
DQN on 72.7, 1.04 54.6, 1.61 73.6, 0.82 55.3, 1.91 72.8, 1.23 50.8, 1.80
DQN on 72.6, 4.10 49.2, 3.29 73.7, 2.18 52.7, 3.07 72.6, 2.50 43.2, 3.02
Table 2: Results for the Toy Processes. Reward obtained with DQN initialized with 10 random seeds. Numbers in each cell correspond to mean and standard deviations. All models are trained on episodes and evaluated on . Results within a std of the best are bolded.

Results: We compare the learned representations with end-to-end DQN [30] trained on length sequences of actions-outputs observations , single observation , the continuous sufficient statistics , and the causal states refinements . The difficulty level of these environments can be extended by increasing the class size and the memory . Table 2 reports the results for . As can be appreciated from the first row, it is impossible to obtain more than random reward of without using memory, and the task can be successfully solved by stacking frames as input as indicated by the second row, making the task fully-observable. The third and fourth row show the matching performance of our task-agnostic representation for the continuous and the discrete , respectively. We found for the Gaussian and MNIST results that vector quantization worked better than k-means, so the results are reported with VQ.

c.2 Doom at Additional Frequencies

We tune frequency for the Doom environment and show results in Fig. 11. DRQN learns slightly on and noticeably drops with longer frequencies, whereas our method (Causal States) exhibits consistently good performance at all frequencies.

Figure 11: Performance of Causal States and DRQN on frequencies .

c.3 Planning

With a minimal sufficient unifilar model we can perform efficient planning by representing it as a labeled directed graph , with causal states as nodes and action-observation conditional transitions as edges. Because of the unifilar property, once an agent has perfect knowledge of the current causal states, the information in the future action-observation process are sufficient to uniquely determine the future causal states. This property can be exploited by multi-step planning algorithms which need not keeping track of potential stochastic transitions between the underlying states, enabling a variety of methods that are otherwise amenable only to MDPs like Dijkstra’s algorithm. We plan over our learned discrete representation by building a graph where , , and for . We obtain the optimal policy for Layout 1 and 2 obtaining respectively 0.5 and 0.3 reward we derive and save high reward states seen during the rollouts as goal states. Then one can map the initial and final observations to a node in the graph and run Dijkstra’s algorithm to find the shortest path as proposed in [42]. Unlike value iteration, this requires no learning and no re-sampling of the environment by making use of the graph edges, where Q-learning only uses the nodes.