1 Introduction
Highdimensional state spaces are commonly seen in recent reinforcement learning applications (Mnih et al., 2015). Although states may be as large as images, typically the information required to make good decisions is much smaller. This motivates the need for state abstraction, the process of encoding states into compressed representations that retain the necessary information and discard the rest. One principled approach for state abstraction is via bisimulation in Markov decision processes (MDP) (Dean and Givan, 1997). Bisimulations formalize the notion of finding smaller equivalent abstract MDPs that preserve transition and reward information, i.e., they retain the relevant decisionmaking information while reducing the state space size. We demonstrate this idea in Figure 1, where a grid world with fifteen states is compressed into an MDP with three states.
Unfortunately, finding bisimulations with maximally compressed state spaces is NPhard (Dean et al., 1997). One common approach to circumvent finding an abstract MDP is using bisimulation metrics, which facilitate transfer of existing policies to similar states (Ferns et al., 2004). However, this method cannot generalize to new tasks, for which there is no existing policy. Instead, we pursue the original bisimulation goal of finding a discrete abstract MDP, that can be used to solve unseen tasks efficiently.
In this paper, we introduce an approach to finding approximate MDP bisimulations using the variational information bottleneck (VIB) (Tishby et al., 2001; Alemi et al., 2017). This framework is typically used to learn representations that predict quantities of interest accurately while ignoring certain aspects of the domain. The VIB approach has even been recently explored in the context of state abstraction, but the state abstraction found in general does not result in an MDP bisimulation (Abel et al., 2019). This is problematic, because the abstract MDP can only represent the policies it was trained on, but cannot be used to plan on new tasks. Whereas Abel et al. (2019) use the abstract states to predict actions from an expert policy, we use abstract states to predict learned Qvalues in the VIB objective.
In our setup, a learned encoder maps a state in the original MDP into a continuous embedding , which we then infer belongs to a learned discrete abstract state . One perspective on the VIB objective is that it is learning an encoder (state abstraction function: ) that predicts observed Qvalues well using alone, but is subject to a prior in the embedding space (). Concretely, we propose using priors that prefer clusters with Markovian transition structure. A sequence of embedded states
is treated as observations from either a Gaussian mixture model (GMM) or an actionconditioned hidden Markov model (HMM), where each embedding
is emitted from a latent cluster representing abstract state . In the HMM case, we also learn a cluster transition matrix for each action, serving as the abstract MDP transition model. The key insight is that abstract states group together ground states (and embeddings ) with similar Qvalues and similar transition properties, thereby forming an approximate MDP bisimulation.Although structured priors have been used in the context of variational autoencoders, one key difference in our approach is that the parameters of our GMM and HMM priors are learned as well. The learned parameters (cluster means, covariances, and discrete transition matrix between clusters) therefore form our abstract MDP state space and transition function. When presented with tasks not seen during training, we can use the learned abstract model to plan to solve these tasks without additional learning efficiently.
In summary, our contributions are:

Framing bisimulation learning as a VIB objective.

Introducing two structured priors (GMM, HMM) with learned parameters for VIBbased state abstraction.

Using the learned parameters of the prior to extract a discrete abstract MDP, which is an approximate bisimulation of the original MDP.

Using the abstract MDP to plan for new goals.
2 Background
Markov decision process: We model our tasks as episodic Markov Decision Processes (MDPs). An MDP is a tuple (Bellman, 1957), where and are state and action sets, respectively. The function describes the expected reward associated with each stateaction pair. The density
describes transition probabilities between states.
is a discount factor. A policyencodes the behavior of an agent as a probability distribution over
conditioned on . The stateaction value of a policy is the expected discounted reward of executing action from state and subsequently following policy :(1) 
We want to behave optimally both in the ground and abstract MDPs. A policy is optimal when .
State abstraction: We approach state abstraction from the perspective of model minimization. The goal is to find a function that maps from the state space of the original MDP to a compact state space while preserving the reward and transition dynamics (Dean and Givan, 1997; Givan et al., 2003). Concretely, we want a surjective function that induces a partition over . That is, each abstract state is associated with a block of states in defined by the preimage of at , . Since must induce a partition over , we require . A bisimulation is a surjection that induces a partition over and preserves the reward and transition dynamics. It is commonly formalized as:
Definition 1 (MDP Bisimulation).
Let and be MDPs. A function is an MDP bisimulation from to if the preimage induces a partition of and for each , and , implies both and .
Given two MDPs and , there may exist many bisimulations. These bisimulations can be placed in a partial order. Given two bisimulations and from to , we will say that is a refinement of if the partition induced by is a refinement of that induced by .
Information Bottleneck Methods: We approach the state abstraction problem using information bottleneck (IB) methods (Tishby et al., 2001). These methods assume that we provide a distribution over features (in our case the state) and a prediction target (in our case expected reward or return). Given , IB methods train an encoder that maps onto a compressed representation by maximizing the IB objective
(2) 
where denotes the mutual information between its arguments. The intuition behind this objective is that we would like to learn a (lossy) compressed representation of by maximizing the correlation between and the target , which ensures that is predictive of , whilst minimizing the correlation between and , which ensures that any information in that does not correlate with is discarded.
In practice, evaluating the IB objective is intractable. Instead of optimizing Equation (2) directly, IB methods introduce two variational distributions and to bound the mutual information terms
(3)  
(4) 
Combining these terms then bounds the IB objective
Maximizing the above with respect to , , and
is a form of variational expectation maximization; optimizing
and tightens the bound, whereas optimizing maximizes the IB objective. Note that the entropy term does not depend on and can therefore be ignored safely during optimization.In practice, modern IB methods use a neural network as the encoder (Alemi et al., 2017)
, and perform stochastic gradient descent using reparameterized Monte Carlo estimators, resulting in models that are closely related to variational autoencoders
(Kingma and Welling, 2014; Rezende et al., 2014). We can make the connection between IB methods and variational autoencoders concrete by noting that when , the lower bound on becomes(5) 
Maximizing is then equivalent to learning an encoder that approximates the posterior of a generative model , and optimizing the generative model to maximize the log marginallikelihood .
3 Related Work
Bisimulation for MDPs was first described by Dean and Givan (1997) together with an algorithm for finding it when a full model of the ground MDP is available. Without the exact model of an environment, comparing transition and reward functions is problematic, as even a fractional deviation between the dynamics of two similar states will result in them being separated. Dean et al. (1997) proposed approximate bisimulation: it allows the dynamics of states mapped to a single abstract state to vary up to a constant . However, the problem of finding the coarsest approximate grouping is NPhard. Ferns et al. (2004, 2006) extended approximate bisimulation to bisimulation metric: a method for comparing states based on their onestep dynamics. The bisimulation metric has been used to transfer policies from simple to complex domains Castro and Precup (2010, 2011), and Abel et al. (2016) proved bounds for the quality of a policy over the approximate abstraction. More recently, Castro (2019) created an objective for a deep neural network based on the bisimulation metric to learn similarities between states in grid worlds and Atari games. The main difference between this line of work and ours is that we aim to learn an abstract MDP with discrete states, in which we can plan efficiently, whereas bisimulation metrics are more commonly used to find similar states for the purpose of transfer of policies.
The Information Bottleneck method defines an objective that maximizes the predictive power of a model while minimizing the number of bits needed to encode an input (Tishby et al., 2001). Abel et al. (2019)
drew a connection between Information Bottleneck and Ratedistortion theory for the purpose of learning state abstraction. They propose an ExpectationMaximizationlike algorithm for finding state abstractions in discretestate environments and a loss function for learning compressed policies in Atari games. Apart from state abstraction, the Information Bottleneck method has also been used to regularize policies represented by deep neural networks.
Goyal et al. (2019) improved the generalization of goalconditioned policies by training their agent to detect decision states – states in which the agent requires information about the current goal to act optimally. Here, the Information Bottleneck objective forces the agent to only access the information about the current goal when it is necessary. Teh et al. (2017) used a similar objective to distill a taskagnostic policy in the multitask Reinforcement Learning setting. Strouse et al. (2018) used information bottleneck to control the amount of information transferred between two agents. Tishby and Polani (2011); Rubin et al. (2012) study the interactions of an agent with and environment from an informationtheoretic perspective.4 Learning bisimulations
We propose a variational model for finding bisimulations directly from experience. The end result of our process is an abstract MDP, in which we can efficiently plan policies. Our model consists of three parts:

a deep neural net encoder that projects states (usually represented as images) onto a lowdimensional continuous latent space (Figure 2 left),

a prior that encodes the prior belief that the experience was generated by a small discrete Markov process (Figure 2 lower right),

a linear decoder that predicts stateaction values from the continuous encodings (Figure 2 upper right).
We tie the three models together using the deep variational information bottleneck method (Alemi et al., 2017). Unlike the standard setting, we encode pairs of ground states as latent state pairs . This enables us to learn a tabular transition function between discrete states inside a prior . We treat the discrete states of the prior as abstract states, which together with the learned transition function and a given reward function define an abstract MDP.
4.1 Bisimulation as an information bottleneck
Let be a empirical distribution representing a dataset of transitions from state to state under an action , selected by an arbitrary policy . denotes the stateaction value for the pair under . We will use the IB method to find a compact latent encoding of that enables us to predict while simultaneously matching a prior on the temporal dynamics of the process that generated the data. Let denote a sequential pair of states and a corresponding sequential pair of latent states. The standard IB formulation is:
where use as shorthand notation and expand and in the last identity.
We make two architectural decisions grounded in standard Markov assumptions. First, we assume that the value is conditionally independent of given : . Second, we assume that is conditionally independent of , , and given and likewise that is conditionally independent of , , and : .
Putting it together, we maximize an IB lower bound
(6) 
presents a tradeoff between encoding enough information of in order to predict (the first term of Equation 6) and making the sequence likely under our prior (the second term). This prior, , is a key element of our approach and is discussed in the next section.
Notice that (what we predict in the first term of Equation 6) is a stateaction value, not a reward. This gives our model additional supervision. Without it, our model tends to collapse by predicting a single abstract state for each ground state.
4.2 Structured Priors
The denominator of the second term in Equation 6 is the prior. We want it to express an expectation that we are observing a discrete Markov process. We explore two approaches: a prior based on a Gaussian mixture model (GMM) and a prior based on an action conditioned Hidden Markov model (HMM).
4.2.1 GMM Prior
A Gaussian mixure model consists of components, each parameterized by a mean and a covariance :
(7)  
(8) 
where denotes the probability that was generated by component and
is the Gaussian distribution for the
component. In this paper, we constrain to be diagonal. For the GMM prior, we set and allow and to vary; is uniform and fixed. This encodes a desire to find a latent encoding generated by membership in a finite set of discrete states (the mixture components). Each mixture component corresponds to a distinct abstract state. The weighting function is the probability that the continuous encoding was generated by the abstract state. This encodes the prior belief that latent encoding of state should be distributed according to a mixture of Gaussians with unknown mean, covariance, and weights. Note that while this approach gives us an encoder that projects realvalued high dimensional states onto a small discrete set of abstract states, it ignores the temporal aspect of the Markov process.4.2.2 HMM Prior
In order to capture the temporal aspect of a Markov process, we can model the prior as an action conditioned hidden Markov model (an HMM). Here, the “hidden” state is the unobserved discrete abstract state used to generate “observations” of the latent state . As in the GMM, there are
discrete abstract states, each of which generates latent states according to a multivariate Normal distribution with mean
and (diagonal) covariance matrix . Since we are modelling a Markov process, we include a separate transition matrix for each action where denotes the probability of transitioning from an abstract state to an abstract state under an action . Using this model, the prior becomes:(9)  
(10)  
(11) 
As with the GMM model, we allow the parameters of this model (, , and ) to vary, except for , which is uniform and fixed. The transition model found during optimization is a discrete conditional probability table that defines a discrete abstract MDP. Essentially, this method finds the parameters of a hidden discrete abstract MDP that fits the observed data over which the loss of Equation 6 is evaluated.
Figure 3 illustrates the latent embedding found using the HMM model for the Column World domain shown in Figure 1. The three clusters correspond to the three states in the coarsest abstract bisimulation MDP. These three clusters are overrepresented by six cluster centroids (the red x’s) because we ran our algorithm using six cluster components. The algorithm “shared” these six mixture components among the three bisimulation classes. The result is still a bisimulation – just not the coarsest bisimulation.
4.3 Deep encoder and endtoend training
The loss (Equation 6) is defined in terms of three distributions that we need to parameterize: the encoder , the Qpredictor and the prior . The encoder
is a convolutional neural network that predicts the mean
and the diagonal covariance of a multivariate normal distribution. In our experiments, we used a modified version of the encoder architecture from (Ha and Schmidhuber, 2018)^{1}^{1}1Appendix A.1. in the version 4 of the arXiv submission.: five convolutional layers followed by one fullyconnected hidden layers and two fullyconnected heads for and , respectively. The Qpredictoris a single fullyconnected layer (i.e. a linear transform). We chose this parameterization to impose another constraint on the latent space: the encodings
not only need to form clearly separable clusters to adhere to the prior, but also linearly dependent on their stateaction values for each action. When we train on stateaction values for multiple tasks, we predict a vector instead of a scalar. Using the reparameterization trick to sample from, we can compute the gradient of the objective with respect to the encoder weights, Qpredictor weights and the prior parameters. The prior parameters include the component means and variance, together with the transition function for hidden states in the HMM.
4.4 Planning in the Abstract MDP
A key aspect of our approach is that we can solve new tasks in the original problem domain by solving a compact discrete MDP for new policies. This is one of the critical motivations for using bisimulations: optimal solutions in the abstract MDP induce optimal solutions in the original MDP. Using the discrete transition table found using the HMM prior, we define the abstract MDP . The abstract reward function can be defined to encode any reward function in the ground MDP by projecting ground rewards into the abstact space using the encoder. Now, we can use standard discrete value iteration to find new policies. These policies can be immediately applied in the ground MDP: observations of state in the ground MDP can be projected into the discrete abstract MDP and the new policy can be used to calculate an action.
5 Connecting VIB abstraction to bisimulation
The HMM embedded in our model learns parameters of an compact discrete MDP (Subsection 4.2). But, is it a bisimulation? We show that under idealized conditions, every optimal solution to the objective is, in fact, a bisimulation. We analyze the idealized case where the following assumptions hold:

The agent receives a reward of 1 in goal states and zero elsewhere;

The transition function of the ground MDP is deterministic;

The HMM prior is parameterized no fewer states than exist in the ground MDP, no two states share their means;

The prior over hidden states is held fixed;

The covariance of our encoder and prior components is 0;

The linear decoder is replaced with a 1nearestneighbor regressor: in order to predict given , it finds the closest encoding of state to (Euclidean distance) and predicts the stateaction value for an action .
Theorem 1 (HMMbisimulation theorem).
Given:

a ground MDP ,

the stateaction value function of an optimal policy,

a set of optimal parameters of our model , , and

an abstract induced by the HMM prior
and adhering to the idealized assumptions described above, there exists a bisimulation mapping from to .
See Section A in the Appendix for the proof.
6 Experiments
The aim of our experiments is to investigate the following aspects of our method:

its ability to find abstractions that are compact and accurately model the ground MDP,

planning in the abstract MDP for new goals, and

its performance in environments without a clear notion of an abstract state.
We start with a simple grid world experiment to compare our method against an approximate bisimulation baseline Subsection 6.1. Then we test it in more complex domains with image states in Subsections 6.2 (and Appendix C.1). Finally, we report results for simplified Atari games that break the assumptions of our method in Subsection 6.3.
6.1 Column World
The purpose of this experiment is to compare our method to a modelbased approximate bisimulation baseline in a simple discrete environment. Column World is a grid world with 30 rows and 3 columns (Lehnert and Littman, 2018). The agent can move left, right, top and down, and it receives a reward 1 for any action executed in the right column; otherwise, it gets 0 reward. Hence, the agent only needs to know if it is in the left, middle or right column, as illustrated in Figure 1.
First, we train a deep Qnetwork on this task and use it to generate a dataset of transitions. As a baseline, we train a neural network model to predict and given . We then find a coarse approximate bisimulation for this model using a greedy algorithm from (Dean et al., 1997) with the approximation constant set to 0.5. We compare it with our method trained with an HMM prior on predicted by the deep Qnetwork. We represent each state as a discrete symbol and use fullyconnected neural networks for all of our models. See Appendix B.2 for details.
Figure 4 shows the purity and the size of the abstractions found by our method and the baseline as a function of dataset size. We need a groundtruth abstraction to calculate the abstraction purity–in this case, it is the threestate abstraction shown in Figure 1 right. We assign each ground state to an abstract state (Figure 2) and find the most common groundtruth label for each abstract state. The abstraction purity is the weighted average of the fraction of members of an abstract states that share its label. We include a snippet of code that computes this measure in Appendix B.1.
Both of the methods can find an abstraction with a high purity. However, approximate bisimulation does not reduce the state space (there are 90 ground states) until the model of the environment is nearly perfect. This happens only when the dataset has more than 11000 examples. Our method always finds an abstraction with six states (the number of abstract states is a hyperparameter), but notice that our method finds a compact high purity abstraction much sooner than does the baseline method. Notice that we parameterize our method with more abstract states than the size of the coarsest bisimulation. In practice, this overparameterization helps our method converge.
6.2 Shapes World
Setting  GMM  HMM 

2 pucks, grid world  
2 pucks, grid world  
2 pucks, grid world  
3 pucks, grid world  
3 pucks, grid world  
3 pucks, grid world  
2 objects, grid world  
2 objects, grid world  
2 objects, grid world  
3 objects, grid world  
3 objects, grid world  
3 objects, grid world 
) and we report abstraction purities. Each model was trained 10 times on the same dataset, we report the means and standard deviations.
Source tasks  2S  3S  2R  3R  2&2S  ST  2D  3D 

2S  
2S, 2R  
3S  
ST  
3S, 3R  
3S, ST  
2&2S  
2&2S, 3S  
2&2S, 3R  
2&2S, ST 
We use a modified version of the Pucks World from Biza and Platt (2019). The world is divided into a grid and objects of various shapes and sizes can be placed or stacked in each cell. States are represented as simulated depth images. The agent can execute a PICK or PLACE action in each of the cells to pick up and place objects. The goal of abstraction here is to recognize that the shapes and sizes of objects do not have any influence on the PICK and PLACE actions. We instantiate eight different tasks in this domain described in Figure 5.
First, we test the ability of our algorithm to find accurate bisimulation partitions. Table 1 shows the results for our method for both the GMM and the HMM prior. Both of the models reach a high abstraction purity (described in Section 6.1) in all cases except for the three objects stacking task in a grid world. The smallest MDP for which a bisimulation exists contains 936 abstract states; our algorithm has 1000 possible abstract states available. Our experiment shows that the HMM prior can leverage the temporal information, which is missing from the GMM, to allocate abstract states better.
Next, we test the ability of the learned abstract models to plan for new goals. We are able to reach a goal only if it is represented as a distinct abstract state in our model–such abstract states can only exist if the training dataset contains examples of the goal. Therefore, we can generalize to unseen goals in the sense that our model does not know about these goals during training, but they are represented in the dataset. During planning for a particular goal, we create a new reward function for the abstract model and assign a reward 1 to all transitions in the dataset that reach that goal. Then, we run Value Iteration in the abstract model and use the found stateaction values to create a stochastic softmax policy. See Appendix B.3 for more details.
Our model is trained on one or two tasks and we report its ability to plan for every single task (Table 2). For tasks with abstractions that are simple to represent (e.g. 2 objects stacking in a grid world has 136 abstract states in the coarsest bisimulation, one for each possible configuration of objects ignoring their shape), our method can successfully transfer to new tasks of similar complexity without additional training. For instance, the abstract model learned from two pucks stacking can plan for placing two and three pucks in a row with a success rate. The middle section of Table 2 shows tasks whose coarsest abstractions barely fit into our abstract model. We can still transfer to similar tasks with a success rate higher than .
The bottom section of Table 2 demonstrates that our algorithm can find partial solutions even if the number of abstract states in the coarsest bisimulation exceeds the capacity of the HMM. This is in stark contrast to methods related to Partition Iteration (e.g. the baseline in Figure 1 and Biza and Platt (2019)) that either find the coarsest bisimulation or start creating new abstract states until the abstraction has the same size as the original problem or some threshold is reached. Even approximate bisimulation, which should tolerate inaccurate models, has the unfortunate property that once it makes an erroneous split of a state block, errors in the following step of Partition Improvement become much more likely, often resulting in a useless abstraction.
We present additional transfer experiments with a house building task similar to Shapes World in Appendix B.4.
Game  DQN  Value Iteration  Mean Q  Random 

Breakout  
Space Invaders  
Freeway  
Asterix 
6.3 MinAtar
The challenge of the Shapes World (Subsection 6.2) is that the coarsest bisimulation can have thousands of abstract states. On the other hand, each task can be solved in less than ten time steps. The simplified Atari games of MinAtar pose an interesting challenge because each episode could potentially last tens or hundreds of time steps (Young and Tian, 2019). MinAtar has five Atari games–Breakout, Space Invaders, Freeway, Asterix and Seaquest^{2}^{2}2We skip Seaquest because we had trouble running it.. The state of the games is fully observable and the dynamics are simplified. We use the same process of training a deep Qnetwork to create a dataset of transitions and then training our model on it. Appendix Subsection B.5 contains further details.
We test the quality of the learned abstraction in two ways. First, we employ standard Value Iteration in the learned abstract model to plan for the optimal policy. Planning in this domain might be challenging, as we do not use any temporal abstractions. We also test our abstraction from the perspective of compression: we average over the values of each stateaction pair (predicted by the deep Qnetwork) belonging to each abstract state. This gives us a single value for each abstractstate action pair–we call this approach Mean Q. Intuitively, we compress the policy represented by the deep Qnetwork into a discrete representation.
Mean Q outperforms DQN in Breakout–an unexpected result–and reaches around of the performance of DQN in Space Invaders and around in Freeway (Table 3). Both of the Breakout policies suffer from a high variance of returns inbetween episodes; we hypothesize that the compression makes the policy more robust. Figure 7 in Appendix C.2 further analyses Mean Q on Breakout. Value Iteration only works in Freeway and we fail to learn a useful abstraction in Asterix.
7 Conclusion
In this work, we present a new method for finding state abstractions from collected image states. We derive our objective function from the information bottleneck framework and learn abstract MDP through an HMM prior conditioned on actions. Our experiments demonstrate that our model is able to learn highquality bisimulation partitions that contain up to 1000 abstract states. We also show that our abstractions enable transfer to goals not known during training. Our method fails gracefully in environments that have complexity greatly exceeding the capacity of our abstraction. Finally, we report experimental results on tasks with long time horizons, showing that we can use learned abstractions to compress DQN policies.
In future work, we plan to address the two main weaknesses of bisimulation: it does not leverage symmetries of the stateaction space to minimize the size of the found abstraction and it does not scale with the temporal horizon of the task. The former problem can be addressed with MDP homomorphisms (Ravindran, 2004). The time horizon problems of bisimulation could be solved with hierarchical Reinforcement Learning.
References
 State abstraction as compression in apprenticeship learning. In AAAI, Cited by: §1, §3.
 Near optimal behavior via approximate state abstraction. In Proceedings of the International Conference on Machine Learning, pp. 2915–2923. Cited by: §3.
 Deep variational information bottleneck. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §1, §2, §4.
 A markovian decision process. Journal of Mathematics and Mechanics 6 (5), pp. 679–684. External Links: ISSN 00959057, 19435274 Cited by: §2.

Online abstraction with mdp homomorphisms for deep learning
. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, Richland, SC, pp. 1125–1133. External Links: ISBN 9781450363099 Cited by: §6.2, §6.2.  Using bisimulation for policy transfer in mdps. In AAAI, Cited by: §3.
 Automatic construction of temporally extended actions for mdps using bisimulation metrics. In EWRL, Cited by: §3.
 Scalable methods for computing state similarity in deterministic markov decision processes. ArXiv abs/1911.09291. Cited by: §3.
 Model reduction techniques for computing approximately optimal solutions for markov decision processes. In UAI, Cited by: §B.2, §1, §3, §6.1.
 Model minimization in markov decision processes. In AAAI/IAAI, Cited by: §1, §2, §3.
 Methods for computing state similarity in markov decision processes. ArXiv abs/1206.6836. Cited by: §3.
 Metrics for finite markov decision processes. In AAAI, Cited by: §1, §3.
 Equivalence notions and model minimization in markov decision processes. Artificial Intelligence 147 (1), pp. 163 – 223. Note: Planning with Uncertainty and Incomplete Information External Links: ISSN 00043702, Document Cited by: §2.
 Transfer and exploration via the information bottleneck. In International Conference on Learning Representations, Cited by: §3.
 World models. CoRR abs/1803.10122. External Links: 1803.10122 Cited by: §4.3.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pp. 448–456. Cited by: §B.3.
 Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Cited by: §B.2, §B.3.
 Autoencoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, Cited by: §2.
 Transfer with model features in reinforcement learning. CoRR abs/1807.01736. External Links: 1807.01736 Cited by: Figure 1, Figure 4, §6.1.
 Humanlevel control through deep reinforcement learning. Nature 518, pp. 529 EP –. Cited by: §1.
 An algebraic approach to abstraction in reinforcement learning. Ph.D. Thesis, University of Massachusetts Amherst. Note: AAI3118325 Cited by: §7.

Stochastic Backpropagation and Approximate Inference in Deep Generative Models
. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China, pp. 1278–1286. Cited by: §2.  Trading value and information in mdps. In Decision Making with Imperfect Decision Makers, T. V. Guy, M. Kárný, and D. H. Wolpert (Eds.), pp. 57–74. External Links: ISBN 9783642246470, Document Cited by: §3.
 Prioritized experience replay. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Cited by: §B.2.
 Learning to share and hide intentions using information regularization. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 10249–10259. Cited by: §3.
 Distral: robust multitask reinforcement learning. In NIPS, Cited by: §3.
 The information bottleneck method. Proceedings of the 37th Allerton Conference on Communication, Control and Computation 49, pp. . Cited by: §1, §2, §3.
 Information theory of decisions and actions. In PerceptionAction Cycle: Models, Architectures, and Hardware, V. Cutsuridis, A. Hussain, and J. G. Taylor (Eds.), pp. 601–636. External Links: ISBN 9781441914521, Document Cited by: §3.
 MinAtar: an atariinspired testbed for more efficient reinforcement learning experiments. CoRR abs/1903.03176. External Links: Link, 1903.03176 Cited by: §B.5, §6.3.
Appendix A Proof of Theorem 1
Theorem 1 (HMMbisimulation theorem).
Given:

a ground MDP ,

the stateaction value function of an optimal policy,

a set of optimal parameters of our model , , and

an abstract induced by the HMM prior
and adhering to the idealized assumptions described in Section 5, there exists a bisimulation mapping from to .
Proof by contradiction.
Our strategy is to show that for every set of parameters of our model with an HMM prior that does not induce a bisimulation, we can find with a higher that does induce a bisimulation. We break down our analysis into two cases: first, we show that for any model that does not preserve the reward function, we can find one that does with a higher . Then, we show that a rewardpreserving model that violates transition dynamics of the ground environment is also suboptimal. We start by defining a SPLIT operation we use in every part of our analysis.
We define as the abstraction function that maps each ground state to its mostly likely hidden state in the HMM prior, . If for all states and , is trivially a bisimulation. In other cases, we can always find a hidden state , such that for each state because there are at least as many hidden states as ground states. We define as an operation that takes states and , such that . It finds an empty hidden state and assigns (therefore, ) and , where is the mean of the observation model corresponding to hidden state . For incoming transition probabilities, we set for all abstract states and actions ; is set analogously. We also assigns if that is not already the case and assume .
Our objective function has two components:
Further analysing the second term, the encoder distributions and turn into Dirac delta functions because their covariance is set to zero. The term can be decomposed as:
Since the covariance of the hidden states is fixed to zero, is maximized if , where is the mean of the observation model for hidden state . The term is held uniform fixed.
Assume that we have such that implies for some pair of states and some action . Since is episodic and has sparse rewards, . We analyze two cases:

: in this case, cannot distinguish between and given a new encoding such that are the nearest neighbors of . We can increase the first component of the objective by executing , the second term of the objective stays fixed or increases if the model can better simulate the ground transition dynamics.

or : we can increase the value of the second component of the objective function by executing . The first term stays fixed because we can still distinguish between and , and the second term increase because or increases and the rest stays the same.
In both cases, is not an optimal solution. Next, we consider the case where for each encoding and some hidden state , and implies and . This implies that there exists a state such that . We can increase the term with while keeping the other terms fixed. Hence, is again suboptimal.
We have shown that for every that does not induce a bisimulation, we can find a with a higher that does induce a bisimulation.
∎
Appendix B Experimental Details
We ran all of our experiments on a machine with Intel Core i79700K CPU @ 3.60GHz, 64GB of RAM and two Nvidia GeForce RTX 2080 Ti graphics cards.
b.1 Abstraction purity
We include a snippet of the Python code that computes abstraction purity. We use the package numpy version 1.16.1. The inputs to the function below are probability distribution over hidden states for each sample in the validation dataset and an array of labels, one for each sample.
import numpy as np
def evaluate_purity(self, cluster_probs, labels): ””” :param cluster_probs: NxK matrix where N is the number of samples and K the number of components. :param labels: A label for each sample. ”””
# compute the probability of each componentlabel pair label_masses = []
for label in np.unique(labels): # find all samples with a particular label and sum over them label_mass = np.sum(cluster_probs[labels == label], axis=0) label_masses.append(label_mass)
label_masses = np.stack(label_masses)
sizes = np.sum(cluster_probs, axis=0) sizes[sizes == 0.0] = 1.0
# assign a label with the highest probability mass to each cluster # calculate the fraction of that mass to the mass of all other labels purities = np.max(label_masses, axis=0) / sizes
# average of cluster purities weighted by cluster sizes mean_purity = np.sum(purities * sizes) / np.sum(sizes)
return purities, sizes, mean_purity
b.2 Columns World
The deep Qnetwork that is used to collect the dataset has two hidden layers of 256 neurons followed by ReLU activation functions. We train it for 40000 time steps with an
greedy policy; linearly decays from 1 to 0.1 over 20000 time steps. We use a learning rate of 0.0001, 32 minibatch size, the target network is updated every 100 time steps and we use prioritized replay with the default settings (Schaul et al., 2016). The optimizer used for training is minibatch gradient descent with momentum set to . The dataset for training the abstract and direct models is collected after training with set to 0.5. We compute the abstraction purity over every possible ground state.Each state is represented as a 90dimensional onehot encoded vector. As a baseline, we train a model with two fullyconnected layers of 128 neurons followed by ReLU activations and two heads, one for predicting the reward and the other for the next state. We use a mean squared error loss for the reward prediction and crossentropy loss for the next state (we treat each dimension of the predicted 90dimensional vector as a probability of being in that particular state). Finally, we run an approximate partition iteration algorithm following
Dean et al. (1997).Our model with an HMM prior uses the same architecture as the above model, except it makes only one prediction: the stateaction value associated with a given stateaction pair. We set the number of hidden states to 6, the observation model of the HMM is 32dimensional, encoder and model learning rates are , is 0.0001, the means of the HMM observations are initialized with 0 mean and 0.01 standard deviation and the diagonal covariances are initialized with 1 mean and 0.1 standard deviation before being exponentiated. We train the models using the Adam optimizer (Kingma and Ba, 2015).
Source tasks  2SB  3SB  3T  3B  3L  3R 

2SB  
3SB  
3T  
3L  
3SB, 3T  
3T, 3L  
3SB, 3T, 3L 
b.3 Shapes World
For dataset collection, the input image is resized to
before being fed into a deep Qnetwork. We use four convolutional layers with 32, 64, 128, and 256 filters; the filter size is four and the stride is set to two (each convolutional downsamples the input by a factor of two); we use ”same” padding. The convolutions are followed by a single fullyconnected layer with 512 neurons and a head for predicting the stateaction values. The learning rate is set to 0.00005, the batch size is 32, the buffer size is 100000 and we train for 100000 steps. Actions are selected with an
greedy policy– is linearly decayed from 1.0 to 0.1 over 50000 time steps. We collect a dataset of 100000 transitions after training the model with set to 0.1. of the dataset is used for training and for computing the abstraction purity.Our model uses the same neural network, except we insert batch normalization between each layer and its activation function (we use ReLU) (Ioffe and Szegedy, 2015). The model predicts a 32dimensional vector of means and a diagonal covariance, from which we sample the continuous encoding . The GMM or HMM uses 1000 components (hidden states), the initial means of the components are drawn from a Gaussian distribution with 0 mean and 0.1 variance. The variances are drawn from a Gaussian with 1.0 mean and 0.1 before being exponentiated. We train the model for 50000 steps, then we collect batch normalization statistics over the whole dataset, and we resume training only the prior with a fixed encoder and unfrozen component weights (previously held uniform fixed) for another 50000 steps. is set to , encoder learning rate to and prior learning rate to . We train the model with Adam optimizer (Kingma and Ba, 2015).
To get a reward function over the abstract MDP induced by the HMM, we find abstract states with of ground states that are mapped to them being goal states for a given goal. We plan stateaction values for each abstractstate action pair using Value Iteration and run an agent with a softmax policy with set to for 100 episodes.
b.4 Stacking Buildings
The hyperparameters are the same as for Shapes World (Subsection B.3).
b.5 MinAtar
We use a deep Qnetwork architecture and a training script provided by Young and Tian (2019). We collect 100000 transitions with set to 0.1 after training it for 3M time steps. The authors train up to 5M time steps, but we disable sticky actions and difficulty ramping, making the games easier.
The details of our model are similar to Subsection B.3, except we use a smaller convolutional network with 32 and 64 filters in two layers, filter size set to three and stride set to one. We do not use batch normalization and the hidden layer after convolutions has only 128 neurons. is set to and the rest of the parameters stay the same. For our abstract agent, we do not threshold goal states and set to .
Appendix C Additional Experiments
c.1 Stacking buildings
The set up for this experiment is the same as Shapes World (Subsection 6.2). We instantiate five different tasks (Figure 6) and report our transfer results in Table 4. The tasks are more difficult than the ones in Shapes World. All tasks except for the first have too many abstract states in the coarsest bisimulation to be represented by the 1000 hidden states available in the HMM prior.
One data point of interest is our models ability to generalize between different orientations of buildings (Figure 6, images 3 (3T–top), 4 (3B–bottom), 5 (3L–left), and 6 (3R–right)). The abstraction trained on the 3T task can generalize to 3B, but not to 3L and 3R. Conversely, 3L can generalize to 3R, but not 3R and 3L. Training on 3T and 3L leads to an abstraction that can solve both 3B and 3R (Table 4 line 6), albeit not as well as the abstractions from 3T (line 3) and 3L (line 4) separately.
c.2 MinAtar
In Figure 7, we investigate the impact of the number of abstract states on the performance of the Mean Q abstract agent (Subsection 6.3) in Breakout. Even though there is a high variance between the qualities of abstraction learned in different runs, the violin plot shows an approximately linear dependence between the number of abstract states and the mean return.
Comments
There are no comments yet.