1 Introduction
Decisionmaking in humans has been shown to involve a tradeoff between achieving goals vs. being economical about cognitive effort (Kool and Botvinick, 2018). For instance, consider if one needs to walk from their bedroom to the kitchen – by exiting through the bedroom door, turning left, walking down the hallway, and turning right. An effortoptimized (or lazy) behavior is to pay attention and make goaldirected decisions at a small number of decision states (e.g. door, end of hallway), while falling back to ‘default’ or goalindependent behavior at all other states (e.g. continuing to walk along a hallway while in it). Essentially, the highlevel goal of “walking to the kitchen” is punctuated by certain decision states, where one switches from one mode of behavior to another. Understanding these decision states in an environment (say in context of a navigation task) has the potential to enable better transfer to novel environments and tasks and understanding of the structure in the environment.
Recent work in this domain, most notably Goyal (Goyal et al., 2019), discovers decision states in a task/goalspecific (or ‘supervised’) manner. In contrast, we aim to do so in a goalindependent (or ‘unsupervised’) manner with agents that explore the environment to understand what they can control. Our key hypothesis is that the decision states (e.g. end of hallways, crossings in a maze, corners of room) are a function of structural regularities in the environment and what the agent has seen, and not (just) of the downstream task/goal. Thus, we propose to identify decision states by merely interacting with the environment entirely without any goal or extrinsic rewards. Fig. 1 illustrates this idea.
Goaldependent Decision States. More formally, Goyal et al. (2019)
discover decision states in partially observed Markov Decision Processes (POMDPs) with an agent operating under a goal
. In addition to training a goalconditioned policy to maximize reward, they also minimize mutual information between the goal the action at each time . This sets up a tradeoff between compression (actions should store as few bits as possible about the goal) and tasksuccess(actions should be useful for getting high reward), in a spirit similar to the information bottleneck (IB) principle as applied to supervised learning
(Alemi et al., 2016; Tishby et al., 1999). This tradeoff allows for identifying decision states as those where the agent’s beliefs entail a high mutual information despite the bottleneck. Goyal et al. (2019) demonstrate that identifying and rewarding an agent for visiting such decision states can help solve novel tasks in novel environments.Goalindependent Decision States. Our key hypothesis is that such decision states exist in an environment even in the absence of external rewards/goals. Thus, we propose to identify them by maximizing empowerment (Salge et al., 2013) while employing an information bottleneck penalty. Concretely, an empowered agent (Gregor et al., 2016) attempts to visit a diverse set of states in the environment while ensuring it reaches them in a reliable manner. We use Gregor et al. (2016)’s framework that learns macroactions or options to maximize empowerment – by learning a policy to maximize where is the final state in a trajectory . To see why this maximizes empowerment, notice that , where denote entropy; thus, we would like to maximize the diversity in the final states reached by the agent while making each option highly predictive of (Salge et al., 2013). This empowers the agent to pick options that enable it to have a control over elements in the environment.
We augment this options framework with an information bottleneck – i.e., we minimize the mutual information between the option and every state in a trajectory. Our hypothesis is such a bottleneck should allow decision states to emerge in an empowered agent without any extrinsic reward. Analogous to Goyal et al. (2019), we we then can identify decision states to be ones with high mutual information despite the bottleneck. For example, in Fig. 1 the agent would need to access to decide if it would like to go to states indicated by or in the middleleft decision state. We call our approach Decision State VIC (DSVIC).
Contributions: We present an approach for an agent to discover decision states in an environment in an entirely ‘unsupervised’ or taskagnostic manner. We find our approach identifies decision states that match our intuitive assessments via an empowerment objective, on a variety of environments with different levels of difficulty in a partial observation setting. Further, we find our decision states are useful for guiding exploration for downstream tasks of interest in novel environments. This is similar to the results obtained by Goyal et al. (2019) for identifying decision states using reward / goaldriven settings, but stronger in the sense that our decision states were never identified in context of any goal, thus are more general. Interestingly, on a challenging grid world navigation task we find evidence that our method outperforms the transfer results from (our implementation of) Goyal et al. (2019). Our broader contribution to RL literature is that our work provides an unsupervised analogue of Goyal et al. (2019) in the same way as VAE (Higgins et al., 2017) is an unsupervised formulation of information bottleneck (Tishby et al., 1999; Alemi et al., 2016) in traditional supervised learning.
2 Methods
2.1 Discovery of Decision States
We first describe the objectives from Gregor et al. (2016) (VIC) which optimizes for intrinsic control using options. We then discuss how to blend an information bottleneck with this approach to discover decision states implicit in the choices of options.
Variational Intrinsic Control (VIC).
VIC considers a finite set of options presented to the agent via a discrete random variable
. The option is sampled once for an episode/trajectory and agent acts according to policy (see Fig. 2). The key idea is to maximize the mutual information (Cover and Thomas, 2006) between an option and the final state in a trajectory given an initial state , i.e. . As discussed in section 1, this maximizes the empowerment for an agent, meaning its internal options correspond to reliable states of the world that it can reach. Gregor et al. (2016) formulate a lower bound on this mutual information. Specifically, let be a prior on options (assumed to be same for all ), denote the (unknown) terminal state distribution achieved when when executing the policy , and denote a (parameterized) variational approximation to the true posterior on options given a final state and initial state . Now, the VIC lowerbound is given by:(1) 
Note that eq. 1 (and the rest of this paper) writes the expectation for a given spawn location , but there is also an an underlying spawn location distribution (typically uniform). In practice, optimization proceeds by taking the expectation of eq. 1 over . Next, we describe the information bottleneck regularizer, which we employ in addition to the VIC objective.
Policy Parameterization. Our model uses a particular parameterization for the agent’s optionconditioned policy , shown in Fig. 2 (right). At every timestep in the episode, given the current state () and the (global) option (), we compute an intermediate representation . This intermediate representation is used along with the state to choose an action to perform at time .
Information Bottleneck Regularizer. Analogous to (Goyal et al., 2019), we impose a information bottleneck penalty encouraging actions to have have low mutual information with the options, i.e. . This is a way of providing an inductive bias to the model that there are states with ‘default’ behavior (where one need not reason about the chosen options). Further, this allows us to identify decision states as states where the mutual information is high (despite the constraint). Combining this with the VIC objective eq. 1 yields the following objective:
(2) 
where controls the strength of the regularization. Intuitively, this is saying that one wants options which allow the agent to have a degree of control over the environment, while ensuring that we recover options that only need to be invoked at a sparse set of states. The first term above is optimized as in (Gregor et al., 2016). We next describe how we optimize the second term.
Bounds for . Mutual information , for a given start state is:
(3) 
However, in practice, this objective is hard to optimize since we do not have access to for computing eq. 3
in closed form. Thus, similar to VIC, one needs to optimize a Monte Carlo estimate. However, when computing the Monte Carlo estimate, the denominator
becomes problematic to compute, since it is difficult to have good estimates of for arbitrary , for example when a particular option has so far never visited a particular state during onpolicy execution of VIC.We thus resort to variational bounds on the mutual information (Barber and Agakov, 2004). In a related but slightly different context of goalconditioned policies, previous works Goyal et al. (2019); Galashov et al. (2018) discuss how to construct such bounds either using the action distribution (Galashov et al., 2018) or using the an internal stochastic representation space (InfoBot Goyal et al. (2019)). In in order to be as comparable as possible to InfoBot, we regularize , which yields the following lower bound (see Sec. A.1 in appendix for a detailed derivation):
(4) 
where is a fixed variational approximation (set to as in InfoBot), and is a parameterized encoder. Eqn. (4) blends nicely with the VIC objective since we can compute a Monte Carlo estimate of eq. 4 by first sampling an option at and then keeping track of all the states visited in the trajectory, in addition to the last state required by VIC.
In addition to the intrinsic control term and the bottleneck regularizer, we also include the expected entropy of the policy over the marginal state distribution (maximumentropy RL Ziebart (2010)) as a bonus to encourage sufficient exploration of the statespace. Putting it all together, the DSVIC objective is:
(5) 
where and are the parameters of the policy, bottleneck variable decoder and the option inference network respectively. Note that we abuse the notation slightly in eq. 5 by writing to denote a trajectory sampled by following . This is a sequential process where and (state transition function). Intuitively, the first term in the objective ensures reliable control over states in the environment while learning options; the second term ensures minimality in using the options sampled and the third provides an incentive for exploration. See algorithm 1 for more details.
Handling Partial Observability. At a high level, our hypothesis is that decision states are a function of (1) the information available to an agent, and (2) the structure in the environment. For (1), we are primarily interested in partiallyobservable MDPs, similar to InfoBot Goyal et al. (2019). Let be the observation made by the agent at time . We then set in all the equations above, where
is a recurrent neural network (RNN)
Hochreiter and Schmidhuber (1997) over the observations. We found it important to parameterize the policy (fig. 2, right) as: , i.e. action distributions are modeled as ‘reactive’ (conditioning only on the current observation) while distributions are ‘recurrent‘ (conditioning on the entire history of observations via ). Intuitively, this is because the sequence of observations might be informative of the highlevel option being followed i.e. conditioning on in addition to could ‘leak’ information about the underlying option to the actions, making the bottleneck on circumventable and thus ineffective. Finally, when maximizing the likelihood of option given the final state in learning in VIC (the option inference network), we use the global (x, y) coordinates of the final state of the agent. Note that the agent’s policy continues to be based only on observations , and it is only at the end of the trajectory that we use the global coordinates to compute intrinsic reward. Please refer to Sec. A.4 in appendix for a discussion of the issues when training the model with partial observations in the option inference network.Related Objectives. Objectives similar to Eqn. 5 have also been explored in prior work on unsupervised skill discovery for RL, although not in the context of decision states. Specifically, Eysenbach et al. (2018) (DIAYN) attempts to learn skills (options) which can control the states visited by agents while ensuring that states, not actions, are used to distinguish skills. A key difference is the introduction of as a hyperparameter which controls the capacity of the latent channel (Goyal et al., 2019) and the degree to which “default behavior” is encouraged during training, thereby allowing us to discover decision states as states where the information bottleneck is violated. In section 4, we compare against decision states extracted from Eysenbach et al. (2018) and find our proposed approach is better.
3 Experimental Setup
In this section, we first describe how we utilize decision states for transfer to goaldriven tasks in different environments (Goyal et al., 2019). We then describe the environments we perform our experiments in and finally summarize our training algorithm.
3.1 Transfer to GoalDriven Tasks
We now discuss how this unsupervised process of identifying decision states and learning reliable options allows us to adapt to goaldriven settings in different environments. Specifically, similar to Goyal et al. (2019), we study if providing an exploration bonus which incentivizes the goalconditioned policy to visit decision states leads to sampleefficient transfer.
Explorationbased Transfer. After optimizing Eqn. 5 to completion , similar to Goyal et al. (2019), we wish to understand if it is possible to directly leverage the existing machinery to identify the decision states to aid transfer to goaldriven tasks in an environment. After the unsupervised training phase, we freeze the encoder to provide as an exploration bonus incentivizing the agent to visit decision states in addition to the environmental reward . We decay the exploration bonus by the visitation frequency of within an episode to ensure the agent does not solely pursue the exploration bonus ^{1}^{1}1Visitation counts have a number of limitations including the requirement of a discrete state space as well as a state table for maintaining visit counts in POMDPs. However, our exploration bonus need not be restricted by these assumptions as one could alternatively use a distillation technique similar to Burda et al. (2018) where the incentive is to learn to distill our trained encoder network, alleviating the need for visitation counts altogether.. We hypothesize that if decision states correspond to structural regularities in an environment, visiting (a subset of) them should generally aid in goaldriven tasks in an environment. An underlying assumption here is the fact that during transfer, certain local physical structures in the environment are shared across the environments we train and evaluate on. We perform experiments to demonstrate transfer to different environments. To summarize, the overall reward the goalconditioned policy is trained to maximize is:
(6) 
Environments. The set of environments we consider are partially observable gridworlds inherited and built on top of MiniGrid ChevalierBoisvert et al. (2018). We first consider a set of simple environments with relative lowcomplexity – 4Room and Maze (see Fig. 3) followed by the MultiRoomNXSY task domains (similar to Goyal et al. (2019)). The MultiRoomNXSY environment consists of X rooms of size Y, connected in random orientations across multiple runs. We summarize our key implementation details and the criterion used to determine convergence of the DSVIC optimization ( Eqn. 5) in Sec. A.7 of the appendix. Algorithm 1 summarizes our approach.
4 Results
Discovered Decision States. We first present the decision states obtained from our unsupervised discovery phase on the 4Room and Maze environments (see Fig. 3). Intuitively, we expect decision states to correspond to structural regularities in the environment. Experimentally, we find this intuition validated as we notice more discernible decision states in the maze environment (with more structure) than the four rooms environment. The maze shows features such as intersections emerge as decision states, whereas the four rooms shows a mixture of corners and entries to corridors as decision states.
Next, when we sweep values of , we find that the nature of the decision states we discover changes dramatically. At higher values of , the bottleneck in eq. 5 is enforced more and thus, optimization finds solutions where more states (trivially) become decision states (fig. 3, columns 3 and 4), which means that default behavior is rarely discovered and practiced.
We also compare the decision states obtained from our objective with an analog of DIAYN Eysenbach et al. (2018) (fig. 3, 4th column) – an approach for intrinsic control that includes a bottleneck term in its objective, but is not designed explicitly for discovering decision states (see Sec. A.5 of Appendix for more details). We identify decision states using DIAYN as states where default behavior is violated i.e. is high. Notice how our approach gives better precision on decision states than the DIAYN approach^{2}^{2}2Note that one cannot run DIAYN with other values of since their algorithm assumes ..
Transfer to GoalDriven Tasks. We study how an exploration bonus (eq. 6) utilizing the encoder used to identify decision states can aid in transfer to goaldriven tasks in different environments (see Sec. 3.1). We restrict ourselves to the pointnavigation task for transfer in the MultiRoomNXSY set of environments. We perform two sets of experiments – (1) train on MultiRoomN2S6 and transfer to MultiRoomN3S4 and MultiRoomN5S4 (similar to Goyal et al. (2019)) and (2) train on MultiRoomN2S10 and transfer to MultiRoomN6S25, which is a more challenging version of the transfer task in InfoBot, in the sense that it has larger rooms (25x25) which makes it more crucial to do targeted exploration to find the doors quickly. In setting (1) we report numbers from Goyal et al. (2019) (countbased, curiositydriven Pathak et al. (2017a), VIME Houthooft et al. (2016)) as well as our implementation of InfoBot Goyal et al. (2019)^{3}^{3}3Code for InfoBot was not public. We report numbers both from our implementation and from Goyal et al. (2019)). Additionally, we provide appropriate comparisons for our implementation of InfoBot, DIAYN and ablations of our model with in the second setting (see table 1, Fig. 4 and Fig. 5).
First half of table 1 reports results from InfoBot, whereas the second half reports our results reproduced on the environments from InfoBot.
We report mean success with standard error of the mean on goaldriven transfer results across 10 random seeds with a value of
(from Eqn. 6) selected based on the best success on MultiroomN5S3 across a set of values (see Sec. A.7 in appendix for details.) For table 1, similar to Goyal et al. (2019), success is defined as the percentage of times the agent reaches goal across a fixed set of 128 evaluation environments. We first observe that nearly all of the exploration incentives achieve success rate for the MultiRoomN2S6 MultiRoomN3S4,MultiRoomN5S4 setups in our experiments (bottom half) ^{4}^{4}4Our implementation of countbased exploration and InfoBot outperforms InfoBot Goyal et al. (2019).. In the harder setup (rightmost column), we observe that our DSVIC exploration bonus (reported on two different values of ) consistently outperforms everything else by a significant margin. Interestingly, among the baselines, we observe that the countbased exploration incentive outperforms InfoBot (although with overlapping standard error of the mean). The DIAYNbased exploration bonus, which operates directly in the action space performs poorly. Additionally, we observe that the value of in the bottleneck is crucial for transfer, with a gradual decay in performance as is decreased (logscale), as shown in Fig. 4. Additionally, we compare to a random network baseline in Fig. 4where a randomly initialized DSVIC encoder is used with adjusted mean and variance of the exploration bonus so as to match that of the best trained DSVIC network. Such a random network baseline controls for the effect of scale and statedependent variance of our DSVIC encoder on the performance of supervised transfer.
Method  MultiRoomN3S4  MultiRoomN5S4  MultiRoomN6S25 

(Encoder pretrained on)  MultiRoomN2S6  MultiRoomN2S6  MultiRoomN2S10 
Goalconditioned A2C  
TRPO + VIME  N/A  
Curiositybased Exploration  N/A  
Countbased Exploration Goyal et al. (2019)  N/A  
InfoBot Goyal et al. (2019)  N/A  
InfoBot (Our Implementation)  
Countbased Exploration (Our Implementation)  %  
DIAYNbased Exploration Bonus ()  
DSVIC ()  
DSVIC () 
To further understand the behavior of the exploration bonus as per Eqn. 6 compared to other exploration incentives, in addition to success rate and sampleefficiency (inferred from the plot of average task return across timesteps in Fig. 5 (a)), we also consider another metric devised specifically for the MultiRoomNXSY environments. The MultiRoomNXSY task domain consists of rooms which are connected in a sequential manner with respect to the agent spawn and the goal location – the agent is always spawned in room and has to traverse all the rooms in the environment sequentially to reach the goal in room . Therefore, to maximize returns, it is critical for an exploration incentive to encourage the agent to visit the next room in the sequence. Fig. 5 (b) shows the number of unique rooms reached so far across episodes and random seeds while Fig. 5 (c) shows the running best of the farthest room reached across episodes and random seeds. The rate at which number of unique rooms visited rises across training time is indicative of the targetedness of the exploration incentive.
We observe from Fig. 5 (a) that our DSVIC exploration bonus () is significantly more sampleefficient for transfer compared to baselines. Fig. 5 (b, c) further demonstrate that our bonus leads to more targeted exploration compared to the exhaustive trends demonstrated by baselines.
To summarize, we make the following observations:

[leftmargin=*]

The decision states obtained by optimizing Eqn. 5
to completion are interpretable and correspond to states with more degree of freedom in terms of possible trajectories – which aligns with our hypothesis that decision states are crucial junctions where an agent needs to carefully make a decision to make progress towards a goal.

The ability to control (in Eqn. 5) as a hyperparameter – which modulates the channel () capacity – is critical for feasible decision states to emerge after the unsupervised training phase. Furthermore, interpretable decision states emerge only for a specific range of values – unlike prior work on discovering useful abstractions in an unsupervised manner Eysenbach et al. (2018) (DIAYN), where is naively set to 1.
5 Related Work
Intrinsic Control and Intrinsic Motivation. Learning how to explore without reward supervision is a foundational problem in Reinforcement Learning (Machado et al., 2017; Pathak et al., 2017b; Gregor et al., 2016; Strehl and Littman, 2008; Schmidhuber, 1990). Typical curiositydriven approaches attempt to visit states that maximize the surprise of an agent in the environment (Pathak et al., 2017b) or improvement in predictions from a dynamics model (Machado et al., 2017). While curiositydriven approaches seek out and explore novel states in an environment they typically do not measure how reliably one can reach them. In contrast, approaches for intrinsic control (Eysenbach et al., 2018; Achiam et al., 2018; Gregor et al., 2016) explore novel states while ensuring those are states that an agent can reliably reach via extrinsic rewardfree options frameworks – Gregor et al. (2016) maximize the number of final states that can be reliably reached by the policy, while Eysenbach et al. (2018) distinguish an option at every state along the trajectory, and Achiam et al. (2018) learn options for entire trajectories by encoding the sequence of states at regular time intervals. Since decision states are more tied to reliably acting in the environment and achieving goals rather than simply visiting novel states (without any estimate of how reliably one can reach them), we formulate our regularizer in the intrinsic control framework, specifically building on top of Gregor et al. (2016).
Default Behavior and Decision States. Recent work in policy compression has focused on learning a default policy when training on a family of tasks, to be able to reuse behavior common to all tasks. In (Galashov et al., 2018; Teh et al., 2017), a default policy is learnt using a set of taskspecific policies which in turn acts as a regularizer for each policy, while Goyal et al. (2019) learn a default policy using an information bottleneck on task information and a latent variable the policy is conditioned on. We devise a similar information bottleneck objective but in a rewardfree setting that learns default behavior to be shared by all intrinsic options so as to reduce learning pressure on optionspecific policies.
Bottleneck states in MDPs. There is a rich literature on identification of bottleneck states in markov decision processes. The core idea is to either identify all the states that are common to multiple goals in an environment (McGovern and Barto, 2001) or use a diffusion model built using an MDP’s transition matrix (Machado et al., 2017; Şimşek and Barto, 2004; Theocharous and Mahadevan, 2002). The key distinction between bottleneck states and decision states is that decision states are more closely tied to the information available to the agent and what it can act upon, whereas bottleneck states are more tied to the connectivity structure of an MDP, representing states which when visited allow access to a novel set of states (Goyal et al., 2019). Concretely, while a corner in a room need not be a bottleneck state, since visiting a corner does not “open up” a new set of states to an agent (the way a doorway would), it is still a useful state for a goaldriven agent with partial observation to visit (since it is a distinct landmark in the state space where decisions could be made meaningfully). Indeed, we found in our initial experiments that our intrinsic control objective would not give interpretable decision states for MDPs. To see this, consider being inside a corridor with an option that takes you to the left of the room. Even in the middle of the corridor, if you had a map, you could “decide” to go to the left of the room, meaning that the end of the corridor need not be a decision state. This further illustrates that decision states are much more tied to the agent (what it has seen) and the environment, as opposed to bottleneck states, which are more agent independent, intrinsic properties of the environment.
Information Bottleneck in Machine Learning.
Since the foundational work of (Tishby et al., 1999; Chechik et al., 2005), there has been a lot of interest in making use of ideas from information bottleneck (IB) for various tasks such as clustering (Strouse and Schwab, 2017; Still et al., 2004), sparse coding (Chalk et al., 2016), classification using deep learning
(Alemi et al., 2016; Achille and Soatto, 2016), cognitive science and language (Zaslavsky et al., 2018), reinforcement learning (Goyal et al., 2019; Galashov et al., 2019; Strouse et al., 2018) etc. In contrast to these works, we apply an information bottleneck to a reinforcement learning agent that must learn without explicit reward supervision to identify decision states in an environment.6 Conclusion
We devise an unsupervised approach – DSVIC– to identify decision states in an environment. These decision states are junctions where the agent needs to make a decision (as opposed to following default behavior) and are a function of environment structure as well as an agent’s partial observation. Our results on multiroom and maze environments demonstrate that the learnt decision states are humaninterpretable e.g. appear at end of hallways, crossings in a maze, etc., as well as transferable i.e. aid exploration on externalreward tasks in terms of better success rate and sample complexity.
References
 Achiam et al. (2018) Achiam, J., H. Edwards, D. Amodei, and P. Abbeel (2018). Variational option discovery algorithms. arXiv preprint arXiv:1807.10299.
 Achille and Soatto (2016) Achille, A. and S. Soatto (2016). Information dropout: learning optimal representations through noise.
 Alemi et al. (2016) Alemi, A. A., I. Fischer, J. V. Dillon, and K. Murphy (2016). Deep variational information bottleneck. In ICLR.
 Barber and Agakov (2004) Barber, D. and F. V. Agakov (2004). Information maximization in noisy channels : A variational approach. In S. Thrun, L. K. Saul, and B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16, pp. 201–208. MIT Press.
 Burda et al. (2018) Burda, Y., H. Edwards, A. Storkey, and O. Klimov (2018). Exploration by random network distillation. arXiv preprint arXiv:1810.12894.
 Chalk et al. (2016) Chalk, M., O. Marre, and G. Tkacik (2016). Relevant sparse codes with variational information bottleneck.
 Chechik et al. (2005) Chechik, G., A. G. N. Tishby, and Y. Weiss (2005). Information bottleneck for gaussian variables. J. of Machine Learning Research 6, 165–188.
 ChevalierBoisvert et al. (2018) ChevalierBoisvert, M., L. Willems, and S. Pal (2018). Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gymminigrid.
 Cover and Thomas (2006) Cover, T. M. and J. A. Thomas (2006). Elements of Information Theory. John Wiley. 2nd edition.
 Eysenbach et al. (2018) Eysenbach, B., A. Gupta, J. Ibarz, and S. Levine (2018). Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070.
 Galashov et al. (2018) Galashov, A., S. M. Jayakumar, L. Hasenclever, D. Tirumala, J. Schwarz, G. Desjardins, W. M. Czarnecki, Y. W. Teh, R. Pascanu, and N. Heess (2018). Information asymmetry in klregularized rl.
 Galashov et al. (2019) Galashov, A., S. M. Jayakumar, L. Hasenclever, D. Tirumala, J. Schwarz, G. Desjardins, W. M. Czarnecki, Y. W. Teh, R. Pascanu, and N. Heess (2019, May). Information asymmetry in KLregularized RL.
 Goyal et al. (2019) Goyal, A., R. Islam, D. Strouse, Z. Ahmed, M. Botvinick, H. Larochelle, S. Levine, and Y. Bengio (2019). Infobot: Transfer and exploration via the information bottleneck. In ICLR.
 Gregor et al. (2016) Gregor, K., D. J. Rezende, and D. Wierstra (2016). Variational intrinsic control. arXiv preprint arXiv:1611.07507.
 Higgins et al. (2017) Higgins, I., L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017). betaVAE: Learning basic visual concepts with a constrained variational framework. In ICLR.
 Hochreiter and Schmidhuber (1997) Hochreiter, S. and J. Schmidhuber (1997). Long shortterm memory. Neural Computation 9(8), 1735–1780.
 Houthooft et al. (2016) Houthooft, R., X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel (2016). Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117.
 Kool and Botvinick (2018) Kool, W. and M. Botvinick (2018, December). Mental labour. Nat Hum Behav 2(12), 899–908.
 Kostrikov (2018) Kostrikov, I. (2018). Pytorch implementations of reinforcement learning algorithms. https://github.com/ikostrikov/pytorcha2cppoacktrgail.
 Machado et al. (2017) Machado, M. C., M. G. Bellemare, and M. Bowling (2017). A laplacian framework for option discovery in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, ICML’17, Sydney, NSW, Australia, pp. 2295–2304. JMLR.org.
 McGovern and Barto (2001) McGovern, A. and A. G. Barto (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML.

Pathak
et al. (2017a)
Pathak, D., P. Agrawal, A. A. Efros, and T. Darrell (2017a).
Curiositydriven exploration by selfsupervised prediction.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 16–17.  Pathak et al. (2017b) Pathak, D., P. Agrawal, A. A. Efros, and T. Darrell (2017b). Curiositydriven exploration by selfsupervised prediction. In ICML.
 Salge et al. (2013) Salge, C., C. Glackin, and D. Polani (2013, October). Empowerment – an introduction.
 Schmidhuber (1990) Schmidhuber, J. (1990). A possibility for implementing curiosity and boredom in modelbuilding neural controllers. In Proceedings of the First International Conference on Simulation of Adaptive Behavior on From Animals to Animats, Cambridge, MA, USA, pp. 222–227. MIT Press.
 Şimşek and Barto (2004) Şimşek, Ö. and A. G. Barto (2004). Using relative novelty to identify useful temporal abstractions in reinforcement learning. In Proceedings of the Twentyfirst International Conference on Machine Learning, ICML ’04, New York, NY, USA, pp. 95–. ACM.
 Still et al. (2004) Still, S., W. Bialek, and L. Bottou (2004). Geometric clustering using the information bottleneck method. In S. Thrun, L. K. Saul, and B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16, pp. 1165–1172. MIT Press.
 Strehl and Littman (2008) Strehl, A. L. and M. L. Littman (2008, December). An analysis of modelbased interval estimation for markov decision processes. J. Comput. System Sci. 74(8), 1309–1331.
 Strouse et al. (2018) Strouse, D. J., M. KleimanWeiner, J. Tenenbaum, M. Botvinick, and D. Schwab (2018, August). Learning to share and hide intentions using information regularization.
 Strouse and Schwab (2017) Strouse, D. J. and D. J. Schwab (2017, December). The information bottleneck and geometric clustering.
 Teh et al. (2017) Teh, Y., V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu (2017). Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506.
 Theocharous and Mahadevan (2002) Theocharous, G. and S. Mahadevan (2002). Learning the hierarchical structure of spatial environments using multiresolution statistical models. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems.
 Tishby et al. (1999) Tishby, N., F. Pereira, and W. Biale (1999). The information bottleneck method. In The 37th annual Allerton Conf. on Communication, Control, and Computing, pp. 368–377.
 Zaslavsky et al. (2018) Zaslavsky, N., C. Kemp, T. Regier, and N. Tishby (2018, July). Efficient compression in color naming and its evolution. Proc. Natl. Acad. Sci. U. S. A. 115(31), 7937–7942.
 Ziebart (2010) Ziebart, B. D. (2010). Modeling purposeful adaptive behavior with the principle of maximum causal entropy.
Appendix A Appendix
a.1 Upper bound on
We explain the steps to derive Eqn. (4) in the main paper, as the upper bound on . By the data processing inequality (Cover and Thomas, 2006) for the graphical model in Fig 2 So we will next derive an upper bound on .
We write , given a start state as:
(7) 
The key difference here is that our objective here uses the options internal to the agent as opposed to Goyal et al. (2019) who use external goal specifications provided to the agent. Similar to VIC, here denotes the (unkown) state distribution at time from which we can draw samples when we execute a policy.
We then assume a variational approximation ^{5}^{5}5For our experiments, we fix to be a unit gaussian, however it could also be learned. for , and using the fact that , we get a the following lower bound:
(8) 
a.2 Decision State Identification – On policy with Options.
We use Eqn. 8 to identify and visualize the decision states learned in an environment, augmented with random sampling of the start state . Thus, we compute our decision states in an onpolicy manner. Mathematically, we can write this as:
(9) 
where is a random spawn location uniformly chosen from the set of states in the environment and is a random option chosen at each of the spawn locations. Thus, for each state in the environment, we look at the aggregate of all the trajectories that pass through it and compute the values of to identify / visualize decision states. In addition to being principled, this methodology also precisely captures our intuition that it is possible to identify decision states which are common to, or frequented across multiple options. Our results section 4 show that such states indeed correspond to structural regularities in the environment.
a.3 Identification of Decision States during Transfer
As mentioned in the main paper, we would like to compute to identify decision states, given a (potentially) novel environment and a novel goalconditioned task, to provide it an exploration bonus. Given a state , that we would like to compute , we can write:
(10) 
However, this cannot be computed in this form in a transfer task, since a goal driven agent is not following onpolicy actions for an option that would allow us to draw samples from above (in order to do a montecarlo estimate of the integral above). Thus instead, we propose to compute the mutual information as follows:
(11) 
Now, given a state , this requires us to draw samples from , which in general is intractable (since this requires us to know , which is not available in closed form). In order to compute the above equation, we make the assumption that . Breaking it down, this makes two assumptions: firstly,
. This means that all options have equal probability of passing through a state
at time , which is not true in general. The second assumption is the same as VIC, namely that . Instead of making this assumption, one could also fit a parameterized variational approximation to and train it in conjunction with VIC. However, we found our simple scheme to work well in practice, and thus avoid fitting an additional variational approximation.a.4 Decision states from local observations
In the main paper, our results assumed that one had access to a good SLAM routine that did mapping and localization from partial observations . In general, as we discuss in the main paper, this is not a bad assumption since our model is always trained on a single environment, and thus it is reasonable for us to expect it to have a sense of the (x, y) coordinates in its internal representation / weights as training progresses. In this section, we detail some of the pathologies that we observed when kicking off training, if we took a naive approach to SLAM, estimating a function to directly regress to from partial observations (Algorithm 1).
We note that since our particular choice of environment is a discrete gridworld, several partial observations (especially for small agent view sizes like the 3x3 window) look the same to the agent and the agent has a tendency to converge to the most trivial optima for learning options – one which learns to end at an easily obtainable partial observation given an option. For instance, while training DSVIC, one particular policy simply learned 4 options corresponding to the four cardinal directions the agent can face, and each can be achieved by left or right turns, without actually requiring any movement by the agent. Since the agent has a compass which tells it the direction it is facing, it can ignore the partial observation and just use the direction vector to predict which of the 4 options was sampled.
a.5 DIAYN comparison
We provide more details on how we compare to DIAYN in our framework. Specifically, setting and maximizing instead of renders Eqn. 2 (main paper) similar to DIAYN. Additionally, similar to DIAYN, we restrict to be deterministic – removing the physical bottleneck, and compute values of in the policy space to identify decision states.
a.6 Network Architecture
We use a 3 layered convolutional neural network with kernels of size 3x3, 2x2 and 2x2 in the 3 layers respectively to process the agent’s egocentric observation. We use
ReLUas the nonlinear activation function after each convolutional layer. The output of the CNN is then concatenated with the agent’s direction vector (compass) and the option (or goal encoding). The concatenated features are then passed through a linear layer with hidden size 64 to produce the final features used by option encoder
head and the policy head . We use the (x, y) coordinates of the final state (embedded through a linear layer of hidden size 64) to regress to the option via . Furthermore, our parameterized policy is a reactive one and the encoder is recurrent over the sequence of states encountered in the episode. The bottleneck random variable is sampled from the parameterized gaussian and is made a differentiable stochastic node using the reparmaterization trick for gaussian random variables.a.7 Implementation Details
As we deal with partiallyobservable settings for all our experiments, the agent receives an egocentric view of it’s surroundings as input, encoded as an occupancy grid where the channel dimension specifies whether the agent or an obstacle is present at an location. We set the coefficient in Eqn. 5 to be for all our experiments based on sweeps conducted across multiple values. In practice, we found it to be difficult to optimize our unsupervised objective in absence of an entropy bonus – the parameterized policy collapses to a deterministic one and no options are learned in addition to inefficient exploration of the statespace. Moreover, we found that it was hard to obtain reasonable results by optimizing for both the terms in the objective from scratch and therefore, we optimize the intrinsic objective itself for episodes (i.e., we set ) after which we turn on the bottleneck term and let grow linearly for another episodes to get feasible outcomes at convergence. We experiment with a vocabulary of 2, 4 and 32 options (imitating undercomplete and overcomplete settings) for all the environments. For all the exploration incentives presented in Table 1, we first picked a value of in Eqn. (6) (decides how to weigh the bonus with respect to the reward) from based on the best samplecomplexity on goaldriven transfer For our implementation of InfoBot Goyal et al. (2019), we picked chose and based on a narrower search in the cross product of the set of values above and the set of values which Goyal et al. (2019) reported. The picked value was held fixed for all transfer experiments for the corresponding exploration incentives (rows in Table 1).
We use Advantage ActorCritic (A2C) (opensourced implementation by Kostrikov (2018)
) for all our experiments. We use RMSprop as the optimizer for all our experiments. Since we were unable to find code to reproduce
Goyal et al. (2019), we implemented InfoBot ourselves while making sure that we are consistent with the architectural and hyperparameter choices adopted by InfoBot (as per the A2C implementation by ChevalierBoisvert et al. (2018)). The only difference between the A2C implementation we use (by Kostrikov (2018)) and that of ChevalierBoisvert et al. (2018) is in terms of how updates are performed – the former does MonteCarlo updates while the latter does temporal difference updates.Convergence Criterion. In practice, we observe that it is hard to learn a lot of options using the unsupervised objective. From the entire option vocabulary, the agent only learns a few number of options discriminatively (as identified by the option termination state), with the rest collapsing into the same set of final states. To pick a suitable checkpoint for the transfer experiments, we pick the ones which have learned the maximum number of options reliably – as measured by the likelihood of the correct option from the final state .
Option Curriculum It is standard to have a fixedsized discrete option space with a uniform pior Gregor et al. (2016). However, learning a meaningful option space with larger option vocabulary size has been reported to be difficult Achiam et al. (2018). We adopt a curriculum based approach proposed in Achiam et al. (2018) where vocalubary size is gradually increases as the option decoder becomes more confident in mapping back the final state back to the corresponding option sampled at the beginning of the episode. More concretely, whenever
(threshold chosen via hyperparameter tuning), the option vocabulary size increases according to
For our experiments, we start from and terminate the curriculum when .
We run each of our experiments on a single GTX TitanX GPU, and use no more than 1 GB of GPU memory per experimental setting.