Unsupervised Discovery of Decision States for Transfer in Reinforcement Learning

We present a hierarchical reinforcement learning (HRL) or options framework for identifying decision states. Informally speaking, these are states considered important by the agent's policy e.g. , for navigation, decision states would be crossroads or doors where an agent needs to make strategic decisions. While previous work (most notably Goyal et. al., 2019) discovers decision states in a task/goal specific (or 'supervised') manner, we do so in a goal-independent (or 'unsupervised') manner, i.e. entirely without any goal or extrinsic rewards. Our approach combines two hitherto disparate ideas - 1) intrinsic control (Gregor et. al., 2016, Eysenbach et. al., 2018): learning a set of options that allow an agent to reliably reach a diverse set of states, and 2) information bottleneck (Tishby et. al., 2000): penalizing mutual information between the option Ω and the states s_t visited in the trajectory. The former encourages an agent to reliably explore the environment; the latter allows identification of decision states as the ones with high mutual information I(Ω; a_t | s_t) despite the bottleneck. Our results demonstrate that 1) our model learns interpretable decision states in an unsupervised manner, and 2) these learned decision states transfer to goal-driven tasks in new environments, effectively guide exploration, and improve performance.


page 2

page 6


Variational Intrinsic Control

In this paper we introduce a new unsupervised reinforcement learning met...

InfoBot: Transfer and Exploration via the Information Bottleneck

A central challenge in reinforcement learning is discovering effective p...

Variational Intrinsic Control Revisited

In this paper, we revisit variational intrinsic control (VIC), an unsupe...

The StreetLearn Environment and Dataset

Navigation is a rich and well-grounded problem domain that drives progre...

Interpretable Reinforcement Learning with Multilevel Subgoal Discovery

We propose a novel Reinforcement Learning model for discrete environment...

Learning more skills through optimistic exploration

Unsupervised skill learning objectives (Gregor et al., 2016, Eysenbach e...

A First-Occupancy Representation for Reinforcement Learning

Both animals and artificial agents benefit from state representations th...

1 Introduction

Decision-making in humans has been shown to involve a trade-off between achieving goals vs. being economical about cognitive effort (Kool and Botvinick, 2018). For instance, consider if one needs to walk from their bedroom to the kitchen – by exiting through the bedroom door, turning left, walking down the hallway, and turning right. An effort-optimized (or lazy) behavior is to pay attention and make goal-directed decisions at a small number of decision states (e.g. door, end of hallway), while falling back to ‘default’ or goal-independent behavior at all other states (e.g. continuing to walk along a hallway while in it). Essentially, the high-level goal of “walking to the kitchen” is punctuated by certain decision states, where one switches from one mode of behavior to another. Understanding these decision states in an environment (say in context of a navigation task) has the potential to enable better transfer to novel environments and tasks and understanding of the structure in the environment.

Recent work in this domain, most notably Goyal  (Goyal et al., 2019), discovers decision states in a task/goal-specific (or ‘supervised’) manner. In contrast, we aim to do so in a goal-independent (or ‘unsupervised’) manner with agents that explore the environment to understand what they can control. Our key hypothesis is that the decision states (e.g. end of hallways, crossings in a maze, corners of room) are a function of structural regularities in the environment and what the agent has seen, and not (just) of the downstream task/goal. Thus, we propose to identify decision states by merely interacting with the environment entirely without any goal or extrinsic rewards. Fig. 1 illustrates this idea.

Figure 1: The VIC framework (Gregor et al., 2016) in a navigation context: an agent learns high-level macro-actions (or options) to reach different states in an environment reliably (left) without any extrinsic reward. Our proposal (right) identifies decision states (in red) associated with options where the agent is empowered to make informed decisions. The rest of the states in a trajectory (dotted lines) then show default, option independent behavior. Identification of decision states leads to improved transfer to novel environments and tasks.

Goal-dependent Decision States. More formally, Goyal et al. (2019)

discover decision states in partially observed Markov Decision Processes (POMDPs) with an agent operating under a goal

. In addition to training a goal-conditioned policy to maximize reward, they also minimize mutual information between the goal the action at each time . This sets up a trade-off between compression (actions should store as few bits as possible about the goal) and task-success

(actions should be useful for getting high reward), in a spirit similar to the information bottleneck (IB) principle as applied to supervised learning 

(Alemi et al., 2016; Tishby et al., 1999). This tradeoff allows for identifying decision states as those where the agent’s beliefs entail a high mutual information despite the bottleneck. Goyal et al. (2019) demonstrate that identifying and rewarding an agent for visiting such decision states can help solve novel tasks in novel environments.

Goal-independent Decision States. Our key hypothesis is that such decision states exist in an environment even in the absence of external rewards/goals. Thus, we propose to identify them by maximizing empowerment (Salge et al., 2013) while employing an information bottleneck penalty. Concretely, an empowered agent (Gregor et al., 2016) attempts to visit a diverse set of states in the environment while ensuring it reaches them in a reliable manner. We use Gregor et al. (2016)’s framework that learns macro-actions or options to maximize empowerment – by learning a policy to maximize where is the final state in a trajectory . To see why this maximizes empowerment, notice that , where denote entropy; thus, we would like to maximize the diversity in the final states reached by the agent while making each option highly predictive of  (Salge et al., 2013). This empowers the agent to pick options that enable it to have a control over elements in the environment.

We augment this options framework with an information bottleneck – i.e., we minimize the mutual information between the option and every state in a trajectory. Our hypothesis is such a bottleneck should allow decision states to emerge in an empowered agent without any extrinsic reward. Analogous to Goyal et al. (2019), we we then can identify decision states to be ones with high mutual information despite the bottleneck. For example, in Fig. 1 the agent would need to access to decide if it would like to go to states indicated by or in the middle-left decision state. We call our approach Decision State VIC (DS-VIC).

Contributions: We present an approach for an agent to discover decision states in an environment in an entirely ‘unsupervised’ or task-agnostic manner. We find our approach identifies decision states that match our intuitive assessments via an empowerment objective, on a variety of environments with different levels of difficulty in a partial observation setting. Further, we find our decision states are useful for guiding exploration for downstream tasks of interest in novel environments. This is similar to the results obtained by Goyal et al. (2019) for identifying decision states using reward / goal-driven settings, but stronger in the sense that our decision states were never identified in context of any goal, thus are more general. Interestingly, on a challenging grid world navigation task we find evidence that our method outperforms the transfer results from (our implementation of) Goyal et al. (2019). Our broader contribution to RL literature is that our work provides an unsupervised analogue of Goyal et al. (2019) in the same way as -VAE (Higgins et al., 2017) is an unsupervised formulation of information bottleneck (Tishby et al., 1999; Alemi et al., 2016) in traditional supervised learning.

2 Methods

2.1 Discovery of Decision States

We first describe the objectives from Gregor et al. (2016) (VIC) which optimizes for intrinsic control using options. We then discuss how to blend an information bottleneck with this approach to discover decision states implicit in the choices of options.

Variational Intrinsic Control (VIC).

VIC considers a finite set of options presented to the agent via a discrete random variable

. The option is sampled once for an episode/trajectory and agent acts according to policy (see Fig. 2). The key idea is to maximize the mutual information (Cover and Thomas, 2006) between an option and the final state in a trajectory given an initial state , i.e. . As discussed in section 1, this maximizes the empowerment for an agent, meaning its internal options correspond to reliable states of the world that it can reach.  Gregor et al. (2016) formulate a lower bound on this mutual information. Specifically, let be a prior on options (assumed to be same for all ), denote the (unknown) terminal state distribution achieved when when executing the policy , and denote a (parameterized) variational approximation to the true posterior on options given a final state and initial state . Now, the VIC lower-bound is given by:


Note that eq. 1 (and the rest of this paper) writes the expectation for a given spawn location , but there is also an an underlying spawn location distribution (typically uniform). In practice, optimization proceeds by taking the expectation of eq. 1 over . Next, we describe the information bottleneck regularizer, which we employ in addition to the VIC objective.

Figure 2: Left: VIC samples an option from a start state , follows a policy and then infers the sampled from the terminating state (), optimizing a lower bound on . Right: Our approach considers a particular parameterization of the policy and imposes an information bottleneck.

Policy Parameterization. Our model uses a particular parameterization for the agent’s option-conditioned policy , shown in Fig. 2 (right). At every timestep in the episode, given the current state () and the (global) option (), we compute an intermediate representation . This intermediate representation is used along with the state to choose an action to perform at time .

Information Bottleneck Regularizer. Analogous to (Goyal et al., 2019), we impose a information bottleneck penalty encouraging actions to have have low mutual information with the options, i.e. . This is a way of providing an inductive bias to the model that there are states with ‘default’ behavior (where one need not reason about the chosen options). Further, this allows us to identify decision states as states where the mutual information is high (despite the constraint). Combining this with the VIC objective eq. 1 yields the following objective:


where controls the strength of the regularization. Intuitively, this is saying that one wants options which allow the agent to have a degree of control over the environment, while ensuring that we recover options that only need to be invoked at a sparse set of states. The first term above is optimized as in  (Gregor et al., 2016). We next describe how we optimize the second term.

Bounds for . Mutual information , for a given start state is:


However, in practice, this objective is hard to optimize since we do not have access to for computing eq. 3

in closed form. Thus, similar to VIC, one needs to optimize a Monte Carlo estimate. However, when computing the Monte Carlo estimate, the denominator

becomes problematic to compute, since it is difficult to have good estimates of for arbitrary , for example when a particular option has so far never visited a particular state during on-policy execution of VIC.

We thus resort to variational bounds on the mutual information (Barber and Agakov, 2004). In a related but slightly different context of goal-conditioned policies, previous works Goyal et al. (2019); Galashov et al. (2018) discuss how to construct such bounds either using the action distribution (Galashov et al., 2018) or using the an internal stochastic representation space (InfoBot Goyal et al. (2019)). In in order to be as comparable as possible to InfoBot, we regularize , which yields the following lower bound (see Sec. A.1 in appendix for a detailed derivation):


where is a fixed variational approximation (set to as in InfoBot), and is a parameterized encoder. Eqn. (4) blends nicely with the VIC objective since we can compute a Monte Carlo estimate of eq. 4 by first sampling an option at and then keeping track of all the states visited in the trajectory, in addition to the last state required by VIC.

In addition to the intrinsic control term and the bottleneck regularizer, we also include the expected entropy of the policy over the marginal state distribution (maximum-entropy RL Ziebart (2010)) as a bonus to encourage sufficient exploration of the state-space. Putting it all together, the DS-VIC objective is:


where and are the parameters of the policy, bottleneck variable decoder and the option inference network respectively. Note that we abuse the notation slightly in eq. 5 by writing to denote a trajectory sampled by following . This is a sequential process where and (state transition function). Intuitively, the first term in the objective ensures reliable control over states in the environment while learning options; the second term ensures minimality in using the options sampled and the third provides an incentive for exploration. See algorithm 1 for more details.

Handling Partial Observability. At a high level, our hypothesis is that decision states are a function of (1) the information available to an agent, and (2) the structure in the environment. For (1), we are primarily interested in partially-observable MDPs, similar to InfoBot Goyal et al. (2019). Let be the observation made by the agent at time . We then set in all the equations above, where

is a recurrent neural network (RNN) 

Hochreiter and Schmidhuber (1997) over the observations. We found it important to parameterize the policy (fig. 2, right) as: , i.e. action distributions are modeled as ‘reactive’ (conditioning only on the current observation) while distributions are ‘recurrent‘ (conditioning on the entire history of observations via ). Intuitively, this is because the sequence of observations might be informative of the high-level option being followed i.e. conditioning on in addition to could ‘leak’ information about the underlying option to the actions, making the bottleneck on circumventable and thus ineffective. Finally, when maximizing the likelihood of option given the final state in learning in VIC (the option inference network), we use the global (x, y) coordinates of the final state of the agent. Note that the agent’s policy continues to be based only on observations , and it is only at the end of the trajectory that we use the global coordinates to compute intrinsic reward. Please refer to Sec. A.4 in appendix for a discussion of the issues when training the model with partial observations in the option inference network.

Related Objectives. Objectives similar to Eqn. 5 have also been explored in prior work on unsupervised skill discovery for RL, although not in the context of decision states. Specifically, Eysenbach et al. (2018) (DIAYN) attempts to learn skills (options) which can control the states visited by agents while ensuring that states, not actions, are used to distinguish skills. A key difference is the introduction of as a hyper-parameter which controls the capacity of the latent channel  (Goyal et al., 2019) and the degree to which “default behavior” is encouraged during training, thereby allowing us to discover decision states as states where the information bottleneck is violated. In section 4, we compare against decision states extracted from Eysenbach et al. (2018) and find our proposed approach is better.

3 Experimental Setup

In this section, we first describe how we utilize decision states for transfer to goal-driven tasks in different environments (Goyal et al., 2019). We then describe the environments we perform our experiments in and finally summarize our training algorithm.

3.1 Transfer to Goal-Driven Tasks

We now discuss how this unsupervised process of identifying decision states and learning reliable options allows us to adapt to goal-driven settings in different environments. Specifically, similar to Goyal et al. (2019), we study if providing an exploration bonus which incentivizes the goal-conditioned policy to visit decision states leads to sample-efficient transfer.

Exploration-based Transfer. After optimizing Eqn. 5 to completion , similar to Goyal et al. (2019), we wish to understand if it is possible to directly leverage the existing machinery to identify the decision states to aid transfer to goal-driven tasks in an environment. After the unsupervised training phase, we freeze the encoder to provide as an exploration bonus incentivizing the agent to visit decision states in addition to the environmental reward . We decay the exploration bonus by the visitation frequency of within an episode to ensure the agent does not solely pursue the exploration bonus 111Visitation counts have a number of limitations including the requirement of a discrete state space as well as a state table for maintaining visit counts in POMDPs. However, our exploration bonus need not be restricted by these assumptions as one could alternatively use a distillation technique similar to Burda et al. (2018) where the incentive is to learn to distill our trained encoder network, alleviating the need for visitation counts altogether.. We hypothesize that if decision states correspond to structural regularities in an environment, visiting (a subset of) them should generally aid in goal-driven tasks in an environment. An underlying assumption here is the fact that during transfer, certain local physical structures in the environment are shared across the environments we train and evaluate on. We perform experiments to demonstrate transfer to different environments. To summarize, the overall reward the goal-conditioned policy is trained to maximize is:


Environments. The set of environments we consider are partially observable grid-worlds inherited and built on top of MiniGrid Chevalier-Boisvert et al. (2018). We first consider a set of simple environments with relative low-complexity – 4-Room and Maze (see Fig. 3) followed by the MultiRoomNXSY task domains (similar to Goyal et al. (2019)). The MultiRoomNXSY environment consists of X rooms of size Y, connected in random orientations across multiple runs. We summarize our key implementation details and the criterion used to determine convergence of the DS-VIC optimization ( Eqn. 5) in Sec. A.7 of the appendix. Algorithm 1 summarizes our approach.

0:     A parameterized encoder , policy
0:     A parameterized option inference network
0:     A parameterized goal-conditioned policy
0:     A prior on discrete options
0:     A variational approximation of the option-marginalized encoder
0:     A regularization weight and max-ent coefficient
0:     A set of training environments and transfer environments
  Unsupervised Discovery
  Sample training environment
  for episodes = 1 to  do
     Sample a spawn location and an option
     Unroll a state-action trajectory under with reparametrized
     Infer from
     Update the parameters and based on Eqn. 5
  end for
  Transfer to Goal-Driven Tasks
  Sample transfer environment
  for episodes = 1 to  do
     Sample a goal
     Unroll a state-action trajectory under the goal-conditioned policy
     Update policy parameters to maximize the reward given by Eqn. 6
  end for
Algorithm 1 DS-VIC

4 Results

Discovered Decision States. We first present the decision states obtained from our unsupervised discovery phase on the 4-Room and Maze environments (see Fig. 3). Intuitively, we expect decision states to correspond to structural regularities in the environment. Experimentally, we find this intuition validated as we notice more discernible decision states in the maze environment (with more structure) than the four rooms environment. The maze shows features such as intersections emerge as decision states, whereas the four rooms shows a mixture of corners and entries to corridors as decision states.

Next, when we sweep values of , we find that the nature of the decision states we discover changes dramatically. At higher values of , the bottleneck in eq. 5 is enforced more and thus, optimization finds solutions where more states (trivially) become decision states (fig. 3, columns 3 and 4), which means that default behavior is rarely discovered and practiced.

We also compare the decision states obtained from our objective with an analog of DIAYN Eysenbach et al. (2018) (fig. 3, 4th column) – an approach for intrinsic control that includes a bottleneck term in its objective, but is not designed explicitly for discovering decision states (see Sec. A.5 of Appendix for more details). We identify decision states using DIAYN as states where default behavior is violated i.e. is high. Notice how our approach gives better precision on decision states than the DIAYN approach222Note that one cannot run DIAYN with other values of since their algorithm assumes ..

Figure 3: Decision States on Preliminary Environments – Four Room environment (top row) and Maze environment (bottom row). The first column depicts the environment layout, second and third column shows results for DS-VIC for and respectively and fourth column corresponds to DIAYN analog of DS-VIC with . Figures are plotted from a buffer of on-policy trajectories during the unsupervised training phase.

Transfer to Goal-Driven Tasks. We study how an exploration bonus (eq. 6) utilizing the encoder used to identify decision states can aid in transfer to goal-driven tasks in different environments (see Sec. 3.1). We restrict ourselves to the point-navigation task for transfer in the MultiRoomNXSY set of environments. We perform two sets of experiments – (1) train on MultiRoomN2S6 and transfer to MultiRoomN3S4 and MultiRoomN5S4 (similar to Goyal et al. (2019)) and (2) train on MultiRoomN2S10 and transfer to MultiRoomN6S25, which is a more challenging version of the transfer task in InfoBot, in the sense that it has larger rooms (25x25) which makes it more crucial to do targeted exploration to find the doors quickly. In setting (1) we report numbers from Goyal et al. (2019) (count-based, curiosity-driven Pathak et al. (2017a), VIME Houthooft et al. (2016)) as well as our implementation of InfoBot Goyal et al. (2019)333Code for InfoBot was not public. We report numbers both from our implementation and from Goyal et al. (2019)). Additionally, we provide appropriate comparisons for our implementation of InfoBot, DIAYN and ablations of our model with in the second setting (see  table 1, Fig. 4 and Fig. 5).

First half of table 1 reports results from InfoBot, whereas the second half reports our results reproduced on the environments from InfoBot.

We report mean success with standard error of the mean on goal-driven transfer results across 10 random seeds with a value of

(from Eqn. 6) selected based on the best success on MultiroomN5S3 across a set of values (see Sec. A.7 in appendix for details.) For table 1, similar to Goyal et al. (2019), success is defined as the percentage of times the agent reaches goal across a fixed set of 128 evaluation environments. We first observe that nearly all of the exploration incentives achieve success rate for the MultiRoomN2S6 MultiRoomN3S4,MultiRoomN5S4 setups in our experiments (bottom half) 444Our implementation of count-based exploration and InfoBot outperforms InfoBot Goyal et al. (2019).. In the harder setup (rightmost column), we observe that our DS-VIC exploration bonus (reported on two different values of ) consistently outperforms everything else by a significant margin. Interestingly, among the baselines, we observe that the count-based exploration incentive outperforms InfoBot (although with overlapping standard error of the mean). The DIAYN-based exploration bonus, which operates directly in the action space performs poorly. Additionally, we observe that the value of in the bottleneck is crucial for transfer, with a gradual decay in performance as is decreased (log-scale), as shown in Fig. 4. Additionally, we compare to a random network baseline in Fig. 4

where a randomly initialized DS-VIC encoder is used with adjusted mean and variance of the exploration bonus so as to match that of the best trained DS-VIC network. Such a random network baseline controls for the effect of scale and state-dependent variance of our DS-VIC encoder on the performance of supervised transfer.

Figure 4: Effect of during DS-VIC pre-training on success with exploration bonus for the goal-conditioned policy (blue, error bars are standard error of the mean over 10 random seeds). Reference lines (dashed) are for count-based (green) and bonus with a randomly initializaed network (orange) calibrated so as to match the mean and variance of the best trained DS-VIC encoder. This would control for the effect of scale and state-dependent variance on the performance of our trained network.
Method MultiRoomN3S4 MultiRoomN5S4 MultiRoomN6S25
(Encoder pretrained on) MultiRoomN2S6 MultiRoomN2S6 MultiRoomN2S10
Goal-conditioned A2C
Curiosity-based Exploration N/A
Count-based Exploration Goyal et al. (2019) N/A
InfoBot Goyal et al. (2019) N/A
InfoBot (Our Implementation)
Count-based Exploration (Our Implementation) %
DIAYN-based Exploration Bonus ()
Table 1: Success rate (mean standard error of the mean) of the goal-conditioned policy when trained with (and without) different exploration bonuses in addition to the environmental reward . We find that our exploration bonus (see Eqn. 6) consistently outperforms InfoBot Goyal et al. (2019) and a naive count-based exploration bonus.

To further understand the behavior of the exploration bonus as per Eqn. 6 compared to other exploration incentives, in addition to success rate and sample-efficiency (inferred from the plot of average task return across time-steps in Fig. 5 (a)), we also consider another metric devised specifically for the MultiRoomNXSY environments. The MultiRoomNXSY task domain consists of rooms which are connected in a sequential manner with respect to the agent spawn and the goal location – the agent is always spawned in room and has to traverse all the rooms in the environment sequentially to reach the goal in room . Therefore, to maximize returns, it is critical for an exploration incentive to encourage the agent to visit the next room in the sequence. Fig. 5 (b) shows the number of unique rooms reached so far across episodes and random seeds while Fig. 5 (c) shows the running best of the farthest room reached across episodes and random seeds. The rate at which number of unique rooms visited rises across training time is indicative of the targetedness of the exploration incentive.

(a) Avg. Task Return
(b) # Unique Rooms Reached
(c) Farthest Room Reached (running best)
Figure 5: (a) Average Task Return, (b) number of unique rooms reached, (c) farthest room reached so far, versus time-steps. Exploration bonus from our method (DS-VIC) achieves higher values than all the baselines on the three metrics. Moreover, DS-VIC shows better sample-complexity as measured by the number of time-steps needed to attain the final values, with higher corresponding to better performance. Shaded areas represent standard error of the mean over 10 random seeds.

We observe from Fig. 5 (a) that our DS-VIC exploration bonus () is significantly more sample-efficient for transfer compared to baselines. Fig. 5 (b, c) further demonstrate that our bonus leads to more targeted exploration compared to the exhaustive trends demonstrated by baselines.

To summarize, we make the following observations:

  • [leftmargin=*]

  • The decision states obtained by optimizing Eqn. 5

    to completion are interpretable and correspond to states with more degree of freedom in terms of possible trajectories – which aligns with our hypothesis that decision states are crucial junctions where an agent needs to carefully make a decision to make progress towards a goal.

  • The ability to control (in Eqn. 5) as a hyper-parameter – which modulates the channel () capacity – is critical for feasible decision states to emerge after the unsupervised training phase. Furthermore, interpretable decision states emerge only for a specific range of values – unlike prior work on discovering useful abstractions in an unsupervised manner Eysenbach et al. (2018) (DIAYN), where is naively set to 1.

  • Using the decision states identified from our unsupervised DS-VIC objective (Eqn. 5) as an exploration bonus leads to better transfer (in terms of success rate and sample-complexity) to goal-driven tasks compared to Goyal et al. (2019) (InfoBot) and a naive count-based exploration bonus.

5 Related Work

Intrinsic Control and Intrinsic Motivation. Learning how to explore without reward supervision is a foundational problem in Reinforcement Learning (Machado et al., 2017; Pathak et al., 2017b; Gregor et al., 2016; Strehl and Littman, 2008; Schmidhuber, 1990). Typical curiosity-driven approaches attempt to visit states that maximize the surprise of an agent in the environment  (Pathak et al., 2017b) or improvement in predictions from a dynamics model  (Machado et al., 2017). While curiosity-driven approaches seek out and explore novel states in an environment they typically do not measure how reliably one can reach them. In contrast, approaches for intrinsic control (Eysenbach et al., 2018; Achiam et al., 2018; Gregor et al., 2016) explore novel states while ensuring those are states that an agent can reliably reach via extrinsic reward-free options frameworks – Gregor et al. (2016) maximize the number of final states that can be reliably reached by the policy, while Eysenbach et al. (2018) distinguish an option at every state along the trajectory, and Achiam et al. (2018) learn options for entire trajectories by encoding the sequence of states at regular time intervals. Since decision states are more tied to reliably acting in the environment and achieving goals rather than simply visiting novel states (without any estimate of how reliably one can reach them), we formulate our regularizer in the intrinsic control framework, specifically building on top of Gregor et al. (2016).

Default Behavior and Decision States. Recent work in policy compression has focused on learning a default policy when training on a family of tasks, to be able to re-use behavior common to all tasks. In (Galashov et al., 2018; Teh et al., 2017), a default policy is learnt using a set of task-specific policies which in turn acts as a regularizer for each policy, while Goyal et al. (2019) learn a default policy using an information bottleneck on task information and a latent variable the policy is conditioned on. We devise a similar information bottleneck objective but in a reward-free setting that learns default behavior to be shared by all intrinsic options so as to reduce learning pressure on option-specific policies.

Bottleneck states in MDPs. There is a rich literature on identification of bottleneck states in markov decision processes. The core idea is to either identify all the states that are common to multiple goals in an environment  (McGovern and Barto, 2001) or use a diffusion model built using an MDP’s transition matrix (Machado et al., 2017; Şimşek and Barto, 2004; Theocharous and Mahadevan, 2002). The key distinction between bottleneck states and decision states is that decision states are more closely tied to the information available to the agent and what it can act upon, whereas bottleneck states are more tied to the connectivity structure of an MDP, representing states which when visited allow access to a novel set of states (Goyal et al., 2019). Concretely, while a corner in a room need not be a bottleneck state, since visiting a corner does not “open up” a new set of states to an agent (the way a doorway would), it is still a useful state for a goal-driven agent with partial observation to visit (since it is a distinct landmark in the state space where decisions could be made meaningfully). Indeed, we found in our initial experiments that our intrinsic control objective would not give interpretable decision states for MDPs. To see this, consider being inside a corridor with an option that takes you to the left of the room. Even in the middle of the corridor, if you had a map, you could “decide” to go to the left of the room, meaning that the end of the corridor need not be a decision state. This further illustrates that decision states are much more tied to the agent (what it has seen) and the environment, as opposed to bottleneck states, which are more agent independent, intrinsic properties of the environment.

Information Bottleneck in Machine Learning.

Since the foundational work of (Tishby et al., 1999; Chechik et al., 2005), there has been a lot of interest in making use of ideas from information bottleneck (IB) for various tasks such as clustering (Strouse and Schwab, 2017; Still et al., 2004), sparse coding (Chalk et al., 2016)

, classification using deep learning 

(Alemi et al., 2016; Achille and Soatto, 2016), cognitive science and language (Zaslavsky et al., 2018), reinforcement learning (Goyal et al., 2019; Galashov et al., 2019; Strouse et al., 2018) etc. In contrast to these works, we apply an information bottleneck to a reinforcement learning agent that must learn without explicit reward supervision to identify decision states in an environment.

6 Conclusion

We devise an unsupervised approach – DS-VIC– to identify decision states in an environment. These decision states are junctions where the agent needs to make a decision (as opposed to following default behavior) and are a function of environment structure as well as an agent’s partial observation. Our results on multi-room and maze environments demonstrate that the learnt decision states are human-interpretable e.g. appear at end of hallways, crossings in a maze, etc., as well as transferable i.e. aid exploration on external-reward tasks in terms of better success rate and sample complexity.


  • Achiam et al. (2018) Achiam, J., H. Edwards, D. Amodei, and P. Abbeel (2018). Variational option discovery algorithms. arXiv preprint arXiv:1807.10299.
  • Achille and Soatto (2016) Achille, A. and S. Soatto (2016). Information dropout: learning optimal representations through noise.
  • Alemi et al. (2016) Alemi, A. A., I. Fischer, J. V. Dillon, and K. Murphy (2016). Deep variational information bottleneck. In ICLR.
  • Barber and Agakov (2004) Barber, D. and F. V. Agakov (2004). Information maximization in noisy channels : A variational approach. In S. Thrun, L. K. Saul, and B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16, pp. 201–208. MIT Press.
  • Burda et al. (2018) Burda, Y., H. Edwards, A. Storkey, and O. Klimov (2018). Exploration by random network distillation. arXiv preprint arXiv:1810.12894.
  • Chalk et al. (2016) Chalk, M., O. Marre, and G. Tkacik (2016). Relevant sparse codes with variational information bottleneck.
  • Chechik et al. (2005) Chechik, G., A. G. N. Tishby, and Y. Weiss (2005). Information bottleneck for gaussian variables. J. of Machine Learning Research 6, 165–188.
  • Chevalier-Boisvert et al. (2018) Chevalier-Boisvert, M., L. Willems, and S. Pal (2018). Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid.
  • Cover and Thomas (2006) Cover, T. M. and J. A. Thomas (2006). Elements of Information Theory. John Wiley. 2nd edition.
  • Eysenbach et al. (2018) Eysenbach, B., A. Gupta, J. Ibarz, and S. Levine (2018). Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070.
  • Galashov et al. (2018) Galashov, A., S. M. Jayakumar, L. Hasenclever, D. Tirumala, J. Schwarz, G. Desjardins, W. M. Czarnecki, Y. W. Teh, R. Pascanu, and N. Heess (2018). Information asymmetry in kl-regularized rl.
  • Galashov et al. (2019) Galashov, A., S. M. Jayakumar, L. Hasenclever, D. Tirumala, J. Schwarz, G. Desjardins, W. M. Czarnecki, Y. W. Teh, R. Pascanu, and N. Heess (2019, May). Information asymmetry in KL-regularized RL.
  • Goyal et al. (2019) Goyal, A., R. Islam, D. Strouse, Z. Ahmed, M. Botvinick, H. Larochelle, S. Levine, and Y. Bengio (2019). Infobot: Transfer and exploration via the information bottleneck. In ICLR.
  • Gregor et al. (2016) Gregor, K., D. J. Rezende, and D. Wierstra (2016). Variational intrinsic control. arXiv preprint arXiv:1611.07507.
  • Higgins et al. (2017) Higgins, I., L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017). beta-VAE: Learning basic visual concepts with a constrained variational framework. In ICLR.
  • Hochreiter and Schmidhuber (1997) Hochreiter, S. and J. Schmidhuber (1997). Long short-term memory. Neural Computation 9(8), 1735–1780.
  • Houthooft et al. (2016) Houthooft, R., X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel (2016). Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pp. 1109–1117.
  • Kool and Botvinick (2018) Kool, W. and M. Botvinick (2018, December). Mental labour. Nat Hum Behav 2(12), 899–908.
  • Kostrikov (2018) Kostrikov, I. (2018). Pytorch implementations of reinforcement learning algorithms. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail.
  • Machado et al. (2017) Machado, M. C., M. G. Bellemare, and M. Bowling (2017). A laplacian framework for option discovery in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, Sydney, NSW, Australia, pp. 2295–2304. JMLR.org.
  • McGovern and Barto (2001) McGovern, A. and A. G. Barto (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML.
  • Pathak et al. (2017a) Pathak, D., P. Agrawal, A. A. Efros, and T. Darrell (2017a). Curiosity-driven exploration by self-supervised prediction. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , pp. 16–17.
  • Pathak et al. (2017b) Pathak, D., P. Agrawal, A. A. Efros, and T. Darrell (2017b). Curiosity-driven exploration by self-supervised prediction. In ICML.
  • Salge et al. (2013) Salge, C., C. Glackin, and D. Polani (2013, October). Empowerment – an introduction.
  • Schmidhuber (1990) Schmidhuber, J. (1990). A possibility for implementing curiosity and boredom in model-building neural controllers. In Proceedings of the First International Conference on Simulation of Adaptive Behavior on From Animals to Animats, Cambridge, MA, USA, pp. 222–227. MIT Press.
  • Şimşek and Barto (2004) Şimşek, Ö. and A. G. Barto (2004). Using relative novelty to identify useful temporal abstractions in reinforcement learning. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, New York, NY, USA, pp. 95–. ACM.
  • Still et al. (2004) Still, S., W. Bialek, and L. Bottou (2004). Geometric clustering using the information bottleneck method. In S. Thrun, L. K. Saul, and B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16, pp. 1165–1172. MIT Press.
  • Strehl and Littman (2008) Strehl, A. L. and M. L. Littman (2008, December). An analysis of model-based interval estimation for markov decision processes. J. Comput. System Sci. 74(8), 1309–1331.
  • Strouse et al. (2018) Strouse, D. J., M. Kleiman-Weiner, J. Tenenbaum, M. Botvinick, and D. Schwab (2018, August). Learning to share and hide intentions using information regularization.
  • Strouse and Schwab (2017) Strouse, D. J. and D. J. Schwab (2017, December). The information bottleneck and geometric clustering.
  • Teh et al. (2017) Teh, Y., V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu (2017). Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506.
  • Theocharous and Mahadevan (2002) Theocharous, G. and S. Mahadevan (2002). Learning the hierarchical structure of spatial environments using multiresolution statistical models. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems.
  • Tishby et al. (1999) Tishby, N., F. Pereira, and W. Biale (1999). The information bottleneck method. In The 37th annual Allerton Conf. on Communication, Control, and Computing, pp. 368–377.
  • Zaslavsky et al. (2018) Zaslavsky, N., C. Kemp, T. Regier, and N. Tishby (2018, July). Efficient compression in color naming and its evolution. Proc. Natl. Acad. Sci. U. S. A. 115(31), 7937–7942.
  • Ziebart (2010) Ziebart, B. D. (2010). Modeling purposeful adaptive behavior with the principle of maximum causal entropy.

Appendix A Appendix

a.1 Upper bound on

We explain the steps to derive Eqn. (4) in the main paper, as the upper bound on . By the data processing inequality (Cover and Thomas, 2006) for the graphical model in Fig 2 So we will next derive an upper bound on .

We write , given a start state as:


The key difference here is that our objective here uses the options internal to the agent as opposed to Goyal et al. (2019) who use external goal specifications provided to the agent. Similar to VIC, here denotes the (unkown) state distribution at time from which we can draw samples when we execute a policy.

We then assume a variational approximation 555For our experiments, we fix to be a unit gaussian, however it could also be learned. for , and using the fact that , we get a the following lower bound:


a.2 Decision State Identification – On policy with Options.

We use Eqn. 8 to identify and visualize the decision states learned in an environment, augmented with random sampling of the start state . Thus, we compute our decision states in an on-policy manner. Mathematically, we can write this as:


where is a random spawn location uniformly chosen from the set of states in the environment and is a random option chosen at each of the spawn locations. Thus, for each state in the environment, we look at the aggregate of all the trajectories that pass through it and compute the values of to identify / visualize decision states. In addition to being principled, this methodology also precisely captures our intuition that it is possible to identify decision states which are common to, or frequented across multiple options. Our results section 4 show that such states indeed correspond to structural regularities in the environment.

a.3 Identification of Decision States during Transfer

As mentioned in the main paper, we would like to compute to identify decision states, given a (potentially) novel environment and a novel goal-conditioned task, to provide it an exploration bonus. Given a state , that we would like to compute , we can write:


However, this cannot be computed in this form in a transfer task, since a goal driven agent is not following on-policy actions for an option that would allow us to draw samples from above (in order to do a monte-carlo estimate of the integral above). Thus instead, we propose to compute the mutual information as follows:


Now, given a state , this requires us to draw samples from , which in general is intractable (since this requires us to know , which is not available in closed form). In order to compute the above equation, we make the assumption that . Breaking it down, this makes two assumptions: firstly,

. This means that all options have equal probability of passing through a state

at time , which is not true in general. The second assumption is the same as VIC, namely that . Instead of making this assumption, one could also fit a parameterized variational approximation to and train it in conjunction with VIC. However, we found our simple scheme to work well in practice, and thus avoid fitting an additional variational approximation.

a.4 Decision states from local observations

In the main paper, our results assumed that one had access to a good SLAM routine that did mapping and localization from partial observations . In general, as we discuss in the main paper, this is not a bad assumption since our model is always trained on a single environment, and thus it is reasonable for us to expect it to have a sense of the (x, y) coordinates in its internal representation / weights as training progresses. In this section, we detail some of the pathologies that we observed when kicking off training, if we took a naive approach to SLAM, estimating a function to directly regress to from partial observations (Algorithm 1).

We note that since our particular choice of environment is a discrete gridworld, several partial observations (especially for small agent view sizes like the 3x3 window) look the same to the agent and the agent has a tendency to converge to the most trivial optima for learning options – one which learns to end at an easily obtainable partial observation given an option. For instance, while training DS-VIC, one particular policy simply learned 4 options corresponding to the four cardinal directions the agent can face, and each can be achieved by left or right turns, without actually requiring any movement by the agent. Since the agent has a compass which tells it the direction it is facing, it can ignore the partial observation and just use the direction vector to predict which of the 4 options was sampled.

a.5 DIAYN comparison

We provide more details on how we compare to DIAYN in our framework. Specifically, setting and maximizing instead of renders Eqn. 2 (main paper) similar to DIAYN. Additionally, similar to DIAYN, we restrict to be deterministic – removing the physical bottleneck, and compute values of in the policy space to identify decision states.

a.6 Network Architecture

We use a 3 layered convolutional neural network with kernels of size 3x3, 2x2 and 2x2 in the 3 layers respectively to process the agent’s egocentric observation. We use


as the non-linear activation function after each convolutional layer. The output of the CNN is then concatenated with the agent’s direction vector (compass) and the option (or goal encoding). The concatenated features are then passed through a linear layer with hidden size 64 to produce the final features used by option encoder

head and the policy head . We use the (x, y) coordinates of the final state (embedded through a linear layer of hidden size 64) to regress to the option via . Furthermore, our parameterized policy is a reactive one and the encoder is recurrent over the sequence of states encountered in the episode. The bottleneck random variable is sampled from the parameterized gaussian and is made a differentiable stochastic node using the re-parmaterization trick for gaussian random variables.

a.7 Implementation Details

As we deal with partially-observable settings for all our experiments, the agent receives an egocentric view of it’s surroundings as input, encoded as an occupancy grid where the channel dimension specifies whether the agent or an obstacle is present at an location. We set the coefficient in Eqn. 5 to be for all our experiments based on sweeps conducted across multiple values. In practice, we found it to be difficult to optimize our unsupervised objective in absence of an entropy bonus – the parameterized policy collapses to a deterministic one and no options are learned in addition to inefficient exploration of the state-space. Moreover, we found that it was hard to obtain reasonable results by optimizing for both the terms in the objective from scratch and therefore, we optimize the intrinsic objective itself for episodes (i.e., we set ) after which we turn on the bottleneck term and let grow linearly for another episodes to get feasible outcomes at convergence. We experiment with a vocabulary of 2, 4 and 32 options (imitating undercomplete and overcomplete settings) for all the environments. For all the exploration incentives presented in Table 1, we first picked a value of in Eqn. (6) (decides how to weigh the bonus with respect to the reward) from based on the best sample-complexity on goal-driven transfer For our implementation of InfoBot Goyal et al. (2019), we picked chose and based on a narrower search in the cross product of the set of values above and the set of values which Goyal et al. (2019) reported. The picked value was held fixed for all transfer experiments for the corresponding exploration incentives (rows in Table 1).

We use Advantage Actor-Critic (A2C) (open-sourced implementation by Kostrikov (2018)

) for all our experiments. We use RMSprop as the optimizer for all our experiments. Since we were unable to find code to reproduce 

Goyal et al. (2019), we implemented InfoBot ourselves while making sure that we are consistent with the architectural and hyper-parameter choices adopted by InfoBot (as per the A2C implementation by Chevalier-Boisvert et al. (2018)). The only difference between the A2C implementation we use (by Kostrikov (2018)) and that of Chevalier-Boisvert et al. (2018) is in terms of how updates are performed – the former does Monte-Carlo updates while the latter does temporal difference updates.

Convergence Criterion. In practice, we observe that it is hard to learn a lot of options using the unsupervised objective. From the entire option vocabulary, the agent only learns a few number of options discriminatively (as identified by the option termination state), with the rest collapsing into the same set of final states. To pick a suitable checkpoint for the transfer experiments, we pick the ones which have learned the maximum number of options reliably – as measured by the likelihood of the correct option from the final state .

Option Curriculum It is standard to have a fixed-sized discrete option space with a uniform pior Gregor et al. (2016). However, learning a meaningful option space with larger option vocabulary size has been reported to be difficult Achiam et al. (2018). We adopt a curriculum based approach proposed in Achiam et al. (2018) where vocalubary size is gradually increases as the option decoder becomes more confident in mapping back the final state back to the corresponding option sampled at the beginning of the episode. More concretely, whenever

(threshold chosen via hyperparameter tuning), the option vocabulary size increases according to

For our experiments, we start from and terminate the curriculum when .

We run each of our experiments on a single GTX Titan-X GPU, and use no more than 1 GB of GPU memory per experimental setting.