Introduction
Hierarchical reinforcement learning (HRL) provides a principled framework for decomposing problems into natural hierarchical structures BerliacHierachialRL2019. By factoring a complex task into simpler subtasks, HRL can offer benefits over nonHRL approaches in solving challenging largescale problems efficiently. Much of HRL research has focused on discovering temporally extended highlevel actions, such as options, that can be used to improve learning and planning efficiency sutton1991; doina_thesis. Notably, the optioncritic (OC) framework has become popular because it can learn deep hierarchies of both option policies and termination conditions from scratch without domainspecific knowledge (bacon2016optioncritic; riemer18options). However, while OC provides a theoretically grounded framework for HRL, it is known to exhibit common failure cases in practice including lack of option diversity, short option duration, and large sample complexity diversity.
Our key insight in this work is that OC suffers from these problems at least partly because it considers the entire state space during option learning and thus fails to reduce problem complexity as intended. For example, consider the partially observable maze domain in Figure 1. The objective in this domain is to pick up the key and unlock the door to reach the green goal. A nonHRL learning problem in this setting can be viewed as a search over the space of deterministic policies: , where and denote the size of the action space and belief state space, respectively. Frameworks such as OC consider the entire state space or belief state when learning options, which can be viewed themselves as policies over the entire state space, naively resulting in an increased learning problem size of , where denotes the number of options. Note that this has not yet considered the extra needed complexity for learning policies to select over and switch options. Unfortunately, because , learning options in this way can then only increase the effective complexity of the learning problem.
To address the problem complexity issue, we propose to consider both temporal abstraction, in the form of highlevel skills, and contextspecific abstraction over factored belief states, in the form of latent representations, to learn options that can fully benefit from hierarchical learning. Considering the maze example again, the agent does not need to consider the states going to the goal when trying to get the key (see Option 1 in Figure 1). Similarly, the agent only needs to consider states that are relevant to solving a specific subtask. Hence, this contextspecific abstraction leads to the desired reduction in problem complexity, and an agent can mutually benefit from having both contextspecific and temporal abstraction for hierarchical learning.
Remark 1
With contextspecific abstraction for each option, the learning problem size can be reduced such that:
(1) 
where denotes the subset of the belief state space for each option (or subpolicy) and . We further illustrate this idea in the tree diagram of Figure 1, where each option maps to only a subset of the state space.
Contribution.
With this insight, we introduce a new option learning framework: ContextSpecific Representation Abstraction for Deep Option Learning (CRADOL).
CRADOL provides options with the flexibility to select over dynamically learned abstracted components. This allows effective decomposition of problem size when scaling up to large state and action spaces, leading to faster learning. Our experiments demonstrate how CRADOL can improve performance over key baselines in partially observable settings.
Problem Setting and Notation
Partially Observable Markov Decision Process.
An agent’s interaction in the environment can be represented by a partially observable Markov decision process (POMDP), formally defined as a tuple
POMDPs. is the state space, is the action space,is the state transition probability function,
is the reward function, is the observation space, is the observation probability function, and is the discount factor. An agent executes an action according to its stochastic policy parameterized by , where denotes the belief state and the agent uses the observation history up to the current timestep to form its belief . Each observation is generated according to . An action yields a transition from the current state to the next state with probability , and an agent obtains a reward . An agent’s goal is to find the deterministic optimal policy that maximizes its expected discounted return .OptionCritic Framework.
A Markovian option consists of a triple . is the initiation set, is the intraoption policy, and is the option termination function sutton1991. Similar to option discovery methods Mankowitz2016AdaptiveSA; determine_options_daniel; NIPS2016_c4492cbe, we also assume that all options are available in each state. MDPs with options become semiMDPs smdp with an associated optimal value function over options and optionvalue function . bacon2016optioncritic introduces a gradientbased framework for learning subpolicies represented by an option. An option is selected according to a policy over options , and selects a primitive action until the termination of this option (as determined by ), which triggers a repetition of this procedure. The framework also models another auxiliary value function for computing the gradient of .
Factored POMDP.
In this work, we further assume that the environment structure can be modeled as a factored MDP in which the state space is factored into variables: , where each takes on values in a finite domain and Guestrin_2003. Under this assumption, a state is an assignment of values to the set of state variables: such that . These variables are also assumed to be largely independent of each other with a small number of causal parents to consider at each timestep, resulting in the following simplification of the transition function:
(2) 
This implies that in POMDP settings we can match the factored structure of the state space by representing our belief state in a factored form as well with and . Another consequence of the factored MDP assumption is contextspecific independence.
ContextSpecific Independence.
At any given timestep, only a small subset of the factored state space may be necessary for an agent to be aware of. As per context_definition, we define this subset to be a context. Formally, a context is a pair , where is a subset of state space variables and is the space of possible joint assignments of state space variables in the subset. A state is in the context when its joint assignment of variables in is present in . Two variables are contextually independent under if . This independence relation is referred to as contextspecific independence (CSI) (chitnis2020camps). CSI in the state space also implies that this same structure can be leveraged in the belief state for POMDP problems.
ContextSpecific Belief State Abstractions
for OptionCritic Learning
We now outline how factored state space structure and context specific independence can be utilized by OC to decrease the size of the search problem over policy space. We first note that the size of the search problem over the policy space of OC can be decomposed into the sum of the search problem sizes for each subpolicy within the architecture:
(3) 
OC can provide a decrease in the size of the learning problem over the flat RL method when . This decrease is possible if an agent leverages contextspecific independence relationships with respect to each option such that only a subset of the belief state variables are considered , where and , implying that because the belief state space gets exponentially smaller with each removed variable. The abstracted belief state is then sent as an input to the intraoption policy and termination policy , dictating the size of the learning problem for each:
(4) 
However, a challenge becomes that we must also consider the size of the learning problem for the policy over options . cannot simply focus on an abstracted belief state in the context of an individual option because it must reason about the impact of all options. Naively sending the full belief state to is unlikely to make the learning problem smaller for a large class of problems because we cannot achieve gains over the flat policy in this case if is of comparable or greater size than . In this work, we address this issue by sending the policy over options its own compact abstracted belief state , where . As a result, the learning problem size of is and the total problem size for is:
(5) 
Note that represents a different kind of belief state abstraction than in that it must consider all factors of the state space. , however, should also consider much less detail than is contained in for each factor, because is only concerned with the higher level semiMDP problem that operates at a much slower time scale when options last for significant lengths before terminating and does not need to reason about the intricate details needed to decide on individual actions at each timestep. As a result, we consider settings where and , ensuring that an agent leveraging our CRADOL framework solves a smaller learning problem than either flat RL or OC as traditionally applied. Finally, the auxiliary value function , which is used to compute the gradient of , must consider rewards accumulated across options similarly to . Thus, we also send its own compact abstracted belief state as an input to such that the resulting problem size is small .
Learning ContextSpecific Abstractions
with Recurrent Independent Mechanisms
In this section, we present a new framework, CRADOL, based on our theoretical motivation outlined for applying abstractions over belief state representations in the last section. In particular, our approach leverages recurrent independent mechanisms (RIMs) to model factored contextspecific belief state abstractions for deep option learning.
Algorithm Overview
Figure 2 shows a highlevel overview of CRADOL. CRADOL uses small LSTM networks to provide compact abstract representations for and . It additionally leverages a set of mechanisms to model the factored belief state while only sending a subset of this representation to each option during its operation.
Option Learning
Our framework for option learning broadly consists of four modules: option attention, a recurrent layer, sparse communication, and fully connected (see Figure 2). We choose to model each option based on a group of mechanisms because these mechanisms can learn to specialize into subcomponents following the general structure assumed in the environment for factored POMDPs. This in turn allows our learned options to selforganize into groupings of belief state components.
Option Attention Module.
The option attention module chooses which mechanisms out of mechanisms () are activated for by passing the option through a lookup table . The lookup table ensures a fixed context selection over time (i.e. a fixed mapping to a subset of the belief state space) by an option as considering many different subsets with the same option would be effectively equivalent to considering the entire belief state space. This would lead to a large search for policy space as previously done by OC. To yield the output of the attention module , we apply the following:
(6) 
where is the attention values for and denotes the value weight with the value size .
We note that a top operation is then performed such that only mechanisms components are selected from the available mechanisms (not selected mechanisms are zero masked), which enables an option to operate under an exponentially reduced belief state space by operating over only a subset of mechanisms.
Recurrent Layer Module.
The recurrent layer module updates the hidden representation of all active mechanisms. Here, we have a separate RNN for each of the
mechanisms with all RNNs shared across all options. Each recurrent operation is done for each active mechanism separately by taking input and the previous RNN hidden state , in which we obtain the output of recurrent layer module for all mechanisms, where denotes RNN’s hidden size.Sparse Communication Module.
The sparse communication module enables mechanisms to share information with one another, facilitating coordination in learning amongst the options. Specifically, all active mechanisms can read hidden representations from both active and inactive mechanisms, with only active mechanisms updating their hidden states. This module outputs the contextspecific belief state given the input :
(7) 
where are communication parameters with the communication key size .
Equation 7 is similar to RIMs. We only update the top selected mechanisms from the option selection module, but through this transfer of information, each active mechanism can update its internal belief state and have improved exploration through contextualization. We note that there are many choices in what components to share across options. We explore these choices and their implications in our empirical analysis.
Fully Connected Module.
Lastly, we have a separate fully connected layer for each optionspecific intraoption and termination policy , which take as input. The intraoption policy provides the final output action , and the termination determines whether should terminate.
Implementation
We first describe our choice for representing factored belief states. Under the state uniformity assumption discussed in non_markov_process, we assume the optimal policy network based on the agent’s history is equivalent to the network conditioned on the true state space. Hence, we refer to the representation learned as an implicit belief state. More explicit methods for modeling belief states have been considered, for example, as outlined in igl2018deep. While the CRADOL framework is certainly compatible with this kind of explicit belief state modeling, we have chosen to implement the belief state implicitly and denote it as the factored belief state in order to have a fair empirical comparison to the RIMs method RIMs that we build off.
Our implementation draws inspiration from softactor critic haarnoja2018soft to enable sample efficient offpolicy optimization of the optionvalue functions, intraoption policy, and beta policy. In the appendix, we describe the implementation of our approach (see Algorithm 1) and additional details about the architecture.
Related Work
Hierarchical RL.
There have been two major highlevel approaches in the recent literature for achieving efficient HRL: the options framework to learn abstract skills sutton1991; bacon2016optioncritic; omidshafiei18casl; riemer18options; riemer2020role and goalconditioned hierarchies to learn abstract subgoals nachum2019nearoptimal; Levy2017HierarchicalA; kulkarni; kim20hmat. Goalconditioned HRL approaches require a predefined representation of the environment and mapping of observation space to goal space, whereas the OC framework facilitates longtimescale credit assignment by dividing the problem into pieces and learning higherlevel skills. Our work is most similar to khetarpal_options_interest which learns a smaller initiation set through an attention mechanism to better focus an option to a region of the state space, and hence achieve specialization of skills. However, it does not leverage the focus to abstract and minimize the size of the belief space as CRADOL does. We consider a more explicit method of specialization by leveraging contextspecific independence for representation abstraction. Our approach could also potentially consider learning initiation sets, so we consider our contributions to be orthogonal.
While OC provides a temporal decomposition of the problem, other approaches such as Feudal Networks feudalRL decompose problems with respect to the state space. Feudal approaches use a manager to learn more abstract goals over a longer temporal scale and worker modules to perform more primitive actions with fine temporal resolution. Our approach employs a combination of both visions, necessitating both temporal and state abstraction for effective decomposition. Although some approaches such as the MAXQ framework maxq employ both, they involve learning recursively optimal policies that can be highly suboptimal BerliacHierachialRL2019.
State Abstractions.
Our approach is related to prior work that considers the importance of state abstraction and the decomposition of the learning problem KONIDARIS20191; skill_learning_abs; abs_irrelevant. Notable methods of state abstraction include PAC state abstraction, which achieves correct clustering with high probability with respect to a distribution over learning problems pmlrv80abel18a. This abstraction method can have limited applicability to deep RL methods such as CRADOL. casual_state_rep has been able to learn task agnostic state abstractions by identifying casual states in the POMDP setting, whereas our approach considers discovering abstractions for subtask specific learning. abstraction_kaelbling introduces the importance of abstraction in planning problems with chitnis2020camps performing a contextspecific abstraction for the purposes of decomposition in planningrelated tasks. CRADOL extends this work by exploring contextspecific abstraction in HRL.
Evaluation
We demonstrate CRADOL’s efficacy on a diverse suite of domains. The code is available at https://git.io/JucVH and the videos are available at https://bit.ly/3tpJc8Z
. We explain further details on experimental settings, including domains and hyperparameters, in the appendix.
Algorithm  Abstraction  MiniGrid Empty  MiniGrid MultiRoom  MiniGrid KeyDoor  
Temporal  State  AUC  AUC  AUC  
A2C  ✗  ✗  
SAC  ✗  ✗  
OC  ✓  ✗  
A2CRIM  ✗  ✓  
CRADOL  ✓  ✓ 
and Area under the Curve (AUC) in MiniGrid domains. Table shows mean and standard deviation computed with 10 random seeds. Best results in bold (computed by
test with ). Note that CRADOL has the highest and AUC compared to nonHRL (A2C, SAC), HRL (OC), and modular recurrent neural network (A2CRIM) baselines.Experimental Setup
Domains.
We demonstrate the performance of our approach with domains shown in Figure 3. MiniGrid domains are well suited for hierarchical learning due to the natural emergence of skills needed to accomplish the goal. Moving Bandits considers the performance of CRADOL with extraneous features in sparse reward settings. Lastly, the Reacher domain observes the effects of CRADOL on lowlevel observation representations.

MiniGrid gym_minigrid
: A library of opensource gridworld domains in sparse reward settings with image observations. Each grid contains exactly zero or one object with possible object types such as the wall, door, key, ball, box, and goal indicated by different colors. The goal for the domain can vary from obtaining a key to matching similar colored objects. The agent receives a sparse reward of 1 when it successfully reaches the green goal tile, and 0 for failure.

Moving Bandit moving_bandit: This 2D sparse reward setting considers a number of marked positions in the environment that change at random at every episode, with 1 of the positions being the correct goal. An agent receives a reward of 1 and terminates when the agent is close to the correct goal position, and receives 0 otherwise.

Reacher brockman2016openai: In this simulated MuJoCo task of OpenAI Gym environment, a robot arm consisting of 2 linkages with equal length must reach a random red target placed randomly at the beginning of each episode. We modify the domain to be a sparse reward setting: the agent receives a reward signal of 1 when its euclidean distance to the target is within a threshold, and 0 otherwise.
Baselines.
We compare CRADOL to the following nonhierarchical, hierarchical, and modular recurrent neural network baselines:

A2C a2c: This onpolicy method considers neither contextspecific nor temporal abstraction.

SAC haarnoja2018soft: This entropymaximization offpolicy method considers neither contextspecific nor temporal abstraction.

OC bacon2016optioncritic: We consider an offpolicy implementation of OC based on the SAC method to demonstrate the performance of a hierarchical method considering only temporal abstraction.

A2CRIM RIMs. This method considers A2C with recurrent independent mechanisms, a baseline that allows us to observe the performance of a method employing contextspecific abstraction only.
Results
Question 1.
Does contextspecific abstraction help achieve sample efficiency?
To answer this question, we compare performance in the MiniGrid domains. With a sparse reward function and dense belief state representation (i.e., image observation), MiniGrid provides the opportunity to test the temporal and contextspecific abstraction capabilities of our method. Table 1 shows both final tasklevel performance () (i.e., final episodic average reward measured at the end of learning) and area under the learning curve (AUC). Higher values indicate better results for both metrics.
Overall, for all three scenarios of MiniGrid, CRADOL has the highest and AUC than the baselines. We observe that OC has a lower performance than CRADOL due to the inability of the options learned to diversify by considering the entire belief state space and the high termination probability of each option. Both A2C and SAC result in suboptimal performance due to their failure in sparse reward settings. Finally, due to inefficient exploration and large training time required for A2CRIM to converge, it is unable to match the performance of CRADOL. We see a smaller difference between CRADOL and these baselines for the Empty domain, as there is a smaller amount of contextspecific abstraction that is required in this simplest setting of the MiniGrid domain. For the MultiRoom domain which is more difficult than the EmptyRoom domain, there is an increasingly larger gap as the agent needs to consider only the belief states in one room when trying to exit that room and belief states of the other room when trying to reach the green goal. Lastly, we see the most abstraction required for the Key Door domain where the baselines are unable to match the performance of CRADOL. As described in Figure 1, OC equipped with only temporal abstraction is unable to match the performance of CRADOL consisting of both temporal and contextspecific abstraction.
Question 2.
What does it mean to have diverse options?
We visualize the behaviors of options for temporal abstraction and mechanisms for contextspecific abstraction to further understand whether options are able to learn diverse subpolicies. Figure 3(a) shows the option trajectory in the DoorKey domain for the following (learned) subtasks: getting the key, opening the door, and going to the door. We find that each option is only activated for one subtask. Figure 3(b) shows the mapping between options and mechanisms, and we see that each option maps to a unique subset of mechanisms. To understand whether these mechanisms have mapped to different factors of the belief state (and hence have diverse parameters), Figure 3(c)
computes the correlation between options, measured by the Pearson productmoment correlation method
freedman2007statistics. We find low correlation between option 1 & 2 and option 2 & 3 but higher correlation between option 1 & 3. Specifically, we observe a high correlation between option 1 & 3 in getting the key and opening the door due to the shared states between them, because opening the door is highly dependent on obtaining the key in the environment. This visualization empirically verifies our hypothesis that both temporal and contextspecific abstraction are necessary to learn diverse and complementary option policies.Question 3.
How does performance change as the need for contextspecific abstraction increases?
In order to understand the full benefits of contextspecific abstraction, we observe the performance with increasing contextspecific abstraction in the Moving Bandits domain. We consider the performance determined by AUC for an increasing number of spurious features in the observation. Namely, we add 3 & 23 additional goals to the original 2 goal observation to test the capabilities of CRADOL (see Figure 3). Increasing the number of goals requires an increasing amount of contextspecific abstraction, as there are more spurious features the agent must ignore to learn to move to the correct goal location as indicated in its observation. As shown in Figure 5, CRADOL performs significantly better than OC as the number of spurious goal locations it must ignore increases. We expect that this result is due to the CRADOL’s capability of both temporal and contextspecific abstraction.
Question 4.
What is shared between options in the context of mechanisms?
As described in Figure 2, there are 4 components that make up the option learning model. In order to investigate which are necessary to share between options, we perform an ablation study with a subset of the available choices shown in Figure 6. Specifically, we study sharing of the lookup table in the input attention module between options (CRADOLJointP), learning a separate parameter between options (CRADOLSepV), learning separate parameters for (, ) of the sparse communication layer for each option (CRADOLSepComm), and three other combinations of these three modules.
We find the lowest performance for the method with a separate sparse communication module for each option. We hypothesize that this is due to a lack of coordination between each option in updating their active mechanisms and ensuring other nonactive mechanisms are learning separate and complementary parameters. Having a joint lookup table results in the secondlowest performance. This effectively maps each option to the same set of mechanisms, leading to a lack of diversity between option policies and only allowing for the difference in the fully connected layer of each option. Lastly, we observe the thirdlowest performance with a separate parameter between options. Each option learning from a different representation can lead to similar optionpolicies and factored belief states for certain subgoals unaccounted for. Other combinations reaffirm that sharing the sparse communication larger is essential for coordinated behavior when learning option policies.
Question 5.
Is contextspecific abstraction always beneficial?
The Reach domain allows us to observe the effects of our method when there is little benefit of reducing the problem size, namely, when the observation is incredibly small. This domain does not require contextspecific abstraction as the entire belief state space consists of relevant information to achieve the goal at hand. Specifically, the lowlevel representation of the observation as a 4element observation vector, with the first 2 elements containing the generalized positions and the next 2 elements containing the generalized velocities of the two arm joints, are essential to reach the goal location. As expected in
Figure 7, the performance between CRADOL and OC is similar in this domain as the observation space does not contain any features that are useful for CRADOL to perform contextspecific abstraction. We note our gains are larger for problems where they have larger intrinsic dimensionality.Conclusion
In this paper, we have introduced ContextSpecific Representation Abstraction for Deep Option Learning (CRADOL) for incorporating both temporal and state abstraction for the effective decomposition of a problem into simpler components. The key idea underlying our proposed formulation is to map each option to a reduced state space, effectively considering each option as a subset of mechanisms. We evaluated our method on several benchmarks with varying levels of required abstraction in sparse reward settings. Our results indicate that CRADOL is able to decompose the problems at hand more effectively than stateoftheart HRL, nonHRL, and modular recurrent neural network baselines. We hope that our work can help provide the community with a theoretical foundation to build off for addressing the deficiencies in HRL methods.
Acknowledgements
Research funded by IBM, Samsung (as part of the MITIBM Watson AI Lab initiative) and computational support through Amazon Web Services.
References
Algorithm
Option Learning Gradients
Given a set of Markov options with stochastic intraoption policies differentiable in their parameters , we denote the gradient of the expected discounted return with respect to and initial condition :
(8) 
where denotes the discounted weighting of along trajectories originating from , with derived as denoted in Figure 2 and . We expand the optionvalue function upon arrival with state abstraction as:
(9) 
where equals to:
(10) 
Regarding the termination function, the gradient of the expected discounted return objective with respect to and the initial condition is:
(11) 
where denotes the discounted weighting of along trajectories originating from , and is the advantage function over options.
Additional Domain Details
MiniGrid:
The observation is a partially observable view of the gridworld environment using a compact and efficient encoding, with 3 input values per visible grid cell and including 7x7x3 values in total. For the Empty Room domain, an agent is randomly initialized at the start of each episode and must learn to navigate to the green goal location. For the MultiRoom domain, a randomly initialized agent must learn to go through the green door to the green agent. Lastly, we test on the Key Domain in MiniGrid as described in the motivation. For the latter two domains, at the start of every episode, the structure of the grid also changes. Code for this domain can be found here: –https://github.com/maximecb/gymminigrid˝.
Moving Bandit:
We modify this domain’s termination condition to simulate sparse reward settings. Specifically, we terminate when the agent has reached the goal location, receiving a reward of 1. Code for this domain can be found at: https://github.com/maximilianigl/rlmsol.
Reacher:
To consider sparse reward settings, we make a minor change to the reward signal in this domain. A tolerance is specified to create a terminal state when the end effector approximately reaches the target. The tolerance creates a circular area centered at the target, which we specify as 0.003. We use the code from OpenAI Gym.
Additional Experiment Details
We use the PyTorch library and GeForce RTX 3090 Graphics Card for running experiments. For specific versions of software, see the source code. We report hyperparameter values used across all methods, as well as episodic reward performance plots in the next page.
Hyperparameter  Value 
Batch Size  100 
Learning Rate  0.0003 
Entropy Weight  0.005 
Options  3 
Mechanisms  4 
Topk  3 
Hidden Size per RNN  6 
Value Size  32 
Discount Factor  0.95 
Hyperparameter  Value 
Batch Size  100 
Learning Rate  0.0005 
Entropy Weight  0.001 
Options  3 
Mechanisms  4 
Topk  3 
Hidden Size per RNN  6 
Value Size  32 
Discount Factor  0.95 
Hyperparameter  Value 
Batch Size  100 
Learning Rate  0.005 
Entropy Weight  0.005 
Options  3 
Mechanisms  4 
Topk  3 
Hidden Size per RNN  6 
Value Size  16 
Discount Factor  0.95 
Hyperparameter  Value 
Batch Size  100 
Learning Rate  0.001 
Entropy Weight  0.001 
Options  3 
Mechanisms  6 
Topk  4 
Hidden Size per RNN  6 
Value Size  64 
Discount Factor  0.95 
Comments
There are no comments yet.