Context-Specific Representation Abstraction for Deep Option Learning

09/20/2021
by   Marwa Abdulhai, et al.
MIT
ibm
0

Hierarchical reinforcement learning has focused on discovering temporally extended actions, such as options, that can provide benefits in problems requiring extensive exploration. One promising approach that learns these options end-to-end is the option-critic (OC) framework. We examine and show in this paper that OC does not decompose a problem into simpler sub-problems, but instead increases the size of the search over policy space with each option considering the entire state space during learning. This issue can result in practical limitations of this method, including sample inefficient learning. To address this problem, we introduce Context-Specific Representation Abstraction for Deep Option Learning (CRADOL), a new framework that considers both temporal abstraction and context-specific representation abstraction to effectively reduce the size of the search over policy space. Specifically, our method learns a factored belief state representation that enables each option to learn a policy over only a subsection of the state space. We test our method against hierarchical, non-hierarchical, and modular recurrent neural network baselines, demonstrating significant sample efficiency improvements in challenging partially observable environments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

01/07/2022

Attention Option-Critic

Temporal abstraction in reinforcement learning is the ability of an agen...
10/02/2019

Variational Temporal Abstraction

We introduce a variational approach to learning and inference of tempora...
10/18/2021

MDP Abstraction with Successor Features

Abstraction plays an important role for generalisation of knowledge and ...
08/22/2017

Reinforcement Learning in POMDPs with Memoryless Options and Option-Observation Initiation Sets

Many real-world reinforcement learning problems have a hierarchical natu...
04/15/2019

Disentangling Options with Hellinger Distance Regularizer

In reinforcement learning (RL), temporal abstraction still remains as an...
02/14/2012

A temporally abstracted Viterbi algorithm

Hierarchical problem abstraction, when applicable, may offer exponential...
07/30/2020

Data-efficient Hindsight Off-policy Option Learning

Solutions to most complex tasks can be decomposed into simpler, intermed...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Hierarchical reinforcement learning (HRL) provides a principled framework for decomposing problems into natural hierarchical structures BerliacHierachialRL2019. By factoring a complex task into simpler sub-tasks, HRL can offer benefits over non-HRL approaches in solving challenging large-scale problems efficiently. Much of HRL research has focused on discovering temporally extended high-level actions, such as options, that can be used to improve learning and planning efficiency sutton1991; doina_thesis. Notably, the option-critic (OC) framework has become popular because it can learn deep hierarchies of both option policies and termination conditions from scratch without domain-specific knowledge (bacon2016optioncritic; riemer18options). However, while OC provides a theoretically grounded framework for HRL, it is known to exhibit common failure cases in practice including lack of option diversity, short option duration, and large sample complexity diversity.

Our key insight in this work is that OC suffers from these problems at least partly because it considers the entire state space during option learning and thus fails to reduce problem complexity as intended. For example, consider the partially observable maze domain in Figure 1. The objective in this domain is to pick up the key and unlock the door to reach the green goal. A non-HRL learning problem in this setting can be viewed as a search over the space of deterministic policies: , where and denote the size of the action space and belief state space, respectively. Frameworks such as OC consider the entire state space or belief state when learning options, which can be viewed themselves as policies over the entire state space, naively resulting in an increased learning problem size of , where denotes the number of options. Note that this has not yet considered the extra needed complexity for learning policies to select over and switch options. Unfortunately, because , learning options in this way can then only increase the effective complexity of the learning problem.

Figure 1: Temporal abstraction in this maze consists of 3 options; getting the key, opening the door, and going to the green goal. With context-specific belief state abstraction, the agent should only focus on parts of the belief state pertaining to the highlighted sub-task. Equipped with both temporal and context-specific abstraction, our method achieves a significant reduction in the learning problem size compared to considering the entire belief state space for each option as highlighted in Remark 1.

To address the problem complexity issue, we propose to consider both temporal abstraction, in the form of high-level skills, and context-specific abstraction over factored belief states, in the form of latent representations, to learn options that can fully benefit from hierarchical learning. Considering the maze example again, the agent does not need to consider the states going to the goal when trying to get the key (see Option 1 in Figure 1). Similarly, the agent only needs to consider states that are relevant to solving a specific sub-task. Hence, this context-specific abstraction leads to the desired reduction in problem complexity, and an agent can mutually benefit from having both context-specific and temporal abstraction for hierarchical learning.

Remark 1

With context-specific abstraction for each option, the learning problem size can be reduced such that:

(1)

where denotes the subset of the belief state space for each option (or sub-policy) and . We further illustrate this idea in the tree diagram of Figure 1, where each option maps to only a subset of the state space.

Contribution.

With this insight, we introduce a new option learning framework: Context-Specific Representation Abstraction for Deep Option Learning (CRADOL).

CRADOL provides options with the flexibility to select over dynamically learned abstracted components. This allows effective decomposition of problem size when scaling up to large state and action spaces, leading to faster learning. Our experiments demonstrate how CRADOL can improve performance over key baselines in partially observable settings.

Problem Setting and Notation

Partially Observable Markov Decision Process.

An agent’s interaction in the environment can be represented by a partially observable Markov decision process (POMDP), formally defined as a tuple

 POMDPs. is the state space, is the action space,

is the state transition probability function,

is the reward function, is the observation space, is the observation probability function, and is the discount factor. An agent executes an action according to its stochastic policy parameterized by , where denotes the belief state and the agent uses the observation history up to the current timestep to form its belief . Each observation is generated according to . An action yields a transition from the current state to the next state with probability , and an agent obtains a reward . An agent’s goal is to find the deterministic optimal policy that maximizes its expected discounted return .

Option-Critic Framework.

A Markovian option consists of a triple . is the initiation set, is the intra-option policy, and is the option termination function sutton1991. Similar to option discovery methods Mankowitz2016AdaptiveSA; determine_options_daniel; NIPS2016_c4492cbe, we also assume that all options are available in each state. MDPs with options become semi-MDPs smdp with an associated optimal value function over options and option-value function . bacon2016optioncritic introduces a gradient-based framework for learning sub-policies represented by an option. An option is selected according to a policy over options , and selects a primitive action until the termination of this option (as determined by ), which triggers a repetition of this procedure. The framework also models another auxiliary value function for computing the gradient of .

Factored POMDP.

In this work, we further assume that the environment structure can be modeled as a factored MDP in which the state space is factored into variables: , where each takes on values in a finite domain and  Guestrin_2003. Under this assumption, a state is an assignment of values to the set of state variables: such that . These variables are also assumed to be largely independent of each other with a small number of causal parents to consider at each timestep, resulting in the following simplification of the transition function:

(2)

This implies that in POMDP settings we can match the factored structure of the state space by representing our belief state in a factored form as well with and . Another consequence of the factored MDP assumption is context-specific independence.

Context-Specific Independence.

At any given timestep, only a small subset of the factored state space may be necessary for an agent to be aware of. As per context_definition, we define this subset to be a context. Formally, a context is a pair , where is a subset of state space variables and is the space of possible joint assignments of state space variables in the subset. A state is in the context when its joint assignment of variables in is present in . Two variables are contextually independent under if . This independence relation is referred to as context-specific independence (CSI) (chitnis2020camps). CSI in the state space also implies that this same structure can be leveraged in the belief state for POMDP problems.

Context-Specific Belief State Abstractions
for Option-Critic Learning

We now outline how factored state space structure and context specific independence can be utilized by OC to decrease the size of the search problem over policy space. We first note that the size of the search problem over the policy space of OC can be decomposed into the sum of the search problem sizes for each sub-policy within the architecture:

(3)

OC can provide a decrease in the size of the learning problem over the flat RL method when . This decrease is possible if an agent leverages context-specific independence relationships with respect to each option such that only a subset of the belief state variables are considered , where and , implying that because the belief state space gets exponentially smaller with each removed variable. The abstracted belief state is then sent as an input to the intra-option policy and termination policy , dictating the size of the learning problem for each:

(4)

However, a challenge becomes that we must also consider the size of the learning problem for the policy over options . cannot simply focus on an abstracted belief state in the context of an individual option because it must reason about the impact of all options. Naively sending the full belief state to is unlikely to make the learning problem smaller for a large class of problems because we cannot achieve gains over the flat policy in this case if is of comparable or greater size than . In this work, we address this issue by sending the policy over options its own compact abstracted belief state , where . As a result, the learning problem size of is and the total problem size for is:

(5)

Note that represents a different kind of belief state abstraction than in that it must consider all factors of the state space. , however, should also consider much less detail than is contained in for each factor, because is only concerned with the higher level semi-MDP problem that operates at a much slower time scale when options last for significant lengths before terminating and does not need to reason about the intricate details needed to decide on individual actions at each timestep. As a result, we consider settings where and , ensuring that an agent leveraging our CRADOL framework solves a smaller learning problem than either flat RL or OC as traditionally applied. Finally, the auxiliary value function , which is used to compute the gradient of , must consider rewards accumulated across options similarly to . Thus, we also send its own compact abstracted belief state as an input to such that the resulting problem size is small .

Figure 2: Architecture for intra-option policy of Context-Specific Abstracted Option-Critic (CRADOL). The left side describes the overview of CRADOL, and the right side details the option learning process.

Learning Context-Specific Abstractions
with Recurrent Independent Mechanisms

In this section, we present a new framework, CRADOL, based on our theoretical motivation outlined for applying abstractions over belief state representations in the last section. In particular, our approach leverages recurrent independent mechanisms (RIMs) to model factored context-specific belief state abstractions for deep option learning.

Algorithm Overview

Figure 2 shows a high-level overview of CRADOL. CRADOL uses small LSTM networks to provide compact abstract representations for and . It additionally leverages a set of mechanisms to model the factored belief state while only sending a subset of this representation to each option during its operation.

Option Learning

Our framework for option learning broadly consists of four modules: option attention, a recurrent layer, sparse communication, and fully connected (see Figure 2). We choose to model each option based on a group of mechanisms because these mechanisms can learn to specialize into sub-components following the general structure assumed in the environment for factored POMDPs. This in turn allows our learned options to self-organize into groupings of belief state components.

Option Attention Module.

The option attention module chooses which mechanisms out of mechanisms () are activated for by passing the option through a look-up table . The lookup table ensures a fixed context selection over time (i.e. a fixed mapping to a subset of the belief state space) by an option as considering many different subsets with the same option would be effectively equivalent to considering the entire belief state space. This would lead to a large search for policy space as previously done by OC. To yield the output of the attention module , we apply the following:

(6)

where is the attention values for and denotes the value weight with the value size .

We note that a top- operation is then performed such that only mechanisms components are selected from the available mechanisms (not selected mechanisms are zero masked), which enables an option to operate under an exponentially reduced belief state space by operating over only a subset of mechanisms.

Recurrent Layer Module.

The recurrent layer module updates the hidden representation of all active mechanisms. Here, we have a separate RNN for each of the

mechanisms with all RNNs shared across all options. Each recurrent operation is done for each active mechanism separately by taking input and the previous RNN hidden state , in which we obtain the output of recurrent layer module for all mechanisms, where denotes RNN’s hidden size.

Sparse Communication Module.

The sparse communication module enables mechanisms to share information with one another, facilitating coordination in learning amongst the options. Specifically, all active mechanisms can read hidden representations from both active and in-active mechanisms, with only active mechanisms updating their hidden states. This module outputs the context-specific belief state given the input :

(7)

where are communication parameters with the communication key size .

Equation 7 is similar to RIMs. We only update the top- selected mechanisms from the option selection module, but through this transfer of information, each active mechanism can update its internal belief state and have improved exploration through contextualization. We note that there are many choices in what components to share across options. We explore these choices and their implications in our empirical analysis.

Fully Connected Module.

Lastly, we have a separate fully connected layer for each option-specific intra-option and termination policy , which take as input. The intra-option policy provides the final output action , and the termination determines whether should terminate.

Implementation

We first describe our choice for representing factored belief states. Under the state uniformity assumption discussed in non_markov_process, we assume the optimal policy network based on the agent’s history is equivalent to the network conditioned on the true state space. Hence, we refer to the representation learned as an implicit belief state. More explicit methods for modeling belief states have been considered, for example, as outlined in igl2018deep. While the CRADOL framework is certainly compatible with this kind of explicit belief state modeling, we have chosen to implement the belief state implicitly and denote it as the factored belief state in order to have a fair empirical comparison to the RIMs method RIMs that we build off.

Our implementation draws inspiration from soft-actor critic haarnoja2018soft to enable sample efficient off-policy optimization of the option-value functions, intra-option policy, and beta policy. In the appendix, we describe the implementation of our approach (see Algorithm 1) and additional details about the architecture.

Related Work

Hierarchical RL.

There have been two major high-level approaches in the recent literature for achieving efficient HRL: the options framework to learn abstract skills sutton1991; bacon2016optioncritic; omidshafiei18casl; riemer18options; riemer2020role and goal-conditioned hierarchies to learn abstract sub-goals nachum2019nearoptimal; Levy2017HierarchicalA; kulkarni; kim20hmat. Goal-conditioned HRL approaches require a pre-defined representation of the environment and mapping of observation space to goal space, whereas the OC framework facilitates long-timescale credit assignment by dividing the problem into pieces and learning higher-level skills. Our work is most similar to khetarpal_options_interest which learns a smaller initiation set through an attention mechanism to better focus an option to a region of the state space, and hence achieve specialization of skills. However, it does not leverage the focus to abstract and minimize the size of the belief space as CRADOL does. We consider a more explicit method of specialization by leveraging context-specific independence for representation abstraction. Our approach could also potentially consider learning initiation sets, so we consider our contributions to be orthogonal.

While OC provides a temporal decomposition of the problem, other approaches such as Feudal Networks feudalRL decompose problems with respect to the state space. Feudal approaches use a manager to learn more abstract goals over a longer temporal scale and worker modules to perform more primitive actions with fine temporal resolution. Our approach employs a combination of both visions, necessitating both temporal and state abstraction for effective decomposition. Although some approaches such as the MAXQ framework maxq employ both, they involve learning recursively optimal policies that can be highly sub-optimal BerliacHierachialRL2019.

Figure 3: Visualization of MiniGrid gym_minigrid, Moving Bandit moving_bandit, and Reacher brockman2016openai domains.

State Abstractions.

Our approach is related to prior work that considers the importance of state abstraction and the decomposition of the learning problem KONIDARIS20191; skill_learning_abs; abs_irrelevant. Notable methods of state abstraction include PAC state abstraction, which achieves correct clustering with high probability with respect to a distribution over learning problems pmlr-v80-abel18a. This abstraction method can have limited applicability to deep RL methods such as CRADOL. casual_state_rep has been able to learn task agnostic state abstractions by identifying casual states in the POMDP setting, whereas our approach considers discovering abstractions for sub-task specific learning. abstraction_kaelbling introduces the importance of abstraction in planning problems with chitnis2020camps performing a context-specific abstraction for the purposes of decomposition in planning-related tasks. CRADOL extends this work by exploring context-specific abstraction in HRL.

Evaluation

We demonstrate CRADOL’s efficacy on a diverse suite of domains. The code is available at https://git.io/JucVH and the videos are available at https://bit.ly/3tpJc8Z

. We explain further details on experimental settings, including domains and hyperparameters, in the appendix.

Algorithm Abstraction MiniGrid Empty MiniGrid MultiRoom MiniGrid KeyDoor
Temporal State AUC AUC AUC
A2C
SAC
OC
A2C-RIM
CRADOL
Table 1:

and Area under the Curve (AUC) in MiniGrid domains. Table shows mean and standard deviation computed with 10 random seeds. Best results in bold (computed by

-test with ). Note that CRADOL has the highest and AUC compared to non-HRL (A2C, SAC), HRL (OC), and modular recurrent neural network (A2C-RIM) baselines.

Experimental Setup

Domains.

We demonstrate the performance of our approach with domains shown in Figure 3. MiniGrid domains are well suited for hierarchical learning due to the natural emergence of skills needed to accomplish the goal. Moving Bandits considers the performance of CRADOL with extraneous features in sparse reward settings. Lastly, the Reacher domain observes the effects of CRADOL on low-level observation representations.

  • MiniGrid gym_minigrid

    : A library of open-source grid-world domains in sparse reward settings with image observations. Each grid contains exactly zero or one object with possible object types such as the wall, door, key, ball, box, and goal indicated by different colors. The goal for the domain can vary from obtaining a key to matching similar colored objects. The agent receives a sparse reward of 1 when it successfully reaches the green goal tile, and 0 for failure.

  • Moving Bandit moving_bandit: This 2D sparse reward setting considers a number of marked positions in the environment that change at random at every episode, with 1 of the positions being the correct goal. An agent receives a reward of 1 and terminates when the agent is close to the correct goal position, and receives 0 otherwise.

  • Reacher brockman2016openai: In this simulated MuJoCo task of OpenAI Gym environment, a robot arm consisting of 2 linkages with equal length must reach a random red target placed randomly at the beginning of each episode. We modify the domain to be a sparse reward setting: the agent receives a reward signal of 1 when its euclidean distance to the target is within a threshold, and 0 otherwise.

Baselines.

We compare CRADOL to the following non-hierarchical, hierarchical, and modular recurrent neural network baselines:

  • A2C a2c: This on-policy method considers neither context-specific nor temporal abstraction.

  • SAC haarnoja2018soft: This entropy-maximization off-policy method considers neither context-specific nor temporal abstraction.

  • OC bacon2016optioncritic: We consider an off-policy implementation of OC based on the SAC method to demonstrate the performance of a hierarchical method considering only temporal abstraction.

  • A2C-RIM RIMs. This method considers A2C with recurrent independent mechanisms, a baseline that allows us to observe the performance of a method employing context-specific abstraction only.

Results

(a) Option Trajectory
(b) Mechanisms Selection
(c) Option Correlation
Figure 4: (a) Temporal abstraction by the use of 3 options for following tasks: option 1 for getting the key, option 2 for opening the door, and option 3 for going to the door. (b) Each option maps to a unique subset of mechanisms, corresponding to their unique functions in the domain. (c) There is low correlation between option 1 & 2 and option 2 & 3 but higher correlation between option 1 & 2 corresponding to the shared belief states between them.
Question 1.

Does context-specific abstraction help achieve sample efficiency?

To answer this question, we compare performance in the MiniGrid domains. With a sparse reward function and dense belief state representation (i.e., image observation), MiniGrid provides the opportunity to test the temporal and context-specific abstraction capabilities of our method. Table 1 shows both final task-level performance () (i.e., final episodic average reward measured at the end of learning) and area under the learning curve (AUC). Higher values indicate better results for both metrics.

Overall, for all three scenarios of MiniGrid, CRADOL has the highest and AUC than the baselines. We observe that OC has a lower performance than CRADOL due to the inability of the options learned to diversify by considering the entire belief state space and the high termination probability of each option. Both A2C and SAC result in sub-optimal performance due to their failure in sparse reward settings. Finally, due to inefficient exploration and large training time required for A2C-RIM to converge, it is unable to match the performance of CRADOL. We see a smaller difference between CRADOL and these baselines for the Empty domain, as there is a smaller amount of context-specific abstraction that is required in this simplest setting of the MiniGrid domain. For the Multi-Room domain which is more difficult than the EmptyRoom domain, there is an increasingly larger gap as the agent needs to consider only the belief states in one room when trying to exit that room and belief states of the other room when trying to reach the green goal. Lastly, we see the most abstraction required for the Key Door domain where the baselines are unable to match the performance of CRADOL. As described in Figure 1, OC equipped with only temporal abstraction is unable to match the performance of CRADOL consisting of both temporal and context-specific abstraction.

Question 2.

What does it mean to have diverse options?

We visualize the behaviors of options for temporal abstraction and mechanisms for context-specific abstraction to further understand whether options are able to learn diverse sub-policies. Figure 3(a) shows the option trajectory in the DoorKey domain for the following (learned) sub-tasks: getting the key, opening the door, and going to the door. We find that each option is only activated for one sub-task. Figure 3(b) shows the mapping between options and mechanisms, and we see that each option maps to a unique subset of mechanisms. To understand whether these mechanisms have mapped to different factors of the belief state (and hence have diverse parameters), Figure 3(c)

computes the correlation between options, measured by the Pearson product-moment correlation method

freedman2007statistics. We find low correlation between option 1 & 2 and option 2 & 3 but higher correlation between option 1 & 3. Specifically, we observe a high correlation between option 1 & 3 in getting the key and opening the door due to the shared states between them, because opening the door is highly dependent on obtaining the key in the environment. This visualization empirically verifies our hypothesis that both temporal and context-specific abstraction are necessary to learn diverse and complementary option policies.

Question 3.

How does performance change as the need for context-specific abstraction increases?

In order to understand the full benefits of context-specific abstraction, we observe the performance with increasing context-specific abstraction in the Moving Bandits domain. We consider the performance determined by AUC for an increasing number of spurious features in the observation. Namely, we add 3 & 23 additional goals to the original 2 goal observation to test the capabilities of CRADOL (see Figure 3). Increasing the number of goals requires an increasing amount of context-specific abstraction, as there are more spurious features the agent must ignore to learn to move to the correct goal location as indicated in its observation. As shown in Figure 5, CRADOL performs significantly better than OC as the number of spurious goal locations it must ignore increases. We expect that this result is due to the CRADOL’s capability of both temporal and context-specific abstraction.

Figure 5:

AUC between CRADOL and OC in Moving Bandit domain. As the number of spurious features on the x-axis increases, the gap between the AUC performance between CRADOL and OC increases. This indicates a greater need for context-specific abstraction. Note that we see a different AUC scale across the number of goals simply due to different max train iteration for each goal setting. Mean and 95% confidence interval computed for 10 seeds are shown.

Figure 6: Ablation study experimenting with various components that can be shared and not shared between options in the option learning process shown in Figure 2. Too much sharing or too little sharing between options can lead to sub-optimal performance due to a lack of coordination. This result shows the mean computed for 10 seeds.
Question 4.

What is shared between options in the context of mechanisms?

As described in Figure 2, there are 4 components that make up the option learning model. In order to investigate which are necessary to share between options, we perform an ablation study with a subset of the available choices shown in Figure 6. Specifically, we study sharing of the look-up table in the input attention module between options (CRADOL-JointP), learning a separate parameter between options (CRADOL-SepV), learning separate parameters for (, ) of the sparse communication layer for each option (CRADOL-SepComm), and three other combinations of these three modules.

We find the lowest performance for the method with a separate sparse communication module for each option. We hypothesize that this is due to a lack of coordination between each option in updating their active mechanisms and ensuring other non-active mechanisms are learning separate and complementary parameters. Having a joint look-up table results in the second-lowest performance. This effectively maps each option to the same set of mechanisms, leading to a lack of diversity between option policies and only allowing for the difference in the fully connected layer of each option. Lastly, we observe the third-lowest performance with a separate parameter between options. Each option learning from a different representation can lead to similar option-policies and factored belief states for certain sub-goals unaccounted for. Other combinations reaffirm that sharing the sparse communication larger is essential for coordinated behavior when learning option policies.

Question 5.

Is context-specific abstraction always beneficial?

The Reach domain allows us to observe the effects of our method when there is little benefit of reducing the problem size, namely, when the observation is incredibly small. This domain does not require context-specific abstraction as the entire belief state space consists of relevant information to achieve the goal at hand. Specifically, the low-level representation of the observation as a 4-element observation vector, with the first 2 elements containing the generalized positions and the next 2 elements containing the generalized velocities of the two arm joints, are essential to reach the goal location. As expected in

Figure 7, the performance between CRADOL and OC is similar in this domain as the observation space does not contain any features that are useful for CRADOL to perform context-specific abstraction. We note our gains are larger for problems where they have larger intrinsic dimensionality.

Figure 7: In domains with small or negligible context-specific abstraction, the benefit of CRADOL is not as significant as shown in the Reacher domain, where the low-dimensional observation representation does not require context-specific abstraction. This figure shows the mean and 95% confidence interval computed for 10 seeds.

Conclusion

In this paper, we have introduced Context-Specific Representation Abstraction for Deep Option Learning (CRADOL) for incorporating both temporal and state abstraction for the effective decomposition of a problem into simpler components. The key idea underlying our proposed formulation is to map each option to a reduced state space, effectively considering each option as a subset of mechanisms. We evaluated our method on several benchmarks with varying levels of required abstraction in sparse reward settings. Our results indicate that CRADOL is able to decompose the problems at hand more effectively than state-of-the-art HRL, non-HRL, and modular recurrent neural network baselines. We hope that our work can help provide the community with a theoretical foundation to build off for addressing the deficiencies in HRL methods.

Acknowledgements

Research funded by IBM, Samsung (as part of the MIT-IBM Watson AI Lab initiative) and computational support through Amazon Web Services.

References

Algorithm

1:Initialize inter-Q network parameters
2:Initialize intra-Q network parameters
3:Initialize intra-policy parameter
4:Initialize beta-policy parameter
5:Initialize learning rates
6:Initialize for soft target update
7: Replay buffer initialization
8:Initialize internal states , , , ,
9:for each iteration do
10:     Get initial observation
11:     Get initial according to
12:     for each environment step do
13:         Get action and updated from
14:         Get updated , from
15:         Take action and observe and
16:         
17:         
18:         if  terminates in determined by  then
19:              Get new according to               
20:     for each gradient step do
21:          for
22:          for
23:         
24:          for
25:          for
26:               
Algorithm 1 CRADOL with Entropy Maximization

Option Learning Gradients

Given a set of Markov options with stochastic intra-option policies differentiable in their parameters , we denote the gradient of the expected discounted return with respect to and initial condition :

(8)

where denotes the discounted weighting of along trajectories originating from , with derived as denoted in Figure 2 and . We expand the option-value function upon arrival with state abstraction as:

(9)

where equals to:

(10)

Regarding the termination function, the gradient of the expected discounted return objective with respect to and the initial condition is:

(11)

where denotes the discounted weighting of along trajectories originating from , and is the advantage function over options.

Additional Domain Details

MiniGrid:

The observation is a partially observable view of the gridworld environment using a compact and efficient encoding, with 3 input values per visible grid cell and including 7x7x3 values in total. For the Empty Room domain, an agent is randomly initialized at the start of each episode and must learn to navigate to the green goal location. For the MultiRoom domain, a randomly initialized agent must learn to go through the green door to the green agent. Lastly, we test on the Key Domain in MiniGrid as described in the motivation. For the latter two domains, at the start of every episode, the structure of the grid also changes. Code for this domain can be found here: –https://github.com/maximecb/gym-minigrid˝.

Moving Bandit:

We modify this domain’s termination condition to simulate sparse reward settings. Specifically, we terminate when the agent has reached the goal location, receiving a reward of 1. Code for this domain can be found at: https://github.com/maximilianigl/rl-msol.

Reacher:

To consider sparse reward settings, we make a minor change to the reward signal in this domain. A tolerance is specified to create a terminal state when the end effector approximately reaches the target. The tolerance creates a circular area centered at the target, which we specify as 0.003. We use the code from OpenAI Gym.

Additional Experiment Details

We use the PyTorch library and GeForce RTX 3090 Graphics Card for running experiments. For specific versions of software, see the source code. We report hyperparameter values used across all methods, as well as episodic reward performance plots in the next page.

Hyperparameter Value
Batch Size 100
Learning Rate 0.0003
Entropy Weight 0.005
Options 3
Mechanisms 4
Top-k 3
Hidden Size per RNN 6
Value Size 32
Discount Factor 0.95
Table 3: MiniGrid Multi-Room & DoorKey
Hyperparameter Value
Batch Size 100
Learning Rate 0.0005
Entropy Weight 0.001
Options 3
Mechanisms 4
Top-k 3
Hidden Size per RNN 6
Value Size 32
Discount Factor 0.95
Table 4: Moving Bandit
Hyperparameter Value
Batch Size 100
Learning Rate 0.005
Entropy Weight 0.005
Options 3
Mechanisms 4
Top-k 3
Hidden Size per RNN 6
Value Size 16
Discount Factor 0.95
Table 5: Reacher
Hyperparameter Value
Batch Size 100
Learning Rate 0.001
Entropy Weight 0.001
Options 3
Mechanisms 6
Top-k 4
Hidden Size per RNN 6
Value Size 64
Discount Factor 0.95
Table 2: MiniGrid EmptyRoom

Figure 8: MiniGrid-Empty-Random-6x6-v0

Figure 9: MiniGrid-MultiRoom-N2-S4-v0

Figure 10: MiniGrid-DoorKey-6x6-v0

Figure 11: Moving Bandits with 2 goal locations.