# Options Discovery with Budgeted Reinforcement Learning

We consider the problem of learning hierarchical policies for Reinforcement Learning able to discover options, an option corresponding to a sub-policy over a set of primitive actions. Different models have been proposed during the last decade that usually rely on a predefined set of options. We specifically address the problem of automatically discovering options in decision processes. We describe a new learning model called Budgeted Option Neural Network (BONN) able to discover options based on a budgeted learning objective. The BONN model is evaluated on different classical RL problems, demonstrating both quantitative and qualitative interesting results.

## Authors

• 2 publications
• 24 publications
• ### Reinforcement Learning in POMDPs with Memoryless Options and Option-Observation Initiation Sets

Many real-world reinforcement learning problems have a hierarchical natu...
08/22/2017 ∙ by Denis Steckelmacher, et al. ∙ 0

• ### Principled Option Learning in Markov Decision Processes

It is well known that options can make planning more efficient, among th...
09/18/2016 ∙ by Roy Fox, et al. ∙ 0

• ### Classifying Options for Deep Reinforcement Learning

In this paper we combine one method for hierarchical reinforcement learn...
04/27/2016 ∙ by Kai Arulkumaran, et al. ∙ 0

• ### Option Encoder: A Framework for Discovering a Policy Basis in Reinforcement Learning

Option discovery and skill acquisition frameworks are integral to the fu...
09/09/2019 ∙ by Arjun Manoharan, et al. ∙ 6

• ### Learning Robust Options

Robust reinforcement learning aims to produce policies that have strong ...
02/09/2018 ∙ by Daniel J. Mankowitz, et al. ∙ 0

• ### Option Discovery in the Absence of Rewards with Manifold Analysis

Options have been shown to be an effective tool in reinforcement learnin...
03/12/2020 ∙ by Amitay Bar, et al. ∙ 0

• ### Discovering hierarchies using Imitation Learning from hierarchy aware policies

Learning options that allow agents to exhibit temporally higher order be...
12/01/2018 ∙ by Ameet Deshpande, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Works in cognitive science have long emphasized that human or animal behavior can be seen as a hierarchical process, in which solving a task amounts to sequentially solving sub-tasks [Botvinick et al.2009]. Examples of those in a simple maze environment are go to the door or go to the end of the corridor; these sub-tasks themselves correspond to sequences of primitive actions or other sub-tasks.

In the computer science domain, these works have led to the hierarchical reinforcement learning paradigm [Dayan and Hinton1993, Dietterich1998, Parr and Russell1998], motivated by the idea that it makes the discovery of complex skills easier. One line of approaches consists in modeling sub-tasks through options [Sutton et al.1999], giving rise to two main research questions: (a) choosing the best suited option and then (b) selecting actions to apply in the environment based on the chosen option. These two challenges are respectively mediated using a high-level and a low-level controllers.

In the Reinforcement Learning (RL) literature, different models have been proposed to solve these questions. But the additional question of discovering options is rarely addressed: the hierarchical structure in the majority of existing models has to be manually constrained, e.g. by predefining possible sub-goals [Kulkarni et al.2016]. Learning automatic task decomposition, without supervision, remains an open challenge in the field of RL. The difficulty of discovering hierarchical policies is emphasized by the fact that it is not well understood how they emerge in behaviors.

Recent studies in neurosciences suggest that a hierarchy can emerge from habits [Dezfouli and Balleine2013]. Indeed, [Keramati et al.2011] and [Dezfouli and Balleine2012] distinguish between goal-directed and habitual action control; the latter not using any information from the environment. The underlying idea is that the agent first uses goal-directed control, and then progressively switches to habitual action control since goal-directed control requires a higher cognitive cost to process the acquired information111In that case, the cognitive cost is the time the animal spends to decide which action to take..

Based on this idea, we propose the BONN architecture (Budgeted Options Neural Network). In this framework, the agent has access to different amounts of information coming from the environment. More precisely, we consider that at each time step , the agent can use a basic observation denoted used by the low-level controller as in classical RL problems. In addition, it can also choose to acquire a supplementary observation , as illustrated in Fig. 1. This extra observation will provide more information about the current state of the system and will be used by the high-level controller, but at a higher “cognitive” cost. On top of this setting, we assume that options will naturally emerge as a way to reduce the overall cognitive effort generated by a policy. In other words, by constraining the amount of high-level information used by our system (i.e the ), BONN will learn a hierarchical structure where the resulting policy is a sequence of low-cost sub-policies. In that setting, we show that the structure of the resulting policy can be seen as a sequence of intrinsic options [Gregor et al.2016], i.e. vectors in latent space.

The contributions of the paper are threefold:

• We propose a new assumption about options discovery, arguing that it is a consequence of learning a trade-off between policy efficiency and cognitive effort. The hierarchical structure emerges as a way to reduce the overall cost.

• We define a new model called BONN that implements this idea as a hierarchical recurrent neural network for which, at each time step, two observations are available at two different prices. This model is learned using a policy gradient method over a budgeted objective.

• We propose different sets of experiments in multiple settings showing the ability of our approach to discover relevant options.

The paper is organized as follows: related works are presented in Section 2. We define the background for RL and for recurrent policy gradient methods in Section 3. We introduce the BONN model and the learning algorithm for the budgeted problem in Section 4. Experiments are presented and discussed in Section 5. At last, Section 6 concludes and opens perspectives.

## 2 Related Work

The closest architecture to BONN is the Hierarchical Multiscale Recurrent Neural Network [Chung et al.2016]

that discovers hierarchical structures in sequences. It uses a binary boundary detector learned with a straight-through estimator, similar to the acquisition model (see Section

4.1) of BONN.i

More generally, Hierarchical Reinforcement Learning [Dayan and Hinton1993, Dietterich1998, Parr and Russell1998] has been the surge of many different works during the last decade since it is deemed as one solution to solve long-range planning tasks and to allow the transfer of knowledge between tasks. Many different models assume that subtasks are a priori known, e.g., the MAXQ method in [Dietterich1998]. The concept of option is introduced by [Sutton et al.1999]

. In this architecture, each option consists of an initiation set, its own policy (over primitive actions or other options), and a termination function which defines the probability of ending the option given a certain state. The concept of options is at the core of many recent articles. For example, in

[Kulkarni et al.2016], the authors propose an extension of the Deep Q-Learning framework to integrate hierarchical value functions using intrinsic motivation to learn the option policies. But in these different models, the options have to be manually chosen a priori and are not discovered during the learning process. Still in the options framework, [Daniel et al.2016] and [Bacon and Precup2015b]

discover options (both internal policies and the policy over options) without supervision, using respectively the Expectation Maximization algorithm and the option-critic architecture. Our contribution differs from these last two in that the BONN model does not have a fixed discrete number of options and rather uses an “intrinsic option” represented by a latent vector. Moreover, we clearly state how options arise by finding a good trade-off between efficiency and cognitive effort.

Close to our work, the concept of cognitive effort was introduced in [Bacon and Precup2015a] (but with discrete options), while intrinsic options, i.e. options as latent vectors, where used in [Gregor et al.2016].

At last, some articles propose hierarchical policies based on different levels of observations. A first category of models is those that use open-loop policies i.e. do not use observation from the environment at every time step. [Hansen et al.1996] propose a model that mixes open-loop and closed-loop control while considering that sensing incurs a cost. Some models focus on the problem of learning macro-actions [Hauskrecht et al.1998, Mnih et al.2016]

: in that case, a given state is mapped to a sequence of actions. Another category of models divides the state space into several components. For instance, the Abstract Hidden Markov Model

[Bui et al.2002] is based on discrete options defined on each space region. [Heess et al.2016] use a low-level controller that has only access to the proprioceptive information, and a high-level controller has access to all observations. [Florensa et al.2016] use a similar idea of factoring the state space into two components, and learn a stochastic neural network for the high-level controller. The blind setting of the BONN model, described in Section 5.2, is similar to (stochastic) macro-actions, but open-loop policies are rather limited in complex environments. The general BONN architecture is more comparable to works using two different observations, however, those models do not learn when to use the high-level controller.

## 3 Background

### 3.1 (PO-) Markov Decision Processes and Reinforcement Learning

Let us denote a Markov Decision Process (MDP) as a set of states

, a discrete set of possible actions , a transition distribution and a reward function . We consider that each state is associated with an observation , and that is a partial view of (i.e POMDP), being the size of the observation space. Moreover, we denote

the probability distribution over the possible initial states of the MDP.

Given a trajectory , a policy is defined by a probability distribution such that which is the probability of each possible action at time , knowing the history of the agent.

### 3.2 Learning with Recurrent Policy Gradient

Let us denote the discount factor, and the discounted sum of rewards (or discount return) at time , corresponding to the trajectory with the size of the trajectories sampled by the policy 222We describe finite-horizon problems where is the size of the horizon and , but the approach can also be applied to infinite horizon problems with discount factor . Note that corresponds to the classical discount return.

We can define the reinforcement learning problem as the optimization problem such that the optimal policy is computed by maximizing the expected discounted return :

 J(π)=Es0≈PI;a0,....,aT−1≈π[R0] (1)

where is sampled following and the actions are sampled based on .

Different learning algorithms aim at maximizing . In the case of policy gradient techniques, if we consider that, for sake of simplicity, also denotes the set of parameters of the policy, the gradient of the objective can be approximated with:

 ∇πJ(π)≈1MM∑m=1T−1∑t=0∇πlogπ(at|x0,a0,...,xt)(Rt−bt) (2)

where is the number of sampled trajectories used for approximating the gradient using Monte Carlo sampling techniques,

is a variance reduction term at time

estimated during learning, and we consider that future actions do not depend on past rewards (see [Wierstra et al.2010] for details on recurrent policy gradients).

## 4 Budgeted Option Neural Network

### 4.1 The BONN Architecture

In a typical (PO-)MDP setting, the agent uses an observation from the environment at every time step. In contrast, in the BONN architecture, the agent always uses a low-level observation , but can also choose to acquire a high-level observation that will provide a more relevant information, as illustrated in Figure 1. This situation corresponds to many practical cases: for example a robot that acquires information through its camera () can sometimes decide to make a complete scan of the room (); a user driving a car (using ) can decide to consult its map or GPS (); a virtual agent taking decisions in a virtual world (based on ) can ask instructions from a human (), etc. Note that a particular case (called blind setting) is when is empty, and is the classical observation over the environment. In that case, the agent will basically decide whether it wants to use the current observation, as in the goal directed vs habits paradigm [Keramati et al.2011, Dezfouli and Balleine2012].

The structure of BONN is close to a hierarchical recurrent neural network with two hidden states, and , and is composed of three components. The acquisition model aims at choosing whether observation has to be acquired or not. If the agent decides to acquire , the option model uses both observations and to compute a new option denoted as a vector in an option latent space. The actor model updates the state of the actor and aims at choosing which action to perform.

We now formally describe these three components. A schema of the BONN architecture is provided in Fig. 2, and the complete inference procedure is given in Algorithm 1. Note that for sake of simplicity, we consider that the last chosen action is included in the low-level observation , as often done in reinforcement learning, and we avoid to explicitly write the last chosen action in all equations. Relevant representations of and are learned through neural network – linear models in our case. In the following, the notations and directly denote these representations, used as inputs in the BONN architecture.

#### Acquisition Model:

The acquisition model aims at deciding whether a new high-level observation needs to be acquired. It draws

according to a Bernoulli distribution with

. If , the agent will use to compute a new option (see next paragraph), otherwise it will only use to decide which action to apply333In our experiments, is a simple linear model following the concatenation of and .

#### Option Model:

If , the option model computes a new option state following in which is the lastly computed option before time step . represents a GRU cell [Cho et al.2014].

#### Actor Model:

The actor model updates the actor state and computes the next action . The update of depends on and the availability of the high-level observation :

• if :

• if :

The next action is then drawn from the distribution 444 represents a GRU cell while

is in our experiments a simple perceptron

.

#### Options and BONN:

When a new high-level observation is acquired i.e , the option state is updated based on the current observations and . Then, the policy will behave as a classical recurrent policy until a next high-level observation is acquired. In other words, when acquiring a new high-level observation, a new sub-policy is chosen depending on and (and eventually the previous option). The option state can be seen as a latent vector representing the option chosen at time , while represents what is usually called the termination function in option settings. In BONN, since the option is chosen directly according to the state of the environment (more precisely on the observations of the agent and ), there is no need to have an explicit initiation set defining the states where an option can begin.

### 4.2 Budgeted Learning for Options Discovery

Inspired by cognitive sciences [Kool and Botvinick2014], BONN considers that discovering options aims at reducing the cognitive effort of the agent. In our case, the cognitive effort is measured by the amount of high-level observations acquired by the model to solve the task, and thus by the amount of options vectors computed by the model over a complete episode. By constraining our model to find a good trade-off between policy efficiency and the number of high-level observations acquired, BONN discovers when this extra information is essential and has to be acquired, and thus when to start new sub-policies.

Let us denote the acquisition cost for a particular episode. We propose to integrate the acquisition cost (or cognitive effort) in the learning objective, relying on the budgeted learning paradigm already explored in different RL-based applications [Contardo et al.2016, Dulac-Arnold et al.2012]. We define an augmented immediate reward that includes the generated cost:

 r∗(st,at,σt)=r(st,at)−λσt (3)

where controls the trade-off between the policy efficiency and the cognitive charge. The associated discounted return is denoted , so that will be used as the new objective to maximize, resulting in the following policy gradient update rule:

 π←π−γT−1∑t=0(∇πlogP(at|ht)+∇πlogP(σt|ht−1,xt))(R∗t−b∗t) (4)

where is the learning rate. Note that this rule now updates both the probabilities of the chosen actions and the probabilities of the that can be seen as internal actions and that decide if a new option has to be computed or not. is the new resulting variance reduction term as defined in [Wierstra et al.2010].

### 4.3 Discovering a discrete set of options

In the previous sections, we considered that the option generated by the option model is a vector in a latent space. This is slightly different than the classical option definition which usually considers that an agent has a given ”catalog” of possible sub-routines i.e the set of options is a finite discrete set. We propose here a variant of the model where the model learns a finite discrete set of options.

Let us denote the (manually-fixed) number of options one wants to discover. Each option will be associated with a (learned) embedding denoted . The option model will store the different possible options and choose which one to use each time an option is required. In that case, the option model will be considered as a stochastic model able to sample one option index denoted in by using a multinomial distribution on top of a softmax computation. In that case, as the option model computes some stochastic choices, the policy gradient update rule will integrate these additional internal actions with:

 π←π−γT−1∑t=0(∇logP(at|zt)+∇logP(σt|zt−1,at−1,xt)+∇logP(it|yt))(R∗t−bt) (5)

By considering that is computed based on a softmax over a scoring function where is a differentiable function, the learning will update both the function and the options embedding .

## 5 Experiments

### 5.1 Experimental setting

For all experiments, we used the ADAM optimizer [Kingma and Ba2014]

555An open-source version of the model is available at https://github.com/aureliale/BONN-model. The learning rates were optimized by grid-search.

Observations and are represented through linear models with a hidden layer of sizes and

respectively, followed by an activation function

, and the GRU cells are of size (dependent on the environment) [Cho et al.2014].

### 5.2 Blind Setting

Given a POMDP (or MDP), the easiest way to design the two observations and needed for the BONN model is to consider that is the empty observation and is the usual observation coming from the environment. This case is similar to the “goal-directed vs habit action control” paradigm and corresponds to a case in which the agent chooses either to acquire or not the observation. It also corresponds to a (stochastic) macro-actions framework: the agent chooses a sequence of actions for each observation.

Several environments were used to evaluate BONN in this setting: (i) CartPole: This is the classical cart-pole environment as implemented in the OpenAI Gym platform [Brockman et al.2016] in which observations are , and possible actions are or . The reward is for every time step without failure. In this environment, the sizes of the networks used are and . (ii) LunarLander: This environment corresponds to the Lunar Lander environment proposed in OpenAI Gym where observations describe the position, velocity, angle of the agent and whether it is in contact with the ground or not, and possible actions are do nothing, fire left engine, fire main engine, fire right engine. The reward is +100 if landing, +10 for each leg on the ground, -100 if crashing and -0.3 each time the main engine is fired. Here and . (iii) -rooms: This environment corresponds to a maze composed of rooms with doors between them (see Figure 3(a)). The agent always starts at the upper-left corner, while the goal position is chosen randomly at each episode: it can be in any room, in any position and its position changes at each episode. The reward function is -1 when moving and +20 when reaching the goal, while 4 different actions are possible: up,down,left and right. The observation describes the agent position, the position of the doors in the room, and the goal position if the goal is in the same room than the agent (i.e the agent only observes the current room). Note that this environment is much more difficult than other 4-rooms problems (introduced by [Sutton et al.1999]). In the latter, there is only one or two possible goal position(s) while in our case, the goal can be anywhere. Moreover, in our case, in a more realistic setting, the agent only observes the room it is in. Here and .

In simple environments like CartPole and LunarLander, the option model does not need to be recurrent and is just dependent on and .

#### Results:

We compare BONN to a recurrent policy gradient (R-PG) with GRU cells. Note that R-PG has access to all observations ( and ) at every time step and find the optimal policy in these environments, while BONN learns to use only from time to time.

We illustrate the quality of BONN in Table 1. There is two versions of each environment: a deterministic one () and a stochastic one () in which the movement of the agent can fail with probability : in that case, a random transition is applied. In deterministic environments, we can see that the BONN model is able to perform as well as classical baselines while acquiring the observation only a few times per episode: for example in the Cartpole environment, the agent needs to use the observation only 6% of the time. These results clearly show that in simple environments, there is no need to receive (observations) feedback permanently, and that a single planning step can generate several actions. However, in stochastic environments, observations are used more often and even then the performances degrade much more. Indeed, due to the stochasticity, it is not possible for the agent to deduce its position using only the chosen action (due to failure of actions, the agent will not know anymore in which state the environment is), and the observation thus needs to be acquired more often. It demonstrates the limits of open-loop policies in non predictable environments, justifying the use of a basic observation in the following.

### 5.3 Using low-level and high-level observations

As seen above, the problem of open-loop control – i.e control without any feedback from the environment – is that the stochasticity of environment cannot be “anticipated” by the agent without receiving feedback. We study here a setting in which provides a simple and light information, while provides a richer information. The motivation is that can help to decide which action to take, without as much cognitive effort as when using the complete observation.

For that, we use another version of the -rooms environment where contains the agent position in the current room, while corresponds to the remaining information (i.e positions of the doors and position of the goal if it is in the room). The difference with the previous setting is that the agent has a permanent access to its position. In this version, we used , and .

In a stochastic environment (), BONN only uses 16% of the time (versus 60% in the blind setting) and even so achieve a cumulative reward of -18 (roughly the same). We can see in Fig. 3 the rewards w.r.t. cost curves obtained by computing the Pareto front over BONN models with different cost levels . We note that the drop of performance in -rooms with happens at a lower cost than the one in -rooms environment with blind setting. Indeed, with , the agent knows where it is at each time step and is more able to “compensate” the stochasticity of the environment than in the first case, in which the position is only available through . Like in the deterministic blind setting, the agent is able to discover meaningful options and to acquire the relevant information only once by room – see Section 5.4.

This experiment shows the usefulness of permanently using a basic observation with a low cognitive cost, in contrast to temporary “pure” open-loop policies where no observation is used.

### 5.4 Analysis of option discovered

Figures 3(a) illustrates trajectories generated by the agent in the -rooms environment, and the positions where the options are generated. We can see that the agent learns to observe only once in each room and that the agent uses the resulting option until it reaches another room. Thus the agent deducts from if he must move to another room, or reach the goal if it is in the current room. Note that the agent does not go directly to the goal room because it has no information about it. Seeing only the current room, it learns to explore the maze in a “particular” order until reaching the goal room. We have also visualized the options latent vectors using the t-SNE algorithm (Figure 3(b)). Similar colors (for example all green points) mean that the options computed correspond to observations for which the goals are in similar areas. We can for example see that all green options are close, showing that the latent option space effectively captures relevant information about options similarity.

The D-BONN model has been experimented on the -rooms environment, and an example of generated trajectories is given in Figure 5. Each color corresponds to one of the learned discrete options. One can see that the model is still able to learn a good policy, but the constraint over the fixed number of discrete options clearly decreases the quality of the obtained policy. It seems thus more interesting to use continuous options instead of discrete ones, the continuous options being regrouped in smooth clusters as illustrated in Figure 3(b).

### 5.5 Instructions as options

We consider at last a setting in which is an information provided by an oracle, while is a classical observation. The underlying idea is that the agent can choose an action based only on its observation, or use information from an optimal model with a higher cost. To study this case, we consider a maze environment where the maze is randomly generated for each episode (so the agent cannot memorize the exact map) and the goal and agent initial positions are also randomly chosen. The observation is the 9 cases surrounding the agent. The observation is a one hot vector of the action computed by a simple path planning algorithm that has access to the whole map. The parameters of the model are and (no representation is used for ). Note that the computation of can be expensive, leading to the idea that it has a higher cognitive cost.

Two examples of generated trajectories are illustrated in Fig. 6. Figure 5(b) illustrates a generated trajectory when learning is not finished. It shows that, at a certain point, the low-controller has learned to follow a straight path but needs to ask for instructions at each cross-road, or when a change of direction is needed. Some relevant options have already emerged in the agent behavior but are not optimal. When learning is finished (Fig 5(a)), the agent now ask for instructions only at cross-roads where the decision is essential to reach the goal. Between crossroads, the agent learns to follow the corridors, which corresponds to an intuitive and realistic behavior. Note that future works will be done using outputs of more expensive models in lieu of a high-level observation , and the BONN model seems to be an original way to study the Model Free/Model-based paradigm proposed in neuroscience [Gläscher et al.2010].

## 6 Conclusion and Perspectives

We proposed a new model for learning options in POMDP in which the agent chooses when to acquire a more informative – but costly – observation at each time step. The model is learned in a budgeted learning setting in which the acquisition of the additional information, and thus the use of a new option, has a cost. The learned policy is a trade-off between the efficiency and the cognitive effort of the agent. In our setting, the options are handled through learned latent representations. Experimental results demonstrate the possibility of reducing the cognitive cost – i.e. acquiring and computing information – without a drastic drop in performances. We also show the benefit of using different levels of observation and the relevance of extracted options. This work opens different research directions. One is to study if BONN can be applied in multi-task reinforcement learning problems (the environment -rooms, since the goal position is randomly chosen at each episode, can be seen as a multi-task problem). Another question would be to study problems where many different observations can be acquired by the agent at different costs - e.g, many different sensors on a robot. Finally, a promising perspective is learning how and when to interact with another expensive model.

## Acknowlegments

This work has been supported within the Labex SMART supported by French state funds managed by the ANR within the Investissements d’Avenir programme under reference ANR-11-LABX-65.