1 Introduction
Reinforcement learning (RL) techniques have experienced much of their success in simulated environments, such as video games [11] or board games [14, 19]. One of the main reasons why RL has worked so well in these applications is that we are able simulate millions of interactions with the environment in a relatively short period of time, allowing the agent to experience a large number of different situations in the environment and learn the consequences of its actions.
In many real world applications, however, where the agent interacts with the physical world, it might not be easy to generate such a large number of interactions. The time and cost associated with training such systems could render RL an unfeasible approach for training in large scale.
As a concrete example, consider training a large number of humanoid robots (agents) to move quickly, as in the Robocup competition [4]. Although the agents have similar dynamics, subtle variations mean that a single policy shared across all agents would not be an effective solution. Furthermore, learning a policy from scratch for each agent is too datainefficient to be practical. As shown by Farchy et al. (2013), this type of problem can be addressed by leveraging the experience obtained from solving a related task (e.g., walking) to quickly learn a policy for each individual agent that is tailored to a new task (e.g., running).
The situation where agents might need to solve many related, but unique, tasks also occurs in industry; an example would be robots (agents) tasked with sorting items in fulfillment centers. A simple approach, like using PD controllers, would fail to adapt to the forces generated from picking up objects with different weight distributions, causing the agent to drop the objects. RL is able to mitigate this problem by learning a policy for each agent that is able to make corrections quickly, which is tailored to the robot’s dynamics. However, training a new policy for each agent would be far too costly to be a practical solution.
In these scenarios, it is possible to use a small number of policies learned by a subset of the agents, and then leverage the experience obtained from learning those policies to allow the remaining agents to quickly learn their corresponding policies. This approach can turn problems that are prohibitively expensive to solve into relatively simple problems.
To make use of prior experience and improve learning on new related problems in RL, several lines of work, which are complementary to each other, have been proposed and are actively being studied. Transfer learning, [18] refers to the problem of adapting information acquired while solving one task to another. One might consider learning a mapping function that allows for a policy learned in one task to be used in a different task, [1], or simply learn a mapping of the value function learned in one task to another, [17]. These techniques can be quite effective, but are also limited in that they consider mapping information from a source task to a target task; that is, they do not learn a general transfer strategy for many related tasks.
Another approach to reusing prior knowledge is through meta learning or learning to learn [13, 12]. In the context of RL, the goal under this framework is usually for an agent to be exposed to a number of tasks where it can learn some general behavior that generalizes to new tasks. For example, Finn et al. (2017), showed that an agent who learns how to walk forward is able to find a general policy that can quickly be adapted to learn to walk backwards [5].
One last technique to leverage prior experience, and the one this paper focuses on, is through temporally extended actions or temporal abstractions [9, 15]. While in the standard RL framework the agent has access to a set of primitive actions (i.e., actions that last for one time step), temporally extended actions allow an agent to execute actions that last for several timesteps. They introduce a bias in the behavior of the agent which, if appropriate for the problem at hand, results in dramatic improvements in how quickly the agent learns to solve a new task compared to only using primitive actions [9].
A popular representation for temporally extended actions is the options framework (formally introduced in the next section), which is the focus of this work. It has been shown that options learned in a specific task or set of tasks, can be reused to improve learning on new tasks [7, 2]; however, this often requires knowledge from the user about which options or how many options are appropriate for the type of problems the agent will face.
In this paper, we propose learning reusable options for a set of related tasks with minimal information provided by the user. We consider the scenario where the agent must solve a large numbers of tasks and show that after learning a wellperforming policy for a small number of problems, we can learn an appropriate number of options that facilitates learning in a remaining set of tasks. Ideally the trajectories used to learn options would be obtained from optimal policies, but for many learning algorithms it cannot be guaranteed that the learned policies are actually optimal. We propose learning a set of options that minimize the expected number of decisions needed to represent trajectories generated from the policies learned by the agent for a small number of problems, while also maximizing the probability of generating those trajectories. Our experiments show that after learning to solve a small number of tasks, the learned options allow the agent to much more quickly solve the remaining tasks.
2 Background and Notation
A Markov decision process (MDP) is a tuple, , where is the set of possible states of the environment, is the set of possible actions that the agent can take, is the probability that the environment will transition to state if the agent executes action in state , is the expected reward received after taking action in state and transitioning to state , is the initial state distribution, and is a discount factor for rewards received in the future. We use to index the timestep and write , , and to denote the state, action, and reward at time . A policy, , provides a conditional distribution over actions given each possible state: . We denote a trajectory of length as , that is, is defined as a sequence of states, actions and rewards observed after following some policy for timesteps.
This work focuses on learning temporally extended actions—actions lasting for multiple timesteps—that can be used for a set of related tasks. We consider the setting where an agent must solve a set of related tasks, where each task is an MDP, ; that is, each task is an MDP with its own transition function, reward function and initial state distribution, with shared state and action sets. Specifically, our work focuses on learning reusable options [16, 15] for a set of related tasks.
An option, , is a tuple in which is the set of states in which option can be executed (the initiation set), is a policy that governs the behavior of the agent while executing , and is a termination function that determines the probability that terminates in a given state. We assume that for all options ; that is, the options are available at every state. The options framework does not dictate how an agent should choose between available options or how options should be discovered. A common approach to selecting between options is to a learn a policy over options, which is defined by the probability of choosing an option in a particular state. Two recent popular approaches to option discovery are eigenoptions [7] and the optioncritic architecture [2].
The eigenoptions [7] of an MDP are the optimal policies for a set of implicitly defined reward functions called eigenpurposes. Eigenpurposes are defined in terms of protovalue functions [8]
, which are in turn derived from the eigenvectors of a modified adjacency matrix over states for the MDP. The intuition is that no matter the true reward function, the eigenoptions allow an agent to quickly traverse the transition graph, resulting in better exploration of the state space and faster learning. However, there are two major downsides: 1) the adjacency matrix is often not known
a priori, and may be difficult or impossible to construct for large MDPs, and 2) for each eigenpurpose, constructing the corresponding eigenoption requires solving a new MDP.The optioncritic architecture [2] is a more direct approach that learns options and a policy over options simultaneously. The option policies and their termination functions are trained using policy gradient methods, while the policy over options may be trained using any technique. One issue that often arises within this framework is that the termination functions of the learned options tend to collapse to “always terminate”. In a later publication, the authors built on this work to consider the case where there is a cost associated with switching options [6]. This method resulted in the agent learning to use a single option while it was appropriate and terminate when an option switch was needed, allowing it to discover improved policies for a particular task. The authors argue that minimizing the use of the policy over options may be desirable, as the cost of choosing an option may be greater than the cost of choosing a primitive action when using an option—e.g., when a planner is used to select an option. Work recently presented by Harutyunyan et al. (2019) approaches the aforementioned termination problem by explicitly optimizing the termination function of options to focus on small regions of the state space. However, while all of these methods can be effective in learning a policy for the task at hand, they do not explicitly take into consideration that the agent might face related, but different, tasks in the future. In contrast, our method discovers options that are useful for a variety tasks.
We build on the idea that minimizing the number of decisions an agent must make will lead to the discovery of generally useful temporal abstractions, and propose an offline technique where options are learned after solving a small number of tasks. The options can then be leveraged to quickly solve new related problems the agent will face in the future. We use the trajectories generated by the agent when learning policies for a small number of problems, and learn an appropriate set of options by directly minimizing the expected number of decisions the agent makes while simultaneously maximizing the probability of generating the observed trajectories.
3 Learning Reusable Options from Experience
In this section, we formally introduce the objective we use to learn a set of options that are reusable for a set of related tasks. Our algorithm introduces one option at a time until introducing a new option does not improve the objective further. This procedure results in a natural way of learning an adequate number of options without having to predefine it; a new option is included only if it is able to improve the probability of generating optimal behavior while minimizing the number of decisions made by the agent.
3.1 Problem Formulation
In the options framework, at each timestep, , the agent chooses an action, , based on the current option, . Let
be a Bernoulli random variable, where
if the previous option, , terminated at time , and otherwise. If , is chosen using the policy over options, . If , then the previous option continues, that is, . To ensure that we can represent any trajectory, we consider primitive actions to be options which always select one specific action and then terminate; that is, for an option, , corresponding to a primitive, , for all , the termination function would be given by , and the policy by if and otherwise.Let denote a set of options, , where refers to the set of options corresponding to primitive actions and to the set of options corresponding to temporal abstractions. Furthermore, let be a random variable denoting a trajectory of length generated by a wellperforming policy, and let be a random variable denoting the subtrajectory of up to the state encountered at timestep . We seek to find a set, }, that maximizes the following objective:
(1) 
where is a regularizer that encourages a diverse set of options, and is a scalar hyperparameter. If we are also free to learn the parameters of , then .
One choice for is the average KL divergence on a given trajectory over the set of options being learned: . Note that this term is only defined when we consider two or more options. When that is not the case we set this term to 0.
Intuitively, we seek to find options that terminate as infrequently as possible while still generating wellperforming trajectories with high probability. Notice that minimizing the number of terminations is the same as minimizing the number of decisions made by the policy over options, as each termination requires the policy to subsequently choose a new option. Given a set of options, a policy over options, and a sample trajectory, we can calculate the joint probability for a trajectory exactly
. Therefore, we can obtain an accurate estimate of Equation
1 by averaging over a set of sample trajectories. In the next section, we present a slight modification to our objective that results in a practical optimization problem.3.2 Optimization Objective for Learning Options
Given that the agent must learn the corresponding policy for a set of tasks, we can use the experienced gathered from solving a subset of tasks to obtain trajectories demonstrating the optimal behavior learned for these problems. Given a set, , of trajectories generated from an initial subset of tasks, we can now estimate the expectation in (1) to learn options that can be leveraged in the remaining problems.
Because the probability of generating any trajectory approaches as the length of the trajectory increases, we make a slight modification to the original objective that leads to better numerical stability. We explain these modifications after introducing the objective , which we optimize in practice:
(2)  
The objective in (2) is derived from with the following modifications: 1) the sum of the two first terms replaces a product of two terms obtained from computing the joint probability in , 2) the summation over terminations for a trajectory (second term) is normalized by the length of the trajectory, and 3) we introduce a scalar weight to balance the contribution of each term to
. Although this is not an unbiased estimator of
, we derived from with the introduction of some modifications for numerical stability. A more detailed discussion on how we arrived to this objective is provided in Appendix A.We can express (2) entirely in terms of the policy over options , options and the transition function, . When the transition function is not known, we can estimate it from data by assuming a family of distributions and fitting the parameters. The following theorems show how to calculate the first two terms in (2) from known quantities, allowing us to efficiently maximize the proposed objective.
Theorem 1.
Given a set of options, , and a policy, , over options, the expected number of terminations for a trajectory is given by:
(3)  
where we use as a shorthand notation for ,
and .
Proof.
See Appendix B. ∎
Theorem 2.
Given a set of options and a policy over options, the probability of generating a trajectory of length is given by:
where is a recursive function defined as:
Proof.
See Appendix C. ∎
Given a parametric representation of the option policies and termination functions for each and for the policy over options, we use Theorems and to differentiate the objective in (2) with respect to their parameters and optimize with any standard optimization technique.
3.3 Learning Options Incrementally
One common issue in option discovery is identifying how many options are needed for a given problem. Oftentimes this number is predefined by the user based on intuition. In such a scenario, one could learn options by simply randomly initializing the parameters of a number of options and optimizing the proposed objective in (2). Instead, we propose not only learning options, but also the number of options needed, by the procedure shown in Algorithm 1. This algorithm introduces one option at a time and optimizes the objective with respect to the policy over options , with parameters , and the newly introduced option, , with parameters and , for epochs. Optimizing both and allows us to estimate how much we can improve given that we keep any previously introduced option fixed.
After the new option is trained, we measure how much has improved; if it fails to improve above some threshold, , the procedure terminates. This results in a natural way of obtaining an appropriate number of options, as options stop being added once a new option no longer improves the ability to represent the demonstrated behavior.
4 Experimental Results
This section describes experiments used to evaluate the proposed offline option learning approach. We show results in the “four rooms” domain to allow us to visualize and understand the options produced by the approach, and to show empirically that these options produce a clear improvement in learning. We compare against options generated by the optioncritic architecture [2] and eigenoptions [7]. We then extend our experiments to assess the performance of the technique in a few selected problems from the Atari 2600 emulator provided by OpenAI Gym [3]. These experiments demonstrate that when an agent faces a large number of related tasks, by using the trajectories obtained from solving a small subset of tasks, our approach is able to discover options that significantly improve the learning ability of the agent in the tasks it has yet to solve.
4.1 Experiments on Four Rooms Environment
We tested our approach in the four rooms domain: a gridworld of size , in which the agent is placed in a randomly selected start state and needs to reach a randomly selected goal state. At each timestep, the agent executes one of four available actions: moving left, right, up or down, and receives a reward of . After taking a particular action, the agent moves in the intended direction with probability and in any other direction with probability . Upon reaching the goal state, the agent receives a reward of . We generated different task variations (by changing the goal and start location) and collected six sample trajectories from optimal policies, learned using Qlearning, from six different start and goal configurations. We evaluated our method on the remaining tasks.
Each option was represented as a twolayer neural network, with
neurons in each layer, and two output layers: a softmax output layer over the four possible actions representing , and a separate sigmoid layer representing . We used the tabular form of Qlearning as the learning algorithm with greedy exploration.Figure 0(a) shows the change in the average expected number of terminations and average probability of generating the observed trajectories while learning options, as new options are introduced and adapted to the sampled trajectories. Options were learned over the six sampled optimal trajectories and every epochs a new option was introduced to the option set, for a total of options. For every new option, the change in probability of generating the observed trajectories as well as the change in expected number of decisions reaches a plateau after or training epochs. When a new option is introduced, there is a large jump in the loss because a new policy, , is initialized arbitrarily to account for the new option set being evaluated. However, after training the new candidate option, the overall loss improves beyond what it was possible before introducing the new option.
In Figure 0(b), we compare the performance of Qlearning on novel test tasks (randomly selected start and goal states) using options discovered from offline option learning (with and without regularization using KL divergence), eigenoptions, and option critic. We allowed each competing method to learn options from the same six training tasks and, to ensure a fair comparison, we used the original code provided by the authors. As baselines, we also compare against primitive actions and randomly initialized options. It might seem surprising that both eigenoptions and the optioncritic failed to reach an optimal policy when they were shown to work well in this type of problem; for that we offer the following explanation. Our implementation of four rooms is defined in a much larger state space than the ones where these methods were originally tested, making each individual room much larger. Since the options identified by these methods tend to lead the agent from room to room, it is possible that, once in the correct room, the agent executes an option leading to a different room before it had the opportunity to find the goal. When testing our approach in the smaller version of the four room problem, we found no clear difference to the performance of the competing methods. In this setting, the options learned by our method found an optimal policy in all testing tasks in the allotted number of episodes. We set the threshold for introducing a new option to of at the previous iteration and the hyperparameter . When adding KL regularization, we set .
To understand the reason behind the improvement in performance resulting from offline option learning, we turn the reader’s attention to Figure 2. The figure is a visualization of the policy learned by the agent on a particular task: navigate from a specific location in the bottomleft room to a location in the topright room in a small “fourroom” domain of size . ^{1}^{1}1We show a smaller domain than used in the experiments for ease of visualization
The new task to solve is shown in the topleft figure, while the solution found is shown in the topright figure. Each of the remaining rows of images shows how each option was learned and used in the new task. The first row show the options learned after training; the highlighted path depicts one of the sample trajectories used for training, the colors correspond to the probability that the options would take the demonstrated action, and the arrows indicate the most likely action to be taken by the option.
The middle row depicts a heatmap indicating how each option was used to solve this specific task. It shows the probability that would execute each option at any given state. Finally, the last row depicts a heatmap indicating the probability of termination for each option given the state.
Looking at the learned options from these different perspectives provides some insight into how they are being exploited. For example, option is generally useful to navigate towards the top rooms and, since the goal in this task is in the topright room, the option is mainly called in the bottom rooms. Also notice that the option is likely to terminate in the top left and bottom right rooms in the states that would lead the agent to get “trapped” against a wall. These options, when used in combination in specific regions, allow the agent to efficiently tackle problems it has not encountered before.
4.2 Experiments using Atari 2600 Games
We evaluated the quality of the options learned by our framework in two different Atari 2600 games: Breakout and Amidar. We trained the policy over options using A3C [10]
with grayscale images as inputs. Options were represented by a two layer convolutional neural network, and were given the previous two frames as input. For each task variation we randomly picked an integer between
and , and let the agent act randomly for that number of timesteps before learning (changing the initial state distribution), we sampled a number in the range and use it to scale the reward the agent received (changing the reward function), and allowed for a number of frames to be skipped after taking each action (changing the transition function). For used three different tasks for training for each game, and sampled trajectories for training; we used five new tasks for testing. Each trajectory lasted until a life was lost, not for the entire duration of the episode. The options were represented by a twolayer neural network, where the input was represented by gray scale images of the last two frames. We ran training agents in parallel on CPUs, the learning rate was set to and the discount factor was set to .show the performance of the agent as a function of training time in Breakout and Amidar, respectively. The plots show that given good choices of hyperparameters, the learned options led to a clear improvement in performance during training. For both domains, we found that
led to a reasonable tradeoff between the first two term in , and report results with three different values for the regularization term: (no regularization), and . Note that our results do not necessarily show that the options result in a better final policy, but they improve exploration early in training and enable the agent to learn more effectively.Figure 5 depicts the behavior for one of the learned options on Breakout. The option efficiently catches the ball after it bounces off the left wall, and then terminates with high probability before the ball has to be caught again. Bear in mind that the option remains active for many timesteps, significantly reducing the number of decisions made by the policy over options. However, it does not maintain control for so long that the agent is unable to respond to changing circumstances. Note that the option is only useful in specific case; for example, it was not helpful in returning a ball bounced off the right wall. That is to say, the option specialized in a specific subtask within the larger problem: a highly desirable property for generally useful options.
Figure 6 shows the selection of two of the options learned for Amidar when starting a new game. At the beginning of the game, option 1 is selected, which takes the agent to a specific intersection before terminating. The agent then selects option 2, which chooses a direction at the intersection, follows the resulting path, and terminates at the next intersection. Note that the agent does not need to repeatedly select primitive actions in order to simply follow a previously chosen path. Having access to these types of options enables an agent to easily replicate known good behaviors, allowing for faster and more meaningful exploration of the state space.
5 Conclusion and Future Work
In this work we presented an optimization objective for learning options from demonstrations obtained from learned policies on a set of tasks. Optimizing the objective results in a set of options that allows an agent to reproduce the behavior while minimizing the number of decisions made by the policy over options, which are able to improve the learning ability of the agent on new tasks.
There are some clear directions for future development. While we have shown that our method is capable of discovering powerful options, properly tuning the hyperparameters, and , is necessary for learning appropriate options. In complex environments, this is not an easy task. Future work could study methods for finding the right balance between hyperparameters automatically or, if possible, eliminate the need for such hyperparameters altogether. Another possible dimension of improvement is to study how to extend the proposed ideas to the online setting; an agent may be able to sample trajectories as it is learning a task and progressively use them to continuously improve its option set.
We provided results showing how options adapt to the trajectories provided and showed, through several experiments, that the identified options are capable of significantly improving the learning ability of an agent. The resulting options encode meaningful abstractions that help the agent interact with and learn from its environment more efficiently.
References

[1]
(2015)
Unsupervised crossdomain transfer in policy gradient reinforcement learning via manifold alignment.
In
Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence
, AAAI’15, pp. 2504–2510. External Links: ISBN 0262511290, Link Cited by: §1.  [2] (2017) The optioncritic architecture. In AAAI, Cited by: §1, §2, §2, §4.
 [3] (2016) OpenAI gym. CoRR. External Links: 1606.01540 Cited by: §4.
 [4] (201305) Humanoid robots learning to walk faster: from the real world to simulation and back. In Proc. of 12th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS), Cited by: §1.

[5]
(201706–11 Aug)
Modelagnostic metalearning for fast adaptation of deep networks.
In
Proceedings of the 34th International Conference on Machine Learning
, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1126–1135. External Links: Link Cited by: §1.  [6] (2018) When waiting is not an option: learning options with a deliberation cost. In AAAI, Cited by: §2.
 [7] (2017) A Laplacian Framework for Option Discovery in Reinforcement Learning. CoRR. Cited by: §1, §2, §2, §4.
 [8] (2005) Protovalue functions: developmental reinforcement learning. In Proceedings of the 22nd International Conference on Machine Learning (ICML2005), pp. 553–560. Cited by: §2.
 [9] (1998) Macro actions in reinforcement learning: an empirical analysis. Technical Report University of Massachusetts  Amherst, Massachusetts, USA. Cited by: §1.
 [10] (201620–22 Jun) Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 1928–1937. External Links: Link Cited by: §4.2.
 [11] (201502) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: ISSN 00280836, Link Cited by: §1.
 [12] (1998) Learning to learn. S. Thrun and L. Pratt (Eds.), pp. 293–309. External Links: ISBN 0792380479, Link Cited by: §1.
 [13] (1995) On learning how to learn learning strategies. Technical report . Cited by: §1.
 [14] (201601) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. External Links: Document, ISSN 00280836 Cited by: §1.
 [15] (1999) Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence. Cited by: §1, §2.
 [16] (1998) Intraoption learning about temporally abstract actions. In In Proceedings of the 15th International Conference on Machine Learning (ICML1998), Cited by: §2.
 [17] (200712) Transfer learning via intertask mappings for temporal difference learning. J. Mach. Learn. Res. 8, pp. 2125–2167. External Links: ISSN 15324435, Link Cited by: §1.
 [18] (200912) Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10, pp. 1633–1685. External Links: ISSN 15324435, Link Cited by: §1.
 [19] (199503) Temporal difference learning and tdgammon. Commun. ACM 38 (3), pp. 58–68. External Links: ISSN 00010782, Link, Document Cited by: §1.
Appendix A Appendix
The following list defines the notation used in all derivations:

: random variable denoting action taken at step .

: random variable denoting state at step .

: random variable denoting history up to step . .

: random variable denoting the event that the option used at step terminates at state .

: policy over options.

: transition function. denotes the probability of transitioning to state by taking action in state

: random variable denoting the option selected for execution at state .

: option defined as , where is the option policy for option and is the termination function.

Assume primitives are options that perform only 1 action and last for 1 timestep.

: set of available options.
We can compute the probability of an option terminating at state and generating a trajectory as:
(4) 
To compute the proposed objective we need to find an expression for and in terms of known quantities.
a.1 Appendix A  Derivation of
Recall , ignoring the regularization term. Assuming access to a set of sample trajectories, we start by estimating from sample averages and derive the objective as follows:
It can easily be seen that to maximize the above expression should be minimized while should be maximized. Given that for long trajectories the expected number of terminations increases while the probability of generating the trajectories goes to , we normalize the number of terminations by the lenght of the trajectory, , and adjust a hyperparameter, , to prevent one term from dominating the other during optimization. Based on this observation we propose optimizing the following objective:
This objective allow us to control a tradeoff, through , of how much we care about the options reproducing the demonstrated trajectories vs. how much we want the agent to minimize the number of decisions.
a.2 Appendix B  Proof of Theorem 1
Theorem 1 Given a set of options and a policy over options, the expected number of terminations for a trajectory of length is given by:
where,
and .
Proof.
Notice that , so if we find an expression for , we can calculate the expectation exactly. We define for ease of derivation even though there is no option to terminate at .
We are left with finding an expression in terms of known probabilities for .
Given that by convention, , we are now left with figuring out how to calculate
where
Using the recursive function , the expected number of terminations for a given trajectory is given by:
∎
a.3 Appendix C  Proof of Theorem 2
Theorem 2
Given a set of options and a policy over options, the probability of generating a trajectory of length is given by:
, where is a recursive function defined as:
Proof.
We define to be the history from time to time , that is, ), where . If , the history would contain a single state.
Comments
There are no comments yet.