Introduction
Humans have a remarkable ability to acquire skills, and knowing when to apply each skill plays an important role in their ability to quickly solve new tasks. In this work, we tackle the problem of learning such skills in reinforcement learning (RL). AI agents which aim to achieve goals are faced with two difficulties in large problems: the depth of the lookahead needed to obtain a good solution, and the breadth generated by having many choices. The first problem is often solved by providing shortcuts that skip over multiple time steps, for example, by using macroactions
[12]. The second problem is handled by restricting the agent’s attention at each step to a reasonable number of possible choices. Temporal abstraction methods aim to solve the first problem, and a lot of recent literature has been devoted to this topic konidaris2011autonomous,mann2015approximate,kulkarni2016hierarchical,mankowitz2016adaptive,bacon2017option,machado2017laplacian. We focus specifically on the second problem: learning how to reduce the number of choices considered by an RL agent.In classical planning, the early work on STRIPS [7] used preconditions that had to be satisfied before applying a certain action. Similar ideas can also be found in later work on macrooperators [16] or the Schema model [6]. In RL, the framework of options [31] uses a similar concept, initiation sets, which limit the availability of options (i.e. temporally extended actions) in order to deal with the possibly high cost of choosing among many options. Moreover, initiation sets can also lead to options that are more localized [14]
, which can be beneficial in transfer learning. For example, in continual learning ring1997child, specialization is key to both scaling up learning in large environments, as well as to “protecting” knowledge that has already been learned from forgetting due to new updates.
The optioncritic architecture [1] is a gradientbased approach for learning options in order to optimize the usual longterm return obtained by an RL agent from the environment. However, the notion of initiation sets originally introduced in sutton1999between sutton1999between was omitted from bacon2017option bacon2017option due to the difficulty of learning sets with gradientbased methods. We propose a generalization of initiation sets to interest functions [33, 37]. We build from the fact that a set can be represented through its membership function. Interest functions are a generalization of membership functions which allows smooth parameterization. Without this extension, determining suitable initiation sets would necessitate a nondifferentiable, searchbased approach.
Key Contributions: We generalize initiation sets for options to interest functions, which are differentiable, and hence easier to learn. We derive a gradientbased learning algorithm capable of learning all components of options endtoend. The resulting interestoptioncritic architecture generates options that are specialized to different regions of state space. We demonstrate through experiments in both discrete and continuous problems that our approach generates options that are useful in a single task, interpretable and reusable in multitask learning.
Preliminaries
Markov Decision Processes (MDPs). A finite, discretetime MDP [29, 32] is a tuple , where is the set of states, is the set of actions, is the reward function,
is the environment transition probability function, and
is the discount factor. At each time step, the learning agent perceives a state , takes an action drawn from a policy , and with probability enters next state , receiving a numerical reward from the environment. The value function of policy is defined as: and its actionvalue function is: .The Options framework. A Markovian option [31] is composed of an intraoption policy , a termination condition , where is the probability of terminating the option upon entering state , and an initiation set . In the callandreturn option execution model, when an agent is in state , it first examines the options that are available, i.e., for which . Let denote this set of available options. The agent then chooses according to the policy over options , follows the internal policy of , , until it terminates according to , at which point this process is repeated. Note that is the union of all sets . The optionvalue function of is defined as:
where is the value of executing primitive action in the context of stateoption pair :
Note that if an option cannot initiate in a state , its value is considered undefined.
InterestOptionCritic
In sutton1999between sutton1999between, the focus is on discrete environments, and the notion of initiation set provides a direct analog of preconditions from classical planning. In large problems, options would be applicable in parts of the state space described by certain features. For example, an option of the form stop if the traffic light is red would only need to be considered in states where a traffic light is detected. Let be the indicator function corresponding to set : iff and otherwise.
An interest function generalizes the set indicator function, with iff can be initiated in . A bigger value of means the interest in executing in is larger. Note that, depending on how is parameterized, one could view the interest as a prior on the likelihood of executing in . However, we will not use this perspective here, because our goal is to learn . So, instead, we will choose a parameterized form for which is differentiable, in order to leverage the power of gradients in the learning process.
The value of modulates the probability of option being sampled in state by a policy over options , resulting in an interest policy over option defined as:
(1) 
Note that this specializes immediately to usual initiation sets (where the interest is the indicator function).
We will now describe an approach to learning options which includes interest functions. We propose a policy gradient algorithm, in the style of optioncritic bacon2017option, based on the following result:
Theorem 1.
Given a set of Markov options with differentiable interest functions , where
is the parameter vector, the gradient of the expected discounted return with respect to
at is:where is the discounted weighting of the stateoption pairs along trajectories starting from sampled from the distribution determined by , is the termination function and is the value function over options corresponding to .
The proof is in Appendix A.2.1 available on the project page^{1}^{1}1https://sites.google.com/view/optionsofinterest. We can derive policy gradients for intraoption policies and termination functions as in optioncritic [1] (see Appendix A.2.2, A.2.3) with the difference that the discounted weighting of stateoption pairs is now according to the new option sampling distribution determined by . This is natural, as the introduction of the interest function should only impact the choice of option in each state. Pseudocode of the interestoptioncritic (IOC) algorithm using intraoption Qlearning is shown in Algorithm 1.
Intuitively, the gradient update to can be interpreted as increasing the interest in an option which terminates in states with good value. It links initiation and termination, which is natural. It is to be noted that the proposed gradient works at the level of the augmented chain; and not at the SMDP level. Implementing policy gradient at the SMDP level for the policy over options would entail performing gradient updates only upon termination, whereas using the augmented chain allows for updates throughout. Note that this approach does not appear in previous work to the best of our knowledge.
Illustration
In order to elucidate the way in which interest functions can help regulate the complexity of learning how to make decisions, we provide a small illustration of the idea. Consider a point mass agent in a continuous 2D maze, which starts in a uniformly random position and must reach a fixed goal state. Consider a scalar threshold , so that at any choice point, only options whose interest is at least can be initiated by the interest policy over options . The agent uses 16 options in total. Intuitively, an agent which has fewer option choices at a decision point should learn faster, since it has fewer alternatives to explore, but in the long run, this limits the space of usable policies for the agent. Fig. 1 confirms this tradeoff between speed of learning and ultimate quality. Note that this tradeoff holds the same way in planning as well (as discussed extensively in classical planning works).
Experimental Results
We now study the empirical behavior of IOC in order to answer the following questions: (1) are options with interest functions useful in a single task; (2) do interest functions facilitate learning reusable options and, (3) do interest functions provide better interpretability of the skills of an agent. A link to the source code for all experiments is provided on the project page.
Learning in a single task
To analyze the utility of interest functions when learning in a single task, consider a given, fixed policy over options, either specified by a justintime planner or via human input. This setup allows us to understand the impact of interest functions alone in the learning process.
Four rooms (FR) We first consider the classic FR domain sutton1999between (Fig. 3(a)). The agent starts at a uniformly random state and there is a goal state in the bottom right hallway (). With probability , the agent transitions randomly to one of the empty adjacent cells instead of the desired movement. The reward is at the goal and otherwise. We used
options, whose intraoption policies were parameterized with Boltzmann distributions, and termination and interest functions represented as linearsigmoid functions. Options were learned using either IOC or OC with tabular intraoption Qlearning, as described in Algorithm
1. Learning proceeds for episodes, with a maximum of time steps allowed per episode. Additional details are provided in Appendix A.3.1.Results: Fig. 2 shows the steps to goal for both OC and IOC, averaged over independent runs. The IOC agent performs better than OC agent by roughly steps. One potential reason for the improvement in IOC is that options become specialized to different regions of the statespace, as can be seen in Fig. 3. We also observe that the termination functions (which were initialized to ) naturally become coherent with the interest functions learned, and are mostly room specific for each option (see appendix Fig. A1). On the other hand, options learned by OC do not show such specialization and terminate everywhere (see appendix Fig. A1). These results demonstrate that the IOC agent is not only able to correct for the given higher level policy, but also, leads to more understandable options as a side effect.
TMaze Next, we illustrate the learning and use of interest functions in the nonlinear function approximation setting, using simple continuous control tasks implemented in Mujoco [35]. A point mass agent (blue) is located at the bottom end of a Tshaped maze (, ) and must navigate to within of the goal position (, ) at the right end of the maze (green location) (Fig. 4(b)). The state space consists of the coordinates of the agent and the action space consists of force applied in the directions. We use a uniform fixed policy over options for both IOC and OC. We reuse the Proximal Policy OptionCritic (PPOC) algorithm [13]^{2}^{2}2In this work we name this algorithm OC for optioncritic and add a 2layer network with sigmoid outputs to compute the interest functions. However, we do correct the implementation of the gradient of the policy over options which has been overlooked in that work. The remaining update rules are consistent with Algorithm 1. Complete details about the implementation and hyperparameter search are provided in Appendix A.3.2.
Results: We report the average performance over independent runs. The IOC agent is able to converge in almost half the time steps needed by the OC agent. Potentially, interest functions in the IOC agent provide an attention mechanism and thus facilitates learning options which are more diverse (see Fig. 5 for evidence). A deeper analysis of interest functions learned in this domain is deferred to subsequent sections.
MiniWorld We also explore learning in more complex D firstperson visual environment from the MiniWorld framework [3]. We use the Oneroom task where the agent has to navigate to a randomly placed red block in a closed room (Fig. 4(c)). This requires the agent to turn around and scan the room to find the red box, then navigate to it.
The observation space is a channel RGB image and the action space consists of discrete actions. At the start of each episode, the red box is placed randomly. The episode terminates if the agent reaches the box or max of time steps is reached. We used the DQN architecture of [26]. See Appendix A.3.3 for details about implementation and hyperparameters.
Results: The IOC agent is able to converge much faster ( iterations) than the OC agent with a given uniform policy over option (Fig. 2). The performance is averaged across runs.
Based on these experiments, IOC provides improvement in performance consistently across a range of tasks, from the simple fourrooms domain to complex visual navigation tasks such as MiniWorld, indicating the utility of learning interest functions.
Option reusability
One of the primary reasons for an agent to autonomously learn options is the ability to generalize its knowledge quickly in new tasks. We now evaluate our approach in settings where adaptation to changes in the task is vital.
TMaze The point mass agent starts at the bottom of the maze, with two goal locations (Fig. 4(a)), both giving a reward of . After episodes, the goal that has been visited the most is removed and the agent has to adapt its policy to the remaining goal available (Fig. 4(b)). Both OC and IOC learn options. We use a softmax policy over options for both IOC and OC, which is also learned at the same time.
Results: In the initial phase, the difference in performance between IOC and the other two agents (OC and PPO) is striking (Fig. 4): IOC converges twice as fast. Moreover, when the most visited goal is removed and adaptation to the task change is required, the IOC agent is able to explore faster and its performance declines less. This suggests that the structure learned by IOC provides more generality. At the end of task 2, IOC recovers its original performance, whereas PPO fails to recover during thee allotted learning trials.
MiniWorld Initially, the agent is tasked to search and navigate to a randomly placed red box in one closed room (Fig. 4(c)). After episodes, the agent has to adapt its skills to navigate to a randomly located blue box (Fig. 4(d)) which it has never seen before. Here, the policy over options as well as all the option components are being learned at the same time.
Results: The IOC agent outperforms both OC and PPO agents when required to adapt to the new task (Fig. 4). This result indicates that the options learned with interest functions are more easily transferable. The IOC agent is able to adapt faster to unseen scenarios.
HalfCheetah We also study adaptation in learning a complex locomotion task for a planar cheetah. The initial configuration of this environment follows the standard HalfCheetahv1 from OpenAI’s Gym: the agent is rewarded for moving forward as fast as possible. After iterations, we modify the reward function so that the agent is now encouraged to move backward as fast as possible [8].
Results: PPO outperforms both OC and IOC in the initial task. However, as soon as the task changes, IOC reacts in the most efficient way and converges to the highest score after iterations (Fig. 4). As seen consistently in all the environments, IOC generalizes much better over tasks, whereas PPO seems to overfit to the first task and generalizes poorly when the task changes.
In all our experiments, we notice that interest functions result in option specialization, which leads to both reusability and adaptability (i.e. an option may get slightly tweaked), especially in the complex tasks.
Option interpretability
To gain a better understanding of the agent’s behavior, we visualize different aspects of the learning process in several tasks.
TMaze We visualize the interest functions learned in TMaze (Fig. 5). Initially, the interest functions are randomized. At the end of the first task, the interest function for option specializes in the lower diagonal of the statespace (Fig. 5), whereas option ’s interest function is completely different (Fig. 5). When the task changes, the options readjust their interest. Eventually, the interest functions for the two options automatically specialize in different regions of the state space (last column of Fig. 5 & 5). Fig. 5 illustrates the agent trajectories at different time instances, where the yellow and red dots indicate the two different options during the trajectory. A visualization of the emergence of interest functions during the learning process is also available on the project page (see 1). In contrast, the options learned by the OC agent are employed everywhere and have not specialized as much (see Appendix Fig. A2).
HalfCheetah We analyze the skills learned in HalfCheetah. During the task of moving forward as fast as possible, the IOC agent employs option to move forward by dragging its limbs, and option to take much larger hopped steps (Fig. 6). Fig. 6 demonstrates the emergence of these very distinct skills and the agent’s switching between them across time. Additionally, we analyzed each option at the end of task in which the agent was rewarded for moving backward. Option now specializes in moving forward while option focuses on moving backward. This is nice, as the agent preserves some ability to now solve both tasks. OC doesn’t learn options which are as distinct, and both options end up going backward and overfitting to the new task (see accompanying videos ( 1)).
MiniWorld We visualize the skills acquired by inspecting the agent’s behavior at the end of first task. The IOC agent has learned two distinct options: option scans the surroundings, whereas option is used to directly navigate towards the block upon successfully locating it (Fig. 7). During task 2, option is being harnessed primarily to move forward, whereas option is employed when jittery motion is involved, such as turning and scanning.
Related Work
Temporal abstraction in RL has a rich history [27, 34, 4, 5, 24, 25, 30]. Options in particular have been shown to speed up convergence both empirically [28] and theoretically [22]. Constructing such temporal abstractions automatically from data has also been tackled extensively, and with some success [15, 21, 17, 20, 1, 19]. While some of the approaches require prior knowledge, have a fixed time horizon for partial policies [36], or use intrinsic rewards [17], bacon2017option bacon2017option provides an endtoend differentiable approach without needing any subgoals or intrinsic motivation. We generalize their policy gradient approach to learn interest functions. While we use rewards in our gradientbased algorithm, our qualitative analysis also indicates some clustering of states in which a given option starts, as in [18, 2, 23]. Our approach is closely related in motivation to mankowitz2016adaptive mankowitz2016adaptive. However, our method does not make assumptions about a particular structure for the initiation and termination functions (except smoothness).
Initiation sets were an integral part of sutton1999between sutton1999between and provide a way to control the complexity of the process of exploration and planning with options. This aspect of options has been ignored since, including in recent works [1, 9, 10, 11] because there was no elegant way to learn initiation sets. We address this open problem by generalizing initiation sets to differentiable interest functions. Since an interest function is a component of an option, it can be transferred once learned.
Discussion and Future Directions
We introduced the notion of interest functions for options, which generalize initiation sets, with the purpose of controlling search complexity. We presented a policy gradientbased method to learn options with interest functions, using general function approximation. Because the learned options are specialized, they are able to both learn faster in a single task and adapt to changes much more efficiently than options which initiate everywhere. Our qualitative results suggest that the interest function could be interpreted as an attention mechanism (see Appendix Fig. A4). To some extent, the interest functions learnt are able to override termination degeneracy as well (only one option being active all the time, or options switching often) although our approach was not meant to tackle that problem directly. Exploring further the interaction of initiation and termination functions, and imposing more coordination between the two, is an interesting topic for future work.
In our current experiments, the agent optimizes a single external reward function. However, the same algorithm could be used with intrinsic rewards as well.
We did not explore in this paper the impact of interest functions in the context of planning. However, given the intuitions from classical planning, learning models for options with interest functions could lead to better and faster planning, which should be explored in the future.
Finally, other ways of incorporating interest functions into the policy over options would be worth considering, in order to consider only choices over few options at a time.
Acknowledgments
The authors would like to thank NSERC & CIFAR for funding this research; Emmanuel Bengio, Kushal Arora for useful discussions throughout this project; Michael Littman, Zafarali Ahmed, Nishant Anand, and the anonymous reviewers for providing critical and constructive feedback.
References
 [1] (2017) The optioncritic architecture.. In AAAI, pp. 1726–1734. Cited by: Introduction, InterestOptionCritic, Related Work, Related Work.
 [2] (2013) On the bottleneck concept for options discovery. Cited by: Related Work.
 [3] (2018) Gymminiworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gymminiworld Cited by: Learning in a single task.
 [4] (1993) Feudal reinforcement learning. In Advances in neural information processing systems, pp. 271–278. Cited by: Related Work.

[5]
(2000)
Hierarchical reinforcement learning with the maxq value function decomposition.
Journal of Artificial Intelligence Research
13, pp. 227–303. Cited by: Related Work.  [6] (1991) Madeup minds: a constructivist approach to artificial intelligence. MIT Press, Cambridge, MA, USA. External Links: ISBN 0262041200 Cited by: Introduction.
 [7] (197201) Learning and executing generalized robot plans. Artificial Intelligence 3, pp. 251–288. Cited by: Introduction.

[8]
(2017)
Modelagnostic metalearning for fast adaptation of deep networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 1126–1135. Cited by: Option reusability.  [9] (2017) When waiting is not an option: learning options with a deliberation cost. arXiv preprint arXiv:1709.04571. Cited by: Related Work.
 [10] (2019) The termination critic. arXiv preprint arXiv:1902.09996. Cited by: Related Work.
 [11] (2019) Perdecision option discounting. In International Conference on Machine Learning, pp. 2644–2652. Cited by: Related Work.
 [12] (1998) Hierarchical solution of markov decision processes using macroactions. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 220–229. Cited by: Introduction.
 [13] (2017) Learnings options endtoend for continuous action tasks. CoRR abs/1712.00004. External Links: Link, 1712.00004 Cited by: Learning in a single task.
 [14] (2007) Building portable options: skill transfer in reinforcement learning.. In IJCAI, Vol. 7, pp. 895–900. Cited by: Introduction.
 [15] (2011) Autonomous skill acquisition on a mobile manipulator.. In AAAI, Cited by: Related Work.
 [16] (1983) Learning to solve problems by searching for macrooperators. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA. Note: AAI8425820 Cited by: Introduction.
 [17] (2016) Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pp. 3675–3683. Cited by: Related Work.
 [18] (2016) Option discovery in hierarchical reinforcement learning using spatiotemporal clustering. arXiv preprint arXiv:1605.05359. Cited by: Related Work.
 [19] (2017) A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956. Cited by: Related Work.
 [20] (2016) Adaptive skills adaptive partitions (asap). In Advances in Neural Information Processing Systems, pp. 1588–1596. Cited by: Related Work.
 [21] (2015) Approximate value iteration with temporally extended actions. Journal of Artificial Intelligence Research 53, pp. 375–438. Cited by: Related Work.
 [22] (2014) Scaling up approximate value iteration with options: better policies with fewer iterations. In International Conference on Machine Learning, pp. 127–135. Cited by: Related Work.
 [23] (2004) Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the twentyfirst international conference on Machine learning, pp. 71. Cited by: Related Work.
 [24] (2001) Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML, Vol. 1, pp. 361–368. Cited by: Related Work.
 [25] (2002) Qcutdynamic discovery of subgoals in reinforcement learning. In European Conference on Machine Learning, pp. 295–306. Cited by: Related Work.
 [26] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: Learning in a single task.
 [27] (1998) Reinforcement learning with hierarchies of machines. In Advances in neural information processing systems, pp. 1043–1049. Cited by: Related Work.
 [28] (2000) Temporal abstraction in reinforcement learning. Ph. D. thesis, University of Massachusetts. Cited by: Related Work.
 [29] (1995) Markov decision processes: discrete stochastic dynamic programming. Journal of the Operational Research Society 46 (6), pp. 792–792. Cited by: Preliminaries.
 [30] (2002) Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pp. 212–223. Cited by: Related Work.
 [31] (1999) Between mdps and semimdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (12), pp. 181–211. Cited by: Introduction, Preliminaries.
 [32] (1998) Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. Cited by: Preliminaries.
 [33] (2016) An emphatic approach to the problem of offpolicy temporaldifference learning. Journal of Machine Learning Research 17, pp. 73:1–73:29. Cited by: Introduction.
 [34] (1995) Finding structure in reinforcement learning. In Advances in neural information processing systems, pp. 385–392. Cited by: Related Work.
 [35] (2012) Mujoco: a physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: Learning in a single task.
 [36] (2017) Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161. Cited by: Related Work.
 [37] (2017) Unifying task specification in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 3742–3750. Cited by: Introduction.
Comments
There are no comments yet.