Real-world applications of reinforcement learning (RL) face two main challenges: complex long-running tasks and partial observability. Options, the particular instance of Hierarchical RL we focus on, addresses the first challenge by factoring a complex task into simpler sub-tasks [Barto2003, Roy2006, Tessler2016]. Instead of learning what action to perform depending on an observation, the agent learns a top-level policy that repeatedly selects options, that in turn execute sequences of actions before returning [Sutton1999]. The second challenge, partial observability, is addressed by maintaining a belief of what the agent thinks the full state is [Kaelbling1998, Cassandra1994], reasoning about possible future observations [Littman2001, Boots2009], storing information in an external memory for later reuse [Peshkin2001, Zaremba2015, Graves2016], or using recurrent neural networks (RNNs) to allow information to flow between time-steps [Bakker2001, Mnih2016].
Combined solutions to the above two challenges have recently been designed for planning [He2011], but solutions for learning algorithms are not yet ideal. HQ-Learning decomposes a task into a sequence of fully-observable subtasks [Wiering1997], which precludes cyclic tasks from being solved. Using recurrent neural networks in options and for the top-level policy [Sridharan2010] addresses both challenges, but brings in the design complexity of RNNs [Jozefowicz15, Angeline94, Mikolov14]. RNNs also have limitations regarding long time horizons, as their memory decays over time [Hochreiter1997].
In her PhD thesis, Precup (Precup2000, page 126) suggests that options may already be close to addressing partial observability, thus removing the need for more complicated solutions. In this paper, we prove this intuition correct by:
Showing that standard options do not suffice in POMDPs;
Introducing Option-Observation Initiation Sets (OOIs), that make the initiation sets of options conditional on the previously-executed option;
Proving that OOIs make options at least as expressive as Finite State Controllers (Section 3.2), thus able to tackle challenging POMDPs.
In contrast to existing HRL algorithms for POMDPs [Wiering1997, Theocharous2002, Sridharan2010], OOIs handle repetitive tasks, do not restrict the action set available to sub-tasks, and keep the top-level and option policies memoryless. A wide range of robotic and simulated experiments in Section 4 confirm that OOIs allow partially observable tasks to be solved optimally, demonstrate that OOIs are much more sample-efficient than a recurrent neural network over options, and illustrate the flexibility of OOIs regarding the amount of domain knowledge available at design time. In Section 4.5, we demonstrate the robustness of OOIs to sub-optimal option sets. While it is generally accepted that the designer provides the options and their initiation sets, we show in Section 4.4 that random initiation sets, combined with learned option policies and termination functions, allow OOIs to be used without any domain knowledge.
1.1 Motivating Example
OOIs are designed to solve complex partially-observable tasks that can be decomposed into a set of fully-observable sub-tasks. For instance, a robot with first-person sensors may be able to avoid obstacles, open doors or manipulate objects even if its precise location in the building is not observed. We now introduce such an environment, on which our robotic experiments of Section 4.3 are based.
A Khepera III robot111http://www.k-team.com/mobile-robotics-products/old-products/khepera-iiihas to gather objects from two terminals separated by a wall, and to bring them to the root (see Figure 1). Objects have to be gathered one by one from a terminal until it becomes empty, which requires many journeys between the root and a terminal. When a terminal is emptied, the other one is automatically refilled. The robot therefore has to alternatively gather objects from both terminals, and the episode finishes after the terminals have been emptied some random number of times. The root is colored in red and marked by a paper QR-code encoding 1. Each terminal has a screen displaying its color and a dynamic QR-code (1 when full, 2 when empty). Because the robot cannot read QR-codes from far away, the state of a terminal cannot be observed from the root, where the agent has to decide to which terminal it will go. This makes the environment partially observable, and requires the robot to remember which terminal was last visited, and whether it was full or empty.
The robot is able to control the speed of its two wheels. A wireless camera mounted on top of the robot detects bright color blobs in its field of view, and can read nearby QR-codes. Such low-level actions and observations, combined with a complicated task, motivate the use of hierarchical reinforcement learning. Fixed options allow the robot to move towards the largest red, green or blue blob in its field of view. The options terminate as soon as a QR-code is in front of the camera and close enough to be read. The robot has to learn a policy over options that solves the task.
The robot may have to gather a large number of objects, alternating between terminals several times. The repetitive nature of this task is incompatible with HQ-Learning [Wiering1997]. Options with standard initiation sets are not able to solve this task, as the top-level policy is memoryless [Sutton1999] and cannot remember from which terminal the robot arrives at the root, and whether that terminal was full or empty. Because the terminals are a dozen feet away from the root, almost a hundred primitive actions have to be executed to complete any root/terminal journey. Without options, this represents a time horizon much larger than usually handled by recurrent neural networks [Bakker2001] or finite history windows [Lin1993].
OOIs allow each option to be selected conditionally on the previously executed one (see Section 3.1), which is much simpler than combining options and recurrent neural networks [Sridharan2010]. The ability of OOIs to solve complex POMDPs builds on the time abstraction capabilities and expressiveness of options. Section 4.3 shows that OOIs allow a policy for our robotic task to be learned to expert level. Additional experiments demonstrate that both the top-level and option policies can be learned by the agent (see Section 4.4), and that OOIs lead to substantial gains over standard initiation sets even if the option set is reduced or unsuited to the task (see Section 4.5).
This section formally introduces Markov Decision Processes (MDPs), Options, Partially Observable MDPs (POMDPs) and Finite State Controllers, before presenting our main contribution in Section3.
2.1 Markov Decision Processes
A discrete-time Markov Decision Process (MDP) with discrete actions is defined by a possibly-infinite set of states, a finite set of actions, a reward function , that provides a scalar reward for each state transition, a transition function
, that outputs a probability distribution over new statesgiven a state-action pair, and the discount factor, that defines how sensitive the agent should be to future rewards.
A stochastic memoryless policy maps a state to a probability distribution over actions. The goal of the agent is to find a policy that maximizes the expected cumulative discounted reward obtainable by following that policy.
The options framework, defined in the context of MDPs [Sutton1999], consists of a set of options where each option is a tuple , with the memoryless option policy, the termination function that gives the probability for the option to terminate in state , and the initiation set that defines in which states can be started [Sutton1999].
The memoryless top-level policy maps states to a distribution over options and allows to choose which option to start in a given state. When an option is started, it executes until termination (due to ), at which point selects a new option based on the now current state.
2.3 Partially Observable MDPs
Most real-world problems are not completely captured by MDPs, and exhibit at least some degree of partial observability. A Partially Observable MDP (POMDP) is an MDP extended with two components: the possibly-infinite set of observations, and the function that produces observations based on the unobservable state of the process. Two different states, requiring two different optimal actions, may produce the same observation. This makes POMDPs remarkably challenging for reinforcement learning algorithms, as memoryless policies, that select actions or options based only on the current observation, typically no longer suffice.
2.4 Finite State Controllers
Finite State Controllers (FSCs) are commonly used in POMDPs. An FSC is defined by a finite set of nodes, an action function that maps nodes to a probability distribution over actions, a successor function that maps nodes and observations to a probability distribution over next nodes, and an initial function that maps initial observations to nodes [Meuleau1999].
At the first time-step, the agent observes and activates a node by sampling from . An action is performed by sampling from . At each time-step , a node is sampled from , then an action is sampled from . FSCs allow the agent to select actions according to the entire history of past observations [Meuleau1999], which has been shown to be one of the best approaches for POMDPs [Lin1992]. OOIs, our main contribution, make options at least as expressive and as relevant to POMDPs as FSCs, while being able to leverage the hierarchical structure of the problem.
3 Option-Observation Initiation Sets
Our main contribution, Option-Observation Initiation Sets (OOIs), make the initiation sets of options conditional on the option that has just terminated. We prove that OOIs make options at least as expressive as FSCs (thus suited to POMDPs, see Section 3.2), even if the top-level and option policies are memoryless, while options without OOIs are strictly less expressive than FSCs (see Section 3.3). In Section 4, we show on one robotic and two simulated tasks that OOIs allow challenging POMDPs to be solved optimally.
3.1 Conditioning on Previous Option
Descriptions of partially observable tasks in natural language often contain allusions at sub-tasks that must be sequenced or cycled through, possibly with branches. This is easily mapped to a policy over options (learned by the agent) and sets of options that may or may not follow each other.
A good memory-based policy for our motivating example, where the agent has to bring objects from two terminals to the root, can be described as “go to the green terminal, then go to the root, then go back to the green terminal if it was full, to the blue terminal otherwise”, and symmetrically so for the blue terminal. This sequence of sub-tasks, that contains a condition, is easily translated to a set of options. Two options, and , sharing a single policy, go from the green terminal to the root (using low-level motor actions). is executed when the terminal is full, when it is empty. At the root, the option that goes back to the green terminal can only follow , not . When the green terminal is empty, going back to it is therefore forbidden, which forces the agent to switch to the blue terminal when the green one is empty.
We now formally define our main contribution, Option-Observation Initiation Sets (OOIs), that allow to describe which options may follow which ones. We define the initiation set of option so that the set of options available at time depends on the observation and previously-executed option :
with , the set of observations and the set of options. allows the agent to condition the option selected at time on the one that has just terminated, even if the top-level policy does not observe . The top-level and option policies remain memoryless. Not having to observe keeps the observation space of the top-level policy small, instead of extending it to , without impairing the representational power of OOIs, as shown in the next sub-section.
3.2 OOIs Make Options as Expressive as FSCs
Finite State Controllers are state-of-the-art in policies applicable to POMDPs [Meuleau1999]. By proving that options with OOIs are as expressive as FSCs, we provide a lower bound on the expressiveness of OOIs and ensure that they are applicable to a wide range of POMDPs.
OOIs allow options to represent any policy that can be expressed using a Finite State Controller.
The reduction from any FSC to options requires one option
per ordered pair of nodes in the FSC, and one optionper node in the FSC. Assuming that and , the options are defined by:
Each option corresponds to an edge of the FSC. Equation 1 ensures that every option stops after having emitted a single action, as the FSC takes one transition every time-step. Equation 2 maps the current option to the action emitted by the destination node of its corresponding FSC edge. We show that and implement , with , by:
Because maps nodes to nodes and selects options representing pairs of nodes, is extremely sparse and returns a value different from zero, , only when and agree on . ∎
Our reduction uses options with trivial policies, that execute for a single time-step, which leads to a large amount of options to compensate. In practice, we expect to be able to express policies for real-world POMDPs with much less options than the number of states an FSC would require, as shown in our simulated (Section 4.4, 2 options) and robotic experiments (Section 4.3, 12 options). In addition to being sufficient, the next sub-section proves that OOIs are necessary for options to be as expressive as FSCs.
3.3 Original Options are not as Expressive as FSCs
While options with regular initiation sets are able to express some memory-based policies [Sutton1999, page 7], the tiny but valid Finite State Controller presented in Figure 3 cannot be mapped to a set of options and a policy over options (without OOIs). This proves that options without OOIs are strictly less expressive than FSCs.
Options without OOIs are not as expressive as Finite State Controllers.
Figure 3 shows a Finite State Controller that emits a sequence of alternating A’s and B’s, based on a constant uninformative observation . This task requires memory because the observation does not provide any information about what was the last letter to be emitted, or which one must now be emitted. Options having memoryless policies, options executing for multiple time-steps are unable to represent the FSC exactly. A combination of options that execute for a single time-step cannot represent the FSC either, as the options framework is unable to represent memory-based policies with single-time-step options [Sutton1999]. ∎
The experiments in this section illustrate how OOIs allow agents to perform optimally in environments where options without OOIs fail. Section 4.3 shows that OOIs allow the agent to learn an expert-level policy for our motivating example (Section 1.1). Section 4.4 shows that the top-level and option policies required by a repetitive task can be learned, and that learning option policies allow the agent to leverage random OOIs, thereby removing the need for designing them. In Section 4.5, we progressively reduce the amount of options available to the agent, and demonstrate how OOIs still allow good memory-based policies to emerge when a sub-optimal amount of options are used.
All our results are averaged over 20 runs, with standard deviation represented by the light regions in the figures. The source code, raw experimental data, run scripts, and plotting scripts of our experiments, along with a detailed description of our robotic setup, are available as supplementary material. A video detailing our robotic experiment is available athttp://steckdenis.be/oois_demo.mp4.
4.1 Learning Algorithm
, the one-hot encoded current option( when executing the top-level policy), and a mask, . The output is the joint probability distribution over selecting actions or options (so that the same network can be used for the top-level and option policies), while terminating or continuing the current option:
with and the trainable weights and biases of layer ,
the sigmoid function, and
the element-wise product of two vectors. The fraction ensures that a valid probability distribution is produced by the network. The initiation sets of options are implemented using theinput of the neural network, a vector of integers, the same dimension as the output. When executing the top-level policy (), the mask forces the probability of primitive actions to zero, preserves option according to , and prevents the top-level policy from terminating. When executing an option policy (), the mask only allows primitive actions to be executed. For instance, if there are two options and three actions, when executing any of the options. When executing the top-level policy, , with if and only if the option that has just finished is in the initiation set of the first option, and according to the same rule but for the second option. The neural network is trained using Policy Gradient, with the following loss:
with the action executed at time . The return , with , is a simple discounted sum of future rewards, and ignores changes of current option. This gives the agent information about the complete outcome of an action or option, by directly evaluating its flattened policy. A baseline
is used to reduce the variance of theestimate [Sutton2000]. predicts the expected cumulative reward obtainable from in option using a separate neural network, trained on the monte-carlo return obtained from in .
4.2 Comparison with LSTM over Options
In order to provide a complete evaluation of OOIs, a variant of the and networks of Section 4.1, where the hidden layer is replaced with a layer of 20 LSTM units [Hochreiter1997, Sridharan2010], is also evaluated on every task. We use 20 units as this leads to the best results in our experiments, which ensures a fair comparison of LSTM against OOIs. In all experiments, the LSTM agents are provided the same set of options as the agent with OOIs. Not providing any option, or less options, leads to worse results. Options allow the LSTM network to focus on important observations, and reduces the time horizon to be considered. Shorter time horizons have been shown to be beneficial to LSTM [Bakker2001].
Despite our efforts, LSTM over options only manages to learn good policies in our robotic experiment (see Section 4.3), and requires more than twice the amount of episodes as OOIs to do so. In our repetitive task, dozens of repetitions seem to confuse the network, that quickly diverges from any good policy it may learn (see Section 4.4). On TreeMaze, a much more complex version of the T-maze task, originally used to benchmark reinforcement learning LSTM agents [Bakker2001], the LSTM agent learns the optimal policy after more than 100K episodes (not shown on the figures). These results illustrate how learning with recurrent neural networks is sometimes difficult, and how OOIs allow to reliably obtain good results, with minimal engineering effort.
4.3 Object Gathering
The first experiment illustrates how OOIs allow an expert-level policy to be learned for a complex robotic partially-observable repetitive task. The experiment takes place in the environment described in Section 1.1. A robot has to gather objects one by one from two terminals, green and blue, and bring them back to the root location. Because our actual robot has no effector, it navigates between the root and the terminals, but only pretends to move objects. The agent receives a reward of +2 when it reaches a full terminal, -2 when the terminal is empty. At the beginning of the episode, each terminal contains 2 to 4 objects, this amount being selected randomly for each terminal. When the agent goes to an empty terminal, the other one is re-filled with 2 to 4 objects. The episode ends after 2 or 3 emptyings (combined across both terminals). Whether a terminal is full or empty is observed by the agent only when it is at the terminal. The agent therefore has to remember information acquired at terminals in order to properly choose, at the root, to which terminal it will go.
The agent has access to 12 memoryless options that go to red (), green () or blue objects (), and terminate when the agent is close enough to them to read a QR-code displayed on them. The initiation set of is , of is , and of is . This description of the options and their OOIs is purposefully uninformative, and illustrates how little information the agent has about the task. The option set used in this experiment is also richer than the simple example of Section 3.1, so that the solution of the problem, not going back to an empty terminal, is not encoded in OOIs but must be learned by the agent.
Agents with and without OOIs learn top-level policies over these options. We compare them to a fixed agent, using an expert top-level policy that interprets the options as follows: go to the root from a full/empty green/blue terminal (and are selected accordingly at the terminals depending on the QR-code displayed on them), while go to the green/blue terminal from the root when the previous terminal was full/empty and green/blue. At the root, OOIs ensure that only one option amongst go to green after a full green, go to green after an empty blue, go to blue after a full blue and go to blue after an empty green is selected by the top-level policy: the one that corresponds to what color the last terminal was and whether it was full or empty. The agent goes to a terminal until it is empty, then switches to the other terminal, leading to an average reward of 10.222, 2 or 3 emptyings of terminals that contain 2 to 4 objects. Average confirmed experimentally from 1000 episodes using the policy, .
When the top-level policy is learned, OOIs allow the task to be solved, as shown in Figure 4, while standard initiation sets do not allow the task to be learned. Because experiments on a robot are slow, we developed a small simulator for this task, and used it to produce Figure 4 after having successfully asserted its accuracy using two 1000-episodes runs on the actual robot. The agent learns to properly select options at the terminals, depending on the QR-code, and to output a proper distribution over options at the root, thereby matching our expert policy. The LSTM agent learns the policy too, but requires more than twice the amount of episodes to do so. The high variance displayed in Figure 4 comes from the varying amounts of objects in the terminals, and the random selection of how many times they have to be emptied.
Because fixed option policies are not always available, we now show that OOIs allow them to be learned at the same time as the top-level policy.
4.4 Modified DuplicatedInput
In some cases, a hierarchical reinforcement learning agent may not have been provided policies for several or any of its options. In this case, OOIs allow the agent to learn its top-level policy, the option policies and their termination functions. In this experiment, the agent has to learn its top-level and option policies to copy characters from an input tape to an output tape, removing duplicate B’s and D’s (mapping ABBCCEDD to ABCCED for instance; B’s and D’s always appear in pairs). The agent only observes a single input character at a time, and can write at most one character to the output tape per time-step.
The input tape is a sequence of symbols , with and a random number between 20 and 30. The agent observes a single symbol , read from the -th position in the input sequence, and does not observe . When , . There are 20 actions (), each of them representing a symbol (5), whether it must be pushed onto the output tape (2), and whether should be incremented or decremented (2). A reward of 1 is given for each correct symbol written to the output tape. The episode finishes with a reward of -0.5 when an incorrect symbol is written.
The agent has access to two options, and . OOIs are designed so that cannot follow itself, with no such restriction on . No reward shaping or hint about what each option should do is provided. The agent automatically discovers that must copy the current character to the output, and that must skip the character without copying it. It also learns the top-level policy, that selects (skip) when observing B or D and is allowed, otherwise (copy).
Figure 5 shows that an agent with two options and OOIs learns the optimal policy for this task, while an agent with two options and only standard initiation sets () fails to do so. The agent without OOIs only learns to copy characters and never skips any (having two options does not help it). This shows that OOIs are necessary for learning this task, and allow to learn top-level and option policies suited to our repetitive partially observable task.
When the option policies are learned, the agent becomes able to adapt itself to random OOIs, thereby removing the need for designing OOIs. For an agent with options, each option has randomly-selected options in its initiation set, with the initiation sets re-sampled for each run. The agents learn how to leverage their option set, and achieve good results on average (16 options used in Figure 5, more options lead to better results). When looking at individual runs, random OOIs allow optimal policies to be learned, but several runs require more time than others to do so. This explains the high variance and noticeable steps shown in Figure 5.
The next section shows that an improperly-defined set of human-provided options, as may happen in design phase, still allows the agent to perform reasonably well. Combined with our results with random OOIs, this shows that OOIs can be tailored to the exact amount of domain knowledge available for a particular task.
The optimal set of options and OOIs may be difficult to design. When the agent learns the option policies, the previous section demonstrates that random OOIs suffice. This experiment focuses on human-provided option policies, and shows that a sub-optimal set of options, arising from a mis-specification of the environment or normal trial-and-error in design phase, does not prevent agents with OOIs from learning reasonably good policies.
TreeMaze is our generalization of the T-maze environment [Bakker2001] to arbitrary heights. The agent starts at the root of the tree-like maze depicted in Figure 6, and has to reach the extremity of one of the 8 leaves. The leaf to be reached (the goal) is chosen uniformly randomly before each episode, and is indicated to the agent using 3 bits, observed one at a time during the first 3 time-steps. The agent receives no bit afterwards, and has to remember them in order to navigate to the goal. The agent observes its position in the current corridor (0 to 4) and the number of T junctions it has already crossed (0 to 3). A reward of -0.1 is given each time-step, +10 when reaching the goal. The episode finishes when the agent reaches any of the leaves. The optimal reward is 8.2.
We consider 14 options with predefined memoryless policies, several of them sharing the same policy, but encoding distinct states (among 14) of a 3-bit memory where some bits may be unknown. 6 partial-knowledge options , , , …, go right then terminate. 8 full-knowledge options , , …, go to their corresponding leaf. OOIs are defined so that any option may only be followed by itself, or one that represents a memory state where a single 0 or - has been flipped to 1. Five agents have to learn their top-level policy, which requires them to learn how to use the available options to remember to which leaf to go. The agents do not know the name or meaning of the options. Three agents have access to all 14 options (with, without OOIs, and LSTM). The agent with OOIs (8) only has access to full-knowledge options, and therefore cannot disambiguate unknown and 0 bits. The agent with OOIs (4) is restricted to options , , and
and therefore cannot reach odd-numbered goals. The options of the (8) and (4) agents terminate in the first two cells of the first corridor, to allow the top-level policy to observe the second and third bits.
Figure 7 shows that the agent with OOIs (14) consistently learns the optimal policy for this task. When the number of options is reduced, the quality of the resulting policies decreases, while still remaining above the agent without OOIs. Even the agent with 4 options, that cannot reach half the goals, performs better than the agent without OOIs but 14 options. This experiment demonstrates that OOIs provide measurable benefits over standard initiation sets, even if the option set is largely reduced.
Combined, our three experiments demonstrate that OOIs lead to optimal policies in challenging POMDPs, consistently outperform LSTM over options, allow the option policies to be learned, and can still be used when reduced or no domain knowledge is available.
5 Conclusion and Future Work
This paper proposes OOIs, an extension of the initiation sets of options so that they restrict which options are allowed to be executed after one terminates. This makes options as expressive as Finite State Controllers. Experimental results confirm that challenging partially observable tasks, simulated or on physical robots, one of them requiring exact information storage for hundreds of time-steps, can now be solved using options. Our experiments also illustrate how OOIs lead to reasonably good policies when the option set is improperly defined, and that learning the option policies allow random OOIs to be used, thereby providing a turnkey solution to partial observability.
Options with OOIs also perform surprisingly well compared to an LSTM network over options. While LSTM over options does not require the design of OOIs, their ability to learn without any a-priori knowledge comes at the cost of sample efficiency and explainability. Furthermore, random OOIs are as easy to use as an LSTM and lead to superior results (see Section 4.4). OOIs therefore provide a compelling alternative to recurrent neural networks over options, applicable to a wide range of problems.
Finally, the compatibility between OOIs and a large variety of reinforcement learning algorithms leads to many future research opportunities. For instance, we have obtained very encouraging results in continuous action spaces, using CACLA [VanHasselt2007] to implement parametric options, that take continuous arguments when executed, in continuous-action hierarchical POMDPs.
The first author is “Aspirant” with the Science Foundation of Flanders (FWO, Belgium), grant number 1129317N. The second author is “Postdoctoral Fellow” with the FWO, grant number 12J0617N.
Thanks to Finn Lattimore, who gave a computer to the first author, so that he could finish this paper while attending the UAI 2017 conference in Sydney, after his own computer unexpectedly fried. Thanks to Joris Scharpff for his very helpful input on this paper.