For robots at the current stage, it is considered more practical for them to work in a specific environment or solve a specific problem. Such a concept is often referred to as scenario intelligence in contrast to the general artificial intelligence. To achieve such scenario intelligence, humans must transfer knowledge to robots by developing goal-oriented algorithms. Though feasible, these human-designed methods sometimes are insensitive to the dynamically changing environments.
Recently, Deep Reinforcement Learning (DRL) has achieved significant success in various games  and shows a promising future. However, directly applying such methods to real robots is quite difficult. First of all, the majority of these work are trained via thousands of trial-and-error episodes while the robot is too fragile to sustain such a process. Besides, considering there exist a bunch of algorithms off the shelf in the robotics area, sometimes it’s inefficient to make the robot learn from scratch. In this paper, the relationship between the existing human-designed methods and DRL is re-examined. The main concern here is whether we can develop an algorithm that inherits the efficiency and flexibility of learning-based methods and holds a controllable training process on the basis of existing methods.
Human decision making routinely involves choices over a broad range of time scales. By contrast, the Markov Decision Process (MDP) based Reinforcement Learning (RL) does not have such foresight and makes the action only based on current observation. Facing such limitations, Sutton proposed options to represent the courses of actions that take place at different time scales. Different from typical RL, this Hierarchical Reinforcement Learning (HRL) architecture is based on Semi-Markov Decision Process (SMDP) and intended to model temporally extended action sequence.
A schematic diagram of HRL is depicted in Fig. 1. Instead of picking the action directly, the agent first selects an option according to the meta-policy , then follows its intra-option policy until being ended by its termination function . Based on this HRL structure, the Option-Interruption framework is proposed in this paper. Inspired by the fact that the intra-option policies are independent to the meta-policy, we encode the options with the existing human-designed methods, augmented with learnable termination functions. Such a combination brings several benefits: 1) By imparting human knowledge, the training process can be significantly sped up; 2) When combined with well-developed algorithms, the robot’s behavior can be restrained, for instance, equipped with the obstacle avoidance algorithms, the agent will not hit the wall when learning to navigate; 3) The interruption mechanism enables the system to constantly monitor and make a response to their external environments; 4) Our architecture is flexible and can be embedded in various scenarios with the replacement of options.
In this paper, we propose the Option-Interruption architecture combining traditional methods with hierarchical reinforcement learning to achieve the scenario intelligence. The update rule of the meta-policy is derived and the training process is given. The experimental results verify the effectiveness of our method.
A finite discounted Markov Decision Process can be denoted as , where and
represent the set of states and actions respectively. The state transition probability functionis a conditional distribution over next states given that an action is taken under current state . The is a reward function and is a discount factor. The policy
is a probability distribution over actions conditioned on states. The objective of the agent is to learn a policy that maximizes the state-value function, i.e. the expected discounted future reward:and similarly its action-value function is defined as .
MDP is conventionally conceived that it does not involve temporally extended actions  and hence is unable to take advantage of high-level temporal abstraction. To correct such deficiency, Semi-Markov Decision Process is proposed  for continuous-time discrete-event systems. The actions in SMDP last multiple time steps and are intended to model temporally extended courses of action. Therefore the transition probability generalizes to a joint probability , the transition from state to state occurs after a positive waiting time, denoted by , when action is executed.
The idea of temporally extended actions and the hierarchical structure is formulated as the options in . A Markovian option can be represented by a triple , in which is an initiation set, is an intra-option policy, and is a termination function. The option is available in if and only if . We consider call-and-return option execution model , in which an agent picks an option according to its meta-policy , then follows the intra-option policy until the option terminates stochastically according to or the option is finished. Thus the value function can be reformulated in , where is the option-value function.
Combined with deep learning techniques, such hierarchical architecture has attracted a lot of attention recently. Kulkarniet al.  propose a hierarchical-DQN framework to operate at different temporal scales. By specifying subgoals, a top-level meta-controller learns a policy over intrinsic goals while a lower-level function learns a policy over primitive actions to satisfy subgoals. Bacon et al.  derive a set of policy gradient theorems for options and proposed an option-critic architecture capable of learning both the internal policies and the termination functions, in tandem with the meta-policy, without providing additional rewards or subgoals. Tessler et al.  present a deep hierarchical approach for lifelong learning and evaluate it in Minecraft game with impressive performance and show a strong ability to reuse knowledge. However, all of these work ignore the fact that there exist a variety of well-developed algorithms off the shelf in robotics field. Learning from scratch is inadvisable for robots since it may cause serious damage to fragile robots.
In this section, we first introduce the framework of our method, then derive the update rule based on policy gradient methods .
Iii-a Option-Interruption Framework
As introduced in Section II, a complete decision-making process in HRL involves two stages: select an option according to meta-policy , then follow the chosen option’s policy until termination. Each option is composed of , representing the initialization set, intra-option policy and termination function, respectively. In our Option-Interruption framework, the intra-option policy is embedded with existing methods and thus deterministic, representing the temporal abstract of the human knowledge. The meta-policy and termination functions are obtained by training. The intra-option policy can be flexibly replaced by various existing methods on the needs of the scenario.
Take autonomous exploration as an example, where the robot is expected to explore the whole environment with a path as short as possible. Strategies of training the agent with primitive actions, e.g. up, down, left and right are quite inefficient, not to mention the frequent collision with obstacles which further increases the training difficulties. By substituting options with path planning algorithms like algorithm, the agent can focus on high-level decision making and ignore low-level collison checking, thus making the training process more efficient.
Iii-B Update Rules
In this subsection, we derive how to train our framework utilizing policy gradient methods. Policy gradient methods  learn the policy parameters based on the gradient of some performance measure with respect to the policy parameters . Seeking to maximize the performance, their updates approximate the gradient ascent of
is a stochastic estimation, the expectation of which approximates the gradient ofwith respect to .
We start from the undiscounted case where . Let denote the meta-policy parameterized by and , the termination function of parameterized by . In our case, the intra-option policies are deterministic and do not need to be learned from scratch. We define the value of the start state of each episode as the performance measure
where is the value function under the meta-policy . To keep the notation simple, we leave it implicit that is parameterized by and the gradients are also with respect to . The gradient of the state-value function can be rewritten in terms of action-value function as
Since , where , Eq. (3) can be written in
after repeated unrolling , where represents the probability of transitioning from state to state in option steps under policy . Thus
where here is the on-policy state distribution under , and are sample state and option. Note that both and lie on the option scale, i.e. they are sampled at the time step when the option is initialized. In this way, the expectation of the sample gradient is proportional to the actual gradient of the performance measure (which is the value function in our case) with respect to the parameter and thus the meta-policy can be updated via . More generally, we use a critic to approximate the state-value function, parameterized by
, to reduce the variance and speed up the learning process. We generalize the previous undiscounted version to discounted case (). As Thomas showed in , the discount factor makes the usual policy gradient estimator biased. However, correcting for this discrepancy also incurs data inefficiency. As discussed in , we build our model based on the policy gradient estimator for simplicity. Suppose the duration of the option is , i.e. lasts for time steps, then is the discounted accumulated return during to . The parameters of the meta-policy is updated at the option scale:
Interrupting options before they would finish naturally according to their termination conditions endows the agent with the flexibility to switch options when necessary. As derived in  the parameters of the termination function can be updated at each action time step as follows:
The whole Option-Interruption algorithm with policy gradient update is presented in Algorithm 1. Notice that the meta-policy , the approximated state-value function and the option termination funciton are updated at different temporal scales.
Iv-a Four-room Navigation
To verify the effectiveness of our algorithm, firstly we consider the navigation problem in a grid world environment of four rooms as shown in . The four-room environment is depicted in Fig. 2(a), where the cells of the grid correspond to the states of the environment. There are four hallways connecting adjacent rooms and our goal is the east hallway, marked in red in the figure. At the beginning of each episode, the agent is placed at a random location. From any state, it can perform one of four primitive actions： up, down, left or right. In addition, primitive movements can fail with a probability of , in which case the agent randomly transit to one of the empty adjacent cells. The wall, represented by black cell, is not accessible. If an agent hits the wall, then it will remain in the same cell.
In our environment, there are four options corresponding to four hallways. At any cell, only the two options that lead the agent to the hallways of the current room are available. Fig. 2(b) gives an example, showing the initialization set of the east hallway option along with its policy , following the shortest path within the room to its target hallway. Besides, the reward is always on all state transitions except transiting to the goal.
We compare our method with four-option Option-Critic (OC) architecture  which learns both policy over options and option policies from scratch. We also implement Actor-Critic (AC) method at the primitive action level. In all methods, is parameterized with Boltzmann distribution and the termination function
is parameterized with sigmoid functions. The discount factoris , and all value functions are updated by one-step learning. All the weights are initialized to zero.
As shown in Fig. 3, during the training process the average episode length of our method converges much faster than that of OC and AC, which learn primitive actions from scratch, indicating that the involvement of prior knowledge can indeed help to speed up the learning process. Another observation is that at the early stage of the training process, the average episode length of our method is remarkably lower as displayed in TABLE I, which suggests that the options involving temporal abstraction are the key to reduce the searching space. Such advantage is extremely important to real robots since it will constrain the robot’s behavior and accordingly protect it from unexpected damage when deployed in the real world.
It is noted that the effect of termination function in this typical setting is not fully validated since the optimal policy actually does not contain any termination. Hence we design a variant version of Four-room navigation problem, where one of the three hallways (hallways except the goal) will be randomly blocked for time steps where. We compare our method with the none-interruption version, i.e. the option will be terminated only when reaching the target hallway, and the result at the learning stage is shown in Fig. 4. After around episodes the episode length of the version without termination function is higher than the version with termination function and the average option duration is much shorter when equipped with the termination function, from which we can easily recognize the effect of the interruption mechanism. The interruption mechanism endows the agent with the flexibility to switch the policy timely, and in our case, the agent will interrupt the ongoing option if the target hallway is blocked.
Iv-B Autonomous Exploration in Indoor Environments
Autonomously exploring an unknown environment is an essential task for mobile robots, where the agent is expected to find out a safe path to cover the whole map with the constraint of reducing the path cost as much as possible . In this subsection, we perform the exploration task in an indoor environment and compare our Option-Interruption architecture with the typical DRL.
In terms of learning based exploration methods, the agent is expected to summarize the strategy based on its experience, since the indoor layouts of houses are well structured and contain rich spatial information. At each episode, the mobile range-sensing robot starts from a random location and explores the environment in a discrete manner. Every time step it moves a fixed length and the episode ends when the whole area is covered. The key challenges for this task are: 1) the agent must learn to avoid obstacles; 2) the agent is expected to learn to move towards the unknown areas.
In our experiment, the state is a pixels image patch centered at the current location of the robot, as shown in Fig. 5. The robot is drawn at the center of , to indicate its orientation. The goal of robot exploration task is to 1) cover the house in a low path length; 2) guarantee obstacle avoidance. In term of the reward function, ideally three signals are enough to reflect the aforementioned goals: 1) a time penalty at each step to urge the agent to finish the task; 2) a success reward when completing the task; 3) a collision penalty when hitting walls. However, the reward function defined in this way is too difficult for the agent to learn, due to the sparsity of the positive signals. Hence we define in an informative way: the newly explored area is taken into consideration to encourage the agent to collect more information at each step. Our reward function at time step t is defined as
where describes newly covered area (presented by the number of pixels). is a constant coefficient for scaling. and represent the time penalty and the collision penalty. And is the reward if the agent completes the exploration task.
The actions of typical RL are four directions, up, down, left and right with fixed step length. As for our hierarchical structure, the option is specified as one-step action same as typical RL but equipped with the obstacle avoidance ability, i.e. for each option, the initialization set only contains the available adjacent free cells. In this way, the agent would not need to learn to avoid walls.
For the purpose of enabling the robot exploring environments autonomously and efficiently, we build the Asynchronous Advantage Actor-Critic (A3C) network  with parallel workers and the network details are as follows:
A convolution layer with ;
A convolution layer with ;
A convolution layer with ;
A fully connected layer with units;
A policy head and a value head for RL,
or a meta-policy head , a value head and termination function for HRL;
where refers to kernel size , number of outputs
, stride. With an emphasis on the influence on training process, we trained our agent on one map without testing its generalization ability. And the result is shown in Fig. 6.
As we can see, our method converges much faster than Actor-Critic (AC). The largest distance between the two curves lies in th episode which is . This clearly proves that by giving human knowledge the training process can be significantly sped up. And our result can be further improved by implementing other existing methods.
In summary, this paper proposes an Option-Interruption architecture that embeds existing methods into a hierarchical reinforcement learning structure. On the basis of the existing methods, the search space is considerably reduced and hence the training process is significantly sped up. On the other hand, the interruption mechanism provides the flexibility to the changing of the external world which existing methods do not hold. The experiment shows the efficiency of our architecture given proper human knowledge.
At the same time, there is still some future work to do. For example, the training process of the termination function can be further investigated. As displayed in Fig. 4(a), although the final performance of no interruption version is worse, it performs well at the initial training stage, which means it may be a choice to disable the termination function at first to protect our fragile robot when we are training a real robot. Another consideration is to apply our method to real robots, which leaves us a lot of work to do.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Human-level control through deep reinforcement learning.” Nature, vol. 518, no. 7540, p. 529, 2015.
-  D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, and A. Bolton, “Mastering the game of go without human knowledge.” Nature, vol. 550, no. 7676, p. 354, 2017.
-  R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
-  A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete Event Dynamic Systems, vol. 13, no. 4, pp. 341–379, 2003.
-  P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture.” in AAAI, 2017, pp. 1726–1734.
-  T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Advances in neural information processing systems, 2016, pp. 3675–3683.
-  C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor, “A deep hierarchical approach to lifelong learning in minecraft.” in AAAI, vol. 3, 2017, p. 6.
-  R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
-  R. S. Sutton, A. G. Barto, et al., Reinforcement learning: An introduction. MIT press, 1998.
P. Thomas, “Bias in natural actor-critic algorithms,” in
International Conference on Machine Learning, 2014, pp. 441–448.
-  D. Zhu, T. Li, D. Ho, and Q. H. Meng, “Deep reinforcement learning supervised autonomous exploration in office environments,” in IEEE International Conference on Robotics and Automation, 2018.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.