I Introduction
For robots at the current stage, it is considered more practical for them to work in a specific environment or solve a specific problem. Such a concept is often referred to as scenario intelligence in contrast to the general artificial intelligence. To achieve such scenario intelligence, humans must transfer knowledge to robots by developing goaloriented algorithms. Though feasible, these humandesigned methods sometimes are insensitive to the dynamically changing environments.
Recently, Deep Reinforcement Learning (DRL) has achieved significant success in various games [1][2] and shows a promising future. However, directly applying such methods to real robots is quite difficult. First of all, the majority of these work are trained via thousands of trialanderror episodes while the robot is too fragile to sustain such a process. Besides, considering there exist a bunch of algorithms off the shelf in the robotics area, sometimes it’s inefficient to make the robot learn from scratch. In this paper, the relationship between the existing humandesigned methods and DRL is reexamined. The main concern here is whether we can develop an algorithm that inherits the efficiency and flexibility of learningbased methods and holds a controllable training process on the basis of existing methods.
Human decision making routinely involves choices over a broad range of time scales. By contrast, the Markov Decision Process (MDP) based Reinforcement Learning (RL) does not have such foresight and makes the action only based on current observation. Facing such limitations, Sutton
[3] proposed options to represent the courses of actions that take place at different time scales. Different from typical RL, this Hierarchical Reinforcement Learning (HRL) architecture is based on SemiMarkov Decision Process (SMDP) and intended to model temporally extended action sequence.A schematic diagram of HRL is depicted in Fig. 1. Instead of picking the action directly, the agent first selects an option according to the metapolicy , then follows its intraoption policy until being ended by its termination function . Based on this HRL structure, the OptionInterruption framework is proposed in this paper. Inspired by the fact that the intraoption policies are independent to the metapolicy, we encode the options with the existing humandesigned methods, augmented with learnable termination functions. Such a combination brings several benefits: 1) By imparting human knowledge, the training process can be significantly sped up; 2) When combined with welldeveloped algorithms, the robot’s behavior can be restrained, for instance, equipped with the obstacle avoidance algorithms, the agent will not hit the wall when learning to navigate; 3) The interruption mechanism enables the system to constantly monitor and make a response to their external environments; 4) Our architecture is flexible and can be embedded in various scenarios with the replacement of options.
In this paper, we propose the OptionInterruption architecture combining traditional methods with hierarchical reinforcement learning to achieve the scenario intelligence. The update rule of the metapolicy is derived and the training process is given. The experimental results verify the effectiveness of our method.
Ii Preliminaries
A finite discounted Markov Decision Process can be denoted as , where and
represent the set of states and actions respectively. The state transition probability function
is a conditional distribution over next states given that an action is taken under current state . The is a reward function and is a discount factor. The policyis a probability distribution over actions conditioned on states. The objective of the agent is to learn a policy that maximizes the statevalue function, i.e. the expected discounted future reward:
and similarly its actionvalue function is defined as .MDP is conventionally conceived that it does not involve temporally extended actions [3] and hence is unable to take advantage of highlevel temporal abstraction. To correct such deficiency, SemiMarkov Decision Process is proposed [3][4] for continuoustime discreteevent systems. The actions in SMDP last multiple time steps and are intended to model temporally extended courses of action. Therefore the transition probability generalizes to a joint probability , the transition from state to state occurs after a positive waiting time, denoted by , when action is executed.
The idea of temporally extended actions and the hierarchical structure is formulated as the options in [3]. A Markovian option can be represented by a triple , in which is an initiation set, is an intraoption policy, and is a termination function. The option is available in if and only if . We consider callandreturn option execution model [5], in which an agent picks an option according to its metapolicy , then follows the intraoption policy until the option terminates stochastically according to or the option is finished. Thus the value function can be reformulated in , where is the optionvalue function.
Combined with deep learning techniques, such hierarchical architecture has attracted a lot of attention recently. Kulkarni
et al. [6] propose a hierarchicalDQN framework to operate at different temporal scales. By specifying subgoals, a toplevel metacontroller learns a policy over intrinsic goals while a lowerlevel function learns a policy over primitive actions to satisfy subgoals. Bacon et al. [5] derive a set of policy gradient theorems for options and proposed an optioncritic architecture capable of learning both the internal policies and the termination functions, in tandem with the metapolicy, without providing additional rewards or subgoals. Tessler et al. [7] present a deep hierarchical approach for lifelong learning and evaluate it in Minecraft game with impressive performance and show a strong ability to reuse knowledge. However, all of these work ignore the fact that there exist a variety of welldeveloped algorithms off the shelf in robotics field. Learning from scratch is inadvisable for robots since it may cause serious damage to fragile robots.Iii Methodology
In this section, we first introduce the framework of our method, then derive the update rule based on policy gradient methods [8].
Iiia OptionInterruption Framework
As introduced in Section II, a complete decisionmaking process in HRL involves two stages: select an option according to metapolicy , then follow the chosen option’s policy until termination. Each option is composed of , representing the initialization set, intraoption policy and termination function, respectively. In our OptionInterruption framework, the intraoption policy is embedded with existing methods and thus deterministic, representing the temporal abstract of the human knowledge. The metapolicy and termination functions are obtained by training. The intraoption policy can be flexibly replaced by various existing methods on the needs of the scenario.
Take autonomous exploration as an example, where the robot is expected to explore the whole environment with a path as short as possible. Strategies of training the agent with primitive actions, e.g. up, down, left and right are quite inefficient, not to mention the frequent collision with obstacles which further increases the training difficulties. By substituting options with path planning algorithms like algorithm, the agent can focus on highlevel decision making and ignore lowlevel collison checking, thus making the training process more efficient.
IiiB Update Rules
In this subsection, we derive how to train our framework utilizing policy gradient methods. Policy gradient methods [8] learn the policy parameters based on the gradient of some performance measure with respect to the policy parameters . Seeking to maximize the performance, their updates approximate the gradient ascent of
(1) 
where
is a stochastic estimation, the expectation of which approximates the gradient of
with respect to .We start from the undiscounted case where . Let denote the metapolicy parameterized by and , the termination function of parameterized by . In our case, the intraoption policies are deterministic and do not need to be learned from scratch. We define the value of the start state of each episode as the performance measure
(2) 
where is the value function under the metapolicy . To keep the notation simple, we leave it implicit that is parameterized by and the gradients are also with respect to . The gradient of the statevalue function can be rewritten in terms of actionvalue function as
(3) 
Since , where , Eq. (3) can be written in
(4) 
after repeated unrolling [9], where represents the probability of transitioning from state to state in option steps under policy . Thus
(5) 
where here is the onpolicy state distribution under , and are sample state and option. Note that both and lie on the option scale, i.e. they are sampled at the time step when the option is initialized. In this way, the expectation of the sample gradient is proportional to the actual gradient of the performance measure (which is the value function in our case) with respect to the parameter and thus the metapolicy can be updated via . More generally, we use a critic to approximate the statevalue function, parameterized by
, to reduce the variance and speed up the learning process. We generalize the previous undiscounted version to discounted case (
). As Thomas showed in [10], the discount factor makes the usual policy gradient estimator biased. However, correcting for this discrepancy also incurs data inefficiency. As discussed in [5], we build our model based on the policy gradient estimator for simplicity. Suppose the duration of the option is , i.e. lasts for time steps, then is the discounted accumulated return during to . The parameters of the metapolicy is updated at the option scale:(6) 
Interrupting options before they would finish naturally according to their termination conditions endows the agent with the flexibility to switch options when necessary. As derived in [5] the parameters of the termination function can be updated at each action time step as follows:
(7) 
The whole OptionInterruption algorithm with policy gradient update is presented in Algorithm 1. Notice that the metapolicy , the approximated statevalue function and the option termination funciton are updated at different temporal scales.
Iv Experiments
Iva Fourroom Navigation
To verify the effectiveness of our algorithm, firstly we consider the navigation problem in a grid world environment of four rooms as shown in [3]. The fourroom environment is depicted in Fig. 2(a), where the cells of the grid correspond to the states of the environment. There are four hallways connecting adjacent rooms and our goal is the east hallway, marked in red in the figure. At the beginning of each episode, the agent is placed at a random location. From any state, it can perform one of four primitive actions： up, down, left or right. In addition, primitive movements can fail with a probability of , in which case the agent randomly transit to one of the empty adjacent cells. The wall, represented by black cell, is not accessible. If an agent hits the wall, then it will remain in the same cell.


In our environment, there are four options corresponding to four hallways. At any cell, only the two options that lead the agent to the hallways of the current room are available. Fig. 2(b) gives an example, showing the initialization set of the east hallway option along with its policy , following the shortest path within the room to its target hallway. Besides, the reward is always on all state transitions except transiting to the goal.
AC  362.22  11.28  65  226 
OC  357.69  15.45  119  622 
Our method  62.59  9.97  3  21 
No Interruption  22.56  10.85    4 
We compare our method with fouroption OptionCritic (OC) architecture [5] which learns both policy over options and option policies from scratch. We also implement ActorCritic (AC) method at the primitive action level. In all methods, is parameterized with Boltzmann distribution and the termination function
is parameterized with sigmoid functions. The discount factor
is , and all value functions are updated by onestep learning. All the weights are initialized to zero.As shown in Fig. 3, during the training process the average episode length of our method converges much faster than that of OC and AC, which learn primitive actions from scratch, indicating that the involvement of prior knowledge can indeed help to speed up the learning process. Another observation is that at the early stage of the training process, the average episode length of our method is remarkably lower as displayed in TABLE I, which suggests that the options involving temporal abstraction are the key to reduce the searching space. Such advantage is extremely important to real robots since it will constrain the robot’s behavior and accordingly protect it from unexpected damage when deployed in the real world.
It is noted that the effect of termination function in this typical setting is not fully validated since the optimal policy actually does not contain any termination. Hence we design a variant version of Fourroom navigation problem, where one of the three hallways (hallways except the goal) will be randomly blocked for time steps where
is a random variable uniformly distributed in
. We compare our method with the noneinterruption version, i.e. the option will be terminated only when reaching the target hallway, and the result at the learning stage is shown in Fig. 4. After around episodes the episode length of the version without termination function is higher than the version with termination function and the average option duration is much shorter when equipped with the termination function, from which we can easily recognize the effect of the interruption mechanism. The interruption mechanism endows the agent with the flexibility to switch the policy timely, and in our case, the agent will interrupt the ongoing option if the target hallway is blocked.


IvB Autonomous Exploration in Indoor Environments
Autonomously exploring an unknown environment is an essential task for mobile robots, where the agent is expected to find out a safe path to cover the whole map with the constraint of reducing the path cost as much as possible [11]. In this subsection, we perform the exploration task in an indoor environment and compare our OptionInterruption architecture with the typical DRL.
In terms of learning based exploration methods, the agent is expected to summarize the strategy based on its experience, since the indoor layouts of houses are well structured and contain rich spatial information. At each episode, the mobile rangesensing robot starts from a random location and explores the environment in a discrete manner. Every time step it moves a fixed length and the episode ends when the whole area is covered. The key challenges for this task are: 1) the agent must learn to avoid obstacles; 2) the agent is expected to learn to move towards the unknown areas.
In our experiment, the state is a pixels image patch centered at the current location of the robot, as shown in Fig. 5. The robot is drawn at the center of , to indicate its orientation. The goal of robot exploration task is to 1) cover the house in a low path length; 2) guarantee obstacle avoidance. In term of the reward function, ideally three signals are enough to reflect the aforementioned goals: 1) a time penalty at each step to urge the agent to finish the task; 2) a success reward when completing the task; 3) a collision penalty when hitting walls. However, the reward function defined in this way is too difficult for the agent to learn, due to the sparsity of the positive signals. Hence we define in an informative way: the newly explored area is taken into consideration to encourage the agent to collect more information at each step. Our reward function at time step t is defined as
(8) 
where describes newly covered area (presented by the number of pixels). is a constant coefficient for scaling. and represent the time penalty and the collision penalty. And is the reward if the agent completes the exploration task.
The actions of typical RL are four directions, up, down, left and right with fixed step length. As for our hierarchical structure, the option is specified as onestep action same as typical RL but equipped with the obstacle avoidance ability, i.e. for each option, the initialization set only contains the available adjacent free cells. In this way, the agent would not need to learn to avoid walls.
For the purpose of enabling the robot exploring environments autonomously and efficiently, we build the Asynchronous Advantage ActorCritic (A3C) network [12] with parallel workers and the network details are as follows:

A convolution layer with ;

A convolution layer with ;

A convolution layer with ;

A fully connected layer with units;

A policy head and a value head for RL,
or a metapolicy head , a value head and termination function for HRL;
where refers to kernel size , number of outputs
, stride
. With an emphasis on the influence on training process, we trained our agent on one map without testing its generalization ability. And the result is shown in Fig. 6.As we can see, our method converges much faster than ActorCritic (AC). The largest distance between the two curves lies in th episode which is . This clearly proves that by giving human knowledge the training process can be significantly sped up. And our result can be further improved by implementing other existing methods.
V Conclusion
In summary, this paper proposes an OptionInterruption architecture that embeds existing methods into a hierarchical reinforcement learning structure. On the basis of the existing methods, the search space is considerably reduced and hence the training process is significantly sped up. On the other hand, the interruption mechanism provides the flexibility to the changing of the external world which existing methods do not hold. The experiment shows the efficiency of our architecture given proper human knowledge.
At the same time, there is still some future work to do. For example, the training process of the termination function can be further investigated. As displayed in Fig. 4(a), although the final performance of no interruption version is worse, it performs well at the initial training stage, which means it may be a choice to disable the termination function at first to protect our fragile robot when we are training a real robot. Another consideration is to apply our method to real robots, which leaves us a lot of work to do.
References
 [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Humanlevel control through deep reinforcement learning.” Nature, vol. 518, no. 7540, p. 529, 2015.
 [2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, and A. Bolton, “Mastering the game of go without human knowledge.” Nature, vol. 550, no. 7676, p. 354, 2017.
 [3] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 12, pp. 181–211, 1999.
 [4] A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete Event Dynamic Systems, vol. 13, no. 4, pp. 341–379, 2003.
 [5] P.L. Bacon, J. Harb, and D. Precup, “The optioncritic architecture.” in AAAI, 2017, pp. 1726–1734.
 [6] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Advances in neural information processing systems, 2016, pp. 3675–3683.
 [7] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor, “A deep hierarchical approach to lifelong learning in minecraft.” in AAAI, vol. 3, 2017, p. 6.
 [8] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
 [9] R. S. Sutton, A. G. Barto, et al., Reinforcement learning: An introduction. MIT press, 1998.

[10]
P. Thomas, “Bias in natural actorcritic algorithms,” in
International Conference on Machine Learning
, 2014, pp. 441–448.  [11] D. Zhu, T. Li, D. Ho, and Q. H. Meng, “Deep reinforcement learning supervised autonomous exploration in office environments,” in IEEE International Conference on Robotics and Automation, 2018.
 [12] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.
Comments
There are no comments yet.