Learning to Interrupt: A Hierarchical Deep Reinforcement Learning Framework for Efficient Exploration

07/30/2018 ∙ by Tingguang Li, et al. ∙ The Chinese University of Hong Kong 0

To achieve scenario intelligence, humans must transfer knowledge to robots by developing goal-oriented algorithms, which are sometimes insensitive to dynamically changing environments. While deep reinforcement learning achieves significant success recently, it is still extremely difficult to be deployed in real robots directly. In this paper, we propose a hybrid structure named Option-Interruption in which human knowledge is embedded into a hierarchical reinforcement learning framework. Our architecture has two key components: options, represented by existing human-designed methods, can significantly speed up the training process and interruption mechanism, based on learnable termination functions, enables our system to quickly respond to the external environment. To implement this architecture, we derive a set of update rules based on policy gradient methods and present a complete training process. In the experiment part, our method is evaluated in Four-room navigation and exploration task, which shows the efficiency and flexibility of our framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

For robots at the current stage, it is considered more practical for them to work in a specific environment or solve a specific problem. Such a concept is often referred to as scenario intelligence in contrast to the general artificial intelligence. To achieve such scenario intelligence, humans must transfer knowledge to robots by developing goal-oriented algorithms. Though feasible, these human-designed methods sometimes are insensitive to the dynamically changing environments.

Recently, Deep Reinforcement Learning (DRL) has achieved significant success in various games [1][2] and shows a promising future. However, directly applying such methods to real robots is quite difficult. First of all, the majority of these work are trained via thousands of trial-and-error episodes while the robot is too fragile to sustain such a process. Besides, considering there exist a bunch of algorithms off the shelf in the robotics area, sometimes it’s inefficient to make the robot learn from scratch. In this paper, the relationship between the existing human-designed methods and DRL is re-examined. The main concern here is whether we can develop an algorithm that inherits the efficiency and flexibility of learning-based methods and holds a controllable training process on the basis of existing methods.

Fig. 1: Schematic diagram of Hierarchical Reinforcement Learning (HRL) with three options and option 1 is selected.

Human decision making routinely involves choices over a broad range of time scales. By contrast, the Markov Decision Process (MDP) based Reinforcement Learning (RL) does not have such foresight and makes the action only based on current observation. Facing such limitations, Sutton

[3] proposed options to represent the courses of actions that take place at different time scales. Different from typical RL, this Hierarchical Reinforcement Learning (HRL) architecture is based on Semi-Markov Decision Process (SMDP) and intended to model temporally extended action sequence.

A schematic diagram of HRL is depicted in Fig. 1. Instead of picking the action directly, the agent first selects an option according to the meta-policy , then follows its intra-option policy until being ended by its termination function . Based on this HRL structure, the Option-Interruption framework is proposed in this paper. Inspired by the fact that the intra-option policies are independent to the meta-policy, we encode the options with the existing human-designed methods, augmented with learnable termination functions. Such a combination brings several benefits: 1) By imparting human knowledge, the training process can be significantly sped up; 2) When combined with well-developed algorithms, the robot’s behavior can be restrained, for instance, equipped with the obstacle avoidance algorithms, the agent will not hit the wall when learning to navigate; 3) The interruption mechanism enables the system to constantly monitor and make a response to their external environments; 4) Our architecture is flexible and can be embedded in various scenarios with the replacement of options.

In this paper, we propose the Option-Interruption architecture combining traditional methods with hierarchical reinforcement learning to achieve the scenario intelligence. The update rule of the meta-policy is derived and the training process is given. The experimental results verify the effectiveness of our method.

Ii Preliminaries

A finite discounted Markov Decision Process can be denoted as , where and

represent the set of states and actions respectively. The state transition probability function

is a conditional distribution over next states given that an action is taken under current state . The is a reward function and is a discount factor. The policy

is a probability distribution over actions conditioned on states. The objective of the agent is to learn a policy that maximizes the state-value function, i.e. the expected discounted future reward:

and similarly its action-value function is defined as .

MDP is conventionally conceived that it does not involve temporally extended actions [3] and hence is unable to take advantage of high-level temporal abstraction. To correct such deficiency, Semi-Markov Decision Process is proposed [3][4] for continuous-time discrete-event systems. The actions in SMDP last multiple time steps and are intended to model temporally extended courses of action. Therefore the transition probability generalizes to a joint probability , the transition from state to state occurs after a positive waiting time, denoted by , when action is executed.

The idea of temporally extended actions and the hierarchical structure is formulated as the options in [3]. A Markovian option can be represented by a triple , in which is an initiation set, is an intra-option policy, and is a termination function. The option is available in if and only if . We consider call-and-return option execution model [5], in which an agent picks an option according to its meta-policy , then follows the intra-option policy until the option terminates stochastically according to or the option is finished. Thus the value function can be reformulated in , where is the option-value function.

Combined with deep learning techniques, such hierarchical architecture has attracted a lot of attention recently. Kulkarni

et al. [6] propose a hierarchical-DQN framework to operate at different temporal scales. By specifying subgoals, a top-level meta-controller learns a policy over intrinsic goals while a lower-level function learns a policy over primitive actions to satisfy subgoals. Bacon et al. [5] derive a set of policy gradient theorems for options and proposed an option-critic architecture capable of learning both the internal policies and the termination functions, in tandem with the meta-policy, without providing additional rewards or subgoals. Tessler et al. [7] present a deep hierarchical approach for lifelong learning and evaluate it in Minecraft game with impressive performance and show a strong ability to reuse knowledge. However, all of these work ignore the fact that there exist a variety of well-developed algorithms off the shelf in robotics field. Learning from scratch is inadvisable for robots since it may cause serious damage to fragile robots.

Iii Methodology

In this section, we first introduce the framework of our method, then derive the update rule based on policy gradient methods [8].

Iii-a Option-Interruption Framework

As introduced in Section II, a complete decision-making process in HRL involves two stages: select an option according to meta-policy , then follow the chosen option’s policy until termination. Each option is composed of , representing the initialization set, intra-option policy and termination function, respectively. In our Option-Interruption framework, the intra-option policy is embedded with existing methods and thus deterministic, representing the temporal abstract of the human knowledge. The meta-policy and termination functions are obtained by training. The intra-option policy can be flexibly replaced by various existing methods on the needs of the scenario.

Take autonomous exploration as an example, where the robot is expected to explore the whole environment with a path as short as possible. Strategies of training the agent with primitive actions, e.g. up, down, left and right are quite inefficient, not to mention the frequent collision with obstacles which further increases the training difficulties. By substituting options with path planning algorithms like algorithm, the agent can focus on high-level decision making and ignore low-level collison checking, thus making the training process more efficient.

Iii-B Update Rules

In this subsection, we derive how to train our framework utilizing policy gradient methods. Policy gradient methods [8] learn the policy parameters based on the gradient of some performance measure with respect to the policy parameters . Seeking to maximize the performance, their updates approximate the gradient ascent of

(1)

where

is a stochastic estimation, the expectation of which approximates the gradient of

with respect to .

We start from the undiscounted case where . Let denote the meta-policy parameterized by and , the termination function of parameterized by . In our case, the intra-option policies are deterministic and do not need to be learned from scratch. We define the value of the start state of each episode as the performance measure

(2)

where is the value function under the meta-policy . To keep the notation simple, we leave it implicit that is parameterized by and the gradients are also with respect to . The gradient of the state-value function can be rewritten in terms of action-value function as

(3)

Since , where , Eq. (3) can be written in

(4)

after repeated unrolling [9], where represents the probability of transitioning from state to state in option steps under policy . Thus

(5)

where here is the on-policy state distribution under , and are sample state and option. Note that both and lie on the option scale, i.e. they are sampled at the time step when the option is initialized. In this way, the expectation of the sample gradient is proportional to the actual gradient of the performance measure (which is the value function in our case) with respect to the parameter and thus the meta-policy can be updated via . More generally, we use a critic to approximate the state-value function, parameterized by

, to reduce the variance and speed up the learning process. We generalize the previous undiscounted version to discounted case (

). As Thomas showed in [10], the discount factor makes the usual policy gradient estimator biased. However, correcting for this discrepancy also incurs data inefficiency. As discussed in [5], we build our model based on the policy gradient estimator for simplicity. Suppose the duration of the option is , i.e. lasts for time steps, then is the discounted accumulated return during to . The parameters of the meta-policy is updated at the option scale:

(6)
Initialize global step counter ;
Initialize weight , weight , weight ;
repeat
       Initialize episode step counter ;
       Get state ;
       repeat
             Reset gradients , , ;
             ;
             Choose according to ;
             repeat
                   Choose according to ;
                   Receive reward and new state ;
                   , ;
                  
            until  terminates or terminal ;
            
             for  do
                   ;
                   ;
                   ;
                  
             end for
            ; Update , , using , ,
      until terminal or ;
until ;
Algorithm 1 Option-Interruption algorithm with policy gradient update

Interrupting options before they would finish naturally according to their termination conditions endows the agent with the flexibility to switch options when necessary. As derived in [5] the parameters of the termination function can be updated at each action time step as follows:

(7)

The whole Option-Interruption algorithm with policy gradient update is presented in Algorithm 1. Notice that the meta-policy , the approximated state-value function and the option termination funciton are updated at different temporal scales.

Iv Experiments

Iv-a Four-room Navigation

To verify the effectiveness of our algorithm, firstly we consider the navigation problem in a grid world environment of four rooms as shown in [3]. The four-room environment is depicted in Fig. 2(a), where the cells of the grid correspond to the states of the environment. There are four hallways connecting adjacent rooms and our goal is the east hallway, marked in red in the figure. At the beginning of each episode, the agent is placed at a random location. From any state, it can perform one of four primitive actions: up, down, left or right. In addition, primitive movements can fail with a probability of , in which case the agent randomly transit to one of the empty adjacent cells. The wall, represented by black cell, is not accessible. If an agent hits the wall, then it will remain in the same cell.

(a)
(b)
Fig. 2: (a) Four-room grid world environment. The goal is marked in red. (b) The initialization set and the intra-option policy of east hallway option.

In our environment, there are four options corresponding to four hallways. At any cell, only the two options that lead the agent to the hallways of the current room are available. Fig. 2(b) gives an example, showing the initialization set of the east hallway option along with its policy , following the shortest path within the room to its target hallway. Besides, the reward is always on all state transitions except transiting to the goal.

AC 362.22 11.28 65 226
OC 357.69 15.45 119 622
Our method 62.59 9.97 3 21
No Interruption 22.56 10.85 - 4
TABLE I: Comparison result of Four-room Navigation. The columns refer to the episode length after trained times. The columns represent the time when the episode length is under .

We compare our method with four-option Option-Critic (OC) architecture [5] which learns both policy over options and option policies from scratch. We also implement Actor-Critic (AC) method at the primitive action level. In all methods, is parameterized with Boltzmann distribution and the termination function

is parameterized with sigmoid functions. The discount factor

is , and all value functions are updated by one-step learning. All the weights are initialized to zero.

Fig. 3: Average episode length. All curves are the averaged result over runs.

As shown in Fig. 3, during the training process the average episode length of our method converges much faster than that of OC and AC, which learn primitive actions from scratch, indicating that the involvement of prior knowledge can indeed help to speed up the learning process. Another observation is that at the early stage of the training process, the average episode length of our method is remarkably lower as displayed in TABLE I, which suggests that the options involving temporal abstraction are the key to reduce the searching space. Such advantage is extremely important to real robots since it will constrain the robot’s behavior and accordingly protect it from unexpected damage when deployed in the real world.

It is noted that the effect of termination function in this typical setting is not fully validated since the optimal policy actually does not contain any termination. Hence we design a variant version of Four-room navigation problem, where one of the three hallways (hallways except the goal) will be randomly blocked for time steps where

is a random variable uniformly distributed in

. We compare our method with the none-interruption version, i.e. the option will be terminated only when reaching the target hallway, and the result at the learning stage is shown in Fig. 4. After around episodes the episode length of the version without termination function is higher than the version with termination function and the average option duration is much shorter when equipped with the termination function, from which we can easily recognize the effect of the interruption mechanism. The interruption mechanism endows the agent with the flexibility to switch the policy timely, and in our case, the agent will interrupt the ongoing option if the target hallway is blocked.

(a)
(b)
Fig. 4: Experiment results in the dynamically blocked environment. (a) Average episode length. Our method with interruption has a lower length. (b) Average option duration. All curves are averaged over runs.

Iv-B Autonomous Exploration in Indoor Environments

Autonomously exploring an unknown environment is an essential task for mobile robots, where the agent is expected to find out a safe path to cover the whole map with the constraint of reducing the path cost as much as possible [11]. In this subsection, we perform the exploration task in an indoor environment and compare our Option-Interruption architecture with the typical DRL.

In terms of learning based exploration methods, the agent is expected to summarize the strategy based on its experience, since the indoor layouts of houses are well structured and contain rich spatial information. At each episode, the mobile range-sensing robot starts from a random location and explores the environment in a discrete manner. Every time step it moves a fixed length and the episode ends when the whole area is covered. The key challenges for this task are: 1) the agent must learn to avoid obstacles; 2) the agent is expected to learn to move towards the unknown areas.

Fig. 5: The input of the network. White pixels mean free space, gray pixels are unknown areas and black pixels represent obstacles. The input is a pixels image patch centered at the current location of the robot.

In our experiment, the state is a pixels image patch centered at the current location of the robot, as shown in Fig. 5. The robot is drawn at the center of , to indicate its orientation. The goal of robot exploration task is to 1) cover the house in a low path length; 2) guarantee obstacle avoidance. In term of the reward function, ideally three signals are enough to reflect the aforementioned goals: 1) a time penalty at each step to urge the agent to finish the task; 2) a success reward when completing the task; 3) a collision penalty when hitting walls. However, the reward function defined in this way is too difficult for the agent to learn, due to the sparsity of the positive signals. Hence we define in an informative way: the newly explored area is taken into consideration to encourage the agent to collect more information at each step. Our reward function at time step t is defined as

(8)

where describes newly covered area (presented by the number of pixels). is a constant coefficient for scaling. and represent the time penalty and the collision penalty. And is the reward if the agent completes the exploration task.

The actions of typical RL are four directions, up, down, left and right with fixed step length. As for our hierarchical structure, the option is specified as one-step action same as typical RL but equipped with the obstacle avoidance ability, i.e. for each option, the initialization set only contains the available adjacent free cells. In this way, the agent would not need to learn to avoid walls.

For the purpose of enabling the robot exploring environments autonomously and efficiently, we build the Asynchronous Advantage Actor-Critic (A3C) network [12] with parallel workers and the network details are as follows:

  1. A convolution layer with ;

  2. A convolution layer with ;

  3. A convolution layer with ;

  4. A fully connected layer with units;

  5. A policy head and a value head for RL,
    or a meta-policy head , a value head and termination function for HRL;

where refers to kernel size , number of outputs

, stride

. With an emphasis on the influence on training process, we trained our agent on one map without testing its generalization ability. And the result is shown in Fig. 6.

Fig. 6: Average episode length at training stage. All curves are averaged over parallel threads.

As we can see, our method converges much faster than Actor-Critic (AC). The largest distance between the two curves lies in th episode which is . This clearly proves that by giving human knowledge the training process can be significantly sped up. And our result can be further improved by implementing other existing methods.

V Conclusion

In summary, this paper proposes an Option-Interruption architecture that embeds existing methods into a hierarchical reinforcement learning structure. On the basis of the existing methods, the search space is considerably reduced and hence the training process is significantly sped up. On the other hand, the interruption mechanism provides the flexibility to the changing of the external world which existing methods do not hold. The experiment shows the efficiency of our architecture given proper human knowledge.

At the same time, there is still some future work to do. For example, the training process of the termination function can be further investigated. As displayed in Fig. 4(a), although the final performance of no interruption version is worse, it performs well at the initial training stage, which means it may be a choice to disable the termination function at first to protect our fragile robot when we are training a real robot. Another consideration is to apply our method to real robots, which leaves us a lot of work to do.

References

  • [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Human-level control through deep reinforcement learning.” Nature, vol. 518, no. 7540, p. 529, 2015.
  • [2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, and A. Bolton, “Mastering the game of go without human knowledge.” Nature, vol. 550, no. 7676, p. 354, 2017.
  • [3] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
  • [4] A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete Event Dynamic Systems, vol. 13, no. 4, pp. 341–379, 2003.
  • [5] P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture.” in AAAI, 2017, pp. 1726–1734.
  • [6] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Advances in neural information processing systems, 2016, pp. 3675–3683.
  • [7] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor, “A deep hierarchical approach to lifelong learning in minecraft.” in AAAI, vol. 3, 2017, p. 6.
  • [8] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
  • [9] R. S. Sutton, A. G. Barto, et al., Reinforcement learning: An introduction.   MIT press, 1998.
  • [10] P. Thomas, “Bias in natural actor-critic algorithms,” in

    International Conference on Machine Learning

    , 2014, pp. 441–448.
  • [11] D. Zhu, T. Li, D. Ho, and Q. H. Meng, “Deep reinforcement learning supervised autonomous exploration in office environments,” in IEEE International Conference on Robotics and Automation, 2018.
  • [12] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.