Hierarchical Reinforcement Learning for Sequencing Behaviors

03/05/2018 ∙ by Hadi Salman, et al. ∙ Lebanese University Carnegie Mellon University 0

Recent literature in the robot learning community has focused on learning robot skills that abstract out lower-level details of robot control, such as Dynamic Movement Primitives (DMPs), the options framework in hierarchical RL, and subtask policies. To fully leverage the efficacy of these macro actions, it is necessary to then sequence these primitives to achieve a given task. Our objective is to jointly learn a set of robot skills and a sequence of these learnt skills to accomplish a given task. We consider the task of navigating a robot across various environments using visual input, maximizing the distance traveled through the environment while avoiding static obstacles. Traditional planning methods to solve this problem rely on hand-crafted state representations and heuristics for planning, and often fail to generalize. In contrast, deep neural networks have proved to be powerful function approximators, successfully modeling complex control policies. In addition, the ability of such networks to learn good representations of high-dimensional sensory inputs makes them a valuable tool when dealing with visual inputs. In this project, we explore the capability of deep neural networks to learn and sequence robot skills for navigation, directly using visual input.



There are no comments yet.


page 2

page 3

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

During autonomous deployments, mobile robots require the capability to react online to changes in their surrounding, such as terrain changes, dynamic obstacles, etc. A common way to approach this challenge is to define high-level robot behaviors, and to endow robots with the possibility to switch between such behaviors based on their environment [1, 2, 3]. In this paper, we consider the task of improving the navigation of different mobile robots, by sequencing learned or designed behaviors based on visual feedback. Traditional planning methods to solve this problem rely on hand-crafted state representations and heuristics for planning, which often fail to generalize to new scenarios [2]

. Inspired by recent results in machine learning, where deep neural networks can model complex control policies directly from raw inputs 

[4, 5, 6, 7]

, we propose to cast this problem in a reinforcement learning (RL) framework. Our main contribution is a hierarchical RL framework, where agents are trained to both learn and sequence robot behaviors for autonomous navigation by relying solely on raw visual input from a monocular camera. In this framework, agents learn low-level locomotive behaviors, while meta-agents explore the use of these behaviors in different scenarios to maximize the distance travelled through the environment while avoiding obstacles.

Fig. 1: Depiction of elementary environments 1 (left) and environment 2 (center), as well as a composed environment (right) constructed from combining these environments.

Specifically, we rely on deep Q-learning [4] to teach the robot low-level behaviors, as well as to train a meta-agent to sequence the robot’s behaviors. Each of the low-level behaviors allow the robot to locomote while avoiding obstacles in a given environment based on raw visual feedback. These low-level behaviors can be complex gaits, or can be learnt policies that execute simple actions such as forward motion or left and right turns. We also maintain a meta-level policy that selects the most appropriate low-level behavior for the current situation.

We present results of our hierarchical approach on both wheeled and legged robots in simulation. Our low-level behaviors are tailored to a specific environment, each with uniform appearance or structure (such as textured walls, rough terrain, etc.). In contrast, the meta-level policies are learnt in environments composed of several of the training appearances and terrains as shown in Fig. 1. We show how the robots are able to navigate these novel environments by sequencing the appropriate lower-level behaviors based on their immediate surroundings. We further show how learning to sequence low-level behaviors results in a more effective overall policy than either of the individual sub-policies, even in the respective environments they were designed/trained for.

This paper is organized as follows: Section II provides related work to this paper. In Section III, we present our framework that uses hierarchical reinforcement learning for sequencing behaviors given only visual input in a navigation task. In Section IV, we present the results of our framework and compare it with simple deep Q-network (DQN) architecture. In Section V, we conclude by summarizing our results and stating our future work.

Fig. 2: Depiction of the 6-legged robot traversing a tall obstacle by sequencing two gaits based on raw camera pixels.

Ii Related Work

Several efforts in the robotic community have focused on learning primitive-actions or skills; we provide a brief review of such works, and some recent developments that have taken steps towards sequencing these skills.

Skill Trees [8], introduced method to segment demonstrations into skills and chain these skills. The problem of hierarchical reinforcement learning has been of interest for a considerable amount of time [9]. In hierarchical RL, multi-step action policies are represented as options; a meta-controller then selects which option to apply. [10] developed an approach to hierarchical RL that provides “intrinsic motivation” to agents to perform certain subtasks, drastically helping the agents ability to achieve overall task completion.

[3] constructed a layered approach to adapt, select, and sequence DMPs. [11] provided annotations of task structure, and optimized for overall task completion over a set of modular subpolicies. [12] built models of the preconditions and effects of parameterized skills. While [13] address monocular vision based navigation via reinforcement learning, they do not make use of existing robot behaviors. We note that while our paradigm of learning a meta-level policy to sequence behaviors is also adopted by [14], they do not address the challenge of learning from visual inputs.

Iii Approach

The problem of sequencing a set of robot behaviors from visual feedback can be posed as a Markov Decision Process (MDP) with

temporal abstractions inspired by [9]. At a given state corresponding to time step , the robot chooses a behavior (i.e. a low level policy) from a predetermined set of behaviours according to a meta policy , and follows this behaviour for time steps. During these steps, an action is chosen according to the low level policy resulting in a new state of the robot and a collected reward . The goal is to learn the meta-level policy that sequentially chooses a low level policy every steps to maximize the cumulative reward .

In order to learn the meta-level policy (and the low-level policies for in certain cases), we make use of Deep Q-learning [4]

. Q-learning estimates the Q-values of state-action pair

, which is defined as the expected cumulative reward upon taking action from state , and following policy thereafter. Formally, the Q-value of a state- action pair can be written as,


In particular, Q-learning is a temporal difference method that optimizes the following loss function defined as,


Deep Q-learning employs deep neural networks as function approximators to estimate these Q-values. For more details, we refer the reader to [4].

Fig. 3: Schematic architecture of our approach. Here, we jointly depict the architecture of the Meta policy (Duelling DQN), as well as the low-level policies . We further depict the schematic of calling the meta policy first, choosing a low-level control policy to execute, running the low-level policy for steps, and then subsequently using the meta level policy again to choose another low level policy.

Iii-a Framework

We consider two robot agents in this paper, each equipped with a monocular camera that serves to provide an observation of state of the robot. The first robot, a six-legged robot, must navigate an environment that consists of varying terrain. The second robot, a differetial-drive robot, Turtlebot, must navigate an environment consisting of visually dissimilar regions. In both cases, the objective of the robot is to maximize the distance it travels in the environment, while reconciling with changes in the terrain or the appearance of the environment, and avoiding obstacles (in the case of the Turtlebot). Our reward function encodes this objective by positively rewarding the distance travelled by the robot and negatively penalizing any collisions with obstacles.

Key to our approach is the idea of hierarchical reinforcement learning, where an agent learns control policies in a hierarchical framework. In our approach we consider two such levels:

  1. The low-level control policies select robot actions (such as moving forward or turning left or right) on the basis of perceptual input (i.e. raw camera input).

  2. At the higher level, there is a meta-level policy, which selects which of these lower level policies to apply over an extended period of time.

The use of such a hierarchy of policies is augmented by constructing compound environments that are combinations of several dissimilar elementary environments depicted in Figure 1. One low-level policy is trained to navigate each elementary environment, while the meta-level policy is trained to sequence these low-level policies to navigate the compound environment. We note that the low-level policies only observe one of the elementary environments during training (or are only designed to navigate one type of terrain in the case of the legged robot), and hence perform poorly on the other elementary environments. It is thus necessary for the meta-level policy to alternate between these low-level policies employed in order to successfully navigate the compound environment.

We present a schematic of this hierarchical framework in Figure 3. The meta-level policy (depicted in blue) selects a control policy (in green), which then provides low-level commands to the robot. This combination of a hierarchical representation of policies combined with the notion of environments composed of distinct elementary components necessitates the use of a hierarchical framework. We provide a description of the low-level policies specific to each of these robots below, followed by a description of the meta-level policy.

Iii-B Lower-level control policies

The low-level control policies enable the robot to locomote in their respective environments. We describe the form of the low-level policies for each of the robots we use.

Iii-B1 Turtlebot

We provide the Turtlebot with actions to move forward, and turning towards the left or right. The low-level policy must specify which of these actions the Turtlebot must take. To evaluate the quality of each of these actions, we make use of a DQN, as described in Section III, to map an observation of state of the robot, to an estimate of the values of these actions . In this paper, we use a variant of the DQN, i.e. a double duelling deep Q-network [15] for this. The low-level policy is then derived by acting greedily with respect to the estimate of values, i.e., . We train two instances and of the low-level control policy in visually differing elementary environments.

Iii-B2 Legged Robot

In case of the 6-legged robot navigation, our low-level policies take the form of two different gaits. The first gait takes low steps at a relatively fast pace, ideal for traversing flat terrain quickly. The second gait takes higher steps at a slower pace, which is suitable for traversing obstacles present in the environment. We make use of Central Pattern Generators (CPG) [16] to generate these gaits; overloading notation and referring to these gaits as . To quickly traverse a compound environment, the legged robot would ideally use an appropriate combination of these gaits.

We note that the actions provided to each of the robots are executed by underlying controllers that incorporate the dynamics of the robot - for example, the left command steers the turtlebot forward by rotating its wheels at varying speeds. Also we note that we assume we do not have access to the dynamics of the robots, rather we use a physics simulator. Specifically, we use gym-gazebo [17], which is an extension of the OpenAI Gym [18] for robotics that is based on the Robot Operating System (ROS) [19] and the Gazebo simulator [20].

Fig. 4: Depiction of the trained meta-policy running on the rough terrain environment. The legged robot learns to sequence two behaviours to traverse the obstacle present in the middle of the environment. The top left corner of each frame shows the camera view of the robot. This camera images are being used by the robot to decide which gait to use at a specific time step.

Iii-C Meta-level control policy

Our meta-level policy learns an appropriate sequence of the low-level control policies, based on camera observation of the surrounding environment. The action space of the meta-policy corresponds to which low-level policy to execute i.e., or discussed in Section III-B. The meta-level policy thus selects a low-level control policy to execute, and the robot executes this policy for time steps. The meta-policy then subsequently chooses which policy to run, as depicted in Figure 3.

Given the high dimensional observation space of the robots’ states considered in this paper (raw camera pixels), we utilize a DQN meta-policy that uses a neural network as a function approximator for the Q-values. The network is basically composed of convolutional layers, followed by a fully connected layers. The the outputs of this network are the meta-level Q-values of running the low-level policies forward for time steps. Note that we use the same architecture of the meta policy for both of the robots we consider in this paper.

Iii-D Implementation Details

Below we provide details of the training setup. The training settings are identical for both low-level control policies and the meta-policy.

Iii-D1 Reward Function

The robot’s objective is to navigate the largest distance possible without running into the walls, given a particular starting point. This can be framed as a reward maximization problem typical to reinforcement learning. This translates to a positive reward signal proportional to the distance moved by the robot’s centre of mass (+5 for moving forward, and +1 for turning left or right), with a highly negative penalty on collisions with obstacles (-200). We use the ground truth position of the robot (relative to its starting point) to determine the distance traveled by the robot, and provide a reward signal directly proportional to this distance. We detect collisions with the obstacles using the distance to obstacle measured by a LIDAR mounted on the robots (only used during training), and penalize such collisions with a reward value of .

We note that the reward functions for the low-level policy and the meta-policy are different. The low level policy collects an immediate reward after executing each step, whereas the meta-policy uses the cumulative reward collected over N steps of execution of the low-level policy. Note also that we do not require additional reward functions (such as intrinsic motivation) in contrast with [10].

Iii-D2 Training Details

During training, we utilize a number of techniques that have proved useful in training reinforcement learning agents. Specifically, we use experience replay [21] of memory size of , with a burn-in of time steps. We also maintain a target Q-network for training stability as in double Deep Q-Networks [22]. The target network is updated every time steps. We train our DQN using the Adam optimizer with a learning rate of . We use a discount factor of

, and gradient clipping of magnitude

. Futhermore, we make use of a decaying epsilon greedy policy during training, to ensure sufficient exploration of states and actions.

Iv Results

In this section, we describe the results of our framework applied to two robots; a legged robot and a mobile robot. We show results of the meta-policy sequencing a set of low-level policies for each of the robots. We demonstrate the ability of meta-level control policies to navigate the robots through the respective environments based only on monocular camera feedback.

Iv-a Legged-Robot Navigation

The legged robot that has to switch between gaits to traverse across a rough terrain. The low-level policies in this scenario or two CPG gaits. The first gait is characterized by high steps and low forward speed. The second gait is characterized by low steps and high forward speed. We visualize one such episode in Fig. 4, where we depict the progress of the robot at varying time steps. Note the robots ability to sequence both gaits to traverse the high obstacle in the middle of path. For a video of the legged robot navigating performing this task, check the following link.

Fig. 5: Depiction of the trained meta-policy running on the compound environment. The robot is circled by a pink circle for visibility. Notice the robot able to traverse a relatively long complex path with several turns.

Iv-B Wheeled Robot Navigation

The wheeled mobile robot (turtlebot) has to navigate in maze-like environment without colliding with obstacles. The turtlebot is trained separately on two elementary environments: environment shown in Fig. 1-left and environment shown Fig. 1-center. The learned policies in these environments are and respectively. The results for the turtle-bot navigating the environment 2 may be viewed at this link.

Fig. 6: The human constructed deterministic policy (center) is to use (depicted in dark blue) in areas with bricks, and (depicted in light green) in areas with black walls / purple floors. The learned meta-policy chooses the assignment of policies depicted on the left. While there are areas in which this learned meta-policy differs from the human deterministic policy, the meta-policy is able to do better by switching between different policies.

We then learn a new policy that sequences the policies and by training on a compound environment that is shown in Fig. 1-right. Note that here, performs poorly on environment , and performs poorly on environment . This disparity in performances is because the low-level policies are not trained on the opposite environments. The meta-policy learns and exploits this performance in different environments of respective low-level policies.

Iv-C Discussions

Intuitively, the meta-policy should learn to apply in the area of the compound environments with bricks, and apply in the area of the environment with black walls and purple flooring. Pictorially, we can represent this in Fig. 6 center. Upon training the meta policy to sequence the policies and , we plot the low-level policies chosen by the meta-policy as a function of space, as shown in Fig. 6. These policy activations provide us insight into how the meta-policy functions. As is visible from Fig. 6-left, we see that the learned meta-policy learns to choose policy in environment , consistent with the performance of in its training environment.

Average Cumulative Reward
(over 10 episodes)
Percentage of
Successful Trials
Meta-level Policy (ours) 988.5 80%
meta-level Policy
669.3 40%
Low-level Policy
(Compound Env.)
408.9 0%
Low-level Policy
(Elementary Env. 1)
317.9 0%
Low-level Policy
(Elementary Env. 2)
- -
TABLE I: Average reward and task success rate of the our meta-policy versus hand-crafted and learnt baselines

Further, we see that outside of the black walled area with purple flooring, there are occasions when the meta-policy chooses to alternate between and . Observing the video of the turtlebot in Gazebo while it is executing the behavior, we realize the following interesting behavior of the meta-policy.

The low-level policy traverses environment 1, by slowly switching between left and right actions, without moving forward. However, it is able to pass environment 1. Since the forward action has a larger velocity provided to it, the meta-policy learns that such slow, careful behavior is better for portions of environment 1 with turns in it. In regions of the environment 1 that are straight, the meta-policy learns to exploit , which tends to go straight when possible. This modulation of which policy is active as chosen by the meta policy, is visible in Fig. 6-left.

This interesting behavior guarantees that the meta-policy is able to traverse the entire path, while neither of the low-level policies can traverse the entire path. Further, the low-level policies are not perfect, and tend to crash occasionally in their own environments. This implies that the human deterministic policy, that simply calls each policy in its respective environment, is not guaranteed to traverse the entire path. The meta-policy, however, smartly interleaves the two low level policies to traverse the entire path with reasonably good probability, as depicted in Figure


We report the average reward obtained over episodes in Table I. The human deterministic policy is able to achieve an average cumulative reward of , while the learned meta-policy achieves an average cumulative reward of . This significant difference in the rewards is due to the ability of the meta policy to switch between low-level control policies to nearly guarantee the robot traverses the entire path.

We show that a single low level policy is unable to traverse the entire path, thus necessitating the use of a meta-policy in the following link. Next, we show the results of the handcrafted deterministic policy, which sometimes fails to traverse the path, in the following link following link Finally, we show the results of the learned meta-policy, which manages to traverse the path and achieve higher rewards by selecting alternating control policies to use, in the following link. Please note the video has annotations of which low-level control policy is being invoked, in the background on the left. Meta-action refers to , and meta-action refers to .

Fig. 7: Depiction of the task of navigating a wheeled robot in a dynamic environment using our framework.

Iv-D Wheeled-Robot Navigation in Dynamic Environments

To explore the capability of our framework to accommodate sensor modalities other than cameras such as LIDARs, and its applicability in dynamic environments, we test our framework on the task of navigating a wheeled mobile robot in an environment with combined static and dynamic obstacles. We depict the robot and the dynamic environment used in this task in Fig.7. The results of the successful navigation of the robot can be seen in the video in the following link.

V Conclusion and Future Work

In this paper, we present an approach to both learn and sequence robot behaviors, applied to the problem of visual navigation of mobile robots. The layered representation of control policies that we employ allows the robot to adapt to changes in the environment, and select the low-level behavior most appropriate for the current situation, enabling significantly improved task performance. The meta-level policies that we learn are agnostic to the nature of the low-level behaviors used, enabling the use of both hand-crafted as well as learnt policies. In addition, these meta-level policies may also be trained with other input modalities as well. These features exemplify why maintaining a hierarchy of control policies is a potent tool in enabling robots with more autonomous capabilities.

For future work, we would like to explore how this farmework may be modified to deal with learning both levels of policies jointly. This would allow adapting the low-level behaviors online, enabling further capabilities of robots.