During autonomous deployments, mobile robots require the capability to react online to changes in their surrounding, such as terrain changes, dynamic obstacles, etc. A common way to approach this challenge is to define high-level robot behaviors, and to endow robots with the possibility to switch between such behaviors based on their environment [1, 2, 3]. In this paper, we consider the task of improving the navigation of different mobile robots, by sequencing learned or designed behaviors based on visual feedback. Traditional planning methods to solve this problem rely on hand-crafted state representations and heuristics for planning, which often fail to generalize to new scenarios 
. Inspired by recent results in machine learning, where deep neural networks can model complex control policies directly from raw inputs[4, 5, 6, 7]
, we propose to cast this problem in a reinforcement learning (RL) framework. Our main contribution is a hierarchical RL framework, where agents are trained to both learn and sequence robot behaviors for autonomous navigation by relying solely on raw visual input from a monocular camera. In this framework, agents learn low-level locomotive behaviors, while meta-agents explore the use of these behaviors in different scenarios to maximize the distance travelled through the environment while avoiding obstacles.
Specifically, we rely on deep Q-learning  to teach the robot low-level behaviors, as well as to train a meta-agent to sequence the robot’s behaviors. Each of the low-level behaviors allow the robot to locomote while avoiding obstacles in a given environment based on raw visual feedback. These low-level behaviors can be complex gaits, or can be learnt policies that execute simple actions such as forward motion or left and right turns. We also maintain a meta-level policy that selects the most appropriate low-level behavior for the current situation.
We present results of our hierarchical approach on both wheeled and legged robots in simulation. Our low-level behaviors are tailored to a specific environment, each with uniform appearance or structure (such as textured walls, rough terrain, etc.). In contrast, the meta-level policies are learnt in environments composed of several of the training appearances and terrains as shown in Fig. 1. We show how the robots are able to navigate these novel environments by sequencing the appropriate lower-level behaviors based on their immediate surroundings. We further show how learning to sequence low-level behaviors results in a more effective overall policy than either of the individual sub-policies, even in the respective environments they were designed/trained for.
This paper is organized as follows: Section II provides related work to this paper. In Section III, we present our framework that uses hierarchical reinforcement learning for sequencing behaviors given only visual input in a navigation task. In Section IV, we present the results of our framework and compare it with simple deep Q-network (DQN) architecture. In Section V, we conclude by summarizing our results and stating our future work.
Ii Related Work
Several efforts in the robotic community have focused on learning primitive-actions or skills; we provide a brief review of such works, and some recent developments that have taken steps towards sequencing these skills.
Skill Trees , introduced method to segment demonstrations into skills and chain these skills. The problem of hierarchical reinforcement learning has been of interest for a considerable amount of time . In hierarchical RL, multi-step action policies are represented as options; a meta-controller then selects which option to apply.  developed an approach to hierarchical RL that provides “intrinsic motivation” to agents to perform certain subtasks, drastically helping the agents ability to achieve overall task completion.
 constructed a layered approach to adapt, select, and sequence DMPs.  provided annotations of task structure, and optimized for overall task completion over a set of modular subpolicies.  built models of the preconditions and effects of parameterized skills. While  address monocular vision based navigation via reinforcement learning, they do not make use of existing robot behaviors. We note that while our paradigm of learning a meta-level policy to sequence behaviors is also adopted by , they do not address the challenge of learning from visual inputs.
The problem of sequencing a set of robot behaviors from visual feedback can be posed as a Markov Decision Process (MDP) withtemporal abstractions inspired by . At a given state corresponding to time step , the robot chooses a behavior (i.e. a low level policy) from a predetermined set of behaviours according to a meta policy , and follows this behaviour for time steps. During these steps, an action is chosen according to the low level policy resulting in a new state of the robot and a collected reward . The goal is to learn the meta-level policy that sequentially chooses a low level policy every steps to maximize the cumulative reward .
In order to learn the meta-level policy (and the low-level policies for in certain cases), we make use of Deep Q-learning 
. Q-learning estimates the Q-values of state-action pair, which is defined as the expected cumulative reward upon taking action from state , and following policy thereafter. Formally, the Q-value of a state- action pair can be written as,
In particular, Q-learning is a temporal difference method that optimizes the following loss function defined as,
Deep Q-learning employs deep neural networks as function approximators to estimate these Q-values. For more details, we refer the reader to .
We consider two robot agents in this paper, each equipped with a monocular camera that serves to provide an observation of state of the robot. The first robot, a six-legged robot, must navigate an environment that consists of varying terrain. The second robot, a differetial-drive robot, Turtlebot, must navigate an environment consisting of visually dissimilar regions. In both cases, the objective of the robot is to maximize the distance it travels in the environment, while reconciling with changes in the terrain or the appearance of the environment, and avoiding obstacles (in the case of the Turtlebot). Our reward function encodes this objective by positively rewarding the distance travelled by the robot and negatively penalizing any collisions with obstacles.
Key to our approach is the idea of hierarchical reinforcement learning, where an agent learns control policies in a hierarchical framework. In our approach we consider two such levels:
The low-level control policies select robot actions (such as moving forward or turning left or right) on the basis of perceptual input (i.e. raw camera input).
At the higher level, there is a meta-level policy, which selects which of these lower level policies to apply over an extended period of time.
The use of such a hierarchy of policies is augmented by constructing compound environments that are combinations of several dissimilar elementary environments depicted in Figure 1. One low-level policy is trained to navigate each elementary environment, while the meta-level policy is trained to sequence these low-level policies to navigate the compound environment. We note that the low-level policies only observe one of the elementary environments during training (or are only designed to navigate one type of terrain in the case of the legged robot), and hence perform poorly on the other elementary environments. It is thus necessary for the meta-level policy to alternate between these low-level policies employed in order to successfully navigate the compound environment.
We present a schematic of this hierarchical framework in Figure 3. The meta-level policy (depicted in blue) selects a control policy (in green), which then provides low-level commands to the robot. This combination of a hierarchical representation of policies combined with the notion of environments composed of distinct elementary components necessitates the use of a hierarchical framework. We provide a description of the low-level policies specific to each of these robots below, followed by a description of the meta-level policy.
Iii-B Lower-level control policies
The low-level control policies enable the robot to locomote in their respective environments. We describe the form of the low-level policies for each of the robots we use.
We provide the Turtlebot with actions to move forward, and turning towards the left or right. The low-level policy must specify which of these actions the Turtlebot must take. To evaluate the quality of each of these actions, we make use of a DQN, as described in Section III, to map an observation of state of the robot, to an estimate of the values of these actions . In this paper, we use a variant of the DQN, i.e. a double duelling deep Q-network  for this. The low-level policy is then derived by acting greedily with respect to the estimate of values, i.e., . We train two instances and of the low-level control policy in visually differing elementary environments.
Iii-B2 Legged Robot
In case of the 6-legged robot navigation, our low-level policies take the form of two different gaits. The first gait takes low steps at a relatively fast pace, ideal for traversing flat terrain quickly. The second gait takes higher steps at a slower pace, which is suitable for traversing obstacles present in the environment. We make use of Central Pattern Generators (CPG)  to generate these gaits; overloading notation and referring to these gaits as . To quickly traverse a compound environment, the legged robot would ideally use an appropriate combination of these gaits.
We note that the actions provided to each of the robots are executed by underlying controllers that incorporate the dynamics of the robot - for example, the left command steers the turtlebot forward by rotating its wheels at varying speeds. Also we note that we assume we do not have access to the dynamics of the robots, rather we use a physics simulator. Specifically, we use gym-gazebo , which is an extension of the OpenAI Gym  for robotics that is based on the Robot Operating System (ROS)  and the Gazebo simulator .
Iii-C Meta-level control policy
Our meta-level policy learns an appropriate sequence of the low-level control policies, based on camera observation of the surrounding environment. The action space of the meta-policy corresponds to which low-level policy to execute i.e., or discussed in Section III-B. The meta-level policy thus selects a low-level control policy to execute, and the robot executes this policy for time steps. The meta-policy then subsequently chooses which policy to run, as depicted in Figure 3.
Given the high dimensional observation space of the robots’ states considered in this paper (raw camera pixels), we utilize a DQN meta-policy that uses a neural network as a function approximator for the Q-values. The network is basically composed of convolutional layers, followed by a fully connected layers. The the outputs of this network are the meta-level Q-values of running the low-level policies forward for time steps. Note that we use the same architecture of the meta policy for both of the robots we consider in this paper.
Iii-D Implementation Details
Below we provide details of the training setup. The training settings are identical for both low-level control policies and the meta-policy.
Iii-D1 Reward Function
The robot’s objective is to navigate the largest distance possible without running into the walls, given a particular starting point. This can be framed as a reward maximization problem typical to reinforcement learning. This translates to a positive reward signal proportional to the distance moved by the robot’s centre of mass (+5 for moving forward, and +1 for turning left or right), with a highly negative penalty on collisions with obstacles (-200). We use the ground truth position of the robot (relative to its starting point) to determine the distance traveled by the robot, and provide a reward signal directly proportional to this distance. We detect collisions with the obstacles using the distance to obstacle measured by a LIDAR mounted on the robots (only used during training), and penalize such collisions with a reward value of .
We note that the reward functions for the low-level policy and the meta-policy are different. The low level policy collects an immediate reward after executing each step, whereas the meta-policy uses the cumulative reward collected over N steps of execution of the low-level policy. Note also that we do not require additional reward functions (such as intrinsic motivation) in contrast with .
Iii-D2 Training Details
During training, we utilize a number of techniques that have proved useful in training reinforcement learning agents. Specifically, we use experience replay  of memory size of , with a burn-in of time steps. We also maintain a target Q-network for training stability as in double Deep Q-Networks . The target network is updated every time steps. We train our DQN using the Adam optimizer with a learning rate of . We use a discount factor of
, and gradient clipping of magnitude. Futhermore, we make use of a decaying epsilon greedy policy during training, to ensure sufficient exploration of states and actions.
In this section, we describe the results of our framework applied to two robots; a legged robot and a mobile robot. We show results of the meta-policy sequencing a set of low-level policies for each of the robots. We demonstrate the ability of meta-level control policies to navigate the robots through the respective environments based only on monocular camera feedback.
Iv-a Legged-Robot Navigation
The legged robot that has to switch between gaits to traverse across a rough terrain. The low-level policies in this scenario or two CPG gaits. The first gait is characterized by high steps and low forward speed. The second gait is characterized by low steps and high forward speed. We visualize one such episode in Fig. 4, where we depict the progress of the robot at varying time steps. Note the robots ability to sequence both gaits to traverse the high obstacle in the middle of path. For a video of the legged robot navigating performing this task, check the following link.
Iv-B Wheeled Robot Navigation
The wheeled mobile robot (turtlebot) has to navigate in maze-like environment without colliding with obstacles. The turtlebot is trained separately on two elementary environments: environment shown in Fig. 1-left and environment shown Fig. 1-center. The learned policies in these environments are and respectively. The results for the turtle-bot navigating the environment 2 may be viewed at this link.
We then learn a new policy that sequences the policies and by training on a compound environment that is shown in Fig. 1-right. Note that here, performs poorly on environment , and performs poorly on environment . This disparity in performances is because the low-level policies are not trained on the opposite environments. The meta-policy learns and exploits this performance in different environments of respective low-level policies.
Intuitively, the meta-policy should learn to apply in the area of the compound environments with bricks, and apply in the area of the environment with black walls and purple flooring. Pictorially, we can represent this in Fig. 6 center. Upon training the meta policy to sequence the policies and , we plot the low-level policies chosen by the meta-policy as a function of space, as shown in Fig. 6. These policy activations provide us insight into how the meta-policy functions. As is visible from Fig. 6-left, we see that the learned meta-policy learns to choose policy in environment , consistent with the performance of in its training environment.
|Meta-level Policy (ours)||988.5||80%|
Further, we see that outside of the black walled area with purple flooring, there are occasions when the meta-policy chooses to alternate between and . Observing the video of the turtlebot in Gazebo while it is executing the behavior, we realize the following interesting behavior of the meta-policy.
The low-level policy traverses environment 1, by slowly switching between left and right actions, without moving forward. However, it is able to pass environment 1. Since the forward action has a larger velocity provided to it, the meta-policy learns that such slow, careful behavior is better for portions of environment 1 with turns in it. In regions of the environment 1 that are straight, the meta-policy learns to exploit , which tends to go straight when possible. This modulation of which policy is active as chosen by the meta policy, is visible in Fig. 6-left.
This interesting behavior guarantees that the meta-policy is able to traverse the entire path, while neither of the low-level policies can traverse the entire path. Further, the low-level policies are not perfect, and tend to crash occasionally in their own environments. This implies that the human deterministic policy, that simply calls each policy in its respective environment, is not guaranteed to traverse the entire path. The meta-policy, however, smartly interleaves the two low level policies to traverse the entire path with reasonably good probability, as depicted in Figure5.
We report the average reward obtained over episodes in Table I. The human deterministic policy is able to achieve an average cumulative reward of , while the learned meta-policy achieves an average cumulative reward of . This significant difference in the rewards is due to the ability of the meta policy to switch between low-level control policies to nearly guarantee the robot traverses the entire path.
We show that a single low level policy is unable to traverse the entire path, thus necessitating the use of a meta-policy in the following link. Next, we show the results of the handcrafted deterministic policy, which sometimes fails to traverse the path, in the following link following link Finally, we show the results of the learned meta-policy, which manages to traverse the path and achieve higher rewards by selecting alternating control policies to use, in the following link. Please note the video has annotations of which low-level control policy is being invoked, in the background on the left. Meta-action refers to , and meta-action refers to .
Iv-D Wheeled-Robot Navigation in Dynamic Environments
To explore the capability of our framework to accommodate sensor modalities other than cameras such as LIDARs, and its applicability in dynamic environments, we test our framework on the task of navigating a wheeled mobile robot in an environment with combined static and dynamic obstacles. We depict the robot and the dynamic environment used in this task in Fig.7. The results of the successful navigation of the robot can be seen in the video in the following link.
V Conclusion and Future Work
In this paper, we present an approach to both learn and sequence robot behaviors, applied to the problem of visual navigation of mobile robots. The layered representation of control policies that we employ allows the robot to adapt to changes in the environment, and select the low-level behavior most appropriate for the current situation, enabling significantly improved task performance. The meta-level policies that we learn are agnostic to the nature of the low-level behaviors used, enabling the use of both hand-crafted as well as learnt policies. In addition, these meta-level policies may also be trained with other input modalities as well. These features exemplify why maintaining a hierarchy of control policies is a potent tool in enabling robots with more autonomous capabilities.
For future work, we would like to explore how this farmework may be modified to deal with learning both levels of policies jointly. This would allow adapting the low-level behaviors online, enabling further capabilities of robots.
-  M. N. Nicolescu and M. J. Matarić, “A hierarchical architecture for behavior-based robots,” in Proceedings of the First International Joint Conference on Autonomous Agents and Multiagent Systems: Part 1, ser. AAMAS ’02. New York, NY, USA: ACM, 2002, pp. 227–233. [Online]. Available: http://doi.acm.org/10.1145/544741.544798
-  M. Richter, Y. Sandamirskaya, and G. Schöner, “A robotic architecture for action selection and behavioral organization inspired by human cognition,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 2457–2464.
-  G. Neumann, C. Daniel, A. Paraschos, A. Kupcsik, and J. Peters, “Learning modular policies for robotics,” Frontiers in Computational Neuroscience, vol. 8, p. 62, 2014. [Online]. Available: http://journal.frontiersin.org/article/10.3389/fncom.2014.00062
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
J. Koutník, J. Schmidhuber, and F. Gomez, “Evolving deep unsupervised
convolutional networks for vision-based reinforcement learning,” in
Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation. ACM, 2014, pp. 541–548.
G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto, “Cst: Constructing skill
trees by demonstration,” in
Proceedings of the ICML Workshop on New Developments in Imitation Learning, July 2011. [Online]. Available: http://lis.csail.mit.edu/pubs/konidaris-icmlws11.pdf
-  R. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, vol. 112, pp. 181–211, 1999.
-  T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Advances in Neural Information Processing Systems, 2016, pp. 3675–3683.
-  J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” CoRR, vol. abs/1611.01796, 2016. [Online]. Available: http://arxiv.org/abs/1611.01796
-  T. L.-P. Leslie Pack Kaelbling, “Learning composable models of parameterized skills,” in IEEE Conference on Robotics and Automation (ICRA), 2017. [Online]. Available: http://lis.csail.mit.edu/pubs/lpk/ICRA17.pdf
-  L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular vision based obstacle avoidance through deep reinforcement learning,” arXiv preprint arXiv:1706.09829, 2017.
-  K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman, “Meta Learning Shared Hierarchies,” ArXiv e-prints, Oct. 2017.
-  Z. Wang, N. de Freitas, and M. Lanctot, “Dueling network architectures for deep reinforcement learning,” CoRR, vol. abs/1511.06581, 2015. [Online]. Available: http://arxiv.org/abs/1511.06581
-  G. Sartoretti, S. Shaw, K. Lam, N. Fan, M. Travers, and H. Choset, “Central pattern generator with inertial feedback for stable locomotion and climbing in unstructured terrain.”
-  I. Zamora, N. G. Lopez, V. M. Vilches, and A. H. Cordero, “Extending the openai gym for robotics: a toolkit for reinforcement learning using ros and gazebo,” arXiv preprint arXiv:1608.05742, 2016.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
-  M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs, E. Berger, R. Wheeler, and A. Ng, “Ros: an open-source robot operating system,” in Proc. of the IEEE Intl. Conf. on Robotics and Automation (ICRA) Workshop on Open Source Robotics, Kobe, Japan, May 2009.
-  N. Koenig and A. Howard, “Gazebo-3d multiple robot simulator with dynamics,” 2006.
-  L.-J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Mach. Learn., vol. 8, no. 3-4, pp. 293–321, May 1992. [Online]. Available: https://doi.org/10.1007/BF00992699
-  H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” CoRR, vol. abs/1509.06461, 2015. [Online]. Available: http://arxiv.org/abs/1509.06461