Log In Sign Up

Robot Perception enables Complex Navigation Behavior via Self-Supervised Learning

by   Marvin Chancán, et al.

Learning visuomotor control policies in robotic systems is a fundamental problem when aiming for long-term behavioral autonomy. Recent supervised-learning-based vision and motion perception systems, however, are often separately built with limited capabilities, while being restricted to few behavioral skills such as passive visual odometry (VO) or mobile robot visual localization. Here we propose an approach to unify those successful robot perception systems for active target-driven navigation tasks via reinforcement learning (RL). Our method temporally incorporates compact motion and visual perception data - directly obtained using self-supervision from a single image sequence - to enable complex goal-oriented navigation skills. We demonstrate our approach on two real-world driving dataset, KITTI and Oxford RobotCar, using the new interactive CityLearn framework. The results show that our method can accurately generalize to extreme environmental changes such as day to night cycles with up to an 80 navigation systems.


page 1

page 2

page 3


MVP: Unified Motion and Visual Self-Supervised Learning for Large-Scale Robotic Navigation

Autonomous navigation emerges from both motion and local visual percepti...

Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers

In this work, we consider the problem of learning a perception model for...

Robot Localization and Navigation through Predictive Processing using LiDAR

Knowing the position of the robot in the world is crucial for navigation...

PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

State-of-the-art approaches to ObjectGoal navigation rely on reinforceme...

Towards navigation without precise localization: Weakly supervised learning of goal-directed navigation cost map

Autonomous navigation based on precise localization has been widely deve...

Autonomous Identification and Goal-Directed Invocation of Event-Predictive Behavioral Primitives

Voluntary behavior of humans appears to be composed of small, elementary...

I Introduction

Recent advances in self-supervised learning have show promising results in a range of visuomotor tasks including robotic manipulation [18, 20, 5, 11, 6, 16] using deep reinforcement learning (RL), both in simulation and on real hardware. For mobile robots, these self-supervised learning techniques are now being explored and have already show to achieve comparable results to classical robot perception pipelines for passive visual odometry (VO) [21, 19], visual localization or place recognition (VPR) [7], and also active outdoor navigation tasks [9] in real environments. Nevertheless, end-to-end learning of visuomotor policies for long-term, all-weather autonomous navigation tasks using self-supervision remains unexplored.

Large-scale outdoor navigation is a key component for enabling the deployment of mobile robots and autonomous vehicles in the real world. Recent RL-based navigation systems for real environments rely on GPS-based ground-truth data for labeling raw sensory images. They then reduce the problem of navigation to vision-only methods [15] or extend it with language-based sensory inputs. These approaches: 1) are generally hard to train—due to their weakly-related input sensor modalities, 2) rely on the precision of GPS data—which may not be reliable across month-spaced traversals of the same route, and 3) require a large amount of experience with the environment in terms of RL training episodes—which might be impractical for real robots. Moreover, their generalization capabilities to different environmental changes such as lighting or weather transitions are not explored.

Fig. 1: Overview of our unified robot learning framework for navigation tasks. Given a single traversal of a car ride (), we use self-supervised learning to obtain optimized VO data () and visual representations (). We then temporally combine these compact visuomotor signals to learn control policies for goal-driven navigation skills via RL. Our method can accurately generalize to extreme environmental changes such as day to night transitions.

In this paper, instead of relying on supervised learning methods for capturing motion and visual representations, we investigate how to leverage recent self-supervised learning approaches for enabling efficient and robust long-term robot navigation skills. Our key contributions are:

  • An approach to temporally integrate motion states (classical VO self-optimized with optical flow and depth prediction) with visual observations (self-enhanced with image-to-region similarities) via RL for large-scale, all-weather navigation tasks (see Fig. 1), and

  • Experimental trade-off between the RL navigation success rate and the motion estimation precision, providing key insights to decide which ego-motion sensor would be appropriate for a particular application.

We demonstrate the effectiveness and advantages of our method on two large, real driving datasets for goal-oriented navigation tasks, compared to motion-only and vision-only navigation systems. Furthermore, we report experimental results where our approach is capable of generalizing to extreme environmental transitions such as day to night cycles with high navigation success rate, where vision-only navigation systems typically fail.

Ii Problem Formulation

We formulate the goal-driven navigation task as a Markov decision process

: at any given discrete state at time , the robot executes a discrete action following the policy , then transitions to a new state receiving a corresponding reward . We train our policy to find an optimal that maximizes the objective function given by , with a transition operator and a -discounted reward function over a finite-horizon .

In this work, following the main ideas proposed in [4] and [2], we investigate how to temporally incorporate in compact motion states, , with equally compact visual observations, , both obtained via self-supervised learning from a single monocular image sequence, , using the state-of-the-art RL algorithm PPO [17] (see Fig. 1).

Iii Approach

Our objective is to train an RL agent to perform goal-driven navigation tasks across a range of real-world environmental conditions, especially where noise or poor GPS data typically limit the capabilities of supervised learning approaches. We therefore developed a combined motion-and-vision-based perception method that can be trained using self-supervision. Our approach operates by temporally associating local motion states, obtained from VO-based techniques, with visual observations to efficiently train our navigation policy network, Fig. 1. This enables our policy to learn from both motion and visual information in a self-supervised manner, while training using an RL framework, to being robust to environmental visual changes and also poor GPS data.

Iii-a Self-Supervised Single-Frame Visual Localization

In image-based localization, weak GPS- or geo-tagged labels can be problematic when training visual place recognition (VPR) systems using supervised learning. To overcome these challenges, successful VPR systems such as NetVLAD [1] have achieved state-of-the-art results via weakly-supervised learning, with a range of recent developments [10, 13]. More recently, however, a self-supervised fine-grained region similarities (SFRS) system, especially designed for dealing with noisy pairwise image-label, has outperformed these VPR pipelines [7]. In this work, we attempt to merge the desirable properties of SFRS into our RL-based navigation system for leveraging image-to-region similarities when GPS labels are poor or not available for large-scale image perception.

Iii-B Self-Supervised Monocular Visual Odometry

In robot navigation research, visual odometry (VO) and SLAM techniques are also typically used for performing visual-based localization; providing key complementary information of the environment along with GPS, IMU or LiDAR sensors. While SLAM extends VO, along with loop closing and global map optimization, for building a geometrically consistent map of the environment, VO continues to be a fundamental component for proving ego-motion estimate data for mobile robots. With the rapid progress of deep learning techniques in computer vision, roboticists have been attracted to incorporate these learning capabilities for VO over the past 4 years

[22, 12]. Only recently, however, the use of more advanced self-supervised learning techniques have enabled to outperform those purely geometry-based or deep-learning-based VO systems [21, 19]. Here we incorporate a self-supervised deep pose corrections method (SS-DPC-Net) [19], which combine depth estimation, optical flow, and classical VO in a hybrid manner, for robust VO into our RL-based system, providing compact and optimized ego-motion estimate data.

Fig. 2: Deployment results on the KITTI dataset. The agent navigates from left to right towards the goal destination.
Fig. 3: Deployment results on the Oxford RobotCar dataset. The agent navigates from left to right towards the goal destination on the traversal it was trained (top), and generalize well at night (bottom).

Iii-C Reinforcement learning-based navigation

Goal-driven navigation: We merge both motion states, , and visual observations, , obtained via self-supervision from raw image sequences, for learning to navigate through actions, , towards a required goal destination, , via RL [17].

Architecture: Our policy network is inspired by [15], which includes a single linear layer with units to encode and

. Then, using a single recurrent layer long short-term memory (LSTM) with

units, current states and observations are combined with the agent’s previous actions, . The updated agent’s actions, , are then used to estimate both the new actions and the value function from .

Reward design and curriculum learning: We use multiple levels of curriculum learning to gradually encourage our agent to explore the environment, and a sparse reward function that gives the agent a reward of only when it finds the target.

Fig. 4: RL training curves on the KITTI dataset. Our approach incorporates the desirable properties of motion- and vision-only methods for navigation tasks. Using images (alone) seems to increase complexity and reduce performance, but when combining it with motion data we compensate these shortcomings.
Fig. 5: RL training curves on the Oxford RobotCar dataset. In contrast to the results in Fig. 4, we found that using motion+visual data can actually boost the RL training. In this case, our full model required 15k training episodes, compared to 60k and 44k for the motion- and vision-only baselines.
Fig. 6: RL deployment statistics on the KITTI dataset. We trained and tested on sequence 05 of the raw data.

Iv Experiments

We evaluate our model on two real driving datasets, Oxford RobotCar [14] and KITTI [8], using the CityLearn environment [3], see Figs. 2 and 3. We additionally conduct experiments to obtain the trade-off between the RL success rate and the motion estimation precision.

Figs. 4 and 5 provide the corresponding quantitative results averaged over 6 runs with different random seeds. We compare our full model (green) with two baselines which correspond to pure motion-only RL (blue), and vision-only RL (orange). These two baselines use the same setup as the full model, except that they only use either motion estimate data or visual observations, respectively, as shown in Fig. 1.

We also report deployment statistics in Figs. 6 and 7. For the KITTI dataset, our full model can solve navigation tasks with 80% success rate, compared to 65% for the vision-only system. In contrast, the agent using motion states seems to compete with our full model, however, its main limitation is that it does not incorporate visual information for distinguishing between environmental changes. For the Oxford RobotCar dataset, where we test generalization from day to night, our full model is capable of consistently obtaining around an 80% success rate, compared to 30% for the vision-only system.

Fig. 7: RL deployment statistics on the Oxford RobotCar dataset on the traversal it was trained (top) and generalizing at night (bottom).

To further analyze the influence of motion estimation precision, in all our experiments we compare the ego-motion data obtained using classical VO and SS-DPC-Net against ground truth data provided by each dataset, see Fig. 8 (left) for the KITTI dataset. Interestingly, the difference between these ego-motion results does not seem to impact our three baselines on the KITTI dataset, as all these models are deployed on the same traversal used for training. Conversely, on the Oxford RobotCar dataset, as we also deploy under drastic visual changes (day to night), we note that our full model retains good navigation performance, compared to vision-only systems. We also provide insights on the influence of the VO precision to our full model in Fig. 8 (right).

Fig. 8: Influence of motion estimation precision. Ground truth and VO-based data of the KITTI dataset (left). Trade-off between the RL navigation success rate and the ego-motion estimation precision (right).

V Conclusion

We have shown that combining self-supervised learning for visuomotor perception and RL for decision-making considerably improves the ability to deploy robotic systems capable of solving complex navigation tasks from raw image sequences only. We proposed a method, including a new neural network architecture, that temporally integrates two fundamental sensor modalities such as motion and vision for large-scale target-driven navigation tasks using real data via RL. Our approach was demonstrated to be robust to drastic visual changing conditions, where typical vision-only navigation pipelines fail. This suggest that odometry-based data can be used to improve the overall performance and robustness of conventional vision-based systems for learning complex navigation tasks. In future work, we seek to extend this approach by using unsupervised learning for both decision-making and perception.


  • [1] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic (2016) NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In CVPR, Cited by: §III-A.
  • [2] M. Chancán, L. Hernandez-Nunez, A. Narendra, A. B. Barron, and M. Milford (2020-04) A hybrid compact neural architecture for visual place recognition. IEEE Robotics and Automation Letters 5 (2), pp. 993–1000. Cited by: §II.
  • [3] M. Chancán and M. Milford (2019) From visual place recognition to navigation: learning sample-efficient control policies across diverse real world environments. arXiv preprint arXiv:1910.04335. Cited by: §IV.
  • [4] M. Chancán and M. Milford (2020) MVP: Unified Motion and Visual Self-Supervised Learning for Large-Scale Robotic Navigation. arXiv preprint arXiv:2003.00667. Cited by: §II.
  • [5] X. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl, and D. Fox (2019)

    Self-supervised 6D Object Pose Estimation for Robot Manipulation

    arXiv preprint arXiv:1909.10159. Cited by: §I.
  • [6] F. Ebert, S. Dasari, A. X. Lee, S. Levine, and C. Finn (2018) Robustness via retrying: closed-loop robotic manipulation with self-supervised learning. In Proceedings of The 2nd Conference on Robot Learning,

    Proceedings of Machine Learning Research

    , Vol. 87, , pp. 983–993.
    Cited by: §I.
  • [7] Y. Ge, H. Wang, F. Zhu, R. Zhao, and H. Li (2020) Self-supervising fine-grained region similarities for large-scale image localization. arXiv preprint arXiv:2006.03926. Cited by: §I, §III-A.
  • [8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets Robotics: The KITTI Dataset. The International Journal of Robotics Research. Cited by: §IV.
  • [9] G. Kahn, P. Abbeel, and S. Levine (2020) BADGR: An autonomous self-supervised learning-based navigation system. arXiv preprint arXiv:2002.05700. Cited by: §I.
  • [10] H. J. Kim, E. Dunn, and J. Frahm (2017) Learned contextual feature reweighting for image geo-localization. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 3251–3260. Cited by: §III-A.
  • [11] M. A. Lee, Y. Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg (2020) Making sense of vision and touch: learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics. Cited by: §I.
  • [12] R. Li, S. Wang, Z. Long, and D. Gu (2018) UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning. In IEEE International Conference on Robotics and Automation (ICRA), pp. 7286–7291. Cited by: §III-B.
  • [13] L. Liu, H. Li, and Y. Dai (2019-10) Stochastic attraction-repulsion embedding for large scale image localization. In ICCV, Cited by: §III-A.
  • [14] W. Maddern, G. Pascoe, C. Linegar, and P. Newman (2017) 1 year, 1000 km: The Oxford RobotCar dataset. The International Journal of Robotics Research 36 (1), pp. 3–15. Cited by: §IV.
  • [15] P. Mirowski, M. Grimes, M. Malinowski, K. M. Hermann, K. Anderson, D. Teplyashin, K. Simonyan, k. kavukcuoglu, A. Zisserman, and R. Hadsell (2018) Learning to navigate in cities without a map. In Advances in Neural Information Processing Systems 31, pp. 2419–2430. Cited by: §I, §III-C.
  • [16] S. Nair and C. Finn (2020) Hierarchical foresight: self-supervised learning of long-horizon tasks via visual subgoal generation. In ICLR, Cited by: §I.
  • [17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §II, §III-C.
  • [18] P. Sermanet, C. Lynch, J. Hsu, and S. Levine (2017) Time-contrastive networks: self-supervised learning from multi-view observation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 486–487. Cited by: §I.
  • [19] B. Wagstaff, V. Peretroukhin, and J. Kelly (2020) Self-supervised deep pose corrections for robust visual odometry. arXiv preprint arXiv:2002.12339. Cited by: §I, §III-B.
  • [20] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser (2018) Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 4238–4245. Cited by: §I.
  • [21] H. Zhan, C. S. Weerasekera, J. Bian, and I. Reid (2019) Visual odometry revisited: what should be learnt?. arXiv preprint arXiv:1909.09803. Cited by: §I, §III-B.
  • [22] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In CVPR, Cited by: §III-B.