Navigation is a key component for enabling the deployment of mobile robots and autonomous vehicles in real-world environments. Current large-scale, real-world navigation systems rely on the usage of GPS data only as ground truth for sensory image labeling [IEEEexample:mirowski2018learning, IEEEexample:ma2019towards, IEEEexample:streetinst, IEEEexample:hermann2019learning, IEEEexample:Chen_2019_CVPR, IEEEexample:talkwalk]. They then reduce the problem of navigation to vision-only methods [IEEEexample:mirowski2018learning], GPS-level localization combined with publicly available maps [IEEEexample:ma2019towards], or extend it with language-based tasks [IEEEexample:streetinst, IEEEexample:hermann2019learning, IEEEexample:Chen_2019_CVPR, IEEEexample:talkwalk]. These end-to-end learning approaches are hard to train due to their large network models and weakly-related input sensor modalities. Moreover, their generalization capabilities to environments with different visual conditions is not well explored. In contrast, we have recently shown an alternative non end-to-end vision-based approach using preprocessed compact image representations to achieve practical training and deployment on real data with challenging environmental transitions [IEEEexample:chancan2020citylearn].
In this paper, we build on the main ideas of our previous work [IEEEexample:chancan2020citylearn]—that combines reinforcement learning (RL) and visual place recognition (VPR) techniques for navigation tasks—to present a new, more efficient and robust approach. The main contributions of this paper are detailed as follows:
Leveraging successful robotic motion estimation methods including VO [IEEEexample:mohamed2019surveyodometry] or radar [IEEEexample:barnes2019oxford] to capture compact motion information through an environment that can then be used to perform goal-driven navigation tasks (see Fig. 1). This makes our system more efficient and robust to extreme environmental changes, even with limited or no GPS data availability (Fig. 2-left).
Using RL to temporally incorporate compact motion representations with equally compact image observations, obtained via deep-learning-based VPR models[IEEEexample:netvlad], for large-scale, all-weather navigation tasks.
Experimental results on the RL navigation success rate and the VO motion estimation precision trade-off (Fig. 2-right). This shows how our proposed navigation system can improve its overall performance based on precise motion estimation techniques such as VO.
We evaluate our motion and visual perception (MVP) method using our interactive CityLearn framework [IEEEexample:chancan2020citylearn], and present extensive experiments on two large real-world driving datasets, the Oxford RobotCar [IEEEexample:maddern2017oxford] and the Nordland Railway [IEEEexample:nordland] datasets. The results of our MVP-based approach are consistently high across multiple, month-spaced traversals with extreme environmental changes, such as winter, spring, fall, and summer for Nordland, and overcast, night, snow, sun, rain, and clouds for Oxford (see blue bar in Fig. 2). For Nordland, we show how our approach outperforms corresponding vision-only navigation methods under extreme environmental changes, especially when GPS data is fully available and consistent across multiple traversals of the same route. For Oxford, we show the robustness of our approach across a range of real GPS data reception situations, including poor and no data reception at all, where vision-only navigation systems typically fail.
Ii Related Work
We present a brief overview of some successful work in motion estimation research, related visual place recognition methods for sequence-based driving datasets, and recent RL-based navigation systems for large real-world environments.
Ii-a Motion Estimation in Robotics
Odometry-based sensors (i.e. wheel, inertial, laser, radar, and visual) for self-localization have long attracted attention in robotics research as an alternative approach to estimate motion information, especially in situations where GPS data is not reliable such as multi-path reception and changes in environmental conditions [IEEEexample:maddern2017oxford, IEEEexample:mohamed2019surveyodometry, IEEEexample:cadena2016slam]. Traditional VO methods [IEEEexample:vo2004cvpr], SLAM-based systems [IEEEexample:dissanayake2001radar], including MonoSLAM [IEEEexample:davison2007monoslam] and ORB-SLAM [IEEEexample:orbslam], and also bio-inspired models such as RatSLAM [IEEEexample:ratslam2004, IEEEexample:ratslam2008] have captured the main challenges of the localization problem in large outdoor environments and indoor spaces—also with a range of alternative systems [IEEEexample:wifi2010, IEEEexample:renato2016, IEEEexample:renato2017, IEEEexample:renato2019ral]. These methods have shown good invariance to changes in viewpoint and illumination by associating hand-crafted visual place features to locally optimized maps. Odometry, however, is known to be the first phase of solving a SLAM problem, which also includes loop closing and global map optimization. Consequently, multi-sensor fusion techniques combining vision [IEEEexample:orbslam2, IEEEexample:muratal2017vimslam, IEEEexample:vo, IEEEexample:doh2006relative, IEEEexample:pascoe2017nid, IEEEexample:kim2018lowdriftvo, IEEEexample:zhan2019visual], inertial sensors [IEEEexample:svo, IEEEexample:hong2017vio, IEEEexample:babu2018vio], LiDAR [IEEEexample:zhang2015visionlidar, IEEEexample:borges2018posemap, IEEEexample:wang2019vilaser], and radar [IEEEexample:cen2018radar] have been proposed to further improve both the localization accuracy and the real-time performance, with a number of recent deep-learning-based methods that can match the precision of those traditional methods [IEEEexample:wang2017learnVO, IEEEexample:pillai2017learnvo, IEEEexample:zhou2017learnvo, IEEEexample:zhan2018unsupervised, IEEEexample:Casser_2019_CVPR_Workshops, IEEEexample:wang2019learnvo, IEEEexample:loo2019cnn-vo, IEEEexample:shen2019learnvo, IEEEexample:Shen_2019_ICCV, IEEEexample:barnes2019oxford, IEEEexample:suaftescu2020kidnapped]. In this work, we demonstrate our approach using both learning- and conventional-based odometry methods for motion estimation when GPS data is not available, as occurs in several traversals of the Oxford RobotCar dataset [IEEEexample:maddern2017oxford].
Ii-B Visual Place Recognition
VPR approaches can be broadly split into two categories: image-retrieval-based methods that compare a single-frame (query) to an image database (reference) [IEEEexample:Torii_2015_CVPR, IEEEexample:weyand2016planet, IEEEexample:netvlad, IEEEexample:2017geo, IEEEexample:gordo2017end, IEEEexample:fine2019imret], and multi-frame-based VPR techniques built on top of those single-frame methods to work on image sequences [IEEEexample:seqslam, IEEEexample:pepperell2014all, IEEEexample:cnnlanmark, IEEEexample:sunderhauf2015performance, IEEEexample:mpf]; typically found in driving datasets [IEEEexample:nordland, IEEEexample:geiger2012kitti, IEEEexample:maddern2017oxford, IEEEexample:guo2018safe, IEEEexample:naseer, IEEEexample:fabrat, IEEEexample:kitti, IEEEexample:caesar2019nuscenes]. These two approaches often first require computing image descriptors using diverse hand-crafted- [IEEEexample:vprsurvey] or deep-learning-based models [IEEEexample:chen2014convolutional, IEEEexample:zetao2017, IEEEexample:lost]. However, using large image representations can be computationally expensive and also limit the deployment of these methods on real robots. Alternatively, we have recently demonstrated how compact image representations can be used to achieve state-of-the-art results in visual localization [IEEEexample:chancan2020hybrid] by modeling temporal relationships between consecutive frames to improve the performance of compact single-frame-based methods. In this paper, we build on these main ideas to propose our MVP-based approach that uses compact but rich image representations, such as those from NetVLAD [IEEEexample:netvlad], and can also temporally use movement data through an environment via odometry-based techniques.
Ii-C Learning-based Navigation
Significant progress has recently been made in goal-driven navigation tasks using learning methods [IEEEexample:mirowski2018learning, IEEEexample:ma2019towards, IEEEexample:streetinst, IEEEexample:hermann2019learning, IEEEexample:Chen_2019_CVPR, IEEEexample:talkwalk, IEEEexample:kahn2018composable, IEEEexample:mirowski2016complex, IEEEexample:chaplot2018active, IEEEexample:zhang2017neural, IEEEexample:khan2018drlnav, IEEEexample:oh2019learnaction, IEEEexample:kahn2020badgr, IEEEexample:gupta2017cognitive, IEEEexample:gupta2017unifying, IEEEexample:chen2019learning, IEEEexample:chaplot2020Learning, IEEEexample:banino2018vector], inspired by advances in deep RL [IEEEexample:guo2014atari, IEEEexample:mnih2015human]. Most of these algorithms can successfully train navigation agents end-to-end based on raw images. These approaches, however, are typically only evaluated using either synthetic data [IEEEexample:kahn2018composable, IEEEexample:mirowski2016complex, IEEEexample:chaplot2018active, IEEEexample:zhang2017neural, IEEEexample:banino2018vector], indoor spaces [IEEEexample:khan2018drlnav, IEEEexample:oh2019learnaction] or relatively small outdoor environments [IEEEexample:kahn2020badgr], that generally do not require GPS data or map information. Alternatively, combining map-like- or SLAM-based input modalities, including motion sensor data, and images for goal-driven navigation tasks has been proposed [IEEEexample:gupta2017cognitive, IEEEexample:gupta2017unifying, IEEEexample:chen2019learning, IEEEexample:chaplot2020Learning, IEEEexample:mirowski2018learning], but again these methods are trained only using small indoor environments. For large-scale outdoor navigation, however, different approaches that rely on GPS data as ground truth have been proposed [IEEEexample:mirowski2018learning], with a range of developments using language-based tasks [IEEEexample:streetinst, IEEEexample:hermann2019learning, IEEEexample:Chen_2019_CVPR, IEEEexample:talkwalk] or publicly available maps [IEEEexample:ma2019towards]. However, relying on GPS data only for benchmarking purposes may not be reliable; especially when using large driving datasets recorded over many month-spaced traversals, as highlighted in previous work [IEEEexample:maddern2017oxford, IEEEexample:barnes2019oxford].
In this paper, we propose a different approach that overcomes the limitations of prior work for large-scale, all-weather navigation tasks. We unify two fundamental and highly-related sensor modalities: motion and visual perception (MVP) information. Our MVP-based method builds on the main ideas presented in our previous works [IEEEexample:chancan2020citylearn, IEEEexample:chancan2020hybrid]—that use compact image representations to achieve sample-efficient RL-based visual navigation using real data [IEEEexample:chancan2020citylearn], and also demonstrate how to leverage motion information for VPR tasks [IEEEexample:chancan2020hybrid]. We propose a network architecture that can incorporate motion information with visual observations via RL to perform accurate navigation tasks under extreme environmental changes and with limited or no GPS data; where visual-based navigation approaches typically fail. We provide extensive experimental results in both visual place recognition and navigation tasks, using two large real-world dataset, that show how our method efficiently overcomes the limitations of those vision-only navigation pipelines.
Iii MVP-based Method Overview
Our objective is to train an RL agent to perform goal-driven navigation tasks across a range of real-world environmental conditions, especially under poor GPS data conditions. We therefore developed an MVP-based approach that can be trained using motion estimation and visual data gathered in large environments. Our MVP method operates by temporally associating local estimates of motion with compact visual representations to efficiently train our policy network. Using this data, our policy can learn to associate motion representations with visual observations in a self-supervised manner, enabling our system to be robust to both visual changing conditions and poor GPS data.
In the following sections, we describe our problem formulation via RL, the driving datasets we used in our experiments, details of our MVP representations, our evaluation metrics for VPR and navigation tasks, and related visual navigation methods against which we compare our approach.
Iii-a Problem Formulation
We formulate our goal-driven navigation tasks as a Markov Decision Process, with discrete state space , discrete action space , and transition operator as in a finite-horizon problem. Our goal is to find that maximizes this objective function:
where is the stochastic policy we want to learn, and is the reward function with discount factor . We parametrize our navigation policy
with a neural network that can learnto optimize our policy. We also defined our state space by our compact bimodal MVP space representation (, ), and our action space by discrete action movements in the agent action space ().
Iii-B Real-World Driving Datasets
We evaluate our approach using our interactive CityLearn framework [IEEEexample:chancan2020citylearn] on two challenging large real-world datasets, the Oxford RobotCar dataset [IEEEexample:maddern2017oxford] and the Nordland Railway dataset [IEEEexample:nordland], that include diverse environmental changes and real GPS data reception situations.
Oxford RobotCar: This dataset [IEEEexample:maddern2017oxford] was collected using the Oxford RobotCar platform over a 10km route in central Oxford, UK. The data recorded with a range of sensors (e.g. LiDARs, monocular cameras and trinocular stereo cameras) includes more than 100 traversals (image sequences) of the same route with a large range of transitions across weather, season and dynamic urban environment changes over a period of 1 year. In Fig. 3 we show the selected 6 multiple traversals used in our experiments, referred here as overcast, night, snow, sun, rain, and clouds.111Referred as 2015-05-19-14-06-38, 2014-12-10-18-10-50, 2015-02-03-08-45-10, 2015-08-04-09-12-27, 2015-10-29-12-18-17, and 2015-05-15-08-00-49, respectively, in [IEEEexample:maddern2017oxford]. Fig. 5-right shows the raw GPS data of our selected traversals, where both the sun, and the clouds traversals have poor GPS data reception and no GPS data at all, respectively.
Nordland Railway: The Nordland dataset [IEEEexample:nordland] covers a 728km train journey from Trondheim to Bodø in Nordland, Norway. This 10 hour train ride has been recorded four times, once per season: summer, spring, fall, and winter. Fig. 4 shows a sample image for each traversal we used in our experiments, and Fig. 5-left shows the related raw GPS data; which is more consistent compared to the Oxford RobotCar raw GPS data (see Fig. 5-right).
Iii-C Motion Estimation
To provide our agent with motion data we separately use three different sensor modalities in our experiments: raw GPS data, visual odometry (VO), and optimized radar odometry (RO). For the Oxford RobotCar dataset, it already provides both GPS and VO sensor data. For RO, we used the optimized ground truth RO sensor data provided in the extended Oxford Radar RobotCar dataset[IEEEexample:barnes2019oxford]—which has been demonstrated to be more accurate under challenging visual transitions—carefully chosen to visually match our selected traversals. For the Nordland dataset, however, we used the provided raw GPS data only as it is consistent across every traversal (Fig. 5-left).
The goal of using these two datasets is to evaluate the effectiveness of our MVP-based approach in situations where vision-only navigation methods typically fail. For Nordland, when GPS data is fully available and consistent, we show that our approach can generalize better than visual-based navigation systems under extreme visual transitions. Similarly, for Oxford RobotCar, our method outperforms these visual-based navigation pipelines again under extreme visual changes, and also when GPS data reception fails.
Iii-D Visual Representations
To enable sample-efficient RL training, as per previous work [IEEEexample:chancan2020citylearn], we encode all our full resolution RGB sensory images using the off-the-shelf VPR model NetVLAD; based on a VGG-16 network architecture [IEEEexample:simonyan2014very] with PCA plus whitening on top their model. This deep-learning-based model is known to provide significantly better image representations compared to other VPR approaches [IEEEexample:levelling], and also enables to obtain compact feature dimensions (e.g., from 4096-d all the way down to 64-d). However, other deep-learning- or VPR-based models can equally used to encode our raw images. In this work, we used 64-d image representations, , in all our MVP-based experiments. We then combine it with compact 2-d motion representations, , to generate equally compact bimodal representations, , that feed our navigation policy network, see Fig. 6 (a). We encoded
into compact feature vectors to preserve the compactness of, but it can be encoded using larger representations as in [IEEEexample:mirowski2018learning].
Iii-E MVP-based Policy Learning
Goal-driven navigation: Our method is trained on both motion () and visual representations () to successfully navigate through actions () towards a required goal destination (), which is also encoded using 2-d feature representations, over a single traversal in our CityLearn environment; see Fig. 1 and Fig. 6 (a) for further details.
Network Architecture: We design our network model inspired by [IEEEexample:mirowski2018learning], see Fig. 6 (a). A single linear layer with units encodes our MVP bimodal representation () to then combine it with the agent’s previous actions (
), using a single recurrent layer long short-term memory (LSTM)[IEEEexample:lstm] with units. Updated agent’s actions () are also used to estimate both the required next actions and the value function from our policy network (). To optimize , we use the proximal policy optimization (PPO) algorithm [IEEEexample:ppo], which evaluates our objective function in Eq. (1) for policy learning. We choose PPO as it can properly balance the small sample complexity of our compact input modalities and fine tuning requirements.
Reward design and curriculum learning: We use multiple levels of curriculum learning [IEEEexample:curriculumlearning] to gradually encourage our agent to explore the environment, and a sparse reward function that gives the agent a reward of only when it finds the target.
Iii-F Vision-based Navigation Agent
We compare our MVP-based agent against a visual navigation agent with network architecture as proposed in [IEEEexample:mirowski2018learning], see Figs. 6 (a) and (b). This raw image based agent is adapted to also use a 2-d feature vector for to enable a fair comparison. In contrast to our method, this agent relies on GPS data for image labeling during both training and deployment and also does not incorporate motion learning. The network architecture of this agent includes a visual module of convolutional layers, as per previous work [IEEEexample:impala, IEEEexample:mirowski2016complex], with RGB input images of pixels. The first CNN layer uses a kernel of size
, stride of, and 16 feature maps, and the second CNN layer uses a kernel of size , stride of , and 32 feature maps.
Iii-G Evaluation Metrics
VPR experiments: We report extensive VPR results, obtained using our compact image representations (), in order to provide an indicator of the visual component performance underlying our overall RL-based MVP system. A linearclassifier is trained on each reference traversal to then evaluate it on the remaining query traversals. Classification scores obtained for each image are then used to compute precision-recall curves, which are finally used to calculate our area under the curve (AUC) results. AUC results on 10 experiments per traversal are presented in Fig. 7.
RL-based navigation tasks: We evaluate our trained agents on all corresponding dataset traversals, and provide statistics on the number of successful tasks in terms of the success rate results over 10 deployment iterations, each iteration with 100 different targets. Average results on those evaluations are reported in Fig. 8. We additionally constrain the maximum number of agent steps per episode to be less than the number of images within the traversal as in [IEEEexample:chancan2020citylearn].
Iv Experiments: Results
We present two main experimental results: conventional single-frame visual place recognition evaluations (Fig. 7), and reinforcement learning deployment of the navigation agents; both evaluated in our two datasets (Fig. 8). We also provide details on the influence of incorporating motion data into our network architecture, and present selected illustrative results from our MVP method during deployment.
Iv-a Visual Place Recognition Results
In Fig. 7 we provide a full overview of our visual place recognition experiments, as described in Sections III-D and III-G, in terms of AUC performance. For the Nordland dataset (Fig. 7-left), we trained a linear classified on the summer traversal, and then evaluated it on spring, fall and winter conditions. It is observed that extreme environmental changes such as those from fall and winter significantly reduce the results to around 0.50 and 0.25 AUC, respectively. It is worth noting, again, that each traversal of this dataset (and also of the Oxford RobotCar dataset) is a single sequence of images, meaning that we have used a single image from a particular place for training, and we do not use any data augmentation technique.
For the Oxford RobotCar dataset (Fig. 7-right), we trained a linear classifier on the overcast traversal, and evaluated it on the remaining traversals. AUC results in this dataset are relatively lower compared to those from the Nordland dataset; except for the rain traversal that achieves around 0.80 AUC. This is mainly because those traversals include significant viewpoint changes, diverse environmental transitions, and also real GPS data. For the sun and clouds traversals, with around 0.12 and 0.05 AUC, with poor and no GPS data reception, respectively, the results obtained highlight the importance of having good GPS data for ground truth labeling; especially for VPR. In contrast, the night and snow traversals, with good GPS data, still present relatively good result around 0.30 and 0.55 AUC, respectively; regardless of their significantly different visual and lighting conditions, compared to the overcast traversal.
Iv-B Navigation Policy Deployment
Navigation success rate results of our RL policies are reported on the Nordland (Fig. 8-left) and Oxford RobotCar (Fig. 8-right) datasets. We also compare our method to a vision-only approach, as described in Sections III-E and III-F, respectively, using our CityLearn environment.
For the vision-based agent (referred as vision only in Fig. 8), which has been trained on raw images only, it is notable that the success rate results are significantly better than those from our VPR experiments (Fig. 7), with over 84% success rate for the Nordland dataset (Fig. 8-left) and more than 46% for the Oxford RobotCar dataset (Fig. 8-right); except for the clouds traversal which has no GPS data. Suggesting that the generalization capabilities of the whole RL-based systems is robust to environmental variations. However, this method still does not generalize well under different weather conditions with significant viewpoint changes and occlusions, as in the Oxford RobotCar dataset, especially with poor GPS data (see vision only in 8-right for the sun and clouds traversals). Suggesting that RL-based vision-only navigation methods that rely on precise GPS information are likely to fail when using poor motion estimation information.
In contrast, our MVP-based agents overcome the limitations of the VPR module, underlying the vision-only method, by temporally incorporating those visual representations with precise motion information into our navigation policy network using either GPS (when fully available) or odometry-based techniques, referred in Fig. 8 as MVP-GPS and MVP-RO, respectively. On the Nordland dataset (Fig. 8-left), the vision only agent achieves around 86% success rate under challenging winter conditions, compared to around 94% for the MVP-GPS agent (see green bar in both cases). Similarly, on the Oxford RobotCar dataset (Fig. 8-right), the MVP-RO agent achieves 93% success rate under clouds conditions, with no GPS data available, compared to 7% for the vision only agent (see red bar in both cases).
Iv-C Influence of Motion Estimation Precision
To analyze the influence of including motion data as an input to our policy network, we provide additional results on the Oxford RobotCar dataset shown in Fig. 2. Vision-based navigation methods actually generalize relatively well under extreme changes with a 75% success rate from day to night transitions, but with good GPS data reception. However, these methods can fail when GPS data is not precise, even under similar visual conditions, such as day to clouds (day) changes, with a 7% success rate (see green bars for both cases in Fig. 2-left). Conversely, our MVP-based approach leverages the use of relatively precise motion estimation data, including but not limited to those from radar or visual odometry, on top of those vision navigation methods to improve overall performance under both visual changes and when there is no GPS data available (see orange and blue bars in Fig. 2-left). In Fig. 2-right, we characterize the deployment performance of our MVP-based method using VO to estimate motion data. This graph shows how incorporating precise motion information can improve the overall navigation performance of our system, suggesting that current vision-only navigation methods can also benefit from using MVP-like approaches. As also demonstrated in related work [IEEEexample:vo2004cvpr], odometry-based techniques can be used directly for navigation tasks. This method, however, may require additional baseline metrics to estimate global scale factors during deployment on real robots; particularly when using relative motion data relative to the robot initial pose.
Iv-D Generalization Results
We present illustrative navigation deployment results in Figs. 9 and 10 for the vision-based agent and for our MVP-based approach, respectively. The agent is required to navigate from the same starting location to a distant target over all our selected traversals of the Oxford RobotCar dataset; see navigation states from left to right in Figs. 9 and 10 including two intermediate states. Our approach is capable of precisely navigating to the target for every condition change (see Fig. 10), while the vision-based agent fails under extreme condition variations and also where GPS data is poor or not available (Fig. 9).
We have proposed a method including a new network architecture that temporally integrate two fundamental sensor modalities such as motion and visual perception (MVP) information for large-scale target-driven navigation tasks using real data via reinforcement learning (RL). Our MVP-based approach was demonstrated to be robust to both extreme visual changing conditions and also poor absolute positioning information such as those from GPS, where typical visual (only) navigation pipelines fail. This suggests that the incorporation of motion information, including but not limited to GPS (when fully available) or visual/radar odometry, can be used to improve the overall performance and robustness of conventional visual-based navigation systems that rely on raw images only for learning complex navigation tasks. Future work combining different motion estimation modalities such as linear/angular velocities with visual representations is likely to be considered. However, this could potentially increase the network complexity and training requirements [IEEEexample:banino2018vector, IEEEexample:cueva2018emergence], especially when using real data. Quantifying the relationship between required RL performance, visual place recognition generalization capabilities, and motion estimation quality can also provide new insights for selecting between different motion estimation sensor modalities for a specific robotic navigation system.