Long-term navigation across large appearance change is a challenge for robots that use cameras for sensing. Visual Teach and Repeat (VT&R) tackles this problem with the help of multi-experience localization . While the user manually drives the robot to teach a path, VT&R builds a visual map. Afterwards, it localizes live images to the map, allowing the robot to repeat the taught path autonomously. As the environment changes, VT&R
stores data from each path repetition (experience). These intermediate experiences are used to bridge the appearance gap as localizing to the initial map becomes more challenging. With the help of machine learning, we aim to remove the need for such intermediate experiences for localization. Previously, we trained a neural network that would predict relative pose for localization directly from two input images
. The network was able to directly predict pose across large appearance change without using intermediate experiences. However, it learned the full pose estimation pipeline, including the parts that are easily solved with classical methods. Moreover, it did not generalize well to new paths not seen in the training data. These results align with Sattler et al., who found that accuracy can be an issue for methods that regress global pose directly from sensor data.
In this paper, we choose a different strategy. Instead of predicting poses directly from images, we use learning to tackle the perception front-end of long-term localization, while the remaining components of pose estimation are implemented with classical tools. We insert the learned features into the existing VT&R pipeline and perform autonomous path following outdoors without needing intermediate bridging experiences. In particular, we teach a path (and build a visual map) at noon and repeat it for a full day including driving after dark. In another experiment, we show the learned features can generalize by extending our path to two new areas not present in the feature training dataset.
For training, we use data collected across lighting and seasonal change in 2016 and 2017 by a robot using multi-experience VT&R
to bridge the appearance gap. We train a network to provide keypoints with associated descriptors and scores. Using a differentiable pose estimator allows us to backpropagate a loss based on poses from the training data.
Our method differs from others that use learned features for localization across environmental change, such as [21, 30, 31, 11, 26, 24], since they only test localization standalone, while we include our learned features in the full path-following VT&R system. Gladkova et al.  combine existing learned features for localization with visual odometry (VO). However, they only test on datasets, while we test in closed-loop on a robot, showing the features are good enough to provide precise pose estimates across all lighting conditions.
Ii Related Work
There has been a wide range of work on deep learning for pose estimation. Chen et al.  provide a thorough survey of deep learning for mapping and localization. Some research has focused on learning pose for localization directly from images via absolute pose regression , relative pose regression [14, 1], or combining learning for localization and VO . Sattler et al.  note that learning pose directly from image data can struggle with accuracy.
More structure can be imposed on the learning problem by using features to tackle front-end visual matching, while retaining a classical method for pose estimation. A wide range of papers have been published on learning sparse visual features with examples in [22, 29, 34, 17, 6, 7, 15, 23, 32]. Another option is to learn descriptors densely for the image [16, 27, 30, 31, 11, 33], which can also be used for sparse matching [26, 24].
Although several papers have tackled descriptor learning for challenging appearance change, including seasonal change,[21, 30, 31, 11, 26, 24], they test localization standalone. In our work we include the learned features in the VT&R pipeline, where relative localization to a map is combined with VO for long-term path following. Moreover, we use mostly off-road data with fewer permanent structures and where appearance change can be more drastic than in urban areas.
Sarlin et al.  show good generalization to new domains with their learned features. For instance, they are able to use features trained on outdoor data for indoor localization. In [30, 31, 26], the authors generalize to unseen seasonal conditions on the same path, while Piasco et al.  train and test on different sections of a path. In our work, we complete one experiment where we extend a path to areas not included in the training data, showing the generalizability of our features to novel environments.
The work from Gladkova et al.  is the closest to ours in that they integrate existing learned features for localization into a VO pipeline. The localization poses are used as a prior for front-end tracking and integrated into back-end bundle adjustment. Finally, global localization poses are fused with VO estimates. While the authors test on urban datasets with seasonal change, we test offline on a seasonal dataset, but also test our approach in closed loop on a robot.
We base our learning pipeline and network architecture on the design presented by Barnes and Posner , which learns keypoints, descriptors, and scores for radar localization supervised by only a pose loss.
Our fully differentiable training pipeline takes a pair of source and target stereo images and estimates the relative pose, , between their associated frames. We build on the approach for radar localization in  with some modifications to use a vision sensor. In short, the neural network detects keypoints and computes their descriptors and scores. We match keypoints from the source and target before computing their 3D coordinates with a stereo camera model. Finally, the point correspondences are used in a differentiable pose estimator. For an overview, see Figure 2.
Iii-a Keypoint Detection and Description
We start by detecting sparse 2D keypoints,
, at sub-pixel locations in the left stereo image and computing descriptor vectors,, and scores,
, for all pixels in the image. The descriptor and score for a given keypoint is found using bilinear interpolation. The score determines how important a point is for pose estimation. We pass an image to an encoder-decoder neural network following the design from, illustrated in Figure 3. After the bottleneck, the network branches into two decoder branches, one to compute keypoint coordinates and one for the scores. We divide the image into size windows and detect a keypoint for each one by taking the weighted average of the pixel coordinates. We get the weights by computing the softmax over the network output for each window. Applying a sigmoid to the output of the second decoder branch gives us the scores. Finally, the descriptors are found by resizing and concatenating the feature maps of each encoder block, leaving us with a length 496 descriptor vector for each pixel.
Iii-B Keypoint Matching
We have a set of keypoints for the source image and we need to perform data association between these and points in the target image. Descriptors are compared using zero-normalized cross correlation (ZNCC), which means that the resulting value will be in the range . For each keypoint in the source image, , we compute a matched point, , in the target image. This point is the weighted sum of all image coordinates in the target image, where the weight is based on how well descriptors match:
is the total number of pixels in the target image, computes the ZNCC between the descriptors, and takes the temperature-weighted softmax with as the temperature. The keypoint matching is differentiable. We found that using all target pixels worked better in practice than only including keypoints detected in the target image. Finally, we find the descriptor, , and score, , for each computed target point using bilinear interpolation. An example of detected and matched keypoints and corresponding scores for a pair of images can be see in Figure 6.
Iii-C Stereo Camera Model
In order to estimate the pose from matched 2D keypoints, we need to get their corresponding 3D coordinates, which is straightforward with a stereo camera. The camera model, , for a pre-calibrated stereo rig maps a 3D point, , in the camera frame to a left stereo image coordinate, , together with disparity, , as follows:
where and are the horizontal and vertical focal lengths in pixels, and are the the camera’s horizontal and vertical optical centre coordinates in pixels, and is the baseline in metres. We use the inverse stereo camera model to get each keypoint’s 3D coordinates:
We perform stereo matching to obtain disparity, , by using  as implemented in OpenCV.
Iii-D Pose Estimation
Given the correspondences between the source keypoints, , and matched target keypoints, , we can compute the relative pose from the source to the target, , where is the translation from the target frame to the source frame given in the target frame. As described in Section III-C, we use the inverse stereo camera model (3) to compute 3D coordinates, and , from the corresponding 2D keypoints. This allows us to minimize the following cost:
The minimization is implemented using Singular Value Decomposition (SVD) (more details can be found in), which is differentiable. The weight, , for a matched point pair is a combination of the learned point scores, and , and how well the descriptors match:
We additionally remove large outliers at training time based on ground truth keypoint coordinate error and usingRandom Sample Consensus (RANSAC) at inference.
Iii-E Loss Functions
In their paper on radar localization, Barnes and Posner  supervise training using only a loss on the estimated pose. We found that this was insufficient for our approach and therefore also include a loss on the 3D coordinates of the matched keypoints, similar to . For our particular training datasets, we only use a subset of the pose degrees of freedom (DOF) for supervision. Specifically, we use the robot longitudinal direction, , lateral offset, , and heading, . For this reason, we modify the keypoint loss to only use these DOFs.
Using (3), we can compute the 3D coordinates of the keypoints in the source and target camera frames. Given the ground truth pose, , we form a pose
that we use to transform the source keypoints. Because we transform the source points in the plane, we only compare the and point coordinates:
Forming from the ground truth pose and from the estimated pose, we get the following pose loss:
where is used to balance rotation and translation and
is the identity matrix. The total loss is a weighted sum ofand , where the weight is determined empirically to balance the influence of the two loss terms.
The training data were previously collected using a Clearpath Grizzly robot with a Bumblebee XB3 camera, see Figure 1. By using multi-experience VT&R , the robot repeats paths accurately across large lighting and seasonal change with the help of intermediate bridging experiences and Speeded-Up Robust Features (SURF
). We can use the resulting data as ground truth for supervised learning. InVT&R, stereo image keyframes are stored as vertices in a spatio-temporal pose graph. Edges contain the relative pose between temporally adjacent vertices and between a repeat vertex and the teach vertex to which it has been localized. We sample image pairs and poses from the pose graph to build a training dataset.
We use data from two different paths for training111Data available at http://asrl.utias.utoronto.ca/datasets/2020-vtr-dataset.. The In-The-Dark dataset contains 39 runs of a path collected at our campus in summer 2016 along a road and on grass. The path was repeated once per hour for 30 hours systematically capturing lighting change. The Multiseason dataset contains 136 runs of a path in an area with more vegetation and undulating terrain. It was repeated from January until May 2017 capturing varying seasons and weather. Overhead path views can be seen in Figure 4. We generate two separate datasets from these two paths that each have 100,000 training samples and 20,000 validation samples.
Iv-B Training and Inference
We train the network by giving it pairs of images and the ground truth relative pose from the training dataset. Large outliers are removed based on keypoint error using the ground truth pose during training and with RANSAC during inference. The network is trained using the Adam optimizer  with a learning rate of
and other parameters set to default values. We determine the number of training epochs with early stopping based on the validation loss. The network is trained on an NVIDIA Tesla V100 DGXS GPU.
Iv-C Visual Teach and Repeat
In order to test the performance of the learned features, we add them to the VT&R system, which normally relies on sparse SURF for localization. More detail on VT&R can be found in  and the VT&R code base is available online222VT&R code available at http://utiasasrl.github.io/vtr3 from Sept. 28, 2021. We insert the learned features without making substantial changes to VT&R feature matching and pose estimation. Instead, we rely on the existing sparse descriptor matching and do not add the keypoint scores used in training, but plan to include dense descriptor matching (1) and keypoint scores in future work. Note that in the experiments, we match live images directly to the map without using intermediate experiences.
We complete three experiments. For the first we train a network on the Multiseason dataset and run VT&R’s localization module in offline mode on held-out repeats from the same dataset to show that we can match features from different seasonal conditions. See Figure 8 for images from each repeat.
The last two experiments are run in closed loop on the robot. First we train a network on the In-The-Dark dataset. As opposed to the offline experiment, we now teach a new path by physically driving the robot, rather than using held-out data from the training dataset. The new path is similar to the one form the training data (see Figure 5), albeit five years later. We will refer to this as the ‘lighting-change experiment’. We taught the approximately 250 m path at noon on August 2nd and repeated it on August 3rd and 9th covering lighting conditions from 3 a.m. til 10.30 p.m. (see Figure 9). For this experiment the network had a higher number of features for each layer (first layer of size 32 instead of 16). We later found this was unnecessary and both other experiments use the architecture from Figure 3.
In the second closed-loop experiment, which we refer to as the ‘generalization experiment’, we test the learned features’ ability to generalize to new areas. We train a network using both the Multiseason and In-The-Dark datasets and teach a longer, approximately 750 m path. As shown in Figure 5, we include two new areas that are not observed by training data. In particular, the larger of the two new sections is driven in an area with more vegetation and taller grass than seen in the most relevant Multiseason dataset. Moreover, the experiment is conducted in August, while the Multiseason dataset only contains data until May. This path was taught on August 14th and repeated on August 15th, 16th, and 20th covering lighting change from 4 a.m. til 9 p.m. (see Figure 10).
V-a Offline: Seasonal Change
In this experiment, we ran VT&R with learned features offline on held-out repeats from the Multiseason dataset. VT&R localization works the same way offline as on the robot, except there is no robot receiving commands. We compare all the conditions shown in Figure 8 against each other, meaning that we teach using images from one condition and repeat with the rest. We compute the mean number of matched feature inliers (as determined by RANSAC in VT&R) and display them in Figure 7. In VT&R, we consider a drop below 20 inliers to constitute a localization failure and we can see that all repeats have averages well above this. We are able to localize winter with snow to spring with green grass.
Across all 42 experiments we had localization failure for a total of 17 frames. In this case VT&R relies on VO until it regains localization. In one experiment, localizing data from 10/02 to data from 14/02, we had a localization failure that VT&R did not recover from after the turn in the top right corner of the path as seen in Figure 4. Both of these conditions have the most snow cover, which makes it more difficult to detect features, let alone match them. As seen in Figure 7, localizing was most difficult when using data from 14/02 with the most snow and very bright images. Similarly, the data from 10/02 is also challenging due to snow cover.
V-B Closed-loop: Lighting Change
For this experiment, we ran VT&R with learned features for localization in closed loop. VT&R still relies on SURF for visual odometry. Hence, learned features introduce additional computation and we had to reduce the maximum speed of the robot to 0.8 m/s in order to run the experiment. With more work we can make the implementation more efficient and increase driving speed.
We taught a new path similar to the one from the In-The-Dark dataset on August 2nd at noon and repeated on August 3rd and 8th with a 100% autonomy rate. The repeats span lighting change from 3 a.m. until 10.30 p.m., see Figures 9 and 11. All days had sunny weather with only occasional clouds. The robot repeated the path accurately for all conditions including driving in the dark with headlights and during challenging sun flares. The path root mean squared error (RMSE) across all repeats is 0.049 m with 0.070 m being the highest RMSE for an individual repeat, which was driven at 17:06.
Finally, we compute the mean number of matched feature inliers, see Figure 12 (a). For every repeat we get an mean number of inliers well above the 20 required to localize. Across all runs there are 3 frames with localization failure, which occurred during one night-time drive. In this case VO is used until localization recovers. As expected, we get the highest number of inlier matches in the middle of the day and the lowest number when it is dark. The dips in numbers during sunrise and sunset are caused by sun flares. Before sunrise on August 8th there was fog, a condition not seen in the training dataset. This explains the lower number of inliers compared to driving after dark at night.
V-C Closed-loop: Generalization to New Areas
In the previous experiment, we showed that we can learn features based on data collected in 2016, teach a new path five years later, and repeat it across large lighting change. This experiment tests generalization to areas not included in the training datasets and tackles some new seasonal appearance. We train a network with data from both the In-The-Dark and Multiseason datasets so we can drive a path both along the road and in an off-road area. Both the on-road and off-road sections of the path have one new unseen area, as shown in Figure 5. In particular, the new off-road area is very unlike the training data from the Multiseason dataset (which provides off-road samples, while In-The-Dark covers mostly on-road driving). The new area has a lot more vegetation and grass. Moreover, the training data was collected from January until May, while we drive in August.
The path was taught on August 14th at noon and repeated with a 100% autonomy rate on August 15th, 16th, and 20th spanning lighting change from 4 a.m. until 9 p.m., see Figures 10 and 11. We plot the mean number of inliers for all repeats in Figure 12 (b) and see a similar behavior to the previous experiment with the highest number of inliers during the day, some dips at sunrise and sunset due to sun flares, and the lowest inlier counts in the dark. There are localization failures for 160 (0.019%), 190 (0.022%), and 60 (0.021%) frames for repeats at 03:55, 05:21, and 20:59, respectively. In the same plot, we compare to localization using SURF together with colour-constant images . In eight examples SURF fails to localize at the beginning of the path. For the remaining repeats the mean number of inliers is much lower than for the learned features and localization failure happens in 43.7% of frames across these runs.
Since a large part of the path falls in an area with tree cover and other vegetation, we were unable to reliably collect RTK GPS ground truth. In the new off-road area the robot swayed slightly at a few spots during the dark or strong sun flares, but quickly recovered and never drove off the path. In the new on-road area, for one repeat at 10:20, the robot turned too sharply and was off the path by roughly 1.5 meters at the far end of the turn, but recovered when exiting the turn.
Finally, we present two plots in Figure 12 (c) and (d) that compare the mean number of matched feature inliers when driving in the new areas versus the areas already seen in the training dataset. For the purpose of this comparison, we divide the path into on-road and off-road sections since the on-road section generally gets more inliers. We see that the new off-road area consistently gets fewer inliers than the area contained in the training data, but at the same time it gets enough inliers for driving. Moreover, the inlier numbers for the new area exhibit the same variation over time as those for the known area. In the new on-road area, the average number of inliers is a little less intuitive as it remains similar or higher compared to the know area. The reason is that the new area is much more similar to the rest of the path in the on-road case. Moreover, the new part of the path is not as affected by sun flare at sunrise and sunset as the known on-road section, explaining why it has higher values at these times.
Vi Conclusions and Future Work
We have shown that we can use learned features for autonomous path-following under large lighting change and that we can extend our path to new areas not seen in the data used to train the features. From data gathered in summer 2016 and from January to May 2017, we have trained networks to predict visual features that work reliably several years later. We learn front-end visual features, while relying on classical methods for pose estimation. In the experiments we matched features directly between seasonal conditions with snow and green grass in an offline experiment, repeated autonomously across lighting-change from 3 a.m. to 10:30 p.m. over two days, and, finally, repeated a longer and more challenging path with new unseen sections from 4 a.m. until 9 p.m. across three days without failure.
For this paper, we used the existing sparse feature matcher in VT&R. In future work we will implement dense matching in VT&R as described in Section III-B and use the learned scores. We also aim to make the implementation more efficient so that we can drive faster. We plan a longer closed-loop experiment to test against weather and some seasonal change. There is also room to further test generalization. Finally, since the training data is collected using VT&R we do not have a lot of training data with very large viewpoint offsets. We want to test robustness to larger viewpoint offsets and update our training method accordingly.
We would like to thank Clearpath Robotics and the Natural Sciences and Engineering Research Council of Canada (NSERC) for supporting this work. We also thank Yuchen Wu, Ben Congram, and Sherry Chen for their help with VT&R and offline experiments.
-  (2018) Relocnet: continuous metric learning relocalisation using neural nets. In Proceedings of ECCV, pp. 751–767. Cited by: §II.
-  (2017) State estimation for robotics. Cambridge University Press. Cited by: §III-D.
-  (2020) Under the radar: learning to predict robust keypoints for odometry estimation and metric localisation in radar. In Proceedings of IEEE ICRA, Cited by: §II, §III-A, §III-E, §III.
-  (2020) A survey on deep learning for localization and mapping: towards the age of spatial machine intelligence. arXiv preprint arXiv:2006.12567. Cited by: §II.
-  (2019) Unsuperpoint: end-to-end unsupervised interest point detector and descriptor. arXiv preprint arXiv:1907.04011. Cited by: §III-E.
-  (2018) Superpoint: self-supervised interest point detection and description. In Proceedings of IEEE/CVF CVPR Workshops, pp. 224–236. Cited by: §II.
-  (2019) D2-net: a trainable cnn for joint description and detection of local features. In Proceedings of IEEE/CVF CVPR, pp. 8092–8101. Cited by: §II.
-  (2021) Tight integration of feature-based relocalization in monocular direct visual odometry. arXiv preprint arXiv:2102.01191. Cited by: §I, §II.
-  (2020) DeepMEL: compiling visual multi-experience localization into a deep neural network. In Proceedings of IEEE ICRA, Cited by: §I.
-  (2008) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence 30 (2), pp. 328–341. Cited by: §III-C.
-  (2020) Unsupervised metric relocalization using transform consistency loss. CoRR. Cited by: §I, §II, §II.
-  (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In Proceedings of IEEE ICCV, pp. 2938–2946. Cited by: §II.
-  (2014-12) Adam: a method for stochastic optimization. International Conference on Learning Representations, pp. . Cited by: §IV-B.
Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proceedings of IEEE ICCV Workshops, pp. 929–938. Cited by: §II.
-  (2020) Aslfeat: learning local features of accurate shape and localization. In Proceedings of IEEE/CVF CVPR, pp. 6589–6598. Cited by: §II.
-  (2019) Taking a deeper look at the inverse compositional algorithm. In Proceedings of IEEE/CVF CVPR, pp. 4581–4590. Cited by: §II.
-  (2018) LF-net: learning local features from images. In Advances in neural information processing systems, pp. 6234–6244. Cited by: §II.
-  (2015-05) It’s not easy seeing green: lighting-resistant stereo visual teach amp; repeat using color-constant images. In Proceedings of IEEE ICRA, Vol. , pp. 1519–1526. Cited by: §V-C.
-  (2016-10) Bridging the appearance gap: multi-experience localization for long-term visual teach and repeat. In Proceedings of IEEE/RSJ IROS, Vol. , pp. 1918–1925. Cited by: §I, §IV-A.
-  (2018) I can see for miles and miles: an extended field test of visual teach and repeat 2.0. In Field and Service Robotics, pp. 415–431. Cited by: §IV-C.
-  (2019) Learning scene geometry for visual localization in challenging conditions. In Proceedings of IEEE ICRA, pp. 9094–9100. Cited by: §I, §II, §II.
-  (2008) Faster and better: a machine learning approach to corner detection. IEEE transactions on pattern analysis and machine intelligence 32 (1), pp. 105–119. Cited by: §II.
-  (2020) Superglue: learning feature matching with graph neural networks. In Proceedings of IEEE/CVF CVPR, pp. 4938–4947. Cited by: §II.
-  (2021) Back to the feature: learning robust camera localization from pixels to pose. In Proceedings of IEEE/CVF CVPR, pp. 3247–3257. Cited by: §I, §II, §II, §II.
-  (2019) Understanding the limitations of cnn-based absolute camera pose regression. In Proceedings of IEEE/CVF CVPR, pp. 3302–3312. Cited by: §I, §II.
-  (2020) Same features, different day: weakly supervised feature learning for seasonal invariance. In Proceedings of IEEE/CVF CVPR, pp. 6459–6468. Cited by: §I, §II, §II, §II.
-  (2019) BA-net: dense bundle adjustment networks. In ICLR, Cited by: §II.
-  (2018-09) Deep Auxiliary Learning for Visual Localization and Odometry. In Proceedings of IEEE ICRA, pp. 6939–6946. External Links: Cited by: §II.
-  (2015) Tilde: a temporally invariant learned detector. In Proceedings of IEEE/CVF CVPR, pp. 5279–5288. Cited by: §II.
-  (2020) GN-net: the gauss-newton loss for multi-weather relocalization. IEEE RAL 5 (2), pp. 890–897. Cited by: §I, §II, §II, §II.
-  (2020) LM-reloc: levenberg-marquardt based direct visual relocalization. In Proceedings of IEEE 3DV, pp. 968–977. Cited by: §I, §II, §II, §II.
-  (2020) Learning feature descriptors using camera pose supervision. In Proceedings of ECCV, pp. 757–774. Cited by: §II.
-  (2020) Deep probabilistic feature-metric tracking. IEEE RAL 6 (1), pp. 223–230. Cited by: §II.
-  (2016) Lift: learned invariant feature transform. In Proceedings of ECCV, pp. 467–483. Cited by: §II.