Self-Supervised Localisation between Range Sensors and Overhead Imagery

06/03/2020 ∙ by Tim Y. Tang, et al. ∙ University of Oxford 0

Publicly available satellite imagery can be an ubiquitous, cheap, and powerful tool for vehicle localisation when a prior sensor map is unavailable. However, satellite images are not directly comparable to data from ground range sensors because of their starkly different modalities. We present a learned metric localisation method that not only handles the modality difference, but is cheap to train, learning in a self-supervised fashion without metrically accurate ground truth. By evaluating across multiple real-world datasets, we demonstrate the robustness and versatility of our method for various sensor configurations. We pay particular attention to the use of millimetre wave radar, which, owing to its complex interaction with the scene and its immunity to weather and lighting, makes for a compelling and valuable use case.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 6

page 8

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The ability to localise relative to an operating environment is central to robot autonomy. Localisation using range sensors, such as lidars [25, 44] and, more recently, scanning millimetre wave radars [33, 36], is an established proposition. Both are immune to changing lighting conditions and directly measure scale, while the latter adds resilience to weather conditions.

Current approaches to robot localisation typically rely on a prior map built using a sensor configuration that will also be equipped on-board, for example a laser map for laser localisation. This paper looks at an alternative method. Public overhead imagery such as satellite images can be a reliable map source, as they are readily available, and often captures information also observable, albeit perhaps in some complex and incomplete way, by sensors on the ground. We can pose the localisation problem in a natural way: find the pixel location of a sensor in an overhead (satellite) image given range data taken from the ground. The task is, however, non-trivial because of the drastic modality difference between satellite image and sparse, ground-based radar or lidar.

Recent work on learning to localise a ground scanning radar against satellite images [39]

provides a promising direction which addresses the modality difference by first generating a synthetic radar image from a satellite image. The synthetic image is then “compared” against a live radar image for pose estimation. Such an approach learns metric, cross-modality localisation in an end-to-end fashion, and therefore does not require hand-crafted features limited to a specific environment.

The method in [39] trains a multi-stage network, and needs pixel-wise aligned radar and satellite image pairs for supervision at all stages. This in turn requires sub-metre accurate ground truth position and sub-degree accurate ground truth heading. In practice, collecting accurate ground truth requires high-end GPS/ins, and possibly bundle adjustment along with other on-board sensor solutions, bringing in burdens in terms of cost and time consumption.

Fig. 1: Given a map image of modality (left) and a live data image of modality (middle), we wish to find the unknown offset between them. To do so, our method generates a synthetic image of modality (right) that is pixel-wise aligned with the map image, but contains the same appearance and observed scenes as the live data image. Top: localising radar data against satellite imagery. Middle: localising lidar data against satellite imagery. Bottom: localising radar data against prior lidar map.

Building on [39], we propose a method for localising against satellite imagery that is self-supervised. The core idea of both approaches is to generate a synthetic image with the appearance and observed scenes of a live range sensor image, but pixel-wise aligned with the satellite image. We assume a coarse initial pose estimate is available from place recognition, such that there is reasonable overlap between the live ground sensor field-of-view and a queried satellite image.

Vitally, here we make no use of metrically accurate ground truth for training. Note also that although designed for localising against satellite imagery, our method can naturally handle other forms of cross-modality registration, such as localising a radar against a prior lidar map. Figure 1

shows synthetic images generated by our method used for pose estimation.

To the best of our knowledge, this paper presents the first method to learn the cross-modality, metric localisation of a range sensor in a self-supervised fashion. Our method is validated experimentally on multiple datasets and achieves performances on-par with a state-of-the-art supervised approach.

Ii Related Work

Ii-a Localisation Using Overhead Images

Localisation using aerial or overhead images has been an interest for the community for over a decade. The methods in [24, 26, 34] localise a ground camera using aerial images, by detecting Canny edges from aerial imagery, and matching against lines detected by a ground camera. Several other vision-based approaches project the ground camera images to a top-down perspective via a homography, and compare against the aerial imagery by detecting lane markings [35], SURF features [32], or dense matching [37]. Recent work [12] localises a ground robot in a crop field by matching camera features against landmarks from an aerial map, and incorporated semantics of crops to reduce ambiguity.

Metric localisation of range sensors or point-clouds against overhead imagery requires further pre-processing due to the modality difference. Kaminsky et al. [19] projected point-clouds into images and matched against binary edge images from overhead imagery. The method in [19] also constructs a ray image by ray-tracing each point, and introduces a free-space cost to aid the image registration. The work by Veronese et al. [14] accumulates several lidar scans to produce dense lidar intensity images, which are then matched against satellite images utilising Normalised Mutual Information. Similar as [19], several other methods also pre-process the aerial image before matching against ground laser observations, for example using edge detection [21] or semantic segmentation [15]. Our method directly learns the metric localisation of a range sensor end-to-end, without the need for careful pre-processing.

Ii-B Cross-Modality Localisation

Other forms of cross-modality localisation are also heavily studied by the community. A number of works are proposed to localise a forward facing camera against a prior 3D point-cloud map [43, 10, 47]. Carle and Barfoot [9] localised a ground laser scanner against an orbital elevation map. The works in [40, 7, 41, 31] localise an indoor lidar or stereo camera against architectural floor plans. Brubaker et al. [8] and Floros et al. [16] concurrently proposed matching visual odometry paths to road layouts from OpenStreetMap for localisation.

Ii-C Learning-based State Estimation for Range Sensors

A number of recent works were proposed for learning odometry or localisation of lidars. Barsan et al. [5] represented lidar data as intensity images, and learned a deep embedding for metric localisation by comparing embeddings of live and map lidar intensity images. Methods such as [13, 27] learn deep lidar odometry by projecting lidar point-clouds into other representations before passing through the network. Lu et al. [30] used point-clouds as input to learn descriptors, and utilised 3D cnn for solving metric localisation by searching in a 3D cost volume. In their later work, Lu et al. proposed a method to learn lidar point-cloud registration end-to-end [29].

As an emerging sensor for outdoor state estimation, learning-based methods were proposed for scanning fmcw radars. Aldera et al. [1] utilised an encoder-decoder on polar image representation of radar scans to learn key-points for fast classical radar odometry [11]. Barnes et al. learned image-based radar odometry [4] by masking out regions distracting for pose estimation, and point-based radar odometry [3] by detecting key-points from radar images. Saftescu et al. [36] encoded images of radar polar scans through a rotation-invariant architecture to perform topological localisation (place recognition). These methods, however, are designed to compare data of the same sensor type, and does not address modality difference. Our approach is similar to [5, 13, 1, 4, 42, 36, 3] in that we also represent lidar and radar data as 2D images prior to passing through the network.

Ii-D Unsupervised Image Generation

We seek to generate a synthetic image prior to pose computation, where there is no pixel-wise aligned target image for supervision. CycleGAN [48] achieves unsupervised image-to-image transfer between two domains and by learning two pairs of generators and discriminators, and enforcing cycle-consistency when an image is mapped from to and back from to and vice versa. A number of other methods [28, 22] also utilise cycle-consistency, but make different assumptions on how the latent spaces of the two domains are treated. These methods are concerned with generating photo-realistic images. For the problem of metric localisation, however, we need to explicitly encourage the synthetic image to contain information appropriate for pose estimation.

Several prior works are also geometry-aware. The methods in [38, 45, 46]

use separate encoders and/or decoders to disentangle geometry and appearance. The results are networks that can interpolate the geometry and appearance of the output images separately. Similarly, our method separately encodes information about appearance and the relative pose offset, resulting in an architecture where the two are disentangled.

Iii Overview and Motivation

We seek to solve for the pose between a map image of modality and a live data image of modality Our main focus is when modality is satellite imagery, while modality are range sensor data represented as an image.

Previously, RSL-Net [39] was proposed to solve for the metric localisation between matched pairs of radar and satellite images. In particular, a synthetic image is generated such that it preserves the appearance and observed scenes of the live radar image, and is pixel-wise aligned with the paired satellite image. The synthetic image and the live radar image are then projected onto deep embeddings, where their pose offset is found by maximising a correlation surface. We follow the same general approach, but, unlike RSL-Net, our method learns in a self-supervised fashion.

Iii-a Hand-crafting Features vs. Learning

A number of works listed in Sections II-B and II-A can achieve decent accuracy on localising a ground range sensor against aerial imagery. However, they typically rely on pre-processing the aerial images using hand-crafted features or transforms designed for a specific set-up and may not generalise to other sensors or different, more complex environments. For example, [21] focuses on detecting edges from a campus dominated by buildings. While [14] directly matches accumulated lidar intensity images against aerial imagery without pre-processing, the same method is unlikely to work for radars.

Our data-driven approach instead learns to directly infer the geometric relationship across modalities, remaining free of hand-crafted features. We show in Section V when localising against satellite imagery, our method works for various types of scenes including urban, residential, campus, and highway.

Iii-B Generating Images vs. Direct Regression

A naive approach would be to take a satellite image and a live data image as inputs, and directly regress the pose. As originally shown in [39], this led to poor results even for the supervised case. Our hypothesis is that when the two images are starkly different in appearance and observed scenes, the problem becomes too complex for direct regression to succeed.

Generating synthetic images first prior to pose estimation brings two advantages over directly regressing the pose. First, generating synthetic images is a simpler and less ill-posed problem than directly regressing the pose, particularly because we can utilise the live data image to condition the generation. Moreover, to generate images, the network loss is distributed over an entire image of pixels, where and are height and width, instead of on just three parameters ( and ). This introduces greater constraint during optimisation.

Iii-C Conditional Image Generation

Our method of conditional image generation takes in both a map (e.g., satellite) image and a live data image as inputs. An alternative approach is to learn a domain adaptation from the map modality to the live data modality without conditioning on the live data image (e.g., standard image-to-image transfer such as CycleGAN [48]).

Fig. 2: Two radar images captured 15 seconds apart from each other (2 & 4), pixel-wise aligned with satellite images (1 & 3). Though the overlapping scenes in the satellite images are identical, the radar scans appear significantly different, as they capture different regions in their field-of-view.

In practise, the map (e.g., satellite) image is a denser representation of the environment than a frame of data captured by a range sensor. Only a fraction of the scenes captured in a satellite map are present in a ground sensor field-of-view, resulting in the scan to appear drastically different depending on the sensor pose. Shown in Figure 2, the overlapping regions of the two satellite images are identical, while the two radar images observe different regions of the scene.

By using a naive image-to-image transfer approach, there is no guarantee the generated image will contain regions of the scene that are useful for pose comparison against the live data image. Figure 11 shows examples of images generated using CycleGAN [48], where the synthetic image highlights different scenes than what are observed by the live data image. The issue with observability or occlusion can potentially be handled by ray-tracing such as in [19]. However, not only is this computationally expensive, it does not apply to fmcw radars which have multiple range returns per azimuth (in which we are particularly interested). This problem is inherently addressed by our approach: by conditioning the image generation with the live data image, we can encourage the synthetic image to capture regions of the scene also observed by the live data image, as shown in Sections IV and V.

Iv Self-Supervised Cross-Modality Localisation

Iv-a Rotation Inference

Given a paired map (e.g., satellite) image and live data image with an unknown offset, we seek to generate a synthetic image that contains the same appearance and observed scenes as but is pixel-wise aligned with

Let the pose difference between and be parametrised as such that by rotating by followed by a translation of one can pixel-wise align onto The image generation can be formulated as:

(1)

where is a generated image of modality that synthesises the input live sensor image applied with a rotation of followed by a translation of Thus, is pixel-wise aligned with the input map image but contains the same observed scenes as

However, as originally noted in [39], the mapping in (1) is difficult to learn as the inputs and are offset by both a translation and a rotation. cnn are inherently equivariant to translation, but not to rotation [23]. As a result, the cnn in the generator cannot automatically utilise their mutual information and thereby capture their geometric relationship.

The method in [39] proposes to infer the rotation prior to image generation. Namely, reducing (1) to two steps:

(2)
(3)

Here is a function that infers the rotation offset between and and outputs which is input image rotated by Now, is rotation-aligned with the map frame, and therefore offset with only by a translation, which cnn can naturally handle. is an image generation function that produces the synthetic image The experiments in [39] show that learning (2) and (3) sequentially resulted in better performance than learning (1) directly, as the former is congruous with the equivariance properties of cnn.

In [39], the rotation inference function is parametrised by a deep network as shown in Figure 3, where satellite imagery and radar images are used as an example. Given a coarse initial heading estimate, the live data image is rotated a number of times with small increments to form a stack of rotated images , where the number of rotations

and the increment are design parameters. Each rotated image is further concatenated with the map image to form a stacked tensor input of

pairs of map and live data images. The output of the network is a softmaxed image from that corresponds to rotated to be rotation-aligned with namely The core idea is that the network will assign a large softmax weight to the image from whose heading most closely aligns with the map image and small weights to all other images in .

Fig. 3: Prior work in [39] proposes a network to infer the rotation offset. The rotation offset is found by softmaxing a stack of rotated radar images to produce a radar image with the same heading as the satellite image.

If metrically accurate heading ground truth is available, then one can rotate to form an image target to used for supervising the rotation inference, as in Figure 3. In this work we assume this is never the case, thus the network for must learn to infer the rotation offset self-supervised.

Fig. 4: Given and a rotation stack the network finds by taking softmax. Then, given and a rotation stack the network outputs a softmaxed map image from A loss is applied to enforce the output of the second pass to be which in turn enforces the output of the first pass to be Here both symbols for in the figure refer to the same network, but at different forward passes.

For this reason, while following the same architecture as [39], our method for inferring rotation uses a different training strategy that enables self-supervised learning. In order for the network to produce the correct output, it must be able to infer the rotation from the solution space despite there being a modality difference between map image and live data image We make the observation that if the network can infer the rotation offset from a stack of rotated live data images , then, given a live data image , should also be able to output from a stack of rotated map images where is rotation-aligned with Specifically, if we have then the softmaxed map image from should be as and are rotation-aligned.

As such, to learn rotation inference self-supervised, we need to pass through the network twice. The first pass is identical as in the supervised approach in Figure 3, where we denote the output softmaxed image as is then used as input to the second pass through network together with a stack of map images The rotation angles can be chosen randomly, and the order of is shuffled such that the original non-rotated map image can be at any index within Each image is concatenated with to form the input stack for passing through the second time. The network is supervised with an loss that enforces the output of the second pass to be the non-rotated map image which in turn enforces the output of the first pass to be as is rotation-aligned with Our approach is shown in Figure 4. We use an increment of when forming the rotation stack

The estimate for the rotation offset, can then be found from the arg-softmax for the rotation stack

Iv-B Image Generation

Given and we seek to generate a synthetic image as in (3), where is pixel-wise aligned with [39] learns the image generation function by a supervised approach, concatenating and and applying an encoder-decoder architecture, as shown in Figure 5. This is possible since a target for the synthetic image can be obtained by applying the ground truth transform.

Fig. 5: Architecture for image generation in prior supervised approach [39].
Fig. 6: Top: during pre-training, we can learn an appearance encoder and a pose encoder that discovers the translation offset between an image of and a shifted version of itself. Bottom: Taking and and fixing their weights, we seek to learn which discovers the translation offset between two images from different modalities. and can provide the necessary geometric and appearance relationships used for learning self-supervised.

To generate synthetic images self-supervised, we propose an architecture we call pased, shown in Figure 6. pased is trained in two steps: the first is a pre-training, intra-modality process that can be supervised (top half of Figure 6), while the second handles cross-modality comparison (bottom half of Figure 6).

Taking two random images and in the live data modality from the training set, where and can be at arbitrary heading, we apply a known translation offset to This forms an image that is a shifted version of We pass through an appearance encoder that encodes its appearance and observed scenes. and are passed as inputs to a pose encoder that encodes the translation offset between the input images. The latent spaces from and are combined before passing through a decoder which outputs a synthetic image that is shifted by a translation In other words, pased discovers the translation offset between the two images passed as input to and applies the latent translation encoding to the input image of The pre-training can be supervised as is known, thus we can shift by to produce the target The fact that we use different images and for inputs to and ensures appearance and pose are disentangled from each other. As shown later, this allows modules of pased to be separated and re-combined with newly learned modules.

In the second step, we fix the weights of and which are optimised from the pre-training step. This narrows down the self-supervision problem to learning a cross-modality pose encoder that discovers the translation offset between an image of modality and another of Taking and as inputs, should encode the unknown translation offset between them. Concurrently, is encoded by where the latent space is combined with the latent space produced by before decoded by This encoder-decoder combination will generate a synthetic image which we do not have a target for.

We can apply a known shift to the centre position of to query another map image where is offset with by an unknown translation Using the same encoder-decoder combination as before, we can take and to generate a synthetic image Furthermore, given and the networks learned from pre-training, we can easily generate by encoding a zero shift. If we pass and to the pre-trained pose encoder then the latent space will encode a shift of Combing this latent space with we can decode a synthetic image Here is a known value as it is the translation offset applied to to obtain

We can shift by to get Using and we can generate with and shown on the bottom right of Figure 6. A loss can then be established between the two synthetic images where the latter one is a target image created by modules with weights fixed. By back-propagation the loss optimises the network . Alternatively we can use as the target, but using led to faster convergence.

For the loss to be minimised, two conditions must hold true. First, must have correctly encoded the appearance and observed scenes in Second, and must have the correct translations and respectively. By satisfying these two constraints we can ensure is able to discover the translation offset across modalities, and is compatible with pre-trained networks and for image generation.

Fig. 7: The networks and are learned to project real live images and synthetic images to a joint embedding, where their translation offset can be found by maximising correlation.
Fig. 8: Overall data flow of our method at inference: given map image and live data image based on the initial heading estimate, we form a stack of rotated images from which discovers that is rotated to be rotation-aligned with This process also infers the heading estimate and are used to generate a synthetic image that has the same appearance and observed scene as and is pose-aligned with and are projected to deep embeddings and where the estimate for the translation offset is found by correlation maximisation.

Iv-C Pose Estimation

Taking and we embed them to a joint space, where their translation offset is found by maximising correlation on the learned embeddings. This can be performed efficiently in the Fourier domain, as is done in prior works that use a similar approach [5, 4, 39]. In this step, we can infer which is our posterior estimate to the translation.

The embeddings are thus learned to further ensure the synthetic image and the live image can be correctly correlated. Without ground truth we can self-supervise using a similar approach as in learning pased, by applying a known shift. The architecture for learning the embeddings is shown in Figure 7, where we denote the embedding network for real and synthetic images to be and respectively. Given learned deep embeddings and the translation offset by correlation maximisation is found to be If we replace with and reverse the order, the offset found will be The sum of the two offsets is known, and can be used to establish a loss term. Similar as in Section IV-B, is obtained by shifting the map image to get

V Experimental Validation

The overall pipeline for data flow at inference time is shown in Figure 8. The inference runs at about on a single 1080 Ti GPU. We evaluate on a large number of public, real-world datasets collected with vehicles equipped with on-board range sensors. The datasets we use come with metric ground truths that are decently accurate, though we noticed the GPS/ins solutions in certain places can drift up to a few metres.

We add large artificial pose offsets to the ground truth when querying for a satellite image, thereby simulating a realistic robot navigation scenario where the initial pose estimate can solve place recognition, but is too coarse for the robot’s metric pose. Using a map (e.g., satellite) image queried at this coarse initial pose estimate, our method solves metric localisation by comparing against the live sensor data. The true pose offsets are hidden during training as our method is self-supervised, and are only revealed at test time for evaluation purposes.

The artificial offset is chosen such that the initial estimate has an unknown heading error in the range therefore given the initial estimate the rotation inference must choose a solution space of at least to guarantee the correct solution can be found. We use a pixel-wise translation error in the range pixels. Depending on the resolution for a specific experiment, this corresponds to an error of at least and up to more than

V-a Radar Localisation Against Satellite Imagery

We evaluate on two datasets with fmcw radar and GPS: the Oxford Radar RobotCar Dataset [2] and the MulRan Dataset [20]. The satellite images for RobotCar are queried using Google Maps Platform [18]. For MulRan they are queried using Bing Maps Platform [6], as high-definition Google satellite imagery is unavailable at the place of interest.

We benchmark against the prior supervised method RSL-Net [39] in our experiments, which is evaluated only on the RobotCar Dataset. Both datasets contain repeated traversals of the same routes. We separately train, validate, and test for every dataset, splitting the data as in Figure 9. For the RobotCar Dataset, we split the trajectories the same way as in [39] for a fair comparison. For the RobotCar Dataset, the training set consists of training data from sequences no. 2, no. 5, and no. 6, while we test on the test data from sequence no. 2. For the MulRan Dataset, we used sequences KAIST 01 and Sejong 01. The RobotCar test set features an urban environment, while KAIST 01 is in a campus and Sejong 01 is primarily a highway.

Fig. 9: Training (blue), validation (green), and test (red) trajectories for RobotCar (top left), KAIST (top right), Sejong (bottom left) and 20111003_drive0034 (bottom right). Certain data are removed to avoid overlap between the splits.
Mean Error (metric) (pixel)
RobotCar (ours) 3.44 5.40 3.03 3.97 6.23
MulRan (ours) 6.02 7.02 2.92 7.64 8.91
RobotCar
(RSL-Net [39], supervised)
2.74 4.26 3.12 3.16 4.92
MulRan (RSL-Net) 5.85 7.11 1.88 7.42 9.03
TABLE I: Mean error for radar localisation against satellite imagery.

We test on every fifth frame, resulting in 201 frames from the RobotCar Dataset and 358 from the MulRan Dataset, spanning a total distance of near The resolution used is for RobotCar and for MulRan. The mean errors are reported in Table I.

V-B Lidar Localisation Against Satellite Imagery

For this experiment, we evaluate on the RobotCar Dataset [2] which also has two Velodyne HDL-32E lidars mounted in a tilted configuration, and KITTI (raw dataset) [17] which has a Velodyne HDL-64E lidar and GPS data.

For the RobotCar Dataset, the trajectories are split into training, validation, and test sets approximately the same way as in Section V-A. For the KITTI Dataset, the training set includes sequences 20110929_drive0071, 20110930_drive0028, and 20111003_drive0027. Sequence 20110926_drive0117 is used for validation. Finally, data in 20111003_drive0034 are split into training and test, as shown in Figure 9. To turn 3D lidar point-clouds to lidar images, the point-clouds are projected to the plane. We discard points with values smaller than zero to remove ground points when creating the lidar images.

Since lidars have a shorter range than radars, we use satellite images of a greater zoom level, with resolution for RobotCar and for KITTI. The test set consists of 200 frames for RobotCar and 253 for KITTI, spanning a total distance of near The test set for KITTI features a residential area. The results are reported in Table II.

Mean Error (metric) (pixel)
RobotCar (ours) 1.54 1.85 2.34 3.55 4.27
KITTI (ours) 3.05 3.13 1.67 6.64 6.82
RobotCar (RSL-Net) 2.31 2.55 2.08 5.33 5.89
KITTI (RSL-Net) 2.45 2.79 1.59 5.34 6.08
TABLE II: Mean error for lidar localisation against satellite imagery.

V-C Radar Localisation Against Prior Lidar Map

[height=4cm,clip]figs/rt_combined_correct
Fig. 10: Estimated pose (blue) vs. ground truth pose (red) for localising a radar (left) and a lidar (right) against satellite imagery. Our system continuously tracks the vehicle’s pose over , where we occasionally fall back to odometry for the radar experiment (green). Our system is stand-alone and requires GPS only for the first frame.
[height=4.6cm,clip]figs/cyclegan_comb Fig. 11: Results of CycleGAN: satellite image (left), ground truth radar image (middle), synthetic radar image (right). This led to large localisation error as does not contain scenes observed by
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 12: Images at various stages of our method: map image (a), live data image (b), output of rotation inference (c), embedding (d), pixel-wise aligned ground truth (e), synthetic image (f), embedding (g). From top to bottom: radar localisation against satellite imagery evaluated on MulRan, lidar localisation against satellite imagery evaluated on RobotCar and KITTI, radar localisation against lidar map evaluated on MulRan.
Mean Error (metric) (pixel)
RobotCar (ours) 2.21 2.57 2.65 2.55 2.97
MulRan (ours) 3.57 3.26 2.15 4.53 4.13
RobotCar (RSL-Net) 2.66 3.41 2.45 3.07 3.93
MulRan (RSL-Net) 3.37 2.61 1.40 4.28 3.32
RobotCar (CycleGAN) 6.41 9.05 2.65 7.40 10.44
MulRan (CycleGAN) 4.84 4.39 2.15 6.14 5.58
TABLE III: Mean error for radar localisation against prior lidar map.

Though our method is designed for localising against satellite imagery, we show it can also handle more standard forms of cross-modality localisation. Here we build a lidar map using a prior traversal, and localise using radar from a later traversal.

We demonstrate on the RobotCar and MulRan datasets, where we use the same resolution as in Section V-A. For RobotCar, we use ground truth to build a lidar map from sequence no. 2. Radar data in the training sections from no. 5 and no. 6 as in Figure 9 form the training set, while the test section from sequence no. 5 forms the test set. For MulRan, lidar maps are built from KAIST 01 and Sejong 01, and we localised using radar data from KAIST 02 and Sejong 02, which are split into training, validation, and test sets. This resulted in a test set consisting of 201 frames from RobotCar and 272 frames from MulRan, spanning a total distance of near The localisation results are shown in Table III.

This experiment is more suitable for naive image generation methods such as CycleGAN [48] than previous experiments, as the field-of-view is considerably more compatible when both modalities are from range sensors. In Table III, we show results where we replaced the image generation stage of our method by CycleGAN, and kept other modules. The localisation results are however much worse when modality is satellite imagery, as shown qualitatively in Figure 11.

V-D Online Pose-Tracking System

In prior experiments we assumed place recognition is always available, providing a coarse initial estimate for every frame. Here we present a stand-alone pose-tracking system by continuously localising against satellite imagery. Given a coarse initial estimate (e.g., from GPS) for the first frame, the vehicle localises and computes its pose within the satellite map. The initial estimate for every frame onward is then set to be the computed pose of the previous frame. We only need place recognition once at the very beginning; the vehicle then tracks its pose onward without relying on any other measurements.

V-D1 Introspection

As localising using satellite imagery is challenging, the result will not always be accurate. Our method, however, naturally allows for introspection. A synthetic image was generated from and We can apply a known small translation offset to to form Taking and we can generate Finally, we can compute a translation offset by passing and through the learned embeddings and maximising correlation.

Let A large value of indicates the generated images are erroneous. This allows us to examine the solution quality; our system falls back to using odometry for dead-reckoning when exceeds a threshold. We do not require high-quality odometry, but rather only use a naive approach by directly maximising correlation between two consecutive frames without any learned modules. In our experiments, we set to be and to be 5.

V-D2 Results

We conduct two experiments on the test set of RobotCar, one where we track a radar using satellite imagery, and one where we track a lidar. For both experiments we run localisation at The results are shown in Figure V-C. If the solution error is too large, then the initial estimate will be too off for a sufficient overlap between the next queried satellite image and live data, resulting in losing track of the vehicle. Though the solution error can be larger than at times, our system continuously localises the vehicle for over a kilometre without completely losing track. For the lidar experiment, the solutions are sufficiently accurate to not require any odometry. Our experiments are single-frame localisations, and we make no attempt at windowed/batch optimisation or loop closures.

V-E Further Qualitative Results

Additional qualitative results are presented in Figure 12 showing various stages of our methods for different modalities.

Vi Conclusion and Future Work

We present self-supervised learning to address cross-modality metric localisation between satellite imagery and on-board range sensors, without using metrically accurate ground truth for training. Our method is validated across a large number of experiments for multiple modes of localisation, with results on-par with prior supervised approach. A coarse initial pose estimate is needed for our method to compute metric localisation. An extension would then be to solve place recognition for a range sensor within a large satellite map.

Acknowledgments

We thank Giseop Kim from IRAP Lab, KAIST for providing GPS data for the MulRan Dataset.

References

  • [1] R. Aldera, D. De Martini, M. Gadd, and P. Newman (2019) Fast Radar Motion Estimation with a Learnt Focus of Attention Using Weak Supervision. In 2019 International Conference on Robotics and Automation (ICRA), pp. 1190–1196. Cited by: §II-C.
  • [2] D. Barnes, M. Gadd, P. Murcutt, P. Newman, and I. Posner (2020) The Oxford Radar RobotCar Dataset: A Radar Extension to the Oxford RobotCar Dataset. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris. External Links: Link Cited by: §V-A, §V-B.
  • [3] D. Barnes and I. Posner (2020) Under the Radar: Learning to Predict Robust Keypoints for Odometry Estimation and Metric Localisation in Radar. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris. External Links: Link Cited by: §II-C.
  • [4] D. Barnes, R. Weston, and I. Posner (2019) Masking by Moving: Learning Distraction-Free Radar Odometry from Pose Information. In Conference on Robot Learning (CoRL), External Links: Link Cited by: §II-C, §IV-C.
  • [5] I. A. Barsan, S. Wang, A. Pokrovsky, and R. Urtasun (2018) Learning to Localize Using a LiDAR Intensity Map.. In CoRL, pp. 605–616. Cited by: §II-C, §II-C, §IV-C.
  • [6] Bing Maps. Note: https://docs.microsoft.com/en-us/bingmaps/ Cited by: §V-A.
  • [7] F. Boniardi, T. Caselitz, R. Kümmerle, and W. Burgard (2017) Robust LiDAR-based Localization in Architectural Floor Plans. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3318–3324. Cited by: §II-B.
  • [8] M. A. Brubaker, A. Geiger, and R. Urtasun (2013) Lost! Leveraging the Crowd for Probabilistic Visual Self-localization. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3057–3064. Cited by: §II-B.
  • [9] P. J. Carle and T. D. Barfoot (2010) Global Rover Localization by Matching Lidar and Orbital 3D Maps. In 2010 IEEE International Conference on Robotics and Automation, pp. 881–886. Cited by: §II-B.
  • [10] T. Caselitz, B. Steder, M. Ruhnke, and W. Burgard (2016) Monocular Camera Localization in 3D Lidar Maps. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1926–1931. Cited by: §II-B.
  • [11] S. H. Cen and P. Newman (2019) Radar-Only Ego-Motion Estimation in Difficult Settings via Graph Matching. arXiv preprint arXiv:1904.11476. Cited by: §II-C.
  • [12] N. Chebrolu, P. Lottes, T. Läbe, and C. Stachniss (2019) Robot Localization Based on Aerial Images for Precision Agriculture Tasks in Crop Fields. In 2019 International Conference on Robotics and Automation (ICRA), pp. 1787–1793. Cited by: §II-A.
  • [13] Y. Cho, G. Kim, and A. Kim (2019) DeepLO: Geometry-Aware Deep LiDAR Odometry. arXiv preprint arXiv:1902.10562. Cited by: §II-C, §II-C.
  • [14] L. de Paula Veronese, E. de Aguiar, R. C. Nascimento, J. Guivant, F. A. A. Cheein, A. F. De Souza, and T. Oliveira-Santos (2015) Re-emission and Satellite Aerial Maps Applied to Vehicle Localization on Urban Environments. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4285–4290. Cited by: §II-A, §III-A.
  • [15] C. U. Dogruer, A. B. Koku, and M. Dolen (2010) Outdoor Mapping and Localization Using Satellite Images. Robotica 28 (7), pp. 1001–1012. Cited by: §II-A.
  • [16] G. Floros, B. Van Der Zander, and B. Leibe (2013) OpenStreetSLAM: Global Vehicle Localization Using OpenStreetMaps. In 2013 IEEE International Conference on Robotics and Automation, pp. 1054–1059. Cited by: §II-B.
  • [17] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision Meets Robotics: The KITTI Dataset. International Journal of Robotics Research (IJRR). Cited by: §V-B.
  • [18] Google Maps Platform. Note: https://developers.google.com/maps/documentation/maps-static/intro/ Cited by: §V-A.
  • [19] R. S. Kaminsky, N. Snavely, S. M. Seitz, and R. Szeliski (2009) Alignment of 3D Point Clouds to Overhead Images. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 63–70. Cited by: §II-A, §III-C.
  • [20] G. Kim, Y. S. Park, Y. Cho, J. Jeong, and A. Kim (2020) MulRan: Multimodal Range Dataset for Urban Place Recognition. In IEEE International Conference on Robotics and Automation (ICRA), Note: Submitted. Under Review. Cited by: §V-A.
  • [21] R. Kümmerle, B. Steder, C. Dornhege, A. Kleiner, G. Grisetti, and W. Burgard (2011) Large Scale Graph-based SLAM using Aerial Images as Prior Information. Autonomous Robots 30 (1), pp. 25–39. Cited by: §II-A, §III-A.
  • [22] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018)

    Diverse Image-to-Image Translation via Disentangled Representations

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §II-D.
  • [23] K. Lenc and A. Vedaldi (2015) Understanding Image Representations by Measuring Their Equivariance and Equivalence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 991–999. Cited by: §IV-A.
  • [24] K. Y. K. Leung, C. M. Clark, and J. P. Huissoon (2008) Localization in Urban Environments by Matching Ground Level Video Images with an Aerial Image. In 2008 IEEE International Conference on Robotics and Automation, pp. 551–556. Cited by: §II-A.
  • [25] J. Levinson and S. Thrun (2010) Robust Vehicle Localization in Urban Environments Using Probabilistic Maps. In 2010 IEEE International Conference on Robotics and Automation, pp. 4372–4378. Cited by: §I.
  • [26] A. Li, V. I. Morariu, and L. S. Davis (2014) Planar Structure Matching Under Projective Uncertainty for Geolocation. In European Conference on Computer Vision, pp. 265–280. Cited by: §II-A.
  • [27] Q. Li, S. Chen, C. Wang, X. Li, C. Wen, M. Cheng, and J. Li (2019) LO-Net: Deep Real-time Lidar Odometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8473–8482. Cited by: §II-C.
  • [28] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised Image-to-Image Translation Networks. In Advances in Neural Information Processing Systems, pp. 700–708. Cited by: §II-D.
  • [29] W. Lu, G. Wan, Y. Zhou, X. Fu, P. Yuan, and S. Song (2019-10) DeepVCP: An End-to-End Deep Neural Network for Point Cloud Registration. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §II-C.
  • [30] W. Lu, Y. Zhou, G. Wan, S. Hou, and S. Song (2019) L3-Net: Towards Learning Based LiDAR Localization for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6389–6398. Cited by: §II-C.
  • [31] M. Mielle, M. Magnusson, and A. J. Lilienthal (2019) The Auto-Complete Graph: Merging and Mutual Correction of Sensor and Prior Maps for SLAM. Robotics 8 (2), pp. 40. Cited by: §II-B.
  • [32] M. Noda, T. Takahashi, D. Deguchi, I. Ide, H. Murase, Y. Kojima, and T. Naito (2010) Vehicle Ego-Localization by Matching in-Vehicle Camera Images to an Aerial Image. In Asian Conference on Computer Vision, pp. 163–173. Cited by: §II-A.
  • [33] Y. S. Park, J. Jeong, Y. Shin, and A. Kim (2019-May.) Radar Dataset for Robust Localization and Mapping in Urban Environment . In ICRA Workshop on Dataset Generation and Benchmarking of SLAM Algorithms for Robotics and VR/AR, Montreal. Cited by: §I.
  • [34] M. P. Parsley and S. J. Julier (2010) Towards the Exploitation of Prior Information in SLAM. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2991–2996. Cited by: §II-A.
  • [35] O. Pink (2008) Visual Map Matching and Localization Using a Global Feature Map. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–7. Cited by: §II-A.
  • [36] S. Saftescu, M. Gadd, D. De Martini, D. Barnes, and P. Newman (2020) Kidnapped Radar: Topological Radar Localisation using Rotationally-Invariant Metric Learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris. External Links: Link Cited by: §I, §II-C.
  • [37] T. Senlet and A. Elgammal (2011) A Framework for Global Vehicle Localization Using Stereo Images and Satellite and Road maps. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2034–2041. Cited by: §II-A.
  • [38] Z. Shu, M. Sahasrabudhe, R. Alp Guler, D. Samaras, N. Paragios, and I. Kokkinos (2018)

    Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 650–665. Cited by: §II-D.
  • [39] T. Y. Tang, D. De Martini, D. Barnes, and P. Newman (2020-04) RSL-Net: Localising in Satellite Images From a Radar on the Ground. IEEE Robotics and Automation Letters 5 (2), pp. 1087–1094. External Links: Document, ISSN 2377-3774 Cited by: §I, §I, §I, §III-B, §III, Fig. 3, Fig. 5, §IV-A, §IV-A, §IV-A, §IV-A, §IV-A, §IV-B, §IV-C, §V-A, TABLE I.
  • [40] X. Wang, R. J. Marcotte, and E. Olson GLFP: Global Localization from a Floor Plan. Cited by: §II-B.
  • [41] X. Wang, S. Vozar, and E. Olson (2017) FLAG: Feature-based Localization between Air and Ground. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3178–3184. Cited by: §II-B.
  • [42] R. Weston, S. Cen, P. Newman, and I. Posner (2019) Probably Unknown: Deep Inverse Sensor Modelling Radar. In 2019 International Conference on Robotics and Automation (ICRA), pp. 5446–5452. Cited by: §II-C.
  • [43] R. W. Wolcott and R. M. Eustice (2014) Visual Localization within Lidar Maps for Automated Urban Driving. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 176–183. Cited by: §II-B.
  • [44] R. W. Wolcott and R. M. Eustice (2015) Fast LIDAR Localization Using Multiresolution Gaussian Mixture Maps. In 2015 IEEE international Conference on Robotics and Automation (ICRA), pp. 2814–2821. Cited by: §I.
  • [45] W. Wu, K. Cao, C. Li, C. Qian, and C. C. Loy (2019) Transgaga: Geometry-Aware Unsupervised Image-to-Image Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8012–8021. Cited by: §II-D.
  • [46] X. Xing, T. Han, R. Gao, S. Zhu, and Y. N. Wu (2019) Unsupervised Disentangling of Appearance and Geometry by Deformable Generator Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10354–10363. Cited by: §II-D.
  • [47] Y. Xu, V. John, S. Mita, H. Tehrani, K. Ishimaru, and S. Nishino (2017) 3D Point Cloud Map Based Vehicle Localization Using Stereo Camera. In 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 487–492. Cited by: §II-B.
  • [48] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232. Cited by: §II-D, §III-C, §III-C, §V-C.

Supplementary Material

Vi-a Network Architecture

Here we provide details on the network architecture of the various networks used in our method. We make use of the following abbreviations:

  • RP(

    ): 2D reflection padding of

  • Conv(): convolution with input channels, output channels, kernel size stride padding and bias

  • IN: instance normalisation

  • LReLU(): leaky ReLU with negative slope ()

  • Drop(): dropout with ratio

  • ConvT(): transposed convolution with input channels, output channels, kernel size stride padding output padding and bias

 

Rotation Inference Function
Input shape - where
Conv(4, 32, 3, 2, 1) + IN + ReLU
Conv(32, 64, 3, 2, 1) + IN + ReLU
Conv(64, 128, 3, 2, 1) + IN + ReLU
Conv(128, 256, 3, 2, 1) + IN + ReLU
Latent shape -
Sum along + Softmax + Reshape

Latent vector shape -

which are the softmax weights
Matrix-multiple softmax weights with the input
Shape of the multiplication product -
Extract the associated channel(s) to get (or during training)
TABLE IV: Architecture for inferring rotation

 

Appearance Encoder
RP(3) + Conv(1, 16, 7, 1, 0) + IN + ReLU
Conv(16, 32, 3, 2, 1) + IN + ReLU
Conv(32, 64, 3, 2, 1) + IN + ReLU
Conv(64, 128, 3, 2, 1) + IN + ReLU
Conv(128, 256, 3, 2, 1) + IN + ReLU
ResNet blocks ():
Conv(256, 256, 3, 1, 0) + IN + ReLU + Drop(0.5)
Conv(256, 256, 3, 1, 0) + IN

 

Intra-Modality Pose Encoder
RP(3) + Conv(2, 16, 7, 1, 0) + IN + ReLU
Conv(16, 32, 3, 2, 1) + IN + ReLU
Conv(32, 64, 3, 2, 1) + IN + ReLU
Conv(64, 128, 3, 2, 1) + IN + ReLU
Conv(128, 256, 3, 2, 1) + IN + ReLU
ResNet blocks ():
Conv(256, 256, 3, 1, 0) + IN + ReLU + Drop(0.5)
Conv(256, 256, 3, 1, 0) + IN

 

Cross-Modality Pose Encoder
RP(3) + Conv(4, 16, 7, 1, 0) + IN + ReLU
Conv(16, 32, 3, 2, 1) + IN + ReLU
Conv(32, 64, 3, 2, 1) + IN + ReLU
Conv(64, 128, 3, 2, 1) + IN + ReLU
Conv(128, 256, 3, 2, 1) + IN + ReLU
ResNet blocks ():
Conv(256, 256, 3, 1, 0) + IN + ReLU + Drop(0.5)
Conv(256, 256, 3, 1, 0) + IN

 

Decoder
ConvT(512, 256, 3, 2, 1, 1) + IN + ReLU + Drop(0.5)
ConvT(256, 128, 3, 2, 1, 1) + IN + ReLU + Drop(0.5)
ConvT(128, 64, 3, 2, 1, 1) + IN + ReLU + Drop(0.5)
ConvT(64, 32, 3, 2, 1, 1) + IN + ReLU + Drop(0.5)
RP(3) + Conv(32, 1, 7, 1, 0) + Sigmoid
TABLE V: Architecture of for image generation

 

Encoder
RP(3) + Conv(1, 32, 7, 1, 0) + IN + ReLU
Conv(32, 64, 3, 2, 1) + IN + ReLU
Conv(64, 128, 3, 2, 1) + IN + ReLU
Conv(128, 256, 3, 2, 1) + IN + ReLU
Conv(256, 512, 3, 2, 1) + IN + ReLU
ResNet blocks ():
Conv(512, 512, 3, 1, 0) + IN + ReLU + Drop(0.5)
Conv(512, 512, 3, 1, 0) + IN

 

Decoder
ConvT(512, 256, 3, 2, 1, 1) + IN + ReLU + Drop(0.5)
ConvT(256, 128, 3, 2, 1, 1) + IN + ReLU + Drop(0.5)
ConvT(128, 64, 3, 2, 1, 1) + IN + ReLU + Drop(0.5)
ConvT(64, 32, 3, 2, 1, 1) + IN + ReLU + Drop(0.5)
RP(3) + Conv(32, 1, 7, 1, 0) + Sigmoid
TABLE VI: Image generation for our implementation of RSL-Net [36]

 

Embedding Networks and
Conv(1, 32, 4, 2, 0)
LReLU(0.2) + Conv(32, 64, 4, 2, 0) + IN
LReLU(0.2) + Conv(64, 128, 4, 2, 0) + IN
LReLU(0.2) + Conv(128, 256, 4, 2, 0) + IN
LReLU(0.2) + Conv(256, 512, 4, 2, 0) + IN
LReLU(0.2) + ReLU + Conv(512, 1024, 4, 2, 0)
ReLU + ConvT(1024, 512, 4, 2, 1, 0) + IN
ReLU + ConvT(512, 256, 4, 2, 1, 0) + IN
ReLU + ConvT(256, 128, 4, 2, 1, 0) + IN
ReLU + ConvT(128, 64, 4, 2, 1, 0) + IN
ReLU + ConvT(64, 32, 4, 2, 1, 0) + IN
ReLU + ConvT(32, 1, 4, 2, 1, 0) + Sigmoid
With skip connections in-between intermediate layers
TABLE VII: U-Net architecture for learning embeddings

The network architectures are shown in Tables IV to VII. For comparison against prior supervised approach, we use the same architectures where possible. We implemented the image generation network to have the same latent space size at the bottleneck, and the same number of down-samples and up-samples as in ours.

Vi-B Handling Larger Initial Offset

Models for the experiments in Sections V-A, V-B, and V-C were trained assuming an initial translation offset in the range pixels, which corresponds to more than for the radar experiments. In practice, the amount of offset our method can handle depends on the effective receptive field of the convolutional layers in the encoder and decoder networks for generating images. If the offset is too large, the networks will not be able to encode and decode information needed to correctly generate

Our method, however, naturally allows for a strategy to deal with larger initial offsets, without needing to train different models. At inference, rather than using just during image generation, we can apply known translation offsets and to shift into each of the four quadrants. This is depicted in Figure VI-B, where as an example, we shift by and pixels to form and respectively.

[height=7cm,clip]figs/quadrants
Fig. 13: The image is shifted into the four quadrants.
[height=7cm,clip]figs/large_offset_origin Fig. 14: The unknown translation offset between and is larger than the networks are designed for.
[height=7cm,clip]figs/large_offset_2
Fig. 15: If we shift by to form then the offset between and is within what the networks are designed for. In this case, generating and should both be accurate, as the offset in both cases are within what the networks are trained for.
[height=7cm,clip]figs/large_offset_wrong Fig. 16: The resulting synthetic image will still be erroneous, if an incorrect quadrant is selected. Here the offset between and is larger than what the networks can handle. In this case, generating and will both be problematic due to the issue with offsets.

Figure 14 depicts a case where the translation in the satellite image is pixels. This is larger than the range our networks can handle, which is shown by the dashed box around the origin. However, the offset between and is which is within the range our networks can handle, as shown in Figure VI-B.

Forming and during image generation might lead to incorrect results when generating as the offset between and is too large. However, we can also generate using and and such combination does not suffer from the issue with large offsets.

The question is then which shifted image from and to choose from. Shown in Figure 16, generating using and will also be problematic, as this combination suffers from the issue with large offsets. The selection cannot be made ahead of the image generation as is unknown.

We can generate five versions of using and and introspect the quality of each To do so, we apply a known shift to to query for another image and we can also shift each by to form (or for ), as in Figures VI-B and 16.

For each shift (and zero shift for ), we can take the combination and to generate which should be pixel-wise aligned with If generating is problematic due to large offsets, then so will generating be, as shown in Figure 16. On the other hand, if the networks can correctly produce they can also correctly produce as shown in Figure VI-B.

For each shift we can compute a translation offset using and along with an error term For the five pairs of synthetic images, the one that results in the smallest will be used and passed downstream to solve for This forms an augmented approach for handling initial offsets larger than what the models are trained for.

Mean Error (metric) (pixel)
Direct 6.62 7.88 7.64 9.09
Augmented 4.67 5.54 5.39 6.40
Ours () 3.44 5.40 3.97 6.23
TABLE VIII: Radar localisation against satellite imagery evaluated on the test set of RobotCar, where the initial offset is in the range pixels.

Table VIII shows results on the RobotCar Dataset for radar localisation against satellite imagery, where the initial offset is now pixels. Taking a model trained for an offset of and evaluate directly, the errors are high comparing to results in Sections V-A, V-B, and V-C. However, taking our augmented approach by shifting and generating multiple times, we can handle larger offsets without sacrificing significantly on accuracy. The augmented method was not used in experiments shown in Section V due to the increased computational cost.

Vi-C Further Implementation Details

Our method is implemented in PyTorch. For training rotation inference

and networks for image generation and we use a learning rate of For learning the embedding networks and we use a learning rate of

We use Adam as the optimiser for all experiments. The training is terminated when the validation loss increases for more than 5 epochs. This results in approximately 80 to 150 epochs of training for learning

and depending on the dataset and the specific experiment, and approximately 10 to 20 epochs for learning and

For the introspection method in Section V-D, we set to be and set the threshold for to be 5. For rotation inference, we use an increment of when forming the stack of rotated images.

Vi-D Ablation Study

We perform ablation study to investigate the effect of reduced training data. For radar localisation against satellite imagery on the RobotCar Dataset, we trained a model using approximately the first of training data, and another using every frame of training data. The results are shown in Table IX.

Mean Error (metric) (pixel)
RobotCar (full) 3.44 5.40 3.03 3.97 6.23
RobotCar (first ) 7.96 7.45 6.03 9.18 8.59
RobotCar (every ) 4.36 6.18 4.40 5.03 7.14
TABLE IX: Ablation study for using reduced training data, evaluated on radar localisation against satellite imagery on the RobotCar Dataset.

By using every we have used only of training data. However, by sampling the data uniformly, we have a training set that is more varied than selecting the first and therefore led to better performance.

Vi-E Additional Qualitative Results

Additional qualitative results are presented in Figure 17 and Figure 18.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 17: Images at various stages of our method: map image (a), live data image (b), output of rotation inference (c), embedding (d), pixel-wise aligned ground truth (e), synthetic image (f), embedding (g). From top to bottom: radar localisation against satellite imagery evaluated on RobotCar (rows 1-3) and MulRan (rows 4-6), lidar localisation against satellite imagery evaluated on RobotCar (rows 7-9).
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 18: Images at various stages of our method: map image (a), live data image (b), output of rotation inference (c), embedding (d), pixel-wise aligned ground truth (e), synthetic image (f), embedding (g). From top to bottom: lidar localisation against satellite imagery evaluated on KITTI (rows 1-3), radar localisation against prior lidar map evaluated on RobotCar (rows 4-6) and MulRan (7-9).