Continuous Self-Localization on Aerial Images Using Visual and Lidar Sensors

by   Florian Fervers, et al.

This paper proposes a novel method for geo-tracking, i.e. continuous metric self-localization in outdoor environments by registering a vehicle's sensor information with aerial imagery of an unseen target region. Geo-tracking methods offer the potential to supplant noisy signals from global navigation satellite systems (GNSS) and expensive and hard to maintain prior maps that are typically used for this purpose. The proposed geo-tracking method aligns data from on-board cameras and lidar sensors with geo-registered orthophotos to continuously localize a vehicle. We train a model in a metric learning setting to extract visual features from ground and aerial images. The ground features are projected into a top-down perspective via the lidar points and are matched with the aerial features to determine the relative pose between vehicle and orthophoto. Our method is the first to utilize on-board cameras in an end-to-end differentiable model for metric self-localization on unseen orthophotos. It exhibits strong generalization, is robust to changes in the environment and requires only geo-poses as ground truth. We evaluate our approach on the KITTI-360 dataset and achieve a mean absolute position error (APE) of 0.94m. We further compare with previous approaches on the KITTI odometry dataset and achieve state-of-the-art results on the geo-tracking task.



page 4

page 7


Aerial Imagery based LIDAR Localization for Autonomous Vehicles

This paper presents a localization technique using aerial imagery maps a...

Geographical Map Registration and Fusion of Lidar-Aerial Orthoimagery in GIS

Centimeter level globally accurate and consistent maps for autonomous ve...

Lunar Rover Localization Using Craters as Landmarks

Onboard localization capabilities for planetary rovers to date have used...

Accurate outdoor ground truth based on total stations

In robotics, accurate ground-truth position fostered the development of ...

KITTI-CARLA: a KITTI-like dataset generated by CARLA Simulator

KITTI-CARLA is a dataset built from the CARLA v0.9.10 simulator using a ...

End-to-end Learning of Multi-sensor 3D Tracking by Detection

In this paper we propose a novel approach to tracking by detection that ...

Learning to Drive Off Road on Smooth Terrain in Unstructured Environments Using an On-Board Camera and Sparse Aerial Images

We present a method for learning to drive on smooth terrain while simult...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

An accurate method for self-localization is essential for the navigation of ground vehicles. This is the case for human drivers supported by navigational systems as well as fully autonomous driving suites. Short-term trajectories can be estimated accurately via odometry methods utilizing visual, lidar or inertial measurements. However, even under ideal benchmark conditions state-of-the-art odometry methods suffer from a drift upwards of 5m after traveling for 1km

[1]. GNSS can be used to alleviate long-term drift, but suffer from outages and noisy measurements (e.g. due to the multipath effect [2]). A common alternative to GNSS is to continuously realign the local trajectory with a pre-built map of the environment [3]. While ground-based maps (e.g. using lidar point clouds) are costly to produce and to keep up-to-date, two-dimensional grid-maps such as aerial orthophotos are available globally and offer the potential for self-localization in unseen areas without corresponding ground maps.

Recent work in this area can be roughly divided into two categories: (1) Geo-localization methods use no or only a coarse (e.g. city-scale) initial pose estimate and determine the location by matching the ground vehicle data against a database of geo-registered aerial images. (2) Geo-tracking methods start from a known pose and continuously track the vehicle’s movement on aerial images or semantic maps provided by geographic information systems (GIS). While state-of-the-art geo-localization approaches [4, 5, 6] have high recall on benchmark datasets [7, 8] they suffer from low metric accuracy (e.g. when employed in a Markov localization scheme [9]). Further, initial pose estimates are often known or can be provided by cheap GNSS receivers.

Geo-tracking methods are applicable when the initial pose is known, and have demonstrated feasible results using input from visual cameras [10], range scanners [11, 12] and both [13]. Nevertheless, there are several limitations of current approaches to geo-tracking that we address in this work:

  • Several methods [13, 14, 15, 16] train and test on data from the same city area. Others [17, 18] assume that the ground region or aerial data to be tested on has already been seen during the training stage. We emphasize that one of the advantages of using orthophotos or GIS for self-localization is not having to travel through the target area beforehand to gather training data or build ground maps. We train and test our method on entirely different datasets from the United States and Germany to demonstrate the strong generalization of our model to unseen ground and aerial data.

  • Some methods rely on specific environment features such as edges of buildings [19, 20, 14, 15, 16] or semantic features [13, 21]. This limits the applicability to locations where these features are present and visible, i.e. urban regions.

  • Several related works [10, 22, 11, 12] rely on matching low-level visual or lidar-reflectance features and assume that ground and aerial images have a similar low-level appearance. However, dynamic objects in the ground and aerial data that do not match unless the aerial and ground data are captured synchronously. Further, aerial images from sources such as Google Maps [23] or Bing Maps [24] are captured over several years [25] and potentially miss recent changes (e.g. from construction sites and vegetation growth).

We present a new method for geo-tracking that aligns ground with aerial images via learned visual features. We use lidar point clouds captured synchronously with the ground data to project features extracted from ground images into a nadir (

i.e. top-down) perspective. The matching provides a two-dimensional relative pose w.r.t. the aerial image that translates into latitude, longitude and bearing of the vehicle using the geo-registered information of the aerial image.

Our method is aimed to complement existing odometry and simultaneous mapping and localization (SLAM) methods by providing continuous geo-registration that reduces long-term drift to within the registration error. We choose a simple inertial odometry method based purely on measurements of acceleration and angular velocity and show that in combination with our geo-registration it achieves meter-accurate global poses regardless of sequence length.

The main contribution of this work is a novel method for continuous geo-tracking on orthophotos with meter-level accuracy that exhibits strong generalization, does not require hand-crafted features or ground truth and is robust to changes in the environment. Our experiments demonstrate that we achieve state-of-the-art geo-tracking results on the KITTI odometry dataset [26].

Ii Related Works

Authors Ground data Aerial data Extracted features Tracking Evaluated on
(a) Vysotska et al. [19] Lidar OpenStreetMaps Buildings Pose graph Non-public data
Kim et al. [20] Lidar Orthophoto Buildings Particle filter CULD [27]
Kümmerle et al. [28] Lidar or camera Orthophoto Vertical structures Particle filter Non-public data
Wang et al. [29] Camera Orthophoto Vertical structures Particle filter Non-public data
Pink et al. [30] Camera Orthophoto Lane markings - Non-public data
Javanmardi et al. [31] Lidar Orthophoto Lane markings - Non-public data
Viswanathan et al. [32] Lidar and camera Orthophoto Ground-nonground Particle filter Non-public data
Brubaker et al. [33] Camera OpenStreetMaps Trajectory Custom Bayesian filter KITTI [26]
Floros et al. [34] Camera OpenStreetMaps Trajectory Particle filter KITTI [26]
Yan et al. [21] Lidar OpenStreetMaps Semantic descriptor Particle filter KITTI [26]
Miller et al. [13] Lidar and camera Orthophoto Semantic classes Particle filter KITTI [26]
(b) Veronese et al. [11] Lidar Orthophoto - Particle filter Non-public data
Vora et al. [12] Lidar Orthophoto -

Extended Kalman Filter

Non-public data
Senlet et al. [10] Camera Orthophoto Edges Particle filter Non-public data
Noda et al. [22] Camera Orthophoto SURF - Non-public data
(c) Tang et al. [14, 15, 16] Radar or lidar Orthophoto Learned - Robotcar [35], KITTI [26], Mulran [36]
Zhu et al. [37] Lidar Orthophoto Vertical structures Pose graph KITTI [26]
Ours Lidar and camera Orthophoto Learned Extended Kalman Filter KITTI [26], KITTI-360 [38]
TABLE I: Related Works

Ii-a Handcrafted and Semantic Features

Most geo-tracking approaches involve identifying abstract hand-crafted or semantic features in both ground and aerial data that are matched to find the vehicle’s relative pose (cf. Table Ia). The ground data is captured by visual cameras and/or range scanners (i.e. lidar, radar), and the aerial data is represented by orthophotos or grid-maps from GIS such as OpenStreetMaps (OSM) [39]. Different types of features have been used for this purpose in previous works, such as building outlines [19, 20], general vertical structures [28, 29], lane markings [30, 31], a binary ground-nonground distinction [32], the driven trajectory [33, 34] and multiple semantic classes [21, 13].

Semantic and handcrafted features are typically defined to be invariant to seasonal and daylight variations, but also discard much of the information contained in the input data that could otherwise be leveraged for the tracking task. This limits the applicability to areas where the desired type of feature is abundant, i.e. mostly urban regions with a sufficient amount of buildings.

Ii-B Low-level Features

In contrast to hand-crafted or semantic features, methods based on low-level features allow leveraging the input data on a more fine-grained level by directly comparing color and reflectance values [11, 12] or low-level visual features [10, 22] (cf. Table Ib). The camera or lidar data is first projected into a top-down perspective before applying the matching algorithm.

Methods based on low-level features are negatively impacted by changes in illumination and material appearance when ground and aerial images are taken years apart. Furthermore, the different surface areas of vertical structures are typically only visible from either the aerial view (e.g. roofs) or the ground view (e.g. walls) and can thus not be exploited for the tracking task. As a consequence, low-level feature methods rely almost entirely on contrast between bright lane markings, sidewalks and darker roads.

Ii-C End-to-end Learnable Features

End-to-end trainable models for the geo-tracking task have only recently gained some attention in the research community (cf. Table Ic). Unlike low-level features, learned features allow exploiting vertical structures in the environment and can be trained to be invariant to illumination and appearance changes due to a higher level of abstraction. Unlike handcrafted and semantic features, these models do not blanket discard part of the input data, but instead learn to predict invariant features in a data-centric manner. End-to-end learned features can thus combine the advantages of both semantic or handcrafted features and low-level features. They further differ from learned semantic features by being trained directly on the tracking target rather than on the auxiliary semantic ground truth.

Previous methods based on end-to-end trainable models [14, 15, 37, 16] utilize only input from range-scanners (i.e. radar and lidar) on the ground vehicle. The works by Tang et al. [14, 15, 16] also discard lidar points on the ground plane and thereby focus entirely on vertical structures.

Ii-D Cross-View Geo-localization

Cross-view geo-localization refers to the problem of matching a given ground image against a dataset of geo-registered aerial images to find the geo-location of the vehicle. Current state-of-the-art approaches [4, 5, 6] achieve high recall on corresponding benchmark datasets like CVUSA [7] and CVACT [8]. Such models can also be used to perform coarse geo-tracking (e.g. in combination with particle filters [9]), but do not compete with the metric accuracy of dedicated geo-tracking approaches.

Ii-E Placement of Our Work

We propose an end-to-end trainable model for geo-tracking on aerial images in unseen areas. Our work is the first to utilize ground cameras in an end-to-end differentiable model since previous similar works have only used range scanners for this purpose (cf. Section II-C). Camera images both contain rich visual information and have a smaller domain gap to the aerial imagery than range scans.

The model extracts high-level features from aerial and ground images which are more robust to appearance changes than low-level features (cf. Section II-B) and can exploit vertical structures. In contrast to handcrafted and semantic high-level features (cf. Section II-A), the model does not focus on a predefined subset of the input data and thereby discard information that can possibly help improve tracking performance.

Iii Method

Our method extracts visual features from both ground and aerial images and aligns them from a top-down perspective to find the two-dimensional pose of the vehicle relative to the aerial image. We train a model in a metric learning setting to predict similar features for matching pixels of aerial and ground images and dissimilar features for non-matching pixels. The transformation between projected ground features and aerial features that maximizes the sum of pixel matching scores defines the vehicle’s pose relative to the aerial image.

Iii-a Input and Coordinate Systems

We use a scaled Spherical Mercator projection [40] to transform geo-poses into a locally euclidean, metric coordinate system of the earth’s surface. All poses are represented by 3-DoF transformations between local vehicle coordinates and global coordinates. The global coordinate system is centered on the initial pose estimate and stores coordinates in east-north order.

To perform the geo-registration of the vehicle at a given point in time, the method takes the following inputs:

  1. An initial pose estimate and an aerial image centered on and rotated by with a resolution of meters per pixel and dimensions .

  2. Multiple ground images captured synchronously by different cameras on the vehicle with dimensions and .

  3. A lidar point cloud with three-dimensional points in vehicle coordinates captured by one or more lidar scanners.

All data from the ground vehicle are acquired synchronously up to a negligible delay from a pose and represent the measured state of the environment at the given point in time. Our registration method estimates the relative transformation between prior pose estimate and vehicle pose .

Fig. 1: Summary of the steps required to compute the likelihood of a relative pose hypothesis for the ground vehicle. (a) The input consists of an aerial image centered and rotated to match the prior pose estimate , and images captured synchronously by cameras on the ground vehicle, as well as a lidar point cloud . (b) The aerial and ground images are processed by separate aerial and ground networks and to predict pixel-wise feature maps and . (c) All ground image features are projected onto a single top-down grid-map using the lidar point cloud and pose hypothesis

. The aerial features are padded to produce the aerial grid-map

that matches the size of . Red pixels in and refer to valid features, black pixels to missing features. (d) The grid-maps are compared using the similarity function to produce a confidence score for the pose hypothesis. Multiple hypotheses are tested jointly by computing using a cross-correlation.

Iii-B Feature Map Computation

We consider different hypotheses for the vehicle pose . For each candidate the ground image features of all cameras are first transformed from the vehicle frame into by the transformation and then projected via the lidar points onto a top-down grid-map . The grid-map for hypothesis shall match the aerial grid-map if and only if is the correct hypothesis, i.e. . The maps have spatial dimensions and channels. The true pose hypothesis is then determined as


where measures the similarity between the two grid-maps. This is akin to a maximum likelihood estimation, but uses an uncalibrated likelihood . We train our model to output high values for if is the correct hypothesis, and low values for incorrect hypotheses. A summary of the system is shown in Figure 1.

The measured lidar points are not used as an input to compute the feature vectors themselves, but rather only serve to project features from ground view into a top-down perspective.

Iii-B1 Feature prediction

To compute and for a given pose hypothesis , the aerial and ground images are first processed by separate aerial and ground encoder-decoder networks and . This produces pixel-level feature maps with the original image resolutions and channels (cf. Figure 1b).


Iii-B2 Feature projection

The visual features are transferred from the ground feature maps onto the lidar points by projecting the points onto all camera planes and sampling the feature maps at the projected pixel locations. When multiple cameras observe the same lidar point due to an overlapping field of view, the corresponding pixel features are mean-pooled to produce a single feature vector for the lidar point. This yields a feature matrix for the lidar points.

Next, the features stored in are transformed into the top-down grid-map that matches the pixel resolution of the aerial image. The lidar points are first transformed by into the coordinate system and then projected onto two-dimensional pixels in using an orthogonal projection along the -axis. The features of lidar points that are projected onto the same pixel in

are max-pooled to produce a single feature vector. Since lidar points generally do not cover all grid-map pixels,

represents a sparse pixel-level feature map (cf. Figure 1c).

Lastly, the aerial network output is symmetrically padded with zeros in both spatial dimensions to produce the aerial grid-map with spatial dimensions that is used for matching with (cf. Figure 1c).

Iii-C Feature Map Alignment

Given the two feature maps and with equal dimensions, the function computes their matching score as the average of the similarities of individual pixel embeddings. only considers pixels where both and store valid features, i.e. it ignores padded pixels in and pixels in without projected lidar points. We define the set of valid pixels for a hypothesis as .

The matching score between individual pixel embeddings and

is measured as their cosine similarity:


The similarity function is defined as the average similarity of the features stored per valid pixel and measures the confidence assigned to the hypothesis as shown in (5).


We further consider only hypotheses where the proportion of valid ground features that are matched with aerial features is above a threshold parameter . This rules out hypotheses with no or only a small overlap between and (i.e. when is too large). In this case, the measurement of is unreliable.

To determine the maximum likelihood hypothesis according to (1), the expression has to be evaluated for all hypotheses . This becomes unfeasible for large . To alleviate this problem, we choose the set of hypotheses as a pixel-spaced grid of translations around the origin. This allows the corresponding confidence scores to be computed jointly via cross-correlation which reduces to a simple element-wise multiplication in the Fourier domain [41]. The step is repeated for a discrete set of possible rotations

that are chosen as a hyperparameter.

The result of the matching step are the confidence scores for all pose hypotheses .

Iii-D Training and Loss

In the previous sections, we presented a model that predicts the confidence scores of a set of pose hypotheses. The entire model including projections and transformations into the Fourier domain is end-to-end differentiable up to the confidence scores.

Each training sample defines one positive hypothesis based on the ground truth pose of the vehicle, and multiple negative hypotheses with . We employ a soft triplet loss on each pair of positive and negative hypotheses and a soft online hard example mining (OHEM) [42] strategy to train the model to predict discriminative pixel embeddings.

A triplet is defined by the anchor , the positive example and the negative example . The classical hard triplet loss aims to push the distance of the positive example to the anchor above the distance of the negative example to the anchor by at least a margin (see (6)). The distance is represented by the negative similarity between anchor and example.


Taking the average of for all possible

results in very small loss values as soon as the majority of triplets are classified correctly. This prevents the model from learning from hard negative hypotheses with

. Instead, we employ an OHEM strategy during training by only considering hard negative hypotheses and taking the average of for all remaining hypotheses with

. The loss function in (

7) encodes a hard triplet loss with an OHEM strategy.


The derivative in (7) acts as a binary indicator for a hard negative triplet with .

To improve the stability of the training, we smootly approximate with a soft loss function

. To this end, we first replace the relu function

with a smooth softplus function as shown in (8).


The temperature parameter determines the hardness of . We define the new smooth loss function as follows:


The derivative in (9) (i.e

. the sigmoid function) acts as a soft indicator for a hard negative triplet with

. Similar to , we propagate the loss gradient only through the nominator of (9).

The loss function encodes a soft triplet loss with a soft online hard example mining strategy and requires only the vehicle’s pose as ground truth. The gradient at the output of the ground and aerial networks and is sparse and dense respectively, but accumulated over multiple images for .

Iii-E Tracking

The method presented in the previous sections provides a means for registering aerial and ground data with small translational and rotational offsets, i.e. with a limited number of hypotheses to test. The approach aims to achieve an upper bound on the long-term drift of a given tracking method, rather than to replace the tracking method itself.

We choose a simple tracking method based on an Extended Kalman Filter (EKF) with the constant turn-rate and acceleration (CTRA) motion model [43] that continuously integrates measurements of an inertial measurement unit (IMU). The EKF keeps track of the current vehicle state and corresponding state uncertainty. To demonstrate the effectiveness of our registration, we use only the turn-rate and acceleration of the IMU which on its own results in large long-term drift. The acceleration term is integrated twice to produce position values, such that small acceleration noise leads to large translational noise over time. We use our registration method to continuously align the trajectory with aerial images such that the drift is reduced to within the registration error of the method.

Iii-F Calibration of Confidence Scores

The EKF takes as input the IMU measurement and a pose observation and observation uncertainty representing the network prediction with position , and bearing .

Determining and

from the network output is non-trivial since the confidence scores predicted by the model are uncalibrated and do not represent the actual likelihoods of the underlying hypotheses. The confidence scores are further not guaranteed to follow a normal distribution which is required by the EKF.

We first transform all from into a proper likelihood function as shown in (10).


The observation

is determined via maximum a posteriori estimation (MAP) using the likelihood

and the prior probabilities

provided by the EKF.

The covariance cannot directly be estimated from the confidence scores which are uncalibrated and generally do not follow a normal distribution. Instead, we map into a more suitable representation with a transformation , compute its covariance in the transformed space and apply the inverse transformation to arrive at the actual covariance as shown in (11).


We define

as the posterior probability distribution over the set of hypotheses given the likelihood

and prior probabilities

. The forward transform thus multiplies and and acts as a soft windowing function around the prior state estimate. Since this also decreases the estimated covariance in accordance with , the inverse transform removes this effect and arrives at an undistorted covariance estimate .

The choice of the transformation function is motivated by the fact that it is homomorphic when the network output already follows a normal distribution, i.e. . If the confidence scores do not follow a normal distribution, the transformation still allows capturing the curvature of the correlation volume in a soft window around the prior state estimate and convert it into a better calibrated normal distribution.

We discard unreliable registration results based on an empirical heuristic using the translational variance of the hypotheses, the confidence score of the predicted hypothesis and the mahalanobis distance of the predicted hypothesis to the prior probability distribution. In that case, only the IMU data is processed by the EKF.

Iv Evaluation

Iv-a Data

To train and test our method, a sufficient number of geo-registered ground trajectories are required that contain both lidar point clouds and cameras ideally covering 360° around the vehicle. While many datasets for autonomous driving contain large numbers of ground samples, the trajectories often follow repeated routes with short distances and thus cover only few aerial images. Therefore, we choose four datasets for our training split (i.e. Lyft L5 [44], Nuscenes [45], Pandaset [46], Ford AV Dataset [47]), one dataset for validating the training (i.e. Argoverse [48]) and report test results on two datasets (i.e. KITTI odometry [26] and KITTI-360 [38]). We use GoogleMaps [23] and BingMaps [24] to acquire orthophotos of the test split and the training/validation splits, respectively. Aerial and ground data are thus captured up to several years apart.

We train on datasets covering urban and sub-urban areas in the United States and test on datasets of an area around Karlsruhe, Germany. We thus demonstrate that our model is able to generalize over a large domain gap caused by 1. the diverging appearance of regions in the United States and Germany, 2. different capture modalities of the orthophotos and 3. different capture modalities of the ground-based data.

We gather aerial images with a size of pixels and a pixel resolution of . We resize all ground images to a minimum size of

pixels, as we find this resolution to be sufficient for the registration task while enabling inference with multiple images on a single graphics processing unit (GPU). Since the locations of the ground data are distributed unevenly over the aerial images, we select ground samples during the training according to a uniform distribution over disjoint aerial cells of size


While modern autonomous driving datasets (e.g. [44, 45, 46, 47, 48]) provide cameras with full 360° field-of-view, older datasets such as KITTI have only a front-facing stereo camera. Since our method utilizes lidar points that are covered by the viewing frustrum of at least one on-board camera, the older KITTI dataset cannot utilize its full potential. Despite these unfavorable conditions, our quantitative evaluation shows that our method outperforms other state-of-the-art geo-tracking appraoches on the KITTI dataset. We demonstrate our main results on the KITTI-360 dataset which contains a front-facing stereo camera and two lateral-facing fisheye cameras.

Iv-B Implementation

We choose the ConvNeXt-T architecture [49] as encoder and UperNet [50]

as a decoder for both the aerial and ground feature networks. The encoder is pre-trained on ImageNet


, and the entire encoder-decoder network is pre-trained on ADE20K

[52]. The last layer is replaced with a convolution to predict feature vectors with and embedding dimension of .

We train the model for 30 epochs with the RectifiedAdam optimizer

[53] and a batch size of . An epoch consists of a single ground sample chosen randomly per cell. We replace all BatchNorm (BN) layers [54] in UperNet with LayerNorm (LN) layers [55] since BN layers perform worse with small batch sizes [56]. We employ a cosine decaying schedule with an initial learning rate of . Generalization is improved via a decoupled weight decay [57] of , a layer decay [58] of 0.8 and an exponential moving average [59] over the model weights with decay rate 0.999. The loss function is parametrized with a margin and a temperature .

For data augmentation, we randomly rotate and flip the aerial image and lidar point clouds, and apply a color augmentation scheme to the aerial and ground camera images. We randomly shift the center of the aerial image by to prevent the network from simply learning a centering bias on the image. We choose the threshold that discards unreliable hypotheses (cf. Section III-C) as during training and during inference.

For the tracker we choose a total of possible rotations sampled from to degrees with the sampling density increasing for values closer to .

Iv-C Results

The experimental results on KITTI are summarized in Table II. We measure the performance as the mean APE of the two-dimensional poses in meters. Our method is able to track 8 of 10 scenes successfully and achieves state-of-the-art results on 7 scenes. Due to the limited camera field-of-view of KITTI, it loses track in two scenes at some point during the trajectory and is not able to recover afterwards. Only the method of Brubaker et al. [33] is able to successfully track the same number of sequences, but suffers from significantly larger APE.

The results on KITTI-360 are summarized in Table III. Due to the larger field-of-view of the on-board cameras our method successfully tracks all sequences and achieves a mean APE of . Without the continuous alignment of our registration method the purely IMU-based tracker results in large long-term drift and an average APE of over .

Figure 2 shows an overlay of the trajectory of scene 02 with aerial images of the area. The predicted and ground truth positions differ by on average and stay within an upper bound of to each other over the entire trajectory.

Method 00 01 02 04 05 06 07 08 09 10
Brubaker et al. [33] 2.1 3.8 4.1 2.6 1.8 2.4 4.2 3.9
Floros et al. [34]
Yan et al. [21]
Miller et al. [13] 2.0 9.1 7.2
Tang et al. [14, 15, 16] (3.7)
Zhu et al. [37] 4.45
Ours 2.53 1.42 0.66 0.77 0.57 0.85 2.51 0.96
TABLE II: Absolute Position Error in meters on KITTI Odometry Scenes. Results marked with parentheses correspond to a subset of the scene. Lower bounds are given for methods that did not report exact results. Missing cells indicate scenes that are not evaluated or not tracked successfully. Tang et al. [14, 15, 16] show results for and dimensions separately - we report . For each method, the best reported results are shown. Scene 03 is omitted since the GNSS ground truth is not available at the time of publication.
Method 00 02 03 04 05 06 07 09 10 Mean
Ours 0.70 0.94 0.67 0.95 0.75 1.16 0.99 0.75 2.16 0.94
TABLE III: Absolute Position Error in meters on KITTI-360 Scenes
Fig. 2: Scene 02 of the KITTI-360 dataset [38] with the ground truth trajectory in red and the predicted trajectory in blue. Images taken from Google Maps [23].

V Conclusion

We present a novel geo-tracking method that allows tracking a vehicle on aerial images with meter-level accuracy. We perform evaluation on the KITTI-360 and KITTI odometry datasets and report state-of-the-art results on the geo-tracking task. Our method is robust to changes in the environment and requires only geo-poses as ground truth. By training and testing on entirely different datasets we demonstrate its strong generalization. In the future, we will investigate the potential of the approach for change detection w.r.t. aerial imagery, and for identifying the misalignment of orthophotos w.r.t. high-accuracy GNSS signals.


  • [1] (2022) Kitti odometry benchmark. [Online]. Available: {}{}}{{_}odometry.php}{cmr}
  • [2] M.~Karaim, M.~Elsheikh, A.~Noureldin, and R.~Rustamov, ``Gnss error sources,'' Multifunctional Operation and Application of GPS, 2018.
  • [3] G.~Bresson, Z.~Alsayed, L.~Yu, and S.~Glaser, ``Simultaneous localization and mapping: A survey of current trends in autonomous driving,'' Transactions on Intelligent Vehicles, 2017.
  • [4] Y.~Shi, L.~Liu, X.~Yu, and H.~Li, ``Spatial-aware feature aggregation for image based cross-view geo-localization,'' Advances in Neural Information Processing Systems, 2019.
  • [5] R.~Rodrigues and M.~Tani, ``Are these from the same place? seeing the unseen in cross-view image geo-localization,'' in

    Winter Conference on Applications of Computer Vision

    , 2021.
  • [6] T.~Wang, Z.~Zheng, C.~Yan, J.~Zhang, Y.~Sun, B.~Zheng, and Y.~Yang, ``Each part matters: Local patterns facilitate cross-view geo-localization,'' Transactions on Circuits and Systems for Video Technology, 2021.
  • [7] S.~Workman, R.~Souvenir, and N.~Jacobs, ``Wide-area image geolocalization with aerial reference imagery,'' in International Conference on Computer Vision, 2015.
  • [8]

    L.~Liu and H.~Li, ``Lending orientation to neural networks for cross-view geo-localization,'' in

    Conference on Computer Vision and Pattern Recognition

    , 2019.
  • [9] L.~Heng, B.~Choi, Z.~Cui, M.~Geppert, S.~Hu, B.~Kuan, P.~Liu, R.~Nguyen, Y.~C. Yeo, A.~Geiger, et~al., ``Project autovision: Localization and 3d scene perception for an autonomous vehicle with a multi-camera system,'' in International Conference on Robotics and Automation, 2019.
  • [10] T.~Senlet and A.~Elgammal, ``A framework for global vehicle localization using stereo images and satellite and road maps,'' in International Conference on Computer Vision Workshops, 2011.
  • [11] L.~D.~P. Veronese, E.~de~Aguiar, R.~C. Nascimento, J.~Guivant, F.~A.~A. Cheein, A.~F. De~Souza, and T.~Oliveira-Santos, ``Re-emission and satellite aerial maps applied to vehicle localization on urban environments,'' in International Conference on Intelligent Robots and Systems, 2015.
  • [12] A.~Vora, S.~Agarwal, G.~Pandey, and J.~McBride, ``Aerial imagery based lidar localization for autonomous vehicles,'' arXiv preprint, 2020.
  • [13] I.~D. Miller, A.~Cowley, R.~Konkimalla, S.~S. Shivakumar, T.~Nguyen, T.~Smith, C.~J. Taylor, and V.~Kumar, ``Any way you look at it: Semantic crossview localization and mapping with lidar,'' Robotics and Automation Letters, 2021.
  • [14] T.~Y. Tang, D.~De~Martini, D.~Barnes, and P.~Newman, ``Rsl-net: Localising in satellite images from a radar on the ground,'' Robotics and Automation Letters, 2020.
  • [15]

    T.~Y. Tang, D.~De~Martini, S.~Wu, and P.~Newman, ``Self-supervised learning for using overhead imagery as maps in outdoor range sensor localization,''

    International Journal of Robotics Research, 2021.
  • [16] T.~Y. Tang, D.~Martini, and P.~Newman, ``Get to the point: Learning lidar place recognition and metric localisation using overhead imagery,'' Robotics: Science and Systems, 2021.
  • [17] H.~Chu, H.~Mei, M.~Bansal, and M.~R. Walter, ``Accurate vision-based vehicle localization using satellite imagery,'' arXiv preprint, 2015.
  • [18] B.~Zha and A.~Yilmaz, ``Map-based temporally consistent geolocalization through learning motion trajectories,'' in International Conference on Pattern Recognition, 2021.
  • [19] O.~Vysotska and C.~Stachniss, ``Improving slam by exploiting building information from publicly available maps and localization priors,'' Journal of Photogrammetry, Remote Sensing and Geoinformation Science, 2017.
  • [20] J.~Kim and J.~Kim, ``Fusing lidar data and aerial imagery with perspective correction for precise localization in urban canyons,'' in International Conference on Intelligent Robots and Systems, 2019.
  • [21] F.~Yan, O.~Vysotska, and C.~Stachniss, ``Global localization on openstreetmap using 4-bit semantic descriptors,'' in European Conference on Mobile Robots, 2019.
  • [22] M.~Noda, T.~Takahashi, D.~Deguchi, I.~Ide, H.~Murase, Y.~Kojima, and T.~Naito, ``Vehicle ego-localization by matching in-vehicle camera images to an aerial image,'' in Asian Conference on Computer Vision, 2010.
  • [23] ``Google maps.'' [Online]. Available: {}{}}{}{cmr}
  • [24] ``Bing maps.'' [Online]. Available: {}{}}{}{cmr}
  • [25] M.~Lesiv, L.~See, J.~C. Laso~Bayas, T.~Sturn, D.~Schepaschenko, M.~Karner, I.~Moorthy, I.~McCallum, and S.~Fritz, ``Characterizing the spatial and temporal availability of very high resolution satellite imagery in google earth and microsoft bing maps as a source of reference data,'' Land, 2018.
  • [26] A.~Geiger, P.~Lenz, C.~Stiller, and R.~Urtasun, ``Vision meets robotics: The kitti dataset,'' International Journal of Robotics Research, 2013.
  • [27] J.~Jeong, Y.~Cho, Y.-S. Shin, H.~Roh, and A.~Kim, ``Complex urban lidar data set,'' in International Conference on Robotics and Automation, 2018.
  • [28] R.~Kümmerle, B.~Steder, C.~Dornhege, A.~Kleiner, G.~Grisetti, and W.~Burgard, ``Large scale graph-based slam using aerial images as prior information,'' Autonomous Robots, 2011.
  • [29] X.~Wang, S.~Vozar, and E.~Olson, ``Flag: Feature-based localization between air and ground,'' in International Conference on Robotics and Automation, 2017.
  • [30] O.~Pink, ``Visual map matching and localization using a global feature map,'' in Conference on Computer Vision and Pattern Recognition Workshops, 2008.
  • [31] M.~Javanmardi, E.~Javanmardi, Y.~Gu, and S.~Kamijo, ``Towards high-definition 3d urban mapping: Road feature-based registration of mobile mapping systems and aerial imagery,'' Remote Sensing, 2017.
  • [32] A.~Viswanathan, B.~R. Pires, and D.~Huber, ``Vision-based robot localization across seasons and in remote locations,'' in International Conference on Robotics and Automation, 2016.
  • [33] M.~A. Brubaker, A.~Geiger, and R.~Urtasun, ``Lost! leveraging the crowd for probabilistic visual self-localization,'' in Conference on Computer Vision and Pattern Recognition, 2013.
  • [34] G.~Floros, B.~Van Der~Zander, and B.~Leibe, ``Openstreetslam: Global vehicle localization using openstreetmaps,'' in International Conference on Robotics and Automation, 2013.
  • [35] D.~Barnes, M.~Gadd, P.~Murcutt, P.~Newman, and I.~Posner, ``The oxford radar robotcar dataset: A radar extension to the oxford robotcar dataset,'' in International Conference on Robotics and Automation, 2020.
  • [36] G.~Kim, Y.~S. Park, Y.~Cho, J.~Jeong, and A.~Kim, ``Mulran: Multimodal range dataset for urban place recognition,'' in International Conference on Robotics and Automation, 2020.
  • [37] M.~Zhu, Y.~Yang, W.~Song, M.~Wang, and M.~Fu, ``Agcv-loam: Air-ground cross-view based lidar odometry and mapping,'' in Chinese Control And Decision Conference, 2020.
  • [38]

    Y.~Liao, J.~Xie, and A.~Geiger, ``Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,''

    arXiv preprint, 2021.
  • [39] ``Openstreetmaps.'' [Online]. Available: {}{}}{}{cmr}
  • [40] Spherical mercator. [Online]. Available: {}{}}{}{cmr}
  • [41] D.~Barnes, R.~Weston, and I.~Posner, ``Masking by moving: Learning distraction-free radar odometry from pose information,'' arXiv preprint, 2019.
  • [42] A.~Shrivastava, A.~Gupta, and R.~Girshick, ``Training region-based object detectors with online hard example mining,'' in Conference on Computer Vision and Pattern Recognition, 2016.
  • [43] D.~Svensson, ``Derivation of the discrete-time constant turn rate and acceleration motion model,'' in Sensor Data Fusion: Trends, Solutions, Applications, 2019.
  • [44] J.~Houston, G.~Zuidhof, L.~Bergamini, Y.~Ye, L.~Chen, A.~Jain, S.~Omari, V.~Iglovikov, and P.~Ondruska, ``One thousand and one hours: Self-driving motion prediction dataset,'' arXiv preprint, 2020.
  • [45] H.~Caesar, V.~Bankiti, A.~H. Lang, S.~Vora, V.~E. Liong, Q.~Xu, A.~Krishnan, Y.~Pan, G.~Baldan, and O.~Beijbom, ``nuscenes: A multimodal dataset for autonomous driving,'' in Conference on Computer Vision and Pattern Recognition, 2020.
  • [46] P.~Xiao, Z.~Shao, S.~Hao, Z.~Zhang, X.~Chai, J.~Jiao, Z.~Li, J.~Wu, K.~Sun, K.~Jiang, et~al., ``Pandaset: Advanced sensor suite dataset for autonomous driving,'' in International Intelligent Transportation Systems Conference, 2021.
  • [47] S.~Agarwal, A.~Vora, G.~Pandey, W.~Williams, H.~Kourous, and J.~McBride, ``Ford multi-av seasonal dataset,'' International Journal of Robotics Research, 2020.
  • [48] M.-F. Chang, J.~Lambert, P.~Sangkloy, J.~Singh, S.~Bak, A.~Hartnett, D.~Wang, P.~Carr, S.~Lucey, D.~Ramanan, et~al., ``Argoverse: 3d tracking and forecasting with rich maps,'' in Conference on Computer Vision and Pattern Recognition, 2019.
  • [49] Z.~Liu, H.~Mao, C.-Y. Wu, C.~Feichtenhofer, T.~Darrell, and S.~Xie, ``A convnet for the 2020s,'' arXiv preprint, 2022.
  • [50] T.~Xiao, Y.~Liu, B.~Zhou, Y.~Jiang, and J.~Sun, ``Unified perceptual parsing for scene understanding,'' in European Conference on Computer Vision, 2018.
  • [51]

    J.~Deng, W.~Dong, R.~Socher, L.-J. Li, K.~Li, and L.~Fei-Fei, ``Imagenet: A large-scale hierarchical image database,'' in

    Conference on Computer Vision and Pattern Recognition, 2009.
  • [52] B.~Zhou, H.~Zhao, X.~Puig, T.~Xiao, S.~Fidler, A.~Barriuso, and A.~Torralba, ``Semantic understanding of scenes through the ade20k dataset,'' International Journal of Computer Vision, 2019.
  • [53] L.~Liu, H.~Jiang, P.~He, W.~Chen, X.~Liu, J.~Gao, and J.~Han, ``On the variance of the adaptive learning rate and beyond,'' arXiv preprint, 2019.
  • [54]

    S.~Ioffe and C.~Szegedy, ``Batch normalization: Accelerating deep network training by reducing internal covariate shift,'' in

    International Conference on Machine Learning

    , 2015.
  • [55] J.~L. Ba, J.~R. Kiros, and G.~E. Hinton, ``Layer normalization,'' arXiv preprint, 2016.
  • [56] Y.~Wu and K.~He, ``Group normalization,'' in European Conference on Computer Vision, 2018.
  • [57] I.~Loshchilov and F.~Hutter, ``Decoupled weight decay regularization,'' arXiv preprint, 2017.
  • [58] H.~Bao, L.~Dong, and F.~Wei, ``Beit: Bert pre-training of image transformers,'' arXiv preprint, 2021.
  • [59] B.~T. Polyak and A.~B. Juditsky, ``Acceleration of stochastic approximation by averaging,'' SIAM Journal on Control and Optimization, 1992.