Learning Cross-Scale Visual Representations for Real-Time Image Geo-Localization

by   Tianyi Zhang, et al.
University of Michigan

Robot localization remains a challenging task in GPS denied environments. State estimation approaches based on local sensors, e.g. cameras or IMUs, are drifting-prone for long-range missions as error accumulates. In this study, we aim to address this problem by localizing image observations in a 2D multi-modal geospatial map. We introduce the cross-scale dataset and a methodology to produce additional data from cross-modality sources. We propose a framework that learns cross-scale visual representations without supervision. Experiments are conducted on data from two different domains, underwater and aerial. In contrast to existing studies in cross-view image geo-localization, our approach a) performs better on smaller-scale multi-modal maps; b) is more computationally efficient for real-time applications; c) can serve directly in concert with state estimation pipelines.



There are no comments yet.


page 1

page 3

page 5

page 6


Multi-Modulation Network for Audio-Visual Event Localization

We study the problem of localizing audio-visual events that are both aud...

Planetary UAV localization based on Multi-modal Registration with Pre-existing Digital Terrain Model

The autonomous real-time optical navigation of planetary UAV is of the k...

METEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams

Many learning tasks involve multi-modal data streams, where continuous d...

IVD-Net: Intervertebral disc localization and segmentation in MRI with a multi-modal UNet

Accurate localization and segmentation of intervertebral disc (IVD) is c...

Ground Encoding: Learned Factor Graph-based Models for Localizing Ground Penetrating Radar

We address the problem of robot localization using ground penetrating ra...

MOZARD: Multi-Modal Localization for Autonomous Vehicles in Urban Outdoor Environments

Visually poor scenarios are one of the main sources of failure in visual...

Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions

Personality computing has become an emerging topic in computer vision, d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Geo-localization plays a key role in autonomous and robotic systems exploring a priori unknown environments in the wild. To achieve better localization accuracy, a wide range of sensors have been used on today’s field robots. According to the reference frame, sensors can be generally categorized as local or global. Local sensors, e.g. cameras and IMU, observe the environment in a local coordinate frame. Global sensors, e.g. GPS, barometers, and magnetometers, provide global measurements in fixed global frames. While local sensors give high-precision local measurements, global sensors are noisier but do not suffer from the same drift effects when localizing the vehicle. The algorithmic combination of both kinds of sensors achieves locally accurate and globally drift-free performance on long-range tasks [Qin2019AGO].

However, there are many scenarios where global information is not available or only partially available. The scenarios can be underwater, underground, or other GPS denied environments. Taking underwater as an example, neither GPS nor land-based station towers can be accessed since electromagnetic waves are heavily attenuated. Acoustic localization gets downgraded by variation in salinity or temperature in the water body. Magnetometer and depth sensors are reliable global sensors underwater, however, they only provide measurements up to 4 DOF in total. The global measurements of the most important 2 DOF on the horizontal plane are missing anyway.

The absence of global sensing has raised a global challenge for geo-localization: the incremental state estimation based on local-only sensor systems, i.e. dead reckoning, is prone to drifts which accumulate with time. Hence, in long-range missions, we need to find an approach to control the growing localization error online.

Fig. 1: Cross-scale geo-localization: Small-scale raw map and large-scale image observation are encoded into probabilistic representations. Location of the image observation can then be inferred in the map.

In this study, we conceive a real-time geo-localization system (see Fig. 1) for robot platforms equipped with RGB cameras. In this system, a 2D geospatial map is encoded into a belief map. Observations from the camera are encoded into feature representations and its 2D location can be inferred in the belief map. The key problem we address in this study is how to encode 2D geospatial maps of various modalities and image observations into consistent representations. The contributions of this study are as follows:

  • We formulate the cross-scale geo-localization task, which can help robots with large-scale image observations navigate in small-scale maps;

  • We propose a workflow to sample cross-scale visual data from cross-modality resources. We build and release two datasets from different domains;

  • We propose to use a pixels-to-pixels map encoder to efficiently encode map patches with small scale;

  • We propose a framework which trains a map encoder and an image observation encoder jointly without supervision. We use Bhattacharyya coefficient [bhattacharyya1946measure] as the similarity metric between the probabilistic outputs of both encoders. We modified NT-Xent [chen2020simple]

    as loss function for our case.

  • We propose to use Dirichlet distribution to model the probability of encoded observation in an encoded map, which can be potentially leveraged in downstream inference applications.

Ii Related Work

Ii-a TAN with particle filtering

Navigating an AUV (AUV) or UAV (UAV) with a terrain elevation map has been studied for decades. Early work was conducted on an AUV equipped with single-point sonar and water depth sensor [uwpf2003]. Since acquiring information from a map is a highly non-linear operation, a particle filter method is applied to TAN (TAN) for state estimation. Subsequent work has improved the TAN method by plugging in different kinds of range sensors [tan2008profile, tan2019multibeam], upgrading the particle filter into different variants [tan2008pmf, tan2017pfcompare, tan2019mpmf], developing efficient mapping features [tan2006floodslam, tan2009effslam, tan2010slam], and realizing cooperative TAN with multiple vehicles [tan2016cooperative, tan2020colla, tan2021communication]. However, the success of a TAN system requires sufficient excitation from terrain elevation, which is not always guaranteed in many scenarios.

In our work, we break the limitation of using range sensors and a terrain elevation map. Instead, we explore the possibility of using RGB cameras as sensors, and maps of different modalities which provide richer geospatial information.

Ii-B Cross-view image geo-localization

Cross-view image geo-localization refers to determining the geolocation of a query image with an overhead satellite image. This problem was first formulated as an information retrieval problem in ground-and-overhead image databases and was attempted based on the extraction and matching of hand-crafted features [cros2013lin]. Workman et al. [cros2015workman] and Vo and Hays [vo2016localizing] approached this problem with CNN (CNN) backbones for different views and evaluated the performance of different training strategies and embedding architectures. Hu et al. [cros2018cvmnet] proposed CVM-Net which embedded the NetVLAD layer [cros2018netvlad] on top of a CNN to extract descriptors invariant to viewpoint changes. Other extensive studies [cros2019airgan, cros2019safa, cros2020WhereAI] developed domain transfer methods to bridge the gap between different viewpoints, but only apply to the cases that query image is panoramic.

Methods developed in this field of study are all based on the end-to-end framework of information retrieval and ranking. This makes it difficult to integrate such systems in TAN or other general SLAM (SLAM) workflow, which are mostly based on filtering and optimization. Moreover, the exhaustive sliding window search lacks the efficiency to deploy on the mobile platforms need real-time localization in a dynamically updating map.

In our study, we move away from the information retrieval framework and develop solutions more efficient and compatible for real-time geo-localization.

Ii-C RS image classification

Studies in RS (RS) image classification have inspired us with the feature association problem. Cao et al. [rs2018cao]

introduced a land use classification network with both aerial and street view images integrated. Street view features are interpolated by geo-coordinates and concatenated with aerial features. Hong et al. 

[hong2020more] proposed feature fusion and network training strategies for the multi-modal and cross-modal RS image classification. Above mentioned approaches both need supervision from ground truth labels to train.

Ii-D Contrastive learning

Recent progress in contrastive learing has shown us how CNNs can learn visual representations without supervision [chen2020simple, he2020momentum]. Further, Pielawski el al. [pielawski2020comir] proposed CoMIR (CoMIR) which addressed the multimodal image registration problem with contrastive learning. However, CoMIR works with different modalities of exactly the same scale, which means that it is not directly applicable to our localization problem which has a large scale ratio between image observations and the map.

Iii Problem Formulation

We aim to address the geo-localization problem with RGB image observations and multi-modal maps referenced in 2D geo-coordinates. RGB camera provides rich visual information and is one of the most affordable and widely-equipped perceptual sensors on mobile robots. However, high-resolution RGB satellite images which serve as maps in existing studies have limited coverage on the earth. We seek to exploit other lower-resolution modalities with smaller scale to serve as a map. In contrast to existing image geo-localization approaches, we focus on following goals:

  1. Localization in smaller-scale maps (typically, scale of an image observation has a magnitude of or larger than a map);

  2. Localization in maps of different modalities, which massively extend the data that can serve as maps;

  3. Efficient computation for real-time data processing;

  4. Compact map description for potentially efficient transmission over network;

  5. Compatibility as a plug-in module (instead of an end-to-end standalone program) in state estimation pipelines.

Iv Methodology

Iv-a Cross-Scale Dataset

The basic data unit of our proposed cross-scale dataset is a data tuple, which consists of a 2D map patch , image observations and pixel coordinates indicating where elements of are located in . The whole dataset consists of a certain number of such data tuples sampled from one or multiple areas of interest.

Fig. 2 shows the workflow of sampling a data tuple . First, we sample a map patch with random coordinate and random rotation from the data source of small-scale map. Then we randomly sample from map patch as the locations where will be sampled. will be converted into global coordinates for sampling from the data source of large-scale images.

Fig. 2: Workflow of sampling the cross-scale dataset from data source of map and image observations. As the basic unit of our proposed dataset, a data tuple consists of a map patch, certain number of image observations and their pixel coordinates in the map patch.
Fig. 3: The overview of our proposed framework. A map patch is encoded into a belief map. An image observation is augmented into two views then encoded into 1D representation. Anchor features are extracted from belief map by pixel coordinates. Image’s representations and anchor features with corresponding pixel coordinates will serve as positive pairs and the rest of the images will be the negative examples.

Iv-B Network

The proposed network consists of a map encoder and a observation encoder to extract features from map patches and image observations (see Fig. 3). encodes a map patch into a belief map, which has the same height and width with the input. The belief map has channels, which corresponds to the categories of terrain representations. encodes the image into a 1D representation of size . Both encoders are expected to learn representations consistent for the same location while distinguishable for different types of terrain.

Contrastive learning is applied for jointly training both encoders. For clarity, in this paper we use to index data tuples in a minibatch, and to index image observations and their corresponding pixel coordinates in one data tuple. Within a mini-batch of size , each map patch is encoded into a belief map . No matter how is realized, function as last layer will convert the output score of each pixel into a discrete probabilistic distribution. We expect that potential downstream applications do inference based on this property. The anchor feature is then extracted from by coordinate , denoted by . Each image observation is first randomly augmented into 2 views, and , where and are different augmentation operators randomly sampled from same augmentation family. While various kinds of image augmentation has been recommended by [chen2020simple], it’s up to the task which augmentation will be applied in training. For example, color distortion can be necessary for underwater applications. Let and be the representations encoded from both augmented views. Similar to , operation is applied as the last layer of . Both and will be considered as positive examples of the anchor feature . We treat augmented views of all the rest image observations in the mini-batch as negative examples.

Iv-C Similarity and Contrastive Loss

We adapt NT-Xent (the normalized temperature-scaled cross entropy loss) [chen2020simple] to our case where one anchor feature has two positive examples. The loss function for an anchor feature is defined as:


where is the temperature parameter [chen2020simple]. is the similarity function between and . Since the output after layer are interpreted as discrete probabilistic distributions, we use Bhattacharyya coefficient [bhattacharyya1946measure] as the similiarity function, which is defined by:



performs element-wise operation. Though cosine similarity is widely used in well-proved contrastive learning frameworks 

[chen2020simple, he2020momentum], we do not use it in this work. Cosine similarity normalizes the encoder outputs before , removing the ability to constrain the magnitude of the network activation. Such property will lead to inconsistent outputs between two encoders after .

Iv-D Inference

At location (pixel coordinate in belief map), we want to find the distribution of observing representation given from belief map. We model the distribution of observed representation as a Dirichlet Distribution:


where has the same size with . is a free parameter to tune.

V Experiments

V-a Dataset

We build two datasets for this study, Scott Reef dataset and Kempten dataset. Each dataset contains 1000 data tuples to split for training and validation, and 200 data tuples for testing. For both datasets, we sample the image observation with the resolution of and the map patch with .

Scott Reef dataset is built from the Scott Reef 25 dataset (2009) [acfr2010dataset] provided by ACFR (ACFR). The 2D RGB reconstruction covers an area of approximately with the resolution . Image observations are sampled from the raw-resolution reconstruction with scale. Map patches are sampled from the reconstruction down-scaled by a factor of , which has a scale of .

Kempten dataset is built with the Sentinel-1 SAR [ee_sentinel] as map and Google Map Satellite [gmap] as image observations. Sentinel-1 data has a scale of , and 4 channels, i.e. , , and . We project the Sentinel-1 data using Pseudo-Mercator (EPSG:3857) [ee_sentinel]. Google Map Satellite’s RGB images have a scale of with zoom level . Data are collected around and . We choose this area because of diverse terrain types and consistent satellite image quality.

V-B Implementation

First parameter to be determined is the size of the output feature, which should be sufficient to describe the types of terrain in the dataset. Practically,

will be tuned as a hyperparameter in training. However, the magnitude of

can be predetermined with some human knowledge of terrain types. For example, if we believe the number of terrain types in the area of interest lies between to , then this will be the range we use to tune the parameter.

We realize the map encoder with FCN-ResNet50 [long2015fully]. We define the last conv2d layer of FCNHead to output feature of channels. We add a softmax layer at the end of the network to convert the score map to the belief map. We choose ResNet18 [he2016resnet] as the image encoder. Similarly, we have the FC layer to output representations of size , and add a softmax layer as the last layer.

We use for loss function in presenting the results, which is selected empirically.

In our image augmentation family, we include the  [pielawski2020comir] (the finite, cyclic, symmetry group of multiples of rotations) for learning rotation-invariant representations. We also apply random changes in brightness, contrast, saturation, and hue.

The network is trained with SGD optimizer with momentum [sutskever2013momentum]

for 300 epochs. The learning rate starts at

, and is reduced by a factor of once learning stagnates. We include 8 data tuples in a mini-batch, and each data tuple contains 6 image observations. In other words, a mini-batch contains 8 map patches and 48 image observations, which is the largest that fits in a 12GB GPU.

V-C Visualizations

We visualize the representations learned by both backbones to show that they learn consistent representations, see Fig. 4 and Fig. 5. Fig. 3(a) shows an RGB map patch from the Scott Reef dataset. Fig. 3(b) presents the belief map in segmentation style by performing operation along the channel dimension. As shown in 4 different colors, the map patch is encoded into 4 different types of representations. The segmentation-style map shows a terrain pattern that generally aligns with the raw RGB map. Fig. 3(c) profiles the belief map along the axis in red. We can observe changes in 4 types of representations along the axis. We compare the representations encoded from image observation and map in Fig. 3(d). The positions where the image observations are located are also indicated in the Fig. 3(a),  3(b) and  3(c). From the comparison we see both encoders learn similar representations across scales. We also find that those representations learned without supervision can be interpreted into description with human oversight, that observations correspond to b/B or c/C are densely reef-covered terrain, d/D or e/E are sparsely reef-covered, a/A has some dense texture and f/F is almost pure sand. Note that both d and D are combinations of two types of terrains, and their proportion reflected by representations are close to each other.

(a) Map patch with RGB channels (scale61 pixel/m)
(b) Belief map visualized in segmentation style
(c) Profile of the belief map along the axis in red
(d) Image observations (scale500 pixel/m) and representations, labelled with a-f. Corresponding representations extracted from belief map are labelled with A-F.
Fig. 4: Visualized representations for Scott Reef dataset
(a) Map patch with hh, hv and vv channels (scale=0.1 pixel/m)
(b) Belief map visualized in segmentation style
(c) Profile of the belief map along the axis in red
(d) Image observations (scale7 pixel/m) and representations, labelled with a-f. Corresponding representations extracted from belief map are labelled with A-F.
Fig. 5: Visualized representations for Kempten dataset

We do the same visualization to a map patch from the Kempten dataset (see Fig. 5). In this patch, we see 5 different kinds of terrains represented. a/A or d/D are human artifacts, b/B is water area, c/C is farmland with farming texture, e/E is the transition area between farmland and woods, and f/F is woods and forests.

We evaluate the probability density of image observation with each pixel in the belief map by Eq. 3 (with ), and visualize as a heat map (Fig. 6 and Fig. 7). The arrows indicate the ground truth position of image observations in the map. We can see that the images presented all lie in the area with deep red color indicating a high probability density.

(a) Raw RGB map
Fig. 6: Scott Reef Dataset: locations of image observations inferred in the map
(a) Raw SAR map
Fig. 7: Kempten Dataset: locations of image observations inferred in the map

V-D Numerical Evaluation and Comparisons

We compare our approach with Triplet Network [vo2016localizing] and CVM-Net I [cros2018cvmnet]. To keep the comparison fair, we also experimented with their backbone replaced with ResNet18, which is identical to our image encoder (without the layer). Since the other approaches all need image pairs from both scales for training and a sliding window search for testing, we sampled small-scale image tiles with resolution from the raw map. We choose this resolution because it will cover approximately the same (or larger) FOV (FOV) under the scale ratio and the performances of backbone networks have all been proved on datasets with

resolution, e.g. CIFAR-10 


Scott Reef Kempten
1% 5% 1% 5%
Triplet [vo2016localizing] 7.11% 13.92% 18.10% 22.40%
Triplet (ResNet18) 63.86% 66.98% 34.91% 37.56%
CVM-Net [cros2018cvmnet] 29.48% 51.31% 33.61% 38.66%
CVM-Net (ResNet18) 27.27% 48.85% 24.48% 34.42%
Ours 51.92% 56.35% 35.81% 40.28%
TABLE I: Comparision by average recall rate

V-D1 Recall

We first investigate top- recall rate following [vo2016localizing, cros2018cvmnet]. A higher recall rate means that the ground truth location is more likely to be included in the area with top- response as inferred. Since we expect our approach to work in conjunction with other localization frameworks as opposed to alone, we evaluate the recall rate on each map patch from the testing set, instead of the whole region of interest. The recall at and are reported in Table I. Our approach shows second to highest recall performance on Scott Reef dataset and the highest on Kempten dataset. Here it is worth mention that, the scale ratio between image observation and the map patch is for Scott Reef, while approximately for Kempten. The results above imply that as the scale ratio gets larger, it’s harder to extract and associate local features across scales. Since our approach observes the map as a whole instead of with a local sliding window, it seems to be the least downgraded approach regarding the recall rate.

(a) Scott Reef dataset
(b) Kempten dataset
Fig. 8: Running particle filter on selected map patches. The groundtruth trajectory, noisy estimations and filtered estimations are plotted in red, grey and green respectively. The trajectory propagates from left to right.

V-D2 Synthetic trajectory

Comparing to the urban environments studied in [vo2016localizing, cros2018cvmnet, cros2018netvlad], features in nature are more repetitive. Particularly in small-scale maps, the network learns similar representations for terrains with the same appearances. Hence we do not report any precision metrics which heavily depend on the unequal coverage of each terrain type. Instead, we directly evaluate how effective the learned representations are in state estimation with synthetic trajectories and particle filtering. We select 5 map patches with variations in terrain appearance from each dataset and generate a straight line 2D trajectory. We add Gaussian noise in 2D translation and rotation in the incremental estimation. For each trajectory, we randomly generate 100 noisy sequences. Our goal is to evaluate the average improvement in state estimation with particle filtering. The implementation of particle filtering follows [uwpf2003]. For other approaches, we use the of cosine similarity to update the particle weights.

Scott Reef Kempten
Patch 1 Patch 2 Patch 3 Patch 4 Patch 5 Patch 1 Patch 2 Patch 3 Patch 4 Patch 5
Triplet [vo2016localizing] -5.2% 1.3% -1.8% -13.3% -1.1% 5.1% 6.5% 12.3% 6.6% 25.4%
Triplet (ResNet18) 45.7% 33.2% 18.4% 12.1% 8.1% 31.1% 19.13% 22.8% 0.1% 13.3%
CVM-Net [cros2018cvmnet] -5.6% 21.7% 9.3% -0.2% 23.0% 30.2% 30.9% 23.2% 28.15% 44.91%
CVM-Net (ResNet18) 23.9% 8.5% 31.5% 7.0% 14.6% 43.3% 15.8% 13.1% 29.4% 22.9%
Ours 58.4% 18.5% 25.7% 25.05% 25.09% 46.6% 37.9% 28.2% 15.4% 19.9%
TABLE II: Error reduction on selected map patches. Negative values in grey mean that the error is increased.
GPU Memory
Belief Map Size
Belief Map Size
Triplet [vo2016localizing] 0.3061 s 1.97 GB 5.01 Mb
Triplet (ResNet18) 7.7604 s 1.61 GB 5.01 Mb
CVMNet [cros2018cvmnet] 11.6331 s 1.13 GB 2560.06 Mb
CVMNet (ResNet18) 19.6503 s 1.70 Gb 2560.06 Mb
Ours 0.1606 s 1.05 Gb 5.01 Mb
TABLE III: Computational efficiency: , , and are the size of the belief map, channels and descriptors respectively.

We visualize the particle filtering on selected patch 1 from each of both datasets in Fig. 8. We see that for both patches, estimation errors along the trajectory are generally reduced with particle filtering, which means that our proposed network learns an effective representation for inference across the scale ratio and modality gap. However, from the visualization, we also observe certain outlying sequences not converging to ground truth trajectory. The reason behind this can be the inconsistent representation pushing the estimation away from the true trajectory. Also, with no variation in observation nor map, particle filtering is only adding uncertainty into the system. In such cases, the particle filtering only exaggerates the drift of estimation, raising the concern about deployment in real environments.

The correction performances by particle filtering on all selected map patches are presented in Table II

. For each estimated trajectory sequence, we use the accumulated L2 error as the evaluation metric. To eliminate the effect of outliers, we report the median accumulated error among all the sequences for each map patch instead of the average error. It can be seen that our approach outperforms other approaches on three trajectories out of five on both datasets. As the particle filter usually has unpredictable performance with different maps or trajectories, we notice that our approach is one of the most stable which reduced average error by

on all selected patches.

V-D3 Computational cost and efficiency

Computation resource consumption of encoding a single map patch is presented in Table III

. The experiment is conducted on a single Titan V GPU with a 1.20GHz core and we limit the GPU memory usage to 2Gb which is more practical on mobile robot platforms. Our approach is the most efficient regarding the time and memory consumed. It is because other approaches conduct sliding window search which process a large batch of image tiles in parallel, while our approach takes the whole map patch as input. The belief map size of our approach is the same compact size as that of Triplet network. CVM-Net uses a high dimensional descriptor for deep features of each category, hence the map size is multiplied by descriptor size. The compact map size of our approach enables the potential application with real-time map streaming over the network.

Vi Conclusions

In our proposed cross-scale geo-localization task, we break the limitations of existing studies which rely on large-scale high-resolution satellite images as maps. Instead, we extend the range of modalities that can serve as a map in geo-localization. We successfully build two datasets from two domains, underwater and aerial, across different modalities and platforms. On both datasets, our proposed framework demonstrates the ability to learn consistent representations from image observations and maps. Especially, our approach managed to deal with the significant scale ratio between them. In contrast to previous studies, we move away from the paradigm of image retrieval and exhaustive search. We encode a map patch into a belief map, which results in the best computation efficiency regarding time, memory, and storage consumption. Such properties make our approach a solution where a map needed to be updated in real-time. We also believe that for small-scale maps, a pixel-wise encoder looking at the whole map is better at extracting and corresponding features across the scales, which is evidenced by the comparison on recall rate. Experiments with synthetic trajectories show that representations learned with our approach are the most effective in localization. Also, we need to point out that the probabilistic nature of our cross-scale representation makes it compatible as a plug-in module in state estimation pipelines.

This study overall provides an idea for localizing a perceptual robot in the field with a map. To train the system to describe the map and observations with abstract language, what we need is just image observations tagged with 2D geospatial locations in the map. No labelling is needed at all as our approach is based on contrastive learning.

Our future work will focus on compressing and transmitting the belief map over a robotic network, and demonstrate this work on real robots. On the theoretical side, we will be working on understanding the uncertainty of learned representations to make the localization inference robust against inconsistent representations.


The authors would like to acknowledge the Australian Centre for Field Robotics’ Marine Robotics Group for providing the data.