On the Role of Geometry in Geo-Localization

06/26/2019 ∙ by Moti Kadosh, et al. ∙ IDC Herzliya 12

Humans can build a mental map of a geographical area to find their way and recognize places. The basic task we consider is geo-localization - finding the pose (position & orientation) of a camera in a large 3D scene from a single image. We aim to experimentally explore the role of geometry in geo-localization in a convolutional neural network (CNN) solution. We do so by ignoring the often available texture of the scene. We therefore deliberately avoid using texture or rich geometric details and use images projected from a simple 3D model of a city, which we term lean images. Lean images contain solely information that relates to the geometry of the area viewed (edges, faces, or relative depth). We find that the network is capable of estimating the camera pose from the lean images, and it does so not by memorization but by some measure of geometric learning of the geographical area. The main contributions of this paper are: (i) providing insight into the role of geometry in the CNN learning process; and (ii) demonstrating the power of CNNs for recovering camera pose using lean images.



There are no comments yet.


page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, several works in the field focused on trying to understand how neural networks represent data and tackle their limits [33]. Our paper’s main goal is to study the role of geometry in a CNN solution to the geo-localization task rather than propose a working system for application purposes.

What is the geo-localization task? Imagine you are brought blindfolded to a street corner of a city you know well. Now, you remove the blindfold. Can you tell where you are? In the computer vision field, this amounts to estimating the position (and sometimes the orientation) of a camera given its current view. Although localization devices such as Global Positioning Systems (GPS) have improved significantly over the last years, they often do not work well in city scenes and do not provide highly accurate results. Autonomous cars, drones, and IOT devices are expected to benefit tremendously from the ability to determine their pose (position & orientation) accurately in their environment.

Figure 1: Top: lean images contain mostly geometric features: edges (left), faces (center), and depth information (right). We train a CNN to solve the localization problem using such images alone. Bottom: a top view of a city area (buildings are marked as white) where color indicates the localization success rate of the network from red (high) to blue (low). For instance, note how open spaces are more distinct than narrow streets.
Figure 2: Bird’s-eye view of one of the areas we used.

A solution for geo-localization, either by a human or by a machine, can use appearance cues (, texture of a unique building), geometric cues (, a unique shape of a building), or both. In the ‘80s and ‘90s, many computer vision tasks were based on edge images (, mostly geometry). More recently, significant improvements were obtained in object recognition, scene recognition, and localization tasks, largely by exploiting the appearance of the scene (, color and texture and image features such as SIFT

[26, 15, 14]). Later, these methods were improved by adding coarse geometric constraints to the image features (, [20, 2]

). Nowadays, methods are based on machine learning, in particular convolutional neural networks (CNNs), where the input is the unprocessed image. Clearly, both appearance and geometry play an important role in these methods.

We aim to explore the role of geometry alone in geo-localization using end-to-end deep learning neural network, while ignoring the often available texture of the scene. To do this, we consider the geo-localization task using lean images. Lean images contain mostly information that relates to the geometry of the scene while lacking texture or rich geometric details. In particular, we use a city scene and consider two types of binary images that consist of the edges of the buildings’ outline and the buildings’ facades. In addition, we also consider depth images that contain more geometric information.

Examples of the three types of lean images are shown in Figure 1 (top). Note that in the first row, the view contains dominant landmarks, while the second row shows very little distinct information that might be expected to assist localization. Such non-distinct views are very common in large environments such as a city, making localization with lean images very challenging. Further note that we deliberately do not consider real images or synthetic images with texture, since our goal is to study only the information available from purely geometry information.

We use an untextured 3D mesh model of Berlin [3] to generate our data. A bird’s-eye view of one of the areas is shown in Figure 2. Using such a model allows us to study the role of geometry for geo-localization in a controlled manner and in a larger scale than ever before, both in terms of the area covered (many city streets) and in terms of the number of images (up to hundreds of thousands). Our images are obtained simply by projecting the model onto various positions in the scene. Each image is defined by four parameters: represents the camera position on the ground plane and represents the and angles of the camera respectively. We assume for simplicity that the picture is taken at a fixed height, and the roll angle is fixed as horizontal.

A typical geo-localization solution will return either the pose from which an image was taken or the most similar image from a database. We consider two variants of the geo-localization tasks. The first task is recognizing a previously seen view of the scene, which we refer to as Geo-Matching. The second is determining the camera pose in a previously unseen view, which we refer to as



The geo-matching task can be regarded as image retrieval from a database of all available views of the city. A naïve solution would store all images and then perform a brute-force search in the database. However, this is inefficient and can become infeasible as the database gets larger. Defining a compact representation and an efficient search is clearly desired, and it is often performed by manually engineered image representation (, a dictionary of image features) and an image retrieval approach, including the metric between the stored representation and a target one (,

[27, 15, 21, 34, 18, 25, 7, 14]). Neural networks were shown to be effective in geo-localization tasks (, [12, 31, 17]). They may be used to perform both functions: provide storage and learn the metric. The questions we address here are (i) whether CNNs can also be used to solve the geo-matching task from lean images and (ii) whether geometric information is used by the CNN to solve the task. In the geo-interpolation task the image query is not part of the training set. In this case we ask (iii) can the CNN generalize and support geo-interpolation in such large environments using only geometric and spatial data?

As discussed in the results section (Sec. 6), we found positive answers to all these questions, but the results depend on the number of images and their sampling density. We believe this indicates that networks can learn some sort of a spatial map for an area using only geometric data, since no colors or textures are available in our data. The success of geo-localization also depends on the specific position. Figure 1 (bottom) shows how certain positions in the streets of a city are more recognizable than others.

The paper presents an empirical study regarding the information that can be used by CNNs; we do not propose a practical solution based on lean images. The main contributions of our study are: (i) proposing a systematic method to study the role of geometry in CNNs when trained to solve geo-localization tasks; (ii) demonstrating the power of CNNs to use the geometric information efficiently; and (iii) showing that lean images contain sufficient information for solving the geo-matching and geo-interpolation tasks.

2 Related Work

Place recognition (, recognizing the Eiffel Tower in an image) can be regarded as a coarse geo-localization task. Finding images of the same place is a basic tool for solving this task. Classic approaches use visual features to represent each image in a set of images of a given location (, by a bag of words) and then match a target image with the stored representations (, [26, 27, 15, 21]). Hays & Efros [7] were the first to address the place recognition task using millions of geo-tagged images. In their study they compare the results obtained when various visual features are used (tiny images, color histograms, texton histograms, line features, gist descriptors with color and geometric context).

In our study, we consider the geo-localization task, where both position and orientation of a camera with respect to a scene should be estimated. A possible solution can be obtained using triangulation with images with known pose (, [34]). In most studies, 3D models of the scene are used by means of point-clouds (, [9, 23, 16, 29]), Digital Elevation Maps (DEM) (, [1, 2]), or full 3D models (, [20]

). One of the main challenges of these works is to develop an efficient computation of 2D to 3D feature matching. The matching can then be used to determine the query image pose with respect to the model. Computing the matching requires dealing with a very large search space, and outliers must also be discarded. Works that deal with these challenges include classic studies of outliers removal (,

[5, 6, 29, 13, 24]).

New 3D feature representation have also been developed (, [9, 23]). Bansal & Daniilidis [2]

introduce a feature more closely related to the lean images we consider. It consists of 3D corners and direction vectors extracted from a Digital Elevation Map (DEM) to be matched geometrically to the corners and roof-line edges of buildings visible in a street-level query image.

Efficiency and robustness become even more important when dealing with a city-scale 3D model. A fast method for inliers detection that enables solving the correspondence problem on such a scale was suggested by Svärm  [28]. Recent survey on existing localization methods can be found in [19].

One of the key ideas that bypasses the challenge of defining an efficient and robust 2D-3D feature matching required by the abovementioned methods is to use an end-to-end CNN solution that performs both feature extraction and matching. PoseNet

[12] is an impressive CNN based approach for solving the pose of real images. A dataset of images was used for training Google LeNet [30]

where the 6-DoF pose of the camera was used as ground truth. The softmax final layer of Google LeNet, which was used for an object classification task, was replaced by a vector for a regression task. The Google LeNet was pre-trained on the ImageNet dataset for the object recognition challenge

[4, 22]. Walch [31] suggested an improvement to the PoseNet CNN architecture by adding an LSTM, which reduces the dimensionality of the feature vector. Melekhov [17] used ResNet34 [8], which uses encoder-decoder structure to improve model accuracy. Kendall & Cipolla [10] improved their earlier work [12] by applying an uncertainty framework to the CNN pose regressor. In another work, Kendall & Cipolla [11]

studied the affect of various loss functions on the result of PoseNet.

In our study we assume a 3D model of a city is given. Our setup is very challenging since the model and the images consists of only coarse 3D structure of the scene without texture for computing image features. On the other hand, our images are noise-free and there are no object-level occlusions such as trees, cars and people. Our method uses a CNN in a similar manner to PoseNet. However, we use the ResNet50 architecture, also modified for regression, which produced better results for our task. We trained our network from scratch since we use lean images, which are projections of an untextured 3D model, using pre-training done on texture images is irrelevant. In addition, we were not limited by data size, as we projected as many images as we chose.

Most importantly, our goal differs from that of the aforementioned methods: whereas they focus on obtaining a better and faster solution for geo-localization, we focus on trying to understand the role of geometry, alone, in geo-localization, by systematically training and testing the same neural network on controlled datasets.

3 Data

“Your network is as good as your data” is a common phrase in the world of neural networks. Our case is no different. In this respect, using a 3D model as the data source is highly advantageous: we can sample as many images as necessary from the 3D model in any position, orientation and resolution.

All images used in our study are projections of a 3D model of Berlin [3] without textures. This model is very simple, it contains only the geometry of building walls and rooftops, and does not contain any fine geometric details such as window frames or doors (see bird-eyes view in Figure 2). We consider three types of images: edge, face, and depth map, see Figure 1 (top), and we call them lean images since they contain no texture or structural details. Face images are actually the buildings’ facades.

We generated several image datasets that are sampled uniformly along a 4D grid, where each image is defined by its camera pose. That is, , where is the position on the ground plane and is the camera orientation. The density in the domain varies between the datasets but fixed in the domain. Each set of images is created in a defined area of the city. The number of images in the set is determined by the size of the area and the grid sampling density. The three types of lean images were generated for each sample pose.

Figure 3: Example of sampling positions on a area of the map. For the training set: green indicates valid samples and red invalid samples. For the test set: blue indicates valid samples and orange invalid samples. (Better viewed on screen.)

When dealing with lean images, care must be taken not to include empty images. For example when the camera is facing a building wall from a short distance. Such images contain almost no visual information and do not contribute to the learning process. We define an invalid image as an image that has less than  edges, or an image that does not contain a skyline (at least of its top-most pixel row is sky). Moreover, images associated with a pose inside a building are irrelevant to geo-localization, and are also defined as invalid. We remove invalid images from both the training and the test sets (see Figure 3).

Although the representation of Euler angles using is natural and easier to comprehend, it suffers from ambiguity and Gimbal lock. Therefore, in practice we use quaternions which offer stability, efficiency and compactness (see [12]). Each image sampled from the 3D model was defined by a 6D pose vector in the form of

, which in fact represents 4 degrees of freedom.

4 Network

We examined several convolutional neural network (CNN) architectures that proved to be successful on object recognition tasks. Specifically, we tested VGG, Google LeNet [30] and ResNet50 [8], built for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [22, 4].

Our geo-matching task could have been defined as a classification task where each is considered as a class. However, this would involve learning a huge number of classes . In addition, a classification setup loses the spatial relations among the images because each class is considered completely unrelated to others. This prevents the network from exploiting the geometric structure and information, and can preclude an answer to one of our main research questions: Can a network use geometric information?

Thus, more suitable for our problem is to consider a CNN for solving a regression task. This also allows to use the same trained CNN for the geo-interpolation task, by directly returning the pose of unseen images in the test set. Otherwise, if a classification CNN was used, it would have required a post-processing of the result, by classic methods such as averaging the -nearest classification matches. Because the considered CNNs were designed for classification tasks, we follow [12]

and modify them to solve a regression task by simply removing the last softmax layer and replacing it by a fully connected layer of our result vector

. Although position and orientation are considered as different tasks [12] which should have some weighting factor during the learning process, we noticed that normalizing with respect to the total area size eliminates the need for such weighting. Our loss function is for the position and for the orientation .

In a set of preliminary experiments we found that ResNet50 had the combination of smallest network size in terms of parameters and the best training and testing results. Therefore, we report our experiments using only the ResNet50 architecture. We decided not to use transfer learning using pre-trained weights, since the networks we tested were trained with ImageNet, which contains real photographs. Our assumption is that the pre-trained models are tuned for texture information that is not available in lean images. Hence, in our experiments, we trained the CNNs from scratch (note that we did use transfer learning within our setup; see Section 


5 Tasks & Hypothesis

We considered two localization tasks: retrieving the camera pose of an image from the training set (geo-matching), and recovering the camera pose of an image not in the training set (geo-interpolation).

Our goal was to answer the following questions: (i) Does geometry play a role when training the CNN for localization? (ii) Can a CNN be trained to solve these localization tasks from lean images?

5.1 Geo-Matching Task

Given an image from the training set, we tested whether the correct camera pose could be determined. In a sense, the network is trained to overfit. However, this would mean that the network managed to encode the entire set of images in some feature space as well as compute an efficient matching function between the features of the images to find the right pose.

(A) Geometrically Correlated:

We examined whether a CNN can solve the geo-matching task using lean images. In this test, the camera pose for generating the image was used as ground truth for training. Hence, the pose of nearby images is correlated and the network has access to this geometric information.

(B) Geometrically Decorrelated:

An alternative interpretation of the geo-matching task is that the network solves a simple indexing task, where the image’s pose serves as a 4D label. Under this interpretation, the CNN does not use the available geometric information. Hence, an arbitrary labels of images would work just as well as in task A. To test this, we used arbitrary poses as the image ground truth for task B. We randomly shuffled the pose information between images so that poses were not spatially correlated with respect to the images. If no geometry is used by the CNN, the results on this training data are expected to be similar to those obtained with the real pose as a ground truth.


Since our network is a regression network, the computed pose does not necessarily match exactly a pose of an image from the training set (see Figure 4subfig:distance_measures_nn). We used the nearest neighbor (nn) grid sample to the computed pose as the pose retrieval. We report the percentage of images whose correct pose is the nearest neighbor (1nn) and also report the percentage of images whose correct pose is among the three nearest neighbors (3nn) of the computed pose. These evaluations were used for both (A) and (B) geo-localization tasks. An additional advantage of using this measure is that it is given in grid steps and not in meter/angle units, circumventing the difficulty of comparing distances and angles and enabling a comparison of results with different grid densities (we do provide numerical errors in Table 2).

(a) Nearest Neighbor
(b) Manhattan Distance
Figure 4: Illustration in 2D of the evaluation measures for geo-matching (subfig:distance_measures_nn) and geo-interpolation (subfig:distance_measures_manhattan_distance). The real measures are 4D in nature.

5.2 Geo-Interpolation Task (C)

We tested whether the network can estimate the pose of an image that is not in the training set. To avoid over-fitting and allow generalization, the network was trained until best result was achieved on a validation set. We do not expect the network to return a correct position that is outside the learned area. Thus, this task is viewed as an interpolation task from known samples on the grid to unknown positions. For this reason our test set is comprised of images sampled at midpoints of the training grid. These are images that are farthest from the training set samples.


We considered the hyper-cube of the computed pose. A computed pose is considered correct if it lies within the same grid hyper-cube as the test sample. We report the number of correctly computed poses (). In addition, we considered the Manhattan distance between the hyper-cubes of the computed pose and the test sample (see Figure 4subfig:distance_measures_manhattan_distance). We report the number of images for which this distance is smaller than 3 (). Note that these measurements are invariant to the sampling step size. Thus, we are able to compare results of experiments that were sampled with different step sizes. For completness, we also provide the standard errors in Table 2.

Input type Geo-Matching Geo-Interpolation
(B) Arbitrary Pose (A) Correct Pose (C) Correct Pose
1nn 1nn 3nn 2D 4D
D<1 D<3 D<1 D<3
Area 400x400
step 20
37K images
Edges 0.45 0.97 0.99 0.64 0.82 0.58 0.75
Faces 0.35 0.99 0.99 0.56 0.76 0.51 0.69
Depth 0.23 0.99 0.99 0.61 0.79 0.55 0.72
Edges+Faces 0.29 0.98 0.99 0.72 0.88 0.65 0.82
Edges+Faces+Depth 0.24 0.98 0.99 0.71 0.88 0.64 0.81
Area 400x400
step 10
140K images
Edges 0.11 0.98 0.98 0.85 0.94 0.84 0.93
Faces 0.05 0.97 0.97 0.80 0.90 0.79 0.88
Depth 0.06 0.97 0.97 0.83 0.92 0.82 0.91
Edges+Faces 0.09 0.97 0.97 0.88 0.96 0.87 0.95
Edges+Faces+Depth 0.08 0.94 0.95 0.88 0.96 0.87 0.95
Area 800x800
step 20
170K images
Edges 0.06 0.96 0.96 0.62 0.78 0.59 0.75
Faces 0.01 0.96 0.96 0.51 0.68 0.48 0.65
Depth 0.01 0.96 0.97 0.61 0.77 0.59 0.73
Edges+Faces 0.04 0.92 0.93 0.70 0.86 0.67 0.83
Edges+Faces+Depth 0.03 0.95 0.96 0.70 0.85 0.67 0.81
Table 1: Results of our experiments. The fraction of images on which a correct estimation was obtained out of the total number of valid images evaluated (the higher the better). For geo-matching we use the nearest neighbor measure (nn) and for geo-interpolation the Manhattan distance (D). Number of images – average number of valid views in the training set of three experiments on different AOIs. See detailed discussion in the text.

6 Experiments & Results

We tested and evaluated the ResNet50 network for the three tasks described in Section 5. The datasets, which are described in Section 3, are defined by the following parameters:

  1. Area of interest (AOI): .

  2. Grid-step, : the distance between adjacent position of the sampling grid. That is, adjacent to are and . The grid density in domain was fixed.

  3. Input type: edges, faces, depth, edges + faces, edges + faces + depth. For the last two input types the images were fed to the network by stacking them channel-wise.

  4. Validation set created by randomly choosing 10% of the training samples.

  5. Test set created by images that were sampled at midpoints of the training grid.

(A) Geo-Matching (C) Geo-Interpolation
mean median mean median mean median mean median
Area 400x400 step 20 37K images
3.65 3.26 0.84 0.69 26.30 11.26 10.95 1.84
Area 400x400 step 10 140K images
2.37 2.10 0.57 0.48 7.99 3.67 2.65 0.67
Area 800x800 step 20 170K images
5.43 4.71 0.67 0.54 40.23 12.28 9.80 1.40
Table 2: Examples of the errors for an experiment with Edges+Faces image types in each sub-space: spatial errors in (approx.) meters, and orientation ) errors in degrees. Similar to this example, usually the errors show a long-tail distribution: many images have small errors and a few have very large errors.

We used various step sizes for the camera position on the grid: in model units ( unit meter).  () was sampled at steps between and , and () was sampled at steps between and . The height was set to a fixed value of above ground (human height) and no was used.

We report the main results in Table 1, for tasks (A)-(C). Each block of three rows consists of a different dataset, defined by the area size and . For each block we considered the different types of lean images and evaluated them on the three tasks as described in Section 5. Each entry is an average of three different AOIs. For completeness, Table 2 shows an example of the mean and median errors of the pose estimation for edges+faces experiment. Similar results were obtained in other experiments. Table 4 and Table 3 show the results of testing the limitations of the CNN with respect to the sparsity of the grid () and the size of the datasets ( images). We next discuss the obtained results.

6.1 Geo-Matching

Very poor results were obtained for the geo-matching task when arbitrary poses were used as ground truth (Table 1–Task (B)), when no geometric correlation between the images and their ground truth was available. The highest percentage of correct matches () was obtained for the smallest set of considered images ( images). For the largest set ( images), the percentage of correct matches was less than . As can be seen, the quality of the results decreases as the number training samples increases. This is expected because for a fixed number of network parameters, memorization becomes impossible when more and more training samples are added. Note that we do not report on the 3nn measure, since it is meaningless for this randomized pose task.

(A) (C)
Geo-Matching Geo-Interpolation
1nn 3nn 2D 4D
D<1 D<3 D<1 D<3
Area 800x800
step 10
636K images
0.82 0.82 0.80 0.92 0.79 0.92
Area 1600x1600
step 20
666K images
0.58 0.59 0.46 0.69 0.44 0.67
Table 3: Testing network learning capacity. These results are from a single experiment where the image input type is only edges. The network ability to learn drops when the number of images grows beyond a certain point.
Input type (A) Geo-Matching (C) Geo-Interpolation
1nn 3nn 2D 4D
D<1 D<3 D<1 D<3
Area 800x800
step 40
61K images
Edges 0.90 0.96 0.39 0.62 0.30 0.49
Edges+Faces 0.91 0.97 0.48 0.72 0.38 0.59
Edges+Faces+Depth 0.96 0.98 0.48 0.72 0.38 0.58
Area 1600x1600
step 40
174K images
Edges 0.89 0.90 0.22 0.39 0.19 0.32
Edges+Faces 0.94 0.95 0.30 0.50 0.24 0.41
Edges+Faces+Depth 0.94 0.96 0.32 0.51 0.26 0.43
Area 800x800
step 40 / sparse ang.
2.5K images
Edges 0.40 0.41
Edges+Faces 0.37 0.38 Failed
Edges+Faces+Depth 0.26 0.27
Area 1600x1600
step 40 / sparse ang.
7K images
Edges 0.16 0.18
Edges+Faces 0.17 0.19 Failed
Edges+Faces+Depth 0.13 0.14
Table 4: Low grid density results. Datasets (single experiment each) with sparser spatial sampling (top two blocks), and sparser spatial and orientation sampling (bottom two blocks) where the pitch angle is , and yaw . The sparser the data, the worse the results. Geo-interpolation could not succeed in very sparse and very small datasets.

In contrast, when the correct poses were used as ground truth (Table 1–Task (A)), the CNN succeeded in 1nn localization of more than of the training samples in all cases. These results show that a CNN with around 8.5 million parameters is able to exploit the geometric structure of the scene and match an image with images with accuracy of . That is, an average of parameters are used per image for images of size pixels. Our interpretation is that using a CNN makes it possible to avoid the direct storage of the images (or its edges) and their labels. Given the trained network, the matching is much faster111Evaluation of an image in a batch takes on Nvidia GeForce GTX 1080 GPU. than with any traditional search algorithm on such a large dataset of images.

We believe the significant differences between the two geo-matching tasks (A) and (B) is due to the network exploiting the geometric correlations when learning a metric between images.

We also considered much larger datasets with more than images (Table 3). The percentage of correct matches dropped to for a dense grid, , and to for a sparser grid, . For and

images, the network capacity is probably saturated. A comparison of these results to those reported in Table 

1 (Task (A)) for the same values, indicates that both the number of images and the grid size determine how successfully the CNN models the data.

In addition, we tested datasets with sparser sampling in the position domain (Table 4 top 2 blocks), and in both the position and the orientation domains (Table 4 bottom 2 blocks). For sparse sampling only in the position domain, the percentage of correct matches is reduced marginally. However, when reducing the sampling also in the orientation domain, the percentage of correct matches is dramatically dropped. This indicates that it is easier for the CNN to model a denser grid (probably because of higher geometric correlation between images), and it is easier to model fewer images (probably because of network capacity).

6.2 Geo-Interpolation

Here we tested whether the pose estimation by the CNN generalizes to unseen images. We used the same training as in geo-matching with the correct pose as a ground truth, and we tested it on images sampled from the mid point of each grid cell. We report our results with respect to the 2D position as well as with respect to the 4D parameters of a pose (Table 1).

The network was able to generalize image position with good accuracy where of images are correctly positioned in their grid cell, and above of the computed poses are within three cells of the correct one. As expected, this task achieves better results on a tighter grid () than on a sparse grid (). The 4D position error is lower bounded by the 2D position error, and hence is greater. Moreover, the sampling rate in the orientation domain is much higher that in the location. Hence a small error in orientation estimation has a greater effect on the 4D errors. Still, the accuracy in 4D for is .

We further tested the effect of the grid density. It is clear from Table 1–Task(C) that for the results are better than for , even if the number of images is larger. We further explore this observation for a sparser grid, , where the percentage of correct estimations dropped significantly below and for and images, respectively (Table 4-Task(C)). For for images, of the estimations were correct (Table 3-Task(C). For this task, sparser sampling is more critical than the geo-matching task as can be seen in Table 4. For very sparse sampling of the space the network cannot really generalize to positions not seen before. Here again we believe that not only the number of images play a role but also their density. The denser the grid, the higher the correlation between images, and hence better generalization can be obtained.

A nice application of our results is the ability to rate the distinctiveness of positions in the city. In Figure 1 (bottom) we illustrate how certain places can be easily recognized (high geo-interploation success rate) while other are more difficult. Note, for instance, how open spaces are more distinct than narrow streets.

6.3 Effect of Data Type

We compared the results on several types of lean images separately, and their combination. Faces alone provides the least geometric information, and indeed in most cases was inferior to edges or depth results. Surprisingly, edges alone provides better information than depth alone.

When combining edges with faces or with faces+depth, we expect the results of all tasks to be improved with respect to the results obtained when edges alone are used. For the geo-matching task (A) with (Table 1), similar results were obtained for all data types. However, for a very sparse grid (Table 4 2-upper blocks), richer geometric information improves the results. We believe it is because the results on were very high to begin with with only edges.

Figure 5: Transfer learning: learning from scratch vs. starting with pre-trained weights. These graphs are from one experiment where the input type was Edges+Faces, but similar behavior appeared in other experiments.

For the geo-interpolation task (C), adding the faces information significantly improved the results, as expected. Surprisingly, the depth information did not show any significant performance gain when . This may indicate that edges+faces provide sufficient information for these cases. However, for a very sparse grid, , with a relatively small number of images, adding the depth significantly improves the results (Table 4, 174K images).

For the data with geometrically decorrelated pose (Task B) and for the very sparse sampling (Table 4 bottom 2 blocks), the more information we add, the worse results were. The reason for this is still unclear to us. A possible explanation is that as the problem becomes more of a memorization task, the increase of information makes it harder for the CNN to find discriminant features.

6.4 Transfer Learning

Once we had a trained a CNN for some AOI, we applied transfer learning to a new AOI by using the learned weights as initialization values for the new area. As can be seen in Figure 5, doing so improved our learning rate. This indicates that the network managed to learn features of lean images that assist in other, similar experiments, and it does not depend only on memorization of the area for learning.

7 Conclusions

In this work we showed that (i) CNN can achieve good results in geo-localization tasks using only lean images taken from a very simple 3D model, and (ii) that geometry plays an important role in geo-localization, by achieving good results while ignoring texture and scene details. The results indicates that noise-free lean images are sufficient for solving the geo-matching task using a CNN, and that the use of uncorrelated images makes it nearly impossible. In addition, our results indicate that (iii) geo-interpolation which is a generalization task, can also be solved by CNNs when using lean images.

From a more practical perspective, it is of interest to explore whether geometric information can be used for real life geo-localization tasks, also because 3D models, , the Open Street Map project [32] are readily available.


  • [1] G. Baatz, O. Saurer, K. Köser, and M. Pollefeys. Large scale visual geo-localization of images in mountainous terrain. In European Conference on Computer Vision (ECCV), pages 517–530. Springer, 2012.
  • [2] M. Bansal and K. Daniilidis. Geometric urban geo-localization. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 3978–3985. IEEE, 2014.
  • [3] Berlin Partner für Wirtschaft und Technologie GmbH. Berlin 3d city model, 2016. https://www.businesslocationcenter.de/en/WA/B/seite0.jsp.
  • [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • [5] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  • [6] R. M. Haralick, H. Joo, C. Lee, X. Zhuang, V. G. Vaidya, and M. B. Kim. Pose estimation from corresponding point data. IEEE Transactions on Systems, Man, and Cybernetics, 19(6):1426–1446, Nov 1989.
  • [7] J. Hays and A. A. Efros. im2gps: estimating geographic information from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [9] A. Irschara, C. Zach, J. Frahm, and H. Bischof. From structure-from-motion point clouds to fast location recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2599–2606, June 2009.
  • [10] A. Kendall and R. Cipolla. Modelling uncertainty in deep learning for camera relocalization. In IEEE International Conference on Robotics and Automation (ICRA), pages 4762–4769. IEEE, 2016.
  • [11] A. Kendall, R. Cipolla, et al. Geometric loss functions for camera pose regression with deep learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, page 8, 2017.
  • [12] A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In IEEE International Conference on Computer Vision (ICCV), pages 2938–2946. IEEE, 2015.
  • [13] H. Li. Consensus set maximization with guaranteed global optimality for robust geometry estimation. In IEEE International Conference on Computer Vision (ICCV), pages 1074–1080, Sept 2009.
  • [14] Y. Li, N. Snavely, and D. P. Huttenlocher. Location recognition using prioritized feature matching. In European Conference on Computer Vision (ECCV), pages 791–804. Springer, 2010.
  • [15] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, Nov 2004.
  • [16] B. C. Matei, N. V. Valk, Z. Zhu, H. Cheng, and H. S. Sawhney. Image to lidar matching for geotagging in urban environments. In IEEE Workshop on Applications of Computer Vision (WACV), pages 413–420, Jan 2013.
  • [17] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu. Image-based localization using hourglass networks. arXiv preprint arXiv:1703.07971, 2017.
  • [18] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2161–2168, June 2006.
  • [19] N. Piasco, D. Sidibé, C. Demonceaux, and V. Gouet-Brunet. A survey on visual-based localization: On the benefit of heterogeneous data. Pattern Recognition, 74:90–109, 2018.
  • [20] S. Ramalingam, S. Bouaziz, and P. Sturm. Pose estimation using both points and lines for geo-localization. In IEEE International Conference on Robotics and Automation (ICRA), pages 4716–4723. IEEE, 2011.
  • [21] D. P. Robertson and R. Cipolla. An image-based system for urban navigation. In British Machine Vision Conference (BMVC), volume 19, page 165, 2004.
  • [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [23] T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based localization using direct 2d-to-3d matching. In IEEE International Conference on Computer Vision (ICCV), pages 667–674, Nov 2011.
  • [24] T. Sattler, B. Leibe, and L. Kobbelt. Improving image-based localization by active correspondence search. In European Conference on Computer Vision (ECCV), pages 752–765. Springer, 2012.
  • [25] G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–7, June 2007.
  • [26] S. Se, D. Lowe, and J. Little. Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks. The International Journal of Robotics Research, 21(8):735–758, 2002.
  • [27] Sivic and Zisserman. Video google: a text retrieval approach to object matching in videos. In IEEE International Conference on Computer Vision (ICCV), pages 1470–1477 vol.2, Oct 2003.
  • [28] L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson. City-scale localization for cameras with known vertical direction. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(7):1455–1461, 2017.
  • [29] L. Svarm, O. Enqvist, M. Oskarsson, and F. Kahl. Accurate localization and pose estimation for large 3d models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 532–539, 2014.
  • [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [31] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers. Image-based localization using lstms for structured feature correlation. In IEEE International Conference on Computer Vision (ICCV), volume 1, page 3, 2017.
  • [32] O. Wiki. Osm-3d.org — openstreetmap wiki,, 2018. [Online; accessed 1-November-2018].
  • [33] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
  • [34] W. Zhang and J. Kosecka. Image based localization in urban environments. In Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT’06), pages 33–40, June 2006.