Semantic Signatures for Large-scale Visual Localization

by   Li Weng, et al.

Visual localization is a useful alternative to standard localization techniques. It works by utilizing cameras. In a typical scenario, features are extracted from captured images and compared with geo-referenced databases. Location information is then inferred from the matching results. Conventional schemes mainly use low-level visual features. These approaches offer good accuracy but suffer from scalability issues. In order to assist localization in large urban areas, this work explores a different path by utilizing high-level semantic information. It is found that object information in a street view can facilitate localization. A novel descriptor scheme called "semantic signature" is proposed to summarize this information. A semantic signature consists of type and angle information of visible objects at a spatial location. Several metrics and protocols are proposed for signature comparison and retrieval. They illustrate different trade-offs between accuracy and complexity. Extensive simulation results confirm the potential of the proposed scheme in large-scale applications. This paper is an extended version of a conference paper in CBMI'18. A more efficient retrieval protocol is presented with additional experiment results.



page 3

page 4


Semantic SLAM with Autonomous Object-Level Data Association

It is often desirable to capture and map semantic information of an envi...

Code-based Signatures from New Proofs of Knowledge for the Syndrome Decoding Problem

In this paper, we study code-based signatures constructed from Proof of ...

Accurate Visual Localization for Automotive Applications

Accurate vehicle localization is a crucial step towards building effecti...

Efficient refinement of GPS-based localization in urban areas using visual information and sensor parameter

An efficient method is proposed for refining GPS-acquired location coord...

Semantic Pose Verification for Outdoor Visual Localization with Self-supervised Contrastive Learning

Any city-scale visual localization system has to overcome long-term appe...

Finding More Relevance: Propagating Similarity on Markov Random Field for Image Retrieval

To effectively retrieve objects from large corpus with high accuracy is ...

VLASE: Vehicle Localization by Aggregating Semantic Edges

In this paper, we propose VLASE, a framework to use semantic edge featur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual localization [1, 2] represents a range of applications where location information is derived from images. As an alternative to conventional positioning solutions, visual localization finds potential applications in automatic navigation [3] and location-related multimedia service [4, 5], such as landmark recognition and augmented reality (AR). For example, if a landmark recognition system is given the photos in Fig. 1, it might be able to return the landmark name, the city name, or the coordinate. In general, the problem of visual localization is to infer where an image is acquired by matching it with a geo-referenced database. It is typically modelled as an image feature retrieval scenario, and solved by exact or approximate nearest neighbour search. More specifically, features are extracted from a query image and compared with features in a database; the location is inferred from the best matches. Depending on the required accuracy, a visual localization algorithm is designed for one of following tasks:

  • Place recognition (coarse localization);

  • Camera pose estimation (precise localization).

The former estimates the zone where the image was acquired, in the form of a spatial area, a semantic label, etc. [1]

; the latter estimates the camera pose up to six degrees of freedom (6-DOF), including three parameters of translation (x,y,z) and three parameters of rotation (pitch, roll, yaw) 

[6, 7]. Conventional schemes typically accomplish these tasks using low-level hand-crafted visual features, such as bag of SIFT features [8], and more recently learned features [2]. Related research mainly focuses on accuracy and efficiency in challenging conditions (e.g. season change, night/day, long-term datasets). Various efforts have been devoted to database indexing and query strategies [9, 10, 11, 12, 1, 2]. They offer good accuracy but suffer from scalability issues due to large amounts of data.

In this work, a novel approach is pursued to complement conventional visual localization. Instead of low-level visual features, we exploit high-level semantic features, which are related to what we see in the environment. For example, in dense urban areas, one can typically see buildings, cars, trees, etc. It is found that such information can facilitate localization too. Compared with conventional visual features, semantic features have several advantages. First, they can be encoded in a compact way and require much less storage. Second, they can be efficiently obtained from geographic information systems, such as OpenStreetMap. Nevertheless, how to represent and utilize semantic features in localization remains an open problem. This paper summarizes our effort to address the encoding and comparison of such information.

In particular, we focus on “semantic objects” which are static and widely available in urban areas. We assume that such objects can be detected from street-view images by object detection algorithms such as [13, 14]. Once they are detected, localization can be achieved by finding locations with similar objects from a database. In order to represent the distribution of semantic objects at a location, we propose a descriptor called “semantic signature”. It is a compact string that consists of type and angle information of street objects. We then model our localization problem as string matching and solve it with a retrieval framework. Compared with conventional approaches, the proposed scheme has a few advantages, including small database size, large coverage area, and fast retrieval.

This paper is an extended version of [15], where the semantic signature is originally proposed to summarize semantic objects. The contribution of [15] also includes suitable metrics and a “metric fusion” protocol for signature matching and retrieval, and a simulation framework for performance evaluation. In this paper, a more efficient retrieval protocol called “2-stage metric fusion” is presented, as well as additional experiment results. Although there is still a gap from practical deployment, promising results from extensive simulation indicate the potential of our approach in large-scale applications.

The rest of the paper is organized as follows: Section II introduces the motivation of our approach and the role of semantic information in a big picture of visual localization; Section III is a brief literature review; Section IV describes the proposed method; Section V shows experiment results and analysis; Section VI concludes the work with a discussion.

(a) (b)
Fig. 1: Example query images for visual localization.

Ii Background

Our work is motivated by the emergence of AR and “open” data. In AR applications, a user can interact with what he/she sees on a screen, which enables manual annotation of street-views. On the other hand, several national and international open data initiatives, dedicated to the description of territories, exist, leading to databases of semantic information at different scales (city-scale up to world-scale), for example: OpenStreetMap111OpenStreetMap: and Mapillary222Mapillary: Additionally, thanks to the evolution of territorial policies, more and more national mapping agencies also make available different thematic layers of their maps, which contain abundant semantic information. All these databases are regularly updated and represent a rich information source that can be linked to multimedia data, but they are currently under-exploited for visual localization.

The proposed semantic signature can be used in two ways: 1) as an individual localization method, it achieves coarse localization; 2) as a complement to other localization methods, it can effectively reduce the search scope by filtering out irrelevant regions. In large-scale applications, coarse localization can be used as a preceding step before pose estimation. Given a query image, a sophisticated localization work flow might consist of the following steps (see Fig. 


  1. Perform feature detection;

  2. Narrow down the search scope using semantic features;

  3. Retrieve relevant images or low-level visual features;

  4. Perform place recognition or pose estimation.

This paper only covers the second step in the above pipeline, which focuses on the representation, indexing, and matching of semantic information. We foresee that real applications of urban localization using street-view images captured by mobile devices will emerge in the near future.

Iii Related work

Existing work on visual localization can be mainly divided into two categories: feature point retrieval and image retrieval. In the former approach, place recognition and camera pose estimation are solved by point-based voting and matching. For example, Schindler et al. propose a city-scale place recognition scheme 

[9]. They use a vocabulary tree [16] to index SIFT features with improved strategies for tree construction and traversal. Irschara et al. [17] also use a vocabulary tree for sparse place recognition using 3D-point clouds. They not only use real views, but also generate synthetic views to extend localization capability. Li et al. [10] address city-scale place recognition and focus on query efficiency. They prioritize certain database features according to a set covering criterion, and use a randomized neighborhood graph for data indexing and approximate nearest neighbor search [18]. Zamir and Shah [19] use Google street-view images for place recognition. They distinguish single image localization and image group localization, and derive corresponding voting and post-processing schemes for refined matching. Chen et al. [20] study the localization of mobile phone images using street-view databases. They propose to enhance the matching by aggregating the query results from two datasets with different viewing angles. Sattler et al. [11] propose to accelerate 2D-to-3D matching by associating 3D-points with visual words and prioritizing certain words. Zhang et al. [21] address performance degradation in large urban environments by dividing the search area into multiple overlapping cells. Relevant cells are identified according to coarse position estimates by e.g. A-GPS and query results are merged. Li et al. [12] consider worldwide image pose estimation. They propose a co-occurrence prior based RANSAC [22] and bidirectional matching to maintain efficiency and accuracy. Lim [3] et al. address real-time 6-DOF estimation in large scenes for auto-navigation. They use a dense local descriptor DAISY [23] instead of SIFT for fast key-point extraction, and a binary descriptor BRIEF [24] for key-point tracking.

The other category of localization techniques is based on image retrieval. Conventionally, this is only used for place recognition [5, 25]. For example, Zamir and Shah [26] propose multiple nearest neighbor feature matching with generalized graphs. Arandjelovic and Zisserman [27] propose an improved bag-of-features model. Torii et al. [28] apply the VLAD descriptor [29] to synthesis views. Arandjelovic et al. [30]

extend VLAD with a deep neural network architecture to address scene appearance changes due to long-term acquisitions, day/night or seasonal changes. Iscan et al. 

[31] propose to aggregate descriptors from panoramic views. Since 3D-point datasets can be built from 2D images with structure-from-motion techniques [32], it is possible to directly estimate 6-DOF with an image database. Recently, Song et al. [33] propose to estimate 6-DOF after image retrieval. Sattler et al. [34]

show experimentally that image retrieval approaches are perhaps more suitable for large-scale applications. With the success of deep learning, more approaches based on learned features are proposed to highlight effective visual features and exploit multiple modalities. A recent survey can be found in


Our approach is different from existing work, because we use semantic information instead of visual information. A relevant idea can be found in [35], where Ardeshir et al. use existing knowledge of objects to assist object detection. While they show the potential of semantic objects in localization, we perform more extensive study in this paper. We also find that edit distance works better than their histogram based metric. Another related scheme is [36], where Arth et al. use a different kind of semantic information. They perform re-localization by extracting straight line segments of buildings from a query image and comparing with a database. While our work focuses on objects, it can also be extended to include other semantic features, such as building corners [36]. On the other hand, our approach can also be used as an initial step to narrow down the search scope for some existing work, such as [33, 37, 34].

Iv The proposed scheme

The target application is localization in urban environments. In a typical scenario, a user has a mobile device that captures images of the surrounding area. The goal is to tell the user’s location according to these images. In a retrieval-based approach, it is tackled by extracting information from the images and comparing with a geo-referenced database. Figure 2 illustrates a complete application scenario, which is divided into coarse localization and refined localization. A critical question there is what kind of information to extract from images. In this work, the focus is on semantic information, which corresponds to the upper path in Fig. 2. Semantic information is high-level information based on human perception. In our context, it is about what people see from images. For example, people can tell their location by describing their surroundings. The same principle can be applied to localization. Since the images taken by the mobile device are typically street views, the semantic information contains objects such as buildings, streets, the sky, the ground, cars, humans, etc. It is found that some of these objects are useful for localization. In general, semantic objects with the following properties are of particular interest:

  • Permanent – the object does not move;

  • Informative – the object is distinguishable from others;

  • Widely available – the object is distributed in the scene.

Additionally, the objects should have unambiguous locations and be suitable for object detection algorithms. In this paper, we assume that detecting such objects is feasible and focus on retrieval aspects.

Fig. 2: A complete application scenario of visual localization. This paper only focuses on the upper path.

Iv-a Semantic signatures

Once semantic objects have been detected, they are encoded into a compact representation, which we call semantic signature. A semantic signature describes some properties of the corresponding objects. It is required to be compact and easily indexable. In this work, we propose to compose a semantic signature by:

  • Object type – the category (class) of an object;

  • Object angle – the relative angle of an object.

Specifically, the object type is a label, denoted by ; the object angle is measured according to the north and a view point, denoted by . Given a view point coordinate and a visibility range , each location can be associated with a semantic signature, which is related to the semantic objects that can be seen from that location. In our implementation, semantic objects are identified by a panoramic sweep in a clockwise order starting from the north. A semantic signature is the concatenation of two parts: , where represents the type sequence of the corresponding objects, represents the corresponding angle sequence, and is the number of visible semantic objects within . Figure 3 illustrates the generation of semantic signatures. Some examples of semantic objects and their distribution are shown in Fig. 4 and Fig. 5 (see Table I for a complete list). Ideally, each signature is unique, so that localization can be achieved by matching a query signature with a signature database. A database of semantic signatures can be built from existing data sources, such as geographic information systems.

Fig. 3: The generation of semantic signatures.
alignment tree (B) autolib station (J) bike station (H)
traffic light (G) bus stop (M) automatic WC (I)
Fig. 4: Some examples of semantic objects (see Table I).

In addition, it is required by one of our signature comparison metrics that the north is known when generating a signature. This is not unrealistic, because nowadays mobile devices are typically equipped with a compass.

Fig. 5: Distribution of semantic objects on Paris streets.

The centroid of an object is used for representing its location. In order to have a stable angle sequence, it is necessary to quantize angle values. We use 4-bit quantization, i.e., each angle value is quantized by 16 levels ( per step).

Iv-B Signature comparison

Given two semantic signatures, an important question is how to compare them. Since localization is achieved by signature search and retrieval, a similarity metric is needed. Since a signature has two parts, for simplicity it is preferable to use a metric that is compatible with both parts. This is possible if the two parts are considered as two general sequences. In this work, we use the following metrics:

  • Jaccard distance;

  • Histogram distance;

  • Edit distance.

Denote two ordered sequences as , . The Jaccard distance [38] is defined as


The histogram distance is defined as


where represents an object class. This metric was used in [35], so it is a good candidate for performance comparison. The edit distance (a.k.a. the Levenshtein distance) [39] is defined by the recurrence

where , , and (set to by default) are the weight factors for deletion, insertion and substitution respectively. This metric requires that the north is known when generating signatures.

Given two sequences of symbols, these metrics compare the value or the order of the symbols, but they exhibit different levels of “strictness”. The Jaccard distance only considers the occurrence and completely ignores the order and the frequency; the histogram distance also ignores the order but counts the frequency of symbols; the edit distance takes into account both the order and the frequency. By selecting different metrics, different trade-offs between robustness and discrimination power can be achieved. A coarse metric is useful for rough and quick matching, while a fine-grained metric is useful for refined matching. On the other hand, the computation cost is also different. The more complex the metric, the more computation.

Iv-C Retrieval schemes

The localization problem is solved by a retrieval-based framework. The procedure starts with a panoramic query image captured by a mobile device. Then the following basic steps apply:

  1. A query signature is computed from the query image;

  2. Similar signatures are retrieved from a database according to the query signature;

  3. The best matches are returned.

After the best matches are identified, post-processing schemes may follow depending on the specific application. In this paper, the focus is to find the best matches in an accurate and efficient way. Since a semantic signature has two parts – type and angle, an essential question is how each part contributes to the similarity between two signatures. In order to facilitate different occasions where one may choose to favour accuracy or efficiency, we propose two retrieval methods: “metric fusion” and “two-stage metric fusion”. They are specially designed for our scenario, but have some resembelance to the concepts of early fusion and late fusion in content classification [40] and retrieval [41].

Iv-C1 Metric fusion

In this method (see Fig. 6a), a similarity score is first computed from each part of the signature. Then a weighted sum of the two scores is computed. Signatures are ranked according to the total score. Denote two signatures as and . The distance is defined as


where and are weight factors, and are the chosen similarity metrics. By maintaining sufficiently large weight factors, both type and angle information is aggregated. When or is zero, the scheme reduces to single metric based ranking.

(a) (b)
Fig. 6: Metric fusion (a) and two-stage metric fusion (b).

Iv-C2 Two-stage metric fusion

A drawback of the previous scheme is heavy computation: two metrics are computed for each pair of signatures. Although this approach might give the most accurate ranking, in practice only the top ranks are useful. Therefore, it is possible to improve the speed. Intuitively, if two signatures do not match, then any metric is likely to result in a low rank. An improved scheme works as follows (see Fig. 6b):

  1. A similarity score is first computed from one part of the signature;

  2. Top candidate locations are retrieved;

  3. For the retrieved candidates, another score is computed from the other part of the signature, and added to the previous score as in Eqn. (4);

  4. The top candidates are re-ranked according to the new score.

By adjusting , different trade-offs between accuracy and efficiency can be achieved. When , the scheme becomes the standard metric fusion.

Iv-D Prerequisite and post-processing

The proposed scheme utilizes two properties of semantic objects – type and angle. In general, an object recognition algorithm is needed to provide such information. State-of-the-art candidate algorithms are typically based on region proposals and convolutional neural networks, such as 

[42, 43, 44]. In case an object recognition algorithm is not available, the type information can also be provided by a human user (because the semantic objects are easy to recognize) as a query, which is an alternative way to use the proposed scheme. Experiment results later show that even if angle information is missing, type information can individually facilitate localization, and vice versa.

The general goal of the proposed scheme is to provide a list of potential locations according to a query signature. How to derive a final answer from the candidate locations is the task of post-processing. This is not the focus of the paper, but we briefly discuss some particular procedures in the following.

If no further processing is desired, a most straightforward way is to take the best match as the answer, i.e., . When , some analysis can be performed with the candidate locations. For example, it is possible to narrow down the search range by obtaining some prior information about the “popularity” of locations – some locations are more likely to be visited than others. If extra information is available, such as street-view images or 3D models at the candidate locations, then one may perform 2D-to-2D [33, 37] or 2D-to-3D [11] matching using the query image. However, since these operations are expensive in computation and data storage, it is desirable to restrain them in a small scale. Therefore, it is important that the proposed scheme returns “good” candidates in a short list. This is confirmed by the experiment results.

V Experiments

The proposed scheme has been extensively evaluated with a city-scale dataset. The dataset, the evaluation framework, and the results are presented in this section.

V-a The dataset

Our dataset is about Paris. It consists of approximately semantic signatures that cover most of the city. These signatures are built from categories of objects, as listed in Table I.

ID Name Number Symbol
1 Alignment tree 1752696 B
2 Water fountain 6713 C
3 Street light 2299639 D
4 Indicator 36333 E
5 Traffic light 102240 G
6 Bike station 14397 H
7 Automatic WC 8006 I
8 Autolib (car) station 4421 J
9 Taxi station 2537 K
10 Public chair 135748 L
11 Bus stop 32320 M
TABLE I: Semantic objects.

These objects are found from Open Data Paris333Open Data Paris ( hosts a collection of more than public datasets provided by the city of Paris and its partners. with known coordinates. The signature database is constructed by sampling the Paris region with a step of meters. At each sampling point (cell), a semantic signature is created to summarize objects within 30 meters, i.e., the visibility range is set to 30. Some basic properties of the database are listed in Table IIa. Each database record contains a location (represented by a cell) and its signature. If database records are grouped by the signature using only type information, then the number of groups is approximately of the number of signatures (see Table IIb), i.e., on average less than three cells have the same signature. It is expected that each signature group contains only one cell. Some more statistics about the signature groups are listed in Table IIb. It is true that most signature groups () have only one cell. This is crucial to effective localization. Note that there are also rare cases when it is almost impossible to find the correct location. For example, there are cells with the same signature type “DDD”, which means three street lights. This implies that our proposed solution works in a probabilistic sense. In general, whether a location query will be successful depends on the entropy of its signature. Table IIa gives the average length of a signature. If a signature is longer than the average and contains multiple object types, it is likely to be effective, and vice versa. Some examples of successful and unsuccessful query locations are shown in Fig. 78. Nevertheless, the localization power can be improved when type and angle information is combined. The last column of Table IIb shows that angle information is even more discriminative than type information. It is also worth noting that the overall file storage only takes 38.7 MB (without optimization) to cover a large area. This is an extremely small cost for city-scale localization. Conventional low-level feature based approaches, e.g. [12, 37], at least require several GBs even for a small scene.

(a) basic properties (b) signature group size
Visibility range 30 meters No. of signatures 312134 Mean signature length 14 objects Covered area 79 Data storage 38.7 MB by type by angle count 140296 204891 mean 2.2 1.5 std 121.5 11.1 min 1 1 25% 1 1 50% 1 1 75% 1 1 max 29958 1240
TABLE II: Database properties.

For large-scale retrieval applications, it is necessary to consider database indexing schemes for efficiency. Nevertheless, for the demonstration in this work, it suffices to use a linear scan scheme for signature retrieval, thanks to the compactness of signatures. Since a signature is encoded by symbols of small alphabets, more efficient indexing is possible if necessary. For example, signatures can be clustered and indexed according to certain patterns. A natural way to generate a pattern is to gather distinct symbols in a signature and sort them, which is actually the representation used by Jaccard distance.

(a) 2.2961483 48.8372312 (b) 2.3181783 48.8299839
(c) 2.3714248 48.8426869 (d) 2.3879806 48.8832431
Fig. 7: Some examples of successful query locations (longitude/latitude). Their signatures contain mostly alignment trees and street lights.
(a) 2.3639067 48.8223214 (b) 2.2998637 48.8547905
(c) 2.2991023 48.8607219 (d) 2.2893088 48.869208
Fig. 8: Some examples of unsuccessful query locations (longitude/latitude). Their signatures only contain a few street lights.

V-B The evaluation framework

The signature database can be matched with other signatures obtained by various means. In order to evaluate the retrieval aspects of the proposed scheme, we skip object detection. A query set is formed by randomly selecting ten thousand locations and the associated signatures from the database. Each signature in the query set is used for querying the database. The average performance for all queries is noted. We mainly consider two benchmarks:

  • Cumulative distribution of distance errors;

  • Recall rate of correct locations.

The first benchmark measures the average distance from the ground truth location to a candidate location. The best results among top candidates is noted. In our experiments, we set . The second benchmark examines the rank of the ground truth location among all candidates, emphasizing the capability as a filtering tool. It can be considered as a special retrieval scenario with only one relevant answer per query. They will be explained with more details later.

V-B1 Distortion simulation

In practice, object detection is not perfect. Using the query set directly does not reveal the performance in reality. Therefore, we propose to simulate errors in object detection. The simulated operations are listed in Table III.

ID Distortion type Comment
1 Miss detection Remove objects
2 False detection Introduce new objects
3 False classification Change object type
4 Angle noise Add noise to each angle
TABLE III: Simulated signature distortion.

In a more complete setting, each query item is first randomly distorted before matching with the database. We consider three levels of distortion – light, medium, and strong, corresponding to 1, 7, or 13 occurrences of random distortion, including miss detection, false detection, and false classification. Each time up to more than

objects in a signature are distorted. In addition to type distortion, angle noise is always applied following a normal distribution with the standard deviation equal to

and the maximum value clipped to . The distortion parameters are set empirically. They serve as guidelines if upstream visual processing tools are to be designed for our application.

In reality, the queries might be chosen off-grid, which corresponds to distance and angle changes from the nearest sampling points. These effects are also simulated by the distortions to some extent.

V-C Experiment results

In this section, we evaluate the proposed scheme in terms of the two benchmarks defined in Sect. V-B. According to the average signature length (14 objects), we mainly consider performance under medium level distortion. The two signature parts, type and angle, are separately tested first, followed by metric fusion and two-stage metric fusion schemes. In addition, the computation complexity is also measured. After extensive tests, practical configurations are identified and further examined with various system parameters.

V-C1 Localization performance

We first examine the effectiveness of the signature scheme. Figure 9 shows the ideal localization error when only one part of a signature is used without distortion. There are six curves corresponding to the three metrics and the two signature components. For each point on a curve, it means for queries, the distance error is not larger than

. In general the cumulative probability increases with the localization error. A higher curve means better performance. We observe that both type and angle information can be used for localization, but the smaller error shows that angle generally works better. Among the three metrics, edit distance is the best, followed by histogram distance and Jaccard distance. In the best case, i.e. perfect object detection, more than

of queries result in the correct location or have errors less than 10 meters, which is close to GPS accuracy.

Fig. 9: Localization error (single metric, no distortion).

When there is distortion, a similar trend can be observed in Fig. 10. The localization error increases with the distortion. The maximum query percentage for no error drops to . But edit distance still performs the best.

Fig. 10: Localization error (single metric, medium distortion).

Next, we consider another benchmark – the recall rate. In Fig. 11, a point means for queries, the corresponding ground truth rank is not lower than . Ideally, we expect the recall to be as high as possible. The results show that given a query, our proposed method can effectively filter out irrelevant regions. For example, almost in all cases (except for Jaccard distance with type information), all queries have recall@10%=. That means only the top database candidates need to be considered. In Fig. 12, the recall is plotted for medium distortion. Although there is some performance degradation, the good settings can still keep the ground truth rank within top for of queries. It is also noted that metrics with worse performance for higher ranks sometimes give better recalls for lower ranks. For example, the histogram distance with angle information gives higher recalls when ranks lower than 10% are considered.

Fig. 11: Location recall (single metric, no distortion).
Fig. 12: Location recall (single metric, medium distortion).

Figure 13 shows the distribution of localization errors for metric fusion. Since edit distance performs best as a single metric, we fix it for type information and try different metrics for angle information. For example, the legend “edit + jaccard (0.5, 0.5)” means that edit distance is used for type, and jaccard distance is used for angle; the weight factors are and . The figure confirms that combining type and angle information indeed brings performance improvement. The initial probability increases from to more than . All combinations seem to perform equally well. When there is distortion (see Fig. 14), it is more obvious that “edit + edit” is the best combination. On the other hand, note that using edit distance with angle information alone even outperforms the other combinations. That means, although angle information is harder to measure, it has stronger discrimination power than type information, which is consistent with the statistics in Table IIb. It is also an indication that spatial distribution is useful in localization if properly utilized. The corresponding recall is shown in Fig. 15. Compared with Fig. 12, the advantage of metric fusion is clear for higher ranks where curves are relatively close; for lower ranks, “edit + edit” continues to outperform “edit (angle)”, but “edit + hist” performs worse than “hist (angle)” as a trade-off for a slight improvement at higher ranks. We conclude that in general metric fusion is beneficial.

Another important question is what weight factors are the best for metric fusion. In Fig. 16, various weights are tested. Even weights or slightly larger weights for the angle turn out to be good choices, because settings (0.3, 0.7) and (0.5, 0.5) perform the best. Since type and angle are two independent information sources and the angle performs better when used alone, a slightly higher weight for the angle is reasonable. On the other hand, biasing too much, especially towards the type, such as (0.9, 0.1), decreases the performance. In the following, we keep using even weights (0.5, 0.5).

Fig. 13: Localization error (metric fusion, no distortion).
Fig. 14: Localization error (metric fusion, medium distortion).
Fig. 15: Location recall (metric fusion, medium distortion).
Fig. 16: Localization error for different weight factors (metric fusion, medium distortion).

Finally, we look at the results of two-stage metric fusion in Fig. 17. In the legend, the percentage numbers such as represent the proportion of candidates for re-ranking. It is interesting that re-ranking less than candidates result in almost the same performance as , which implies a significant amount of saving in computation. On the other hand, there is a clear performance drop in recall (see Fig. 18). This is a trade-off between recall and computation efficiency.

Fig. 17: Localization error (2-stage metric fusion, no distortion).
Fig. 18: Location recall (2-stage metric fusion, medium distortion).

V-C2 Computation complexity

We also investigate the computation complexity for different configurations. Table IV lists the measured retrieval time for a single query with the average length (14). The results are approximate, for only the signature comparison time is counted. They are obtained by averaging over repetitions using a single metric on a PC with a 3.6 GHz CPU. Clearly, the retrieval time is inversely proportional to the localization performance. For example, edit distance with angle information performs the best among the single metric settings, but it takes the longest time; Jaccard distance does not offer the best localization accuracy, but it is the fastest. For metric fusion, it is straight-forward to estimate the retrieval time by adding up the time for selected metrics. Some example numbers for two-stage metric fusion are listed in the last column of Table IV. Apparently “edit + edit” is the slowest combination, although it provides the best accuracy. Therefore, sometimes it is necessary to make a compromise for speed.

Type Angle Type+Angle, 5%
Jaccard distance 82 ms 496 ms 107 ms
Histogram distance 173 ms 538 ms 200 ms
Edit distance 277 ms 1.85 s 370 ms
TABLE IV: Approximate retrieval time for a single query.

V-C3 Practical localization and parameter dependence

Previous results show that two-stage metric fusion achieves a good trade-off between accuracy and speed. Therefore, we consider the configuration “edit+edit (0.5,0.5) 5%” as a practical setting. Figure 19 shows the localization error distribution for and several distortion levels. This is a “worst-case” scenario, because the top candidate location is judged as the query location. The cumulative probability for small distance errors ranges from to . This is still an encouraging result. The corresponding recall is shown in Fig. 20. The recall for top ranks varies from to more than . These results imply the effectiveness of the proposed scheme in practice. When used as a filtering tool for other methods such as [11, 33, 37], a significant amount of computation and storage might be saved by only considering data associated with top candidate locations. In fact, among the ten thousand query signatures, 4180 (41.8%) have distinct patterns. If only these queries are used, better localization performance could be expected. Additional tests are performed using those “good” queries. The results are shown in Fig. 2122. They are significantly improved compared with previous ones, which again confirms the value of our scheme.

Fig. 19: Localization error for 2-stage metric fusion “edit+edit (0.5,0.5) 5%”, .
Fig. 20: Location recall for 2-stage metric fusion “edit+edit (0.5,0.5) 5%”.

We further examine the performance dependence on some system parameters with the same two-stage metric fusion setting “edit+edit (0.5,0.5) 5%, t=1”. In particular, we focus on the visibility range and the angle quantization level. They both depend on the camera and the object detection algorithm. The visibility range is generally proportional to the signature length – the larger visibility, the longer signatures. As the signature entropy increases, it should be easier to distinguish one location from another. This is confirmed by the results in Table V, where the cumulative probability and the recall is shown for different visibility ranges, with a fixed maximum localization error (50m) and a fixed rank range (top 10%). The results are obtained by generating new databases and repeating the experiments. They suggest that localization performance can be improved by using more powerful imaging devices and algorithms.

On the other hand, angle quantization is meant to counteract noise, e.g. measurement errors. The stronger quantization, the more resistance to noise (and potentially less storage and computation), but also less discrimination power. Table VI shows the cumulative probability for localization errors up to 50m with various quantization strengths. The performance first increases with the number of quantization levels, then slightly drops, which indicates that finer quantization does not always bring better performance. Thus a balance should be sought between noise resistance and discrimination power.

P(error50m) recall@10%
TABLE V: Performance dependence on visibility. denotes the visibility range.
TABLE VI: Performance dependence on angle quantization. denotes the number of quantization levels.
Fig. 21: Localization error for 2-stage metric fusion “edit+edit (0.5,0.5) 5%”, . Unambiguous queries are used.
Fig. 22: Location recall for 2-stage metric fusion “edit+edit (0.5,0.5) 5%”. Unambiguous queries are used.

Vi Conclusion and discussion

In this work, we propose to use semantic information for urban localization. We focus on special objects that can be seen from street views, such as trees, street lights, bus stops, etc. These semantic objects can be obtained from public data sources. They are encoded as semantic signatures. The localization problem is solved by signature matching. Given a query signature, similar signatures are retrieved from a database. The query location is inferred from the best matches’ geo-reference. A semantic signature consists of two parts, a type sequence and an angle sequence. We select a few metrics for sequence matching and find that edit distance shows promising results. In order to aggregate both type and angle information, a metric fusion framework is proposed for signature retrieval. In addition, a two-stage fusion approach is proposed to improve computation efficiency.

Simulation shows that the proposed technique ideally achieves close-to-GPS accuracy. In practice, it can be used alone for coarse localization, and also in integration with other techniques for more accurate localization, such as pose estimation. It is interesting for e.g. tourism applications in urban areas. Since the scheme can effectively filter out irrelevant regions, it is a suitable step before other matching techniques that require heavy computation.

This paper focuses on retrieval. A number of existing semantic objects are used. While object detection is not covered here, the main message of this self-contained work is that if a sufficient amount of semantic objects exist, then satisfactory localization is possible even in a large scale. There are also other obstacles in reality, such as inaccurate distance measurement, object occlusion, out-of-date databases etc. These issues are partly taken into account by simulated distortion. Fine-tuning the system together with an object detection pipeline is an interesting topic for future research.


  • [1] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 1–19, Feb. 2016.
  • [2] N. Piasco, D. Sidibé, C. Demonceaux, and V. Gouet-Brunet, “A survey on visual-based localization: On the benefit of heterogeneous data,” Pattern Recognition, vol. 74, pp. 90–109, 2018.
  • [3] H. Lim, S. N. Sinha, M. F. Cohen, and M. Uyttendaele, “Real-time image-based 6-DOF localization in large-scale environments,” in

    Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2012, pp. 1043–1050.
  • [4] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: Exploring photo collections in 3D,” ACM Trans. Graph., vol. 25, no. 3, pp. 835–846, Jul. 2006.
  • [5] D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg, “Mapping the world’s photos,” in Proc. of International Conference on World Wide Web (WWW).   ACM, 2009, pp. 761–770.
  • [6] X. Qu, B. Soheilian, and N. Paparoditis, “Vehicle localization using mono-camera and geo-referenced traffic signs,” in Proc. of IEEE Intelligent Vehicles Symposium, June 2015, pp. 605–610.
  • [7] E. Brachmann and C. Rother, “Learning less is more - 6d camera localization via 3d surface regression,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [8] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal on Computer Vision (IJCV), vol. 60, no. 2, pp. 91–110, Nov. 2004.
  • [9] G. Schindler, M. Brown, and R. Szeliski, “City-scale location recognition,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2007, pp. 1–7.
  • [10] Y. Li, N. Snavely, and D. P. Huttenlocher, “Location recognition using prioritized feature matching,” in Proc. of European Conference on Computer Vision (ECCV), 2010, pp. 791–804.
  • [11] T. Sattler, B. Leibe, and L. Kobbelt, “Fast image-based localization using direct 2D-to-3D matching,” in Proc. of International Conference on Computer Vision (ICCV), Nov 2011, pp. 667–674.
  • [12] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua, “Worldwide pose estimation using 3D point clouds,” in Proc. of European Conference on Computer Vision (ECCV), 2012, pp. 15–29.
  • [13] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 2999–3007.
  • [14] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” CoRR, vol. abs/1804.02767, 2018. [Online]. Available:
  • [15] L. Weng, B. Soheilian, and V. Gouet-Brunet, “Semantic signatures for urban visual localization,” in International Conference on Content-Based Multimedia Indexing (CBMI), Sep. 2018, pp. 1–6.
  • [16] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, 2006, pp. 2161–2168.
  • [17] A. Irschara, C. Zach, J. M. Frahm, and H. Bischof, “From structure-from-motion point clouds to fast location recognition,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2009, pp. 2599–2606.
  • [18] S. Arya and D. M. Mount, “Approximate nearest neighbor queries in fixed dimensions,” in Proc. of ACM-SIAM Symposium on Discrete Algorithms (SODA), 1993, pp. 271–280.
  • [19] A. R. Zamir and M. Shah, “Accurate image localization based on google maps street view,” in Proc. of European Conference on Computer Vision (ECCV), 2010, pp. 255–268.
  • [20] D. M. Chen, G. Baatz, K. Köser, S. S. Tsai, R. Vedantham, T. Pylvänäinen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, and R. Grzeszczuk, “City-scale landmark identification on mobile devices,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2011, pp. 737–744.
  • [21] J. Zhang, A. Hallquist, E. Liang, and A. Zakhor, “Location-based image retrieval for urban environments,” in Proc. of IEEE International Conference on Image Processing (ICIP), 2011, pp. 3677–3680.
  • [22] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, Jun. 1981.
  • [23] E. Tola, V. Lepetit, and P. Fua, “A fast local descriptor for dense matching,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2008, pp. 1–8.
  • [24] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” in ECCV, 2010, pp. 778–792.
  • [25] A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros, “Data-driven visual similarity for cross-domain image matching,” ACM Trans. Graph., vol. 30, no. 6, p. 10, Dec. 2011.
  • [26] A. R. Zamir and M. Shah, “Image geo-localization based on multiple nearest neighbor feature matching using generalized graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1546–1558, Aug 2014.
  • [27] R. Arandjelović and A. Zisserman, “Dislocation: Scalable descriptor distinctiveness for location recognition,” in Proc. of Asian Conference on Computer Vision (ACCV), 2014, pp. 188–204.
  • [28] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, “24/7 place recognition by view synthesis,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1808–1817.
  • [29] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2010, pp. 3304–3311.
  • [30] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5297–5307.
  • [31] A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum, “Panorama to panorama matching for location recognition,” in Proc. of ACM International Conference on Multimedia Retrieval, 2017, pp. 392–396.
  • [32] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski, “Building Rome in a day,” in Proc. of International Conference on Computer Vision (ICCV), Sept 2009, pp. 72–79.
  • [33] Y. Song, X. Chen, X. Wang, Y. Zhang, and J. Li, “6-DOF image localization from massive geo-tagged reference images,” IEEE Transactions on Multimedia, vol. 18, no. 8, pp. 1542–1554, Aug 2016.
  • [34] T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Okutomi, and T. Pajdla, “Are large-scale 3d models really necessary for accurate visual localization?” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, p. 10.
  • [35] S. Ardeshir, A. R. Zamir, A. Torroella, and M. Shah, “GIS-assisted object detection and geospatial localization,” in Proc. of Eupropean Conference on Computer Vision (ECCV), 2014, pp. 602–617.
  • [36] C. Arth, C. Pirchheim, J. Ventura, D. Schmalstieg, and V. Lepetit, “Instant outdoor localization and slam initialization from 2.5D maps,” in Proc. of International Symposium on Mixed and Augmented Reality (ISMAR), 2015.
  • [37] N. Bhowmik, L. Weng, V. Gouet-Brunet, and B. Soheilian, “Cross-domain image localization by adaptive feature fusion,” in Proc. of Joint Urban Remote Sensing Event, 2017, p. 4.
  • [38] P. Jaccard, “The distribution of the flora in the alpine zone,” New Phytologist, vol. 11, no. 2, pp. 37–50, 1912.
  • [39] G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31–88, 2001.
  • [40] C. G. M. Snoek, M. Worring, and A. W. M. Smeulders, “Early versus late fusion in semantic video analysis,” in ACM International Conference on Multimedia, New York, NY, USA, 2005, pp. 399–402. [Online]. Available:
  • [41] S. Vrochidis, B. Huet, E. Y. Chang, and I. Kompatsiaris, Eds., Big Data Analytics for Large-Scale Multimedia Search.   Wiley, 2019.
  • [42] R. Girshick, “Fast R-CNN,” in Proc. of IEEE International Conference on Computer Vision and Pattern Recognition, 2015, pp. 1440–1448.
  • [43] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.
  • [44] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems (NIPS), 2015, pp. 91–99.