Deep Stochastic Attraction and Repulsion Embedding for Image Based Localization

08/27/2018 ∙ by Liu Liu, et al. ∙ Australian National University 5

This paper tackles the problem of large-scale image-based localization where the geographical location at which a query image was taken is estimated by retrieving geo-tagged reference images depicting the same place from a large database. For this problem, an important and yet under-researched issue is how to learn discriminative image representations that are best tailored to the task of geo-localization. Aiming to find a novel image representation having higher location-discriminating power, this paper presents the following contributions: 1) we represent a place (location) as a set of exemplar images depicting the same landmarks, instead of some pre-defined geographic locations by partitioning the world; 2) we advocate the use of competitive learning among places, directly via feature embeddings, aiming to maximize similarities among intra-class images while minimizing similarities among inter-class images. This represents a significant departure from the state-of-the-art IBL methods using triplet ranking loss, which only enforces intra-place visual similarities are bigger than inter-place ones; 3) we propose a new Stochastic Attraction and Repulsion Embedding (SARE) loss function to facilitate the competitive learning. Our SARE loss is easy to implement and pluggable to any Convolutional Neural Network. Experiments show that the method improves localization performance on standard benchmarks by a large margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 13

page 21

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: The pipeline of our method. We use the VGG16 net [5] with only convolution layers as our architecture. NetVLAD [3]

pooling is used to obtain compact image representations. The feature vectors are post

normalized. The distance between the query-positive and the query-negative

images are calculated, and converted to a probability distribution. The estimated probability distribution is compared with the ground-truth match-ability distribution, yielding the Kullback-Leibler divergence loss.

The task of Image-Based Localization (IBL) is to estimate the geographic location of where a query image is taken, based on comparing it against geo-tagged images from a city-scale image database (i.e., a map). IBL has attracted considerable attention recently due to the wide-spread potential applications such as in robot navigation [6], VR/AR [7, 8, 9, 10], and autonomous driving [11]

. Depending on whether or not 3D point-clouds are used in the map, existing IBL methods can be roughly classified into two groups:

image-retrieval based methods [3, 4, 12, 13, 2, 14] and direct 2D-3D matching based methods [15, 16, 17, 18, 19].

This paper belongs to the image-retrieval group for its effectiveness at large scale and robustness to changing conditions [20]. For image-retrieval based methods, the main challenge is how to discriminatively represent images so that images depicting same landmarks would have similar representations while those depicting different landmarks would have dissimilar representations. The challenge is underpinned by the typically large-scale image database, in which many images may contain repetitive structures and similar landmarks, causing severe ambiguities.

Convolution Neural Networks (CNNs) have demonstrated great success for the IBL task [3, 4, 13, 21, 22, 14]. Typically, CNNs trained for image classification task are fine-tuned for IBL. As far as we know, all the state-of-the-art IBL methods focus on how to effectively aggregate a CNN feature map to obtain discriminative image representation, but have overlooked another important aspect which can potentially boost the IBL performance markedly. The important aspect is how to effectively organize the aggregated image representations. So far, all the state-of-the-art IBL methods use triplet or contrastive embedding to supervise the representation organization process.

This paper fills this gap by proposing a new method to effectively organize the image representations (embeddings). We first define a “place” as a set of images depicting same location landmarks, and then directly enforce the intra-place image similarity and inter-place dissimilarity in the embedding space. Our goal is to cluster learned embeddings from the same place while separating embeddings from different places. Intuitively, we are organizing image representations using places as agents.

The above idea may directly lead to a multi-class classification problem if we can label the “place” tag for each image. Apart from the time-consuming labeling process, the formulation will also result in too many pre-defined classes and we need a large training image set to train the classification CNN net. Recently-proposed methods [2, 1] try to solve the multi-class classification problem using large GPS-tagged training dataset. In their setting, a class is defined as images captured from nearby geographic positions while disregarding their visual appearance information. Since images within the same class do not necessarily depict same landmarks, CNN may only learn high-level information [2] for each geographic position, thus inadequate for accurate localization.

Can we capture the intra-place image “attraction” and inter-place image “repulsion” relationship with limited data, while still enabling competitive embedding? To tackle the “attraction” and “repulsion” relationship, we formulate the IBL task as image similarity-based binary classification in feature embedding space. Specifically, the similarity for images in the same place is defined as 1, and 0 otherwise. This binary-partition of similarity is used to capture the intra-place “attraction” and inter-place “repulsion”. To tackle the limited data issue, we use triplet images to train CNN, consisting of one query, positive (from the same place as the query), and negative image (from a different place). Note that a triplet is a minimum set to define the intra-place “attraction” and inter-place “repulsion”.

Our CNN architecture is given in Fig. 1. We name our metric-learning objective as Stochastic Attraction and Repulsion Embedding (SARE) since it captures pairwise image relationships under the probabilistic framework. Moreover, our SARE objective can be easily extended to handle multiple negative images coming from different places, i.e., enabling competition with multiple other places for each place. In experiments, we demonstrate that, with SARE, we obtain improved performance on various IBL benchmarks. Validations on standard image retrieval benchmarks further justify the superior generalization ability of our method.

2 Related Work

There is a rich family of work in IBL. We briefly review CNN-based image representation learning methods. Please refer to [23] for an overview.

While there have been many works [24, 21, 22, 14, 3, 13, 4, 12] in designing effective CNN feature map aggregation methods for IBL, they almost all exclusively using triplet or contrastive embedding objective to supervise CNN training. Both of the two objectives in spirit pulls the distance of matchable image pair while pushing the distance of non-matching image pair. While they are effective, we will show that our SARE objective outperforms them in IBL task later. Three interesting exceptions which do not use triplet or contrastive embedding objective are the planet [1], IM2GPS-CNN [2], and CPlaNet [25]. They formulate IBL as a geographic position classification task. They first partition a 2D geographic space into cells using GPS-tags and then define a class per-cell. CNN training process is supervised by the cross-entropy classification loss which penalizes incorrectly classified images. We also show that our SARE objective outperforms the multi-class classification objective in IBL task.

Although our SARE objective is formulated in competitive learning framework, it differs from traditional competitive learning methods such as Self-Organizing Map

[26] and Vector Quantization [27]. They are both devoted to learn cluster centers to separate original vectors. No constraints are imposed on original vectors. Under our formulation, we directly impose the “attraction-repulsion” relationship on original vectors to supervise the CNN learning process.

3 Problem Definition and Method Overview

Given a large geotagged image database, the IBL task is to estimate the geographic position of a query image . Image-retrieval based method first identifies the most visually similar image from the database for , and then use the location of the database image as that of . If the identified most similar image comes from the same place as , then we deem that we have successfully localized , and the most similar image is a positive image, denoted as . If the identified most similar image comes from a different place as , then we have falsely localized , and the most similar image is a negative image, denoted as .

Mathematically, a image-retrieval based method is executed as follows: First, query image and database images are converted to compact representations (vectors). This step is called image feature embedding and is done by a CNN network. For example, query image is converted to a fixed-size vector , where is a CNN network and is the CNN weight; Second, we define a similarity function on pairwise vectors. For example, take vectors and , and output a scalar value describing the similarity between and . Since we are comparing the entire large database to find the most similar image for , should be simple and efficiently computed to enable fast nearest neighbor search. A typical choice for is the -metric distance, or functions monotonically increase/decrease with the -metric distance.

Relying on feature vectors extracted by un-trained CNN to perform nearest neighbor search would often output a negative image for . Thus, we need to train CNN using easily obtained geo-tagged training images (Sec.7.1). The training process in general defines a loss function on CNN extracted feature vectors, and use it to update the CNN weight . State-of-the-art triplet ranking loss (Sec.4.1) takes triplet training images , and imposes that is more similar to than . Another contrastive loss (Sec.4.2) tries to separate pair by a pre-defined distance margin (see Fig.2). While the two losses are effective, we construct our metric embedding objective in a substantially different way.

Given triplet training images , we have the prior knowledge that pair is matchable and pair is non-matchable. This simple match-ability prior actually defines a probability distribution. For pair, the match-ability is defined as 1. For pair, the match-ability is defined as 0. Can we respect this match-ability prior in feature embedding space? Our answer is yes. To do it, we direct fit a kernel on the -metric distances of and pairs and obtain a probability distribution. Our metric-learning objective is to minimize the Kullback-Leibler divergence of the above two probability distributions (Sec.4.3).

What’s the benefit of respecting the match-ability prior in feature embedding space? Conceptually, in this way, we capture the intra-place (defined by pair) “attraction” and inter-place (defined by pair) “repulsion” relationship in feature embedding space. Potentially, the “attraction” and “repulsion” relationship balances the embedded positions of the entire image database well. Mathematically, we use gradients of the resulting metric-learning objective with respect to triplet images to figure out the characteristics, and find that our objective adaptively adjusts the force (gradient) to pull the distance of pair, while pushing the distance of pair (Sec.5).

4 Deep Metric Embedding Objectives in IBL

In this section, we first give the two widely-used deep metric embedding objectives in IBL - the triplet and contrastive embedding, and they are facilitated by minimizing the triplet ranking and contrastive loss, respectively. We then give our own objective - Stochastic Attraction and Repulsion Embedding (SARE).

4.1 Triplet Ranking Loss

The triplet ranking loss is defined by

(1)

where is an empirical margin, typically [28, 3, 21, 14]. is used to prune out triplet images with .

4.2 Contrastive Loss

The contrastive loss imposes constraint on image pair by:

(2)

where for pair, , and for pair, . is an empirical margin to prune out negative images with . Typically, [14].

Intuitions to the above two losses are compared in Fig.2.

Figure 2: Triplet ranking loss imposes the constraint . Contrastive loss pulls the distance of pair to infinite-minimal, while pushing the distance of pair to at least -away.

4.3 SARE-Stochastic Attraction and Repulsion Embedding

In this subsection, we give our Stochastic Attraction and Repulsion Embedding (SARE) objective, which is optimized to learn discriminative embeddings for each “place”. A triplet images define two places, one defined by pair and the other defined by . The intra-place and inter-place similarity are defined in a probabilistic framework.

Given a query image , the probability picks as its match is conditional probability , which equals to 1 based on the co-visible or matchable prior. The conditional probability equals to 0 following above definition. Since we are interested in modeling pairwise similarities, we set . Note that the triplet probabilities actually define a probability distribution (summing to 1).

In the feature embedding space, we would like CNN extracted feature vectors to respect the above probability distribution. We define another probability distribution in the embedding space, and try to minimize the mismatch between the two distributions. The Kullback-Leibler divergence is employed to describe the cross-entropy loss and is given by:

(3)

In order to define the probability picks as its match in the feature embedding space, we fit a kernel on pairwise -metric feature vector distances. We use three typical-used kernels to compare their effectiveness: Gaussian, Cauchy, and Exponential kernels. In next paragraphs, we use the Gaussian kernel to demonstrate our method. Loss functions defined by using Cauchy and Exponential kernels are given in Appendix.

For the Gaussian kernel, we have:

(4)

In the feature embedding space, the probability of picks as its match is given by . If the embedded feature vectors and are sufficiently near, and and are far enough under the

-metric, the conditional probability distributions

and will be equal. Thus, our SARE objective aims to find an embedding function that pulls the distance of to infinite-minimal, and that of to infinite-maximal.

5 Comparing the Three Losses

In this section, we illustrate the connections between the above three different loss functions. This is approached by deriving and comparing their gradients, which are key to the back-propagation stage in networks training. Note that gradient may be interpreted as the resultant force created by a set of springs between image pair [29]. For the gradient with respect to positive image , the spring pulls the pair. For the gradient with respect to negative image , the spring pushes the pair.

In Fig. 3, we compare the magnitudes of gradients with respect to and for different objectives. The mathematical equations of gradients with respect to and for different objectives are given in Table 1. For each objective, the gradient with respect to is given by .

LossGradients
Triplet ranking
Contrastive
Gaussian SARE
Cauchy SARE
Exponential SARE
Table 1: Comparison of gradients with respect to and for different objectives. Note that and are different from since they are defined by Cauchy and Exponential kernels, respectively. and share similar form as , and are given in Appendix.
Figure 3: Comparison of gradients with respect to and for different objectives. . (Best viewed in color on screen)

In the case of triplet ranking loss, and increase linearly with respect to distance and , respectively. The saturation regions in which gradients equal to zero correspond to triplet images producing a zero loss (Eq. (1)). For triplet images producing a non-zero loss, is independent of , and vice versa. Thus, the updating of disregards the current embedded position of and vice versa.

For the contrastive loss, is independent of and increase linearly with respect to distance . decreases linearly with respect to distance . The area in which equals zero corresponds to negative images with .

For all kernel defined SAREs, and depend on distances and . The implicitly respecting of the distances comes from the probability (Eq. (4)). Thus, the updating of and considers the current embedded positions of triplet images, which is beneficial for the possibly diverse feature distribution in the embedding space.

The benefit of kernel defined SARE-objectives can be better understood when combined with hard-negative mining strategy, which is widely used in CNN training. The strategy returns a set of hard negative images (i.e., nearest negatives in -metric) for training. Note that both the triplet ranking loss and contrastive loss rely on empirical parameters () to prune out negatives (c.f .the saturation regions). In contrast, our kernel defined SARE-objectives do not rely on these parameters. They preemptively consider the current embedded positions. For example, hard negative with (top-left-triangle in gradients figure) will trigger large force to pull pair while pushing pair. “semi-hard” [30] negative with (bottom-right-triangle in gradients figure) will still trigger force to pull pair while pushing pair, however, the force decays with increasing . Here, large may correspond to well-trained samples or noise, and the gradients decay ability has the potential benefit of reducing over-fitting.

Figure 4: Comparison of the gradients with respect to for different objectives. .

To better understand the gradient decay ability of kernel defined SARE objectives, we fix , and compare for all objectives in Fig. 4. Here,

means that for uniformly distributed feature embeddings, if we randomly sample

pair, we are likely to obtain samples that are -away [31]. Uniformly distributed feature embeddings correspond to an initial untrained/un-fine-tuned CNN. For triplet ranking loss, Gaussian SARE and Cauchy SARE, increases with respect to when it is small. In contrast to the gradually decay ability of SAREs, triplet ranking loss suddenly “close” the force when the triplet images produce a zero loss (Eq. (1)). For contrastive loss and Exponential SARE, decreases with respect to . Again, the contrastive loss “close” the force when the negative image produces a zero loss.

6 Handling Multiple Negatives

In this section, we give two methods to handle multiple negative images in CNN training stage. Equation (3) defines a SARE loss on a triplet and aims to shorten the embedded distance between the query and positive images while enlarging the distance between the query and negative images. Usually, in the task of IBL, the number of positive images is very small since they should depict same landmarks as the query image while the number of negative images is very big since images from different places are negative. At the same time, the time-consuming hard negative images mining process returns multiple negative images for each query image [3, 4]. There are two ways to handle these negative images: one is to treat them independently and the other is to jointly handle them, where both strategies are illustrated in Fig. 5.

Figure 5: Handling multiple negative images. Left: The first method treats multiple negatives independently. Each triplet focuses on the competitiveness over two places, one defined by query and positive , and the other one defined by negative . Right: The second strategy jointly handles multiple negative images, which enables competitiveness over multiple places.

Given negative images, treating them independently results in triplets, and they are substituted to Eq. (3) to calculate the loss to train CNN. Each triplet focuses on the competitiveness of two places (positive VS negative). The repulsion and attractive forces from multiple place pairs are averaged to balance the embeddings.

Jointly handling multiple negatives aims to balance the distance of positives over multiple negatives. In our formulation, we can easily construct an objective function to push negative images simultaneously. Specifically, the match-ability priors for all the negative images are defined as zero, i.e., . The Kullback-Leibler divergence loss over multiple negatives is given by:

(5)

where for Gaussian kernel SARE, is defined as:

(6)

The gradients of Eq. (5) can be easily computed to train CNN. All kernel defined multi-negative loss functions and gradients are given in Appendix.

7 Experiments

This section mainly discusses the performance of SARE objectives for training CNN. We show that with SARE, we can improve the performance on various standard place recognition and image retrieval datasets.

7.1 Implementation Details

Datasets.

Google Street View Time Machine datasets have been widely-used in IBL [32, 3, 4]. It provides multiple street-level panoramic images taken at different times at close-by spatial locations on the map. The panoramic images are projected into multiple perspective images, yielding the training and testing datasets. Each image is associated with a GPS-tag giving its approximate geographic location, which can be used to identify nearby images not necessarily depicting the same landmark. We follow [3, 2]

to identify the positive and negative images for each query image. For each query image, the positive image is the closest neighbor in the feature embedding space at its nearby geo-position, and the negatives are far away images. The above positive-negative mining method is very efficient despite some outliers may exist in the resultant positive/negative images. If accurate positives and negatives are needed, pairwise image matching with geometric validation

[4] or SfM reconstruction [14] can be used. However, they are time-consuming.

The Pitts30k-training dataset [3] is used to train CNN, which has been shown to obtain best CNN [3]. To test our method for IBL, the Pitts250k-test [3], TokyoTM-val [3], 24/7 Tokyo [32] and Sf-0 [33, 12] datasets are used. To show the generalization ability of our method for image retrieval, the Oxford 5k [34], Paris 6k [35], and Holidays [36] datasets are used. Details of these datasets are given in the Appendix.

CNN Architecture.

We use widely-used compact feature vector extraction method NetVLAD [3, 13, 4, 12, 20] to demonstrate the effectiveness of our method. Our CNN architecture is given in Fig. 1.

Evaluation Metric.

For the place recognition datasets Pitts250k-test [3], TokyoTM-val [3], 24/7 Tokyo [32] and Sf-0 [33], we use the Precision-Recall curve to evaluate the performance. Specifically, for Pitts250k-test [3], TokyoTM-val [3], and 24/7 Tokyo [32], the query image is deemed correctly localized if at least one of the top retrieved database images is within meters from the ground truth position of the query image. The percentage of correctly recognized queries (Recall) is then plotted for different values of . For the large-scale Sf-0 [33] dataset, the query image is deemed correctly localized if at least one of the top retrieved database images shares the same building IDs ( manually labeled by [33] ). For the image-retrieval datasets Oxford 5k [34], Paris 6k [35], and Holidays [36], the mean-Average-Precision (mAP) is reported.

Training Details.

We use the training method of [3] to compare different objectives. For the state-of-the-art triplet ranking loss, the off-the-shelf implementation [3] is used. For the contrastive loss [14], triplet images are partitioned into and pairs to calculate the loss (Eq. (2)) and gradients. For our method which treats multiple negatives independent (Our-Ind.), we first calculate the probability (Eq. (4)). is then used to calculate the gradients (Table 1) with respect to the images. The gradients are back-propagated to train CNN. For our method which jointly handles multiple negatives (Our-Joint), we use Eq.(5) to train CNN. Our implementation 111Pre-trained model and code are available at https://github.com/Liumouliu/deepIBL is based on MatConvNet [37]. Details are given in Appendix.

Figure 6: Comparison of recalls for different kernel defined SARE-objectives. From left to right and top to down: Pitts250k-test, TokyoTM-val, 24/7 Tokyo and Sf-0.

7.2 Kernels for SARE

To assess the impact of kernels on fitting the pairwise -metric feature vector distances, we compare CNNs trained by Gaussian, Cauchy and Exponential kernel defined SARE-objectives, respectively. All the hyper-parameters are the same for different objectives, and the results are given in Fig. 6. CNN trained by Gaussian kernel defined SARE generally outperforms CNNs trained by others.

We find that handling multiple negatives jointly (Gaussian-Joint) leads to better training and validation performances than handling multiple negatives independently (Gaussian-Ind.). However, when testing the trained CNNs on Pitts250k-test, TokyoTM-val, and 24/7 Tokyo datasets, the recall performances are similar. Gaussian-Ind. behaves surprisingly well on the large-scale Sf-0 dataset.

7.3 Comparison with state-of-the-art

We use Gaussian kernel defined SARE objectives to train CNNs, and compare our method with state-of-the-art NetVLAD [3] and NetVLAD with Contextual Feature Reweighting [4]. The complete Recall@N performance for different methods are given in Table 2.

MethodDataset Pitts250k-test TokyoTM-val 24/7 Tokyo Sf-0
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
Our-Ind. 88.97 95.50 96.79 94.49 96.73 97.30 79.68 86.67 90.48 80.60 86.70 89.01
Our-Joint 88.43 95.06 96.58 94.71 96.87 97.51 80.63 87.30 90.79 77.75 85.07 87.52
CRN [4] 85.50 93.50 95.50 - - - 75.20 83.80 87.30 - - -
NetVLAD [3] 85.95 93.20 95.13 93.85 96.77 97.59 73.33 82.86 86.03 75.58 83.31 85.21
Table 2: Comparison of Recalls on the Pitts250k-test, TokyoTM-val, 24/7 Tokyo and Sf-0 datasets.

CNNs trained by Gaussian-SARE objectives consistently outperform state-of-the-art CNNs by a large margin on almost all benchmarks. For example, on the challenging 24/7 Tokyo dataset, our-Ind. trained NetVLAD achieves recall@1 of 79.68% compared to the second-best 75.20% obtained by CRN [4], i.e., a relative improvement in recall of 4.48%. On the large-scale challenging Sf-0 dataset, our-Ind. trained NetVLAD achieves recall@1 of 80.60% compared to the 75.58% obtained by NetVLAD [3], i.e., a relative improvement in recall of 5.02%. Note that we do not use the Contextual Reweighting layer to capture the “context” within images, which has been shown to be more effective than the original NetVLAD structure [4]. Similar improvements can be observed in other datasets. This confirms the important premise of this work: formulating the IBL problem in competitive learning framework, and using SARE to supervise the CNN training process can learn discriminative yet compact image representations for IBL. We give a visualization of 2D feature embeddings of query images from 24/7 Tokyo and Sf-0 datasets in Appendix. Images taken from the same place are mostly embedded to nearby 2D positions despite the significant variations in view point, pose, and configuration.

7.4 Qualitative Evaluation

To visualize the areas of the input image which are most important for localization, we adopt [38] to obtain a heat map showing the importance of different areas of the input image. The results are given in Fig. 7. As can be seen, our method focuses on regions that are useful for image geo-localization while emphasizing the distinctive details on buildings. On the other hand, the NetVLAD [3] emphasizes local features, not the overall building style.

(a) Query
(b) Our-heat
(c) NetVLAD-heat
(d) Our-top1
(e) NetVLAD-top1
Figure 7: Example retrieval results on Sf-0 benchmark dataset. From left to right: query image, the heat map of Our-Ind, the heat map of NetVLAD [3], the top retrieved image using our method, the top retrieved image using NetVLAD. Green and red borders indicate correct and incorrect retrieved results, respectively. (Best viewed in color on screen)

7.5 Generalization on Image Retrieval Datasets

To show the generalization ability of our method, we compare the compact image representations trained by different methods on standard image retrieval benchmarks (Oxford 5k [34], Paris 6k [35], and Holidays [36]) without any fine-tuning. The results are given in Table 3 (The complete set of results for different output dimensions is given in Appendix). Comparing the CNN trained by our methods and the off-the-shelf NetVLAD [3] and CRN [4], in most cases, the mAP of our methods outperforms theirs’. Since our CNNs are trained using a city-scale building-oriented dataset from urban areas, it lacks the ability to understand the natural landmarks (e.g., water, boats, cars), resulting in a performance drop in comparison with the city-scale building-oriented datasets. CNN trained by images similar to images encountered at test time can increase the retrieval performance [39]. However, our purpose here is to demonstrate the generalization ability of SARE trained CNNs, which has been justified.

We also did a fine-tuning job on the training dataset from Google Landmark Retrieval Challenge [40] without tuning any hyper-parameters of our method. Our method (SevenSpace) is “In the money” (i.e., top 3).

Method Oxford 5K Paris 6k Holidays
full crop full crop
Our-Ind. 71.66 75.51 82.03 81.07 80.71
Our-Joint 70.26 73.33 81.32 81.39 84.33
NetVLAD [3] 69.09 71.62 78.53 79.67 83.00
CRN [4] 69.20 - - - -
Table 3: Retrieval performance of CNNs on image retrieval benchmarks. No spatial re-ranking, or query expansion are performed. The accuracy is measured by the mean Average Precision (mAP).
Figure 8: Comparison of recalls for deep metric learning objectives. From left to right and top to down: Pitts250k-test, TokyoTM-val, 24/7 Tokyo and Sf-0.

7.6 Comparison with Metric-learning Methods

Although deep metric-learning methods have shown their effectiveness in classification and fine-grain recognition tasks, their abilities in IBL task are unknown. As another contribution of this paper, we show the performances of five current state-of-the-art deep metric-learning methods in IBL, and compare our method with : (1) contrastive loss used by [14]; (2) lifted structure embedding [41]; (3) N-pair loss [42]; (4) N-pair angular loss [43]; (5) Geo-classification loss [2]. The implementation details are given in Appendix.

Fig. 8 shows the results of the quantitative comparison between our method and other deep metric learning methods. The complete Recall@N performance for different methods are given in Appendix. Our method outperforms the contrastive loss [14] and Geo-classification loss [2], while remains comparable with other state-of-the-art metric-learning methods.

8 Conclusion

This paper has addressed the problem of learning discriminative image representation specifically tailored for the task of Image-Based Localization (IBL). We have proposed a new Stochastic Attraction and Repulsion Embedding (SARE) objective for this task. SARE directly enforces the “attraction” and “repulsion” constraints on intra-place and inter-place feature embeddings, respectively. The “attraction” and “repulsion” constraints are formulated as a similarity-based binary classification task. It has shown that SARE improves IBL performance, outperforming other state-of-the-art methods.

References

  • [1] Weyand, T., Kostrikov, I., Philbin, J.: Planet-photo geolocation with convolutional neural networks.

    In: European Conference on Computer Vision, Springer (2016) 37–55

  • [2] Vo, N., Jacobs, N., Hays, J.:

    Revisiting im2gps in the deep learning era.

    In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [3] Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn architecture for weakly supervised place recognition.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 5297–5307

  • [4] Kim, H.J., Dunn, E., Frahm, J.M.: Learned contextual feature reweighting for image geo-localization. In: CVPR. (2017)
  • [5] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [6] Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31(5) (2015) 1147–1163
  • [7] Middelberg, S., Sattler, T., Untzelmann, O., Kobbelt, L.: Scalable 6-dof localization on mobile devices. In: European conference on computer vision, Springer (2014) 268–283
  • [8] Ventura, J., Arth, C., Reitmayr, G., Schmalstieg, D.: Global localization from monocular slam on a mobile phone. IEEE transactions on visualization and computer graphics 20(4) (2014) 531–539
  • [9] Pan, L., Dai, Y., Liu, M., Porikli, F.: Simultaneous stereo video deblurring and scene flow estimation. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE (2017) 6987–6996
  • [10] Pan, L., Dai, Y., Liu, M., Porikli, F.: Depth map completion by jointly exploiting blurry color images and sparse depth maps. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). (March 2018) 1377–1386
  • [11] Liu, L., Li, H., Dai, Y., Pan, Q.: Robust and efficient relative pose with a multi-camera system for autonomous driving in highly dynamic environments. IEEE Transactions on Intelligent Transportation Systems 19(8) (Aug 2018) 2432–2444
  • [12] Sattler, T., Torii, A., Sivic, J., Pollefeys, M., Taira, H., Okutomi, M., Pajdla, T.: Are large-scale 3d models really necessary for accurate visual localization? In: CVPR 2017-IEEE Conference on Computer Vision and Pattern Recognition. (2017)
  • [13] Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [14] Radenović, F., Tolias, G., Chum, O.: Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In: European Conference on Computer Vision, Springer (2016) 3–20
  • [15] Sattler, T., Leibe, B., Kobbelt, L.: Fast image-based localization using direct 2d-to-3d matching. In: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE (2011) 667–674
  • [16] Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence 39(9) (2017) 1744–1756
  • [17] Li, Y., Snavely, N., Huttenlocher, D.P.: Location recognition using prioritized feature matching. In: European conference on computer vision, Springer (2010) 791–804
  • [18] Liu, L., Li, H., Dai, Y.: Efficient global 2d-3d matching for camera localization in a large-scale 3d map. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [19] Carl, T., Erik, S., Lars, H., Lucas, B., Marc, P., Torsten, S., Fredrik, K.: Semantic match consistency for long-term visual localization. ECCV (2018)
  • [20] Sattler, T., Maddern, W., Toft, C., Torii, A., Hammarstrand, L., Stenborg, E., Safari, D., Okutomi, M., Pollefeys, M., Sivic, J., et al.: Benchmarking 6dof outdoor visual localization in changing conditions. In: Proc. CVPR. Volume 1. (2018)
  • [21] Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: Learning global representations for image search. In: European Conference on Computer Vision, Springer (2016) 241–257
  • [22] Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision 124(2) (2017) 237–254
  • [23] Wu, Y.: Image based camera localization: an overview. CoRR abs/1610.03660 (2016)
  • [24] Razavian, A., Sullivan, J., Maki, A., Carlsson, S.: A baseline for visual instance retrieval with deep convolutional networks. 4 (12 2014)
  • [25] Seo, P.H., Weyand, T., Sim, J., Han, B.: Cplanet: Enhancing image geolocalization by combinatorial partitioning of maps. ECCV (2018)
  • [26] Kohonen, T.: The self-organizing map. Neurocomputing 21(1) (1998) 1–6
  • [27] Muñoz-Perez, J., Gómez-Ruiz, J.A., López-Rubio, E., Garcia-Bernal, M.A.: Expansive and competitive learning for vector quantization. Neural processing letters 15(3) (2002) 261–273
  • [28] Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence PP(99) (2017) 1–1
  • [29] Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne.

    Journal of Machine Learning Research

    9(Nov) (2008) 2579–2605
  • [30] Schroff, F., Kalenichenko, D., Philbin, J.:

    Facenet: A unified embedding for face recognition and clustering.

    In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 815–823
  • [31] Manmatha, R., Wu, C.Y., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE (2017) 2859–2867
  • [32] Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M., Pajdla, T.: 24/7 place recognition by view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1808–1817
  • [33] Chen, D.M., Baatz, G., Köser, K., Tsai, S.S., Vedantham, R., Pylvänäinen, T., Roimela, K., Chen, X., Bach, J., Pollefeys, M., et al.: City-scale landmark identification on mobile devices. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE (2011) 737–744
  • [34] Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE (2007) 1–8
  • [35] Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improving particular object retrieval in large scale image databases. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE (2008) 1–8
  • [36] Jegou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometric consistency for large scale image search. Computer Vision–ECCV 2008 (2008) 304–317
  • [37] Vedaldi, A., Lenc, K.: Matconvnet – convolutional neural networks for matlab. In: Proceeding of the ACM Int. Conf. on Multimedia. (2015)
  • [38] Felix Grün, Christian Rupprecht, N.N.F.T.: A taxonomy and library for visualizing learned features in convolutional neural networks. In: ICML Visualization for Deep Learning Workshop. (2016)
  • [39] Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: European conference on computer vision, Springer (2014) 584–599
  • [40] : Google landmark retrieval challenge. https://www.kaggle.com/c/landmark-retrieval-challenge/leaderboard
  • [41] Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4004–4012
  • [42] Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: Advances in Neural Information Processing Systems. (2016) 1857–1865
  • [43] Wang, J., Zhou, F., Wen, S., Liu, X., Lin, Y.: Deep metric learning with angular loss. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)

Appendix

In this Appendix, we describe the gradients of the loss functions which jointly handles multiple negative images (sec.1), provide implementation details (sec.2), and include additional experimental results (sec.3).

1. Handling Multiple Negatives

Give a query image , a positive image , and multiple negative images . The Kullback-Leibler divergence loss over multiple negatives is given by:

(7)

For Gaussian kernel SARE, is defined as:

(8)

where are the feature embeddings of query, positive and negative images, respectively.

Substituting Eq. (8) into Eq. (7) gives:

(9)

Denote as , the gradients of Eq. (9) with respect to the query, positive and negative images are given by:

(10)
(11)
(12)

Similarly, for Cauchy Kernel, the loss function is given by:

(13)

Denote as , the gradients of Eq. (13) with respect to the query, positive and negative images are given by:

(14)
(15)
(16)

For Exponential Kernel, the loss function is given by:

(17)

Denote as , the gradients of Eq. (17) with respect to the query, positive and negative images are given by:

(18)
(19)
(20)

The gradients are back propagated to train the CNN.

2. Implementation Details

We exactly follow the training method of [3], without fine-tuning any hyper-parameters. The VGG-16 [5]

net is cropped at the last convolutional layer (conv5), before ReLU. The learning rate for the Pitts30K-train and Pitts250K-train datasets are set to 0.001 and 0.0001, respectively. They are halved every 5 epochs, momentum 0.9, weight decay 0.001, batch size of 4 tuples. Each tuple consist of one query image, one positive image, and ten negative images. The CNN is trained for at most 30 epochs but convergence usually occurs much faster (typically less than 5 epochs). The network which yields the best recall@5 on the validation set is used for testing.

Triplet Ranking Loss

For the triplet ranking loss [3], we set margin , and triplet images producing a non-zero loss are used in gradient computation, which is the same as [3].

Contrastive Loss

For the contrastive loss [14], we set margin , and negative images producing a non-zero loss are used in gradient computation. Note that positive images are always used in training since they are not pruned out.

Geographic Classification Loss

For the geographic classification method [2], we use the Pitts250k-train dataset for training. We first partition the 2D geographic space into square cells, with each cell size at

. The cell size is selected the same as the evaluation metric for compatibleness, so that the correctly classified images are also the correctly localized images according to our evaluation metric. We remove the Geo-classes which do not contain images, resulting in

Geo-classes. We append a fully connected layer (random initialization, with weights at ) and Softmax-log-loss layer after the NetVLAD pooling layer to predict which class the image belongs to.

SARE Loss

For our methods (Our-Ind., and Our-Joint ), Our-Ind. treats multiple negative images independently while Our-Joint treats multiple negative images jointly. The two methods only differ in the loss function and gradients computation. For each method, the corresponding gradients are back-propagated to train the CNN.

Triplet Angular Loss

For the triplet angular loss [43], we use the N-pair loss function (Eq. (8) in their paper) with as it achieves the best performance on the Stanford car dataset.

N-pair Loss

For the N-pair loss [42], we use the N-pair loss function (Eq. (3) in their paper).

Lifted Structured Loss

For the lifted structured loss [41], we use the smooth loss function (Eq. (4) in their paper). Note that training images producing a zero loss () are pruned out.

3. Additional Results

Dataset.

Table 4 gives the details of datasets used in our experiments.

Visualization of feature embeddings.

Fig. 9 and Fig. 10 visualize the feature embeddings of the 24/7 Tokyo-query and Sf-0-query dataset computed by our method (Our-Ind.) in 2-D using the t-SNE [29], respectively. Images are displayed exactly at their embedded locations. Note that images taken from the same place are mostly embedded to nearby 2D positions although they differ in lighting and perspective.

Dataset #database images #query images
Pitts250k-train 91,464 7,824
Pitts250k-val 78,648 7,608
Pitts250k-test 83,952 8,280
Pitts30k-train 10,000 7,416
Pitts30k-val 10,000 7,608
Pitts30k-test 10,000 6,816
TokyoTM-val 49,056 7,186
Tokyo 24/7 (-test) 75,984 315
Sf-0 610,773 803
Oxford 5k 5063 55
Paris 6k 6412 220
Holidays 991 500
Table 4: Datasets used in experiments. The Pitts250k-train dataset is only used to train the Geographic classification CNN [2]. For all the other CNNs, Pitts30k-train dataset is used to enable fast training.
Figure 9: Visualization of feature embedding computed by our method ( Our-Ind. ) using t-SNE [29] on the 24/7 Tokyo-query dataset. (Best viewed in color on screen)
Figure 10: Visualization of feature embedding computed by our method ( Our-Ind. ) using t-SNE [29] on the Sf-0-query dataset. (Best viewed in color on screen)

Image retrieval for varying dimensions.

Table 5 gives the comparison of image retrieval performance or different output dimensions.

Metric learning methods

Table 6 gives the complete Recall@N performance for different methods. Our method outperforms the contrastive loss [14] and Geo-classification loss [2], while remains comparable with other state-of-the-art metric learning methods.

Method Dim. Oxford 5K Paris 6k Holidays
full crop full crop
Our-Ind. 4096 71.66 75.51 82.03 81.07 80.71
Our-Joint 4096 70.26 73.33 81.32 81.39 84.33
NetVLAD [3] 4096 69.09 71.62 78.53 79.67 83.00
CRN [4] 4096 69.20 - - - -
Our-Ind. 2048 71.11 73.93 80.90 79.91 79.09
Our-Joint 2048 69.82 72.37 80.48 80.49 83.17
NetVLAD [3] 2048 67.70 70.84 77.01 78.29 82.80
CRN [4] 2048 68.30 - - - -
Our-Ind. 1024 70.31 72.20 79.29 78.54 78.76
Our-Joint 1024 68.46 70.72 78.49 78.47 83.15
NetVLAD [3] 1024 66.89 69.15 75.73 76.50 82.06
CRN [4] 1024 66.70 - - - -
Our-Ind. 512 68.96 70.59 77.36 76.44 77.65
Our-Joint 512 67.17 69.19 76.80 77.20 81.83
NetVLAD [3] 512 65.56 67.56 73.44 74.91 81.43
CRN [4] 512 64.50 - - - -
Our-Ind. 256 65.85 67.46 75.61 74.82 76.27
Our-Joint 256 65.30 67.51 74.50 75.32 80.57
NetVLAD [3] 256 62.49 63.53 72.04 73.47 80.30
CRN [4] 256 64.20 - - - -
Our-Ind. 128 63.75 64.71 71.60 71.23 73.57
Our-Joint 128 62.92 63.63 69.53 70.24 77.81
NetVLAD [3] 128 60.43 61.40 68.74 69.49 78.65
CRN [4] 128 61.50 - - - -
Table 5: Retrieval performance of CNNs trained on Pitts250k-test dataset on image retrieval benchmarks. No spatial re-ranking, or query expansion are performed. The accuracy is measured by the mean Average Precision (mAP).
MethodDataset Pitts250k-test TokyoTM-val 24/7 Tokyo Sf-0
r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10 r@1 r@5 r@10
Our-Ind. 88.97 95.50 96.79 94.49 96.73 97.30 79.68 86.67 90.48 80.60 86.70 89.01
Our-Joint 88.43 95.06 96.58 94.71 96.87 97.51 80.63 87.30 90.79 77.75 85.07 87.52
Contrastive [14] 86.33 94.09 95.88 93.39 96.09 96.98 75.87 86.35 88.89 74.63 82.23 84.53
N-pair [42] 87.56 94.57 96.21 94.42 96.73 97.41 80.00 89.52 91.11 76.66 83.85 87.11
Angular [43] 88.60 94.86 96.44 94.84 96.83 97.45 80.95 87.62 90.16 79.51 86.57 88.06
Liftstruct [41] 87.40 94.52 96.28 94.48 96.90 97.47 77.14 86.03 89.21 78.15 84.67 87.11
Geo-Classification [2] 83.19 92.67 94.59 93.54 96.80 97.50 71.43 82.22 85.71 67.84 78.15 81.41
Table 6: Comparison of Recalls on the Pitts250k-test, TokyoTM-val, 24/7 Tokyo and Sf-0 datasets.