The task of Image-Based Localization (IBL) is to estimate the geographic location of where a query image is taken, based on comparing it against geo-tagged images from a city-scale image database (i.e., a map). IBL has attracted considerable attention recently due to the wide-spread potential applications such as in robot navigation , VR/AR [7, 8, 9, 10], and autonomous driving 
. Depending on whether or not 3D point-clouds are used in the map, existing IBL methods can be roughly classified into two groups:image-retrieval based methods [3, 4, 12, 13, 2, 14] and direct 2D-3D matching based methods [15, 16, 17, 18, 19].
This paper belongs to the image-retrieval group for its effectiveness at large scale and robustness to changing conditions . For image-retrieval based methods, the main challenge is how to discriminatively represent images so that images depicting same landmarks would have similar representations while those depicting different landmarks would have dissimilar representations. The challenge is underpinned by the typically large-scale image database, in which many images may contain repetitive structures and similar landmarks, causing severe ambiguities.
Convolution Neural Networks (CNNs) have demonstrated great success for the IBL task [3, 4, 13, 21, 22, 14]. Typically, CNNs trained for image classification task are fine-tuned for IBL. As far as we know, all the state-of-the-art IBL methods focus on how to effectively aggregate a CNN feature map to obtain discriminative image representation, but have overlooked another important aspect which can potentially boost the IBL performance markedly. The important aspect is how to effectively organize the aggregated image representations. So far, all the state-of-the-art IBL methods use triplet or contrastive embedding to supervise the representation organization process.
This paper fills this gap by proposing a new method to effectively organize the image representations (embeddings). We first define a “place” as a set of images depicting same location landmarks, and then directly enforce the intra-place image similarity and inter-place dissimilarity in the embedding space. Our goal is to cluster learned embeddings from the same place while separating embeddings from different places. Intuitively, we are organizing image representations using places as agents.
The above idea may directly lead to a multi-class classification problem if we can label the “place” tag for each image. Apart from the time-consuming labeling process, the formulation will also result in too many pre-defined classes and we need a large training image set to train the classification CNN net. Recently-proposed methods [2, 1] try to solve the multi-class classification problem using large GPS-tagged training dataset. In their setting, a class is defined as images captured from nearby geographic positions while disregarding their visual appearance information. Since images within the same class do not necessarily depict same landmarks, CNN may only learn high-level information  for each geographic position, thus inadequate for accurate localization.
Can we capture the intra-place image “attraction” and inter-place image “repulsion” relationship with limited data, while still enabling competitive embedding? To tackle the “attraction” and “repulsion” relationship, we formulate the IBL task as image similarity-based binary classification in feature embedding space. Specifically, the similarity for images in the same place is defined as 1, and 0 otherwise. This binary-partition of similarity is used to capture the intra-place “attraction” and inter-place “repulsion”. To tackle the limited data issue, we use triplet images to train CNN, consisting of one query, positive (from the same place as the query), and negative image (from a different place). Note that a triplet is a minimum set to define the intra-place “attraction” and inter-place “repulsion”.
Our CNN architecture is given in Fig. 1. We name our metric-learning objective as Stochastic Attraction and Repulsion Embedding (SARE) since it captures pairwise image relationships under the probabilistic framework. Moreover, our SARE objective can be easily extended to handle multiple negative images coming from different places, i.e., enabling competition with multiple other places for each place. In experiments, we demonstrate that, with SARE, we obtain improved performance on various IBL benchmarks. Validations on standard image retrieval benchmarks further justify the superior generalization ability of our method.
2 Related Work
There is a rich family of work in IBL. We briefly review CNN-based image representation learning methods. Please refer to  for an overview.
While there have been many works [24, 21, 22, 14, 3, 13, 4, 12] in designing effective CNN feature map aggregation methods for IBL, they almost all exclusively using triplet or contrastive embedding objective to supervise CNN training. Both of the two objectives in spirit pulls the distance of matchable image pair while pushing the distance of non-matching image pair. While they are effective, we will show that our SARE objective outperforms them in IBL task later. Three interesting exceptions which do not use triplet or contrastive embedding objective are the planet , IM2GPS-CNN , and CPlaNet . They formulate IBL as a geographic position classification task. They first partition a 2D geographic space into cells using GPS-tags and then define a class per-cell. CNN training process is supervised by the cross-entropy classification loss which penalizes incorrectly classified images. We also show that our SARE objective outperforms the multi-class classification objective in IBL task.
Although our SARE objective is formulated in competitive learning framework, it differs from traditional competitive learning methods such as Self-Organizing Map and Vector Quantization . They are both devoted to learn cluster centers to separate original vectors. No constraints are imposed on original vectors. Under our formulation, we directly impose the “attraction-repulsion” relationship on original vectors to supervise the CNN learning process.
3 Problem Definition and Method Overview
Given a large geotagged image database, the IBL task is to estimate the geographic position of a query image . Image-retrieval based method first identifies the most visually similar image from the database for , and then use the location of the database image as that of . If the identified most similar image comes from the same place as , then we deem that we have successfully localized , and the most similar image is a positive image, denoted as . If the identified most similar image comes from a different place as , then we have falsely localized , and the most similar image is a negative image, denoted as .
Mathematically, a image-retrieval based method is executed as follows: First, query image and database images are converted to compact representations (vectors). This step is called image feature embedding and is done by a CNN network. For example, query image is converted to a fixed-size vector , where is a CNN network and is the CNN weight; Second, we define a similarity function on pairwise vectors. For example, take vectors and , and output a scalar value describing the similarity between and . Since we are comparing the entire large database to find the most similar image for , should be simple and efficiently computed to enable fast nearest neighbor search. A typical choice for is the -metric distance, or functions monotonically increase/decrease with the -metric distance.
Relying on feature vectors extracted by un-trained CNN to perform nearest neighbor search would often output a negative image for . Thus, we need to train CNN using easily obtained geo-tagged training images (Sec.7.1). The training process in general defines a loss function on CNN extracted feature vectors, and use it to update the CNN weight . State-of-the-art triplet ranking loss (Sec.4.1) takes triplet training images , and imposes that is more similar to than . Another contrastive loss (Sec.4.2) tries to separate pair by a pre-defined distance margin (see Fig.2). While the two losses are effective, we construct our metric embedding objective in a substantially different way.
Given triplet training images , we have the prior knowledge that pair is matchable and pair is non-matchable. This simple match-ability prior actually defines a probability distribution. For pair, the match-ability is defined as 1. For pair, the match-ability is defined as 0. Can we respect this match-ability prior in feature embedding space? Our answer is yes. To do it, we direct fit a kernel on the -metric distances of and pairs and obtain a probability distribution. Our metric-learning objective is to minimize the Kullback-Leibler divergence of the above two probability distributions (Sec.4.3).
What’s the benefit of respecting the match-ability prior in feature embedding space? Conceptually, in this way, we capture the intra-place (defined by pair) “attraction” and inter-place (defined by pair) “repulsion” relationship in feature embedding space. Potentially, the “attraction” and “repulsion” relationship balances the embedded positions of the entire image database well. Mathematically, we use gradients of the resulting metric-learning objective with respect to triplet images to figure out the characteristics, and find that our objective adaptively adjusts the force (gradient) to pull the distance of pair, while pushing the distance of pair (Sec.5).
4 Deep Metric Embedding Objectives in IBL
In this section, we first give the two widely-used deep metric embedding objectives in IBL - the triplet and contrastive embedding, and they are facilitated by minimizing the triplet ranking and contrastive loss, respectively. We then give our own objective - Stochastic Attraction and Repulsion Embedding (SARE).
4.1 Triplet Ranking Loss
4.2 Contrastive Loss
The contrastive loss imposes constraint on image pair by:
where for pair, , and for pair, . is an empirical margin to prune out negative images with . Typically, .
Intuitions to the above two losses are compared in Fig.2.
4.3 SARE-Stochastic Attraction and Repulsion Embedding
In this subsection, we give our Stochastic Attraction and Repulsion Embedding (SARE) objective, which is optimized to learn discriminative embeddings for each “place”. A triplet images define two places, one defined by pair and the other defined by . The intra-place and inter-place similarity are defined in a probabilistic framework.
Given a query image , the probability picks as its match is conditional probability , which equals to 1 based on the co-visible or matchable prior. The conditional probability equals to 0 following above definition. Since we are interested in modeling pairwise similarities, we set . Note that the triplet probabilities actually define a probability distribution (summing to 1).
In the feature embedding space, we would like CNN extracted feature vectors to respect the above probability distribution. We define another probability distribution in the embedding space, and try to minimize the mismatch between the two distributions. The Kullback-Leibler divergence is employed to describe the cross-entropy loss and is given by:
In order to define the probability picks as its match in the feature embedding space, we fit a kernel on pairwise -metric feature vector distances. We use three typical-used kernels to compare their effectiveness: Gaussian, Cauchy, and Exponential kernels. In next paragraphs, we use the Gaussian kernel to demonstrate our method. Loss functions defined by using Cauchy and Exponential kernels are given in Appendix.
For the Gaussian kernel, we have:
In the feature embedding space, the probability of picks as its match is given by . If the embedded feature vectors and are sufficiently near, and and are far enough under the
-metric, the conditional probability distributionsand will be equal. Thus, our SARE objective aims to find an embedding function that pulls the distance of to infinite-minimal, and that of to infinite-maximal.
5 Comparing the Three Losses
In this section, we illustrate the connections between the above three different loss functions. This is approached by deriving and comparing their gradients, which are key to the back-propagation stage in networks training. Note that gradient may be interpreted as the resultant force created by a set of springs between image pair . For the gradient with respect to positive image , the spring pulls the pair. For the gradient with respect to negative image , the spring pushes the pair.
In Fig. 3, we compare the magnitudes of gradients with respect to and for different objectives. The mathematical equations of gradients with respect to and for different objectives are given in Table 1. For each objective, the gradient with respect to is given by .
In the case of triplet ranking loss, and increase linearly with respect to distance and , respectively. The saturation regions in which gradients equal to zero correspond to triplet images producing a zero loss (Eq. (1)). For triplet images producing a non-zero loss, is independent of , and vice versa. Thus, the updating of disregards the current embedded position of and vice versa.
For the contrastive loss, is independent of and increase linearly with respect to distance . decreases linearly with respect to distance . The area in which equals zero corresponds to negative images with .
For all kernel defined SAREs, and depend on distances and . The implicitly respecting of the distances comes from the probability (Eq. (4)). Thus, the updating of and considers the current embedded positions of triplet images, which is beneficial for the possibly diverse feature distribution in the embedding space.
The benefit of kernel defined SARE-objectives can be better understood when combined with hard-negative mining strategy, which is widely used in CNN training. The strategy returns a set of hard negative images (i.e., nearest negatives in -metric) for training. Note that both the triplet ranking loss and contrastive loss rely on empirical parameters () to prune out negatives (c.f .the saturation regions). In contrast, our kernel defined SARE-objectives do not rely on these parameters. They preemptively consider the current embedded positions. For example, hard negative with (top-left-triangle in gradients figure) will trigger large force to pull pair while pushing pair. “semi-hard”  negative with (bottom-right-triangle in gradients figure) will still trigger force to pull pair while pushing pair, however, the force decays with increasing . Here, large may correspond to well-trained samples or noise, and the gradients decay ability has the potential benefit of reducing over-fitting.
To better understand the gradient decay ability of kernel defined SARE objectives, we fix , and compare for all objectives in Fig. 4. Here,
means that for uniformly distributed feature embeddings, if we randomly samplepair, we are likely to obtain samples that are -away . Uniformly distributed feature embeddings correspond to an initial untrained/un-fine-tuned CNN. For triplet ranking loss, Gaussian SARE and Cauchy SARE, increases with respect to when it is small. In contrast to the gradually decay ability of SAREs, triplet ranking loss suddenly “close” the force when the triplet images produce a zero loss (Eq. (1)). For contrastive loss and Exponential SARE, decreases with respect to . Again, the contrastive loss “close” the force when the negative image produces a zero loss.
6 Handling Multiple Negatives
In this section, we give two methods to handle multiple negative images in CNN training stage. Equation (3) defines a SARE loss on a triplet and aims to shorten the embedded distance between the query and positive images while enlarging the distance between the query and negative images. Usually, in the task of IBL, the number of positive images is very small since they should depict same landmarks as the query image while the number of negative images is very big since images from different places are negative. At the same time, the time-consuming hard negative images mining process returns multiple negative images for each query image [3, 4]. There are two ways to handle these negative images: one is to treat them independently and the other is to jointly handle them, where both strategies are illustrated in Fig. 5.
Given negative images, treating them independently results in triplets, and they are substituted to Eq. (3) to calculate the loss to train CNN. Each triplet focuses on the competitiveness of two places (positive VS negative). The repulsion and attractive forces from multiple place pairs are averaged to balance the embeddings.
Jointly handling multiple negatives aims to balance the distance of positives over multiple negatives. In our formulation, we can easily construct an objective function to push negative images simultaneously. Specifically, the match-ability priors for all the negative images are defined as zero, i.e., . The Kullback-Leibler divergence loss over multiple negatives is given by:
where for Gaussian kernel SARE, is defined as:
The gradients of Eq. (5) can be easily computed to train CNN. All kernel defined multi-negative loss functions and gradients are given in Appendix.
This section mainly discusses the performance of SARE objectives for training CNN. We show that with SARE, we can improve the performance on various standard place recognition and image retrieval datasets.
7.1 Implementation Details
Google Street View Time Machine datasets have been widely-used in IBL [32, 3, 4]. It provides multiple street-level panoramic images taken at different times at close-by spatial locations on the map. The panoramic images are projected into multiple perspective images, yielding the training and testing datasets. Each image is associated with a GPS-tag giving its approximate geographic location, which can be used to identify nearby images not necessarily depicting the same landmark. We follow [3, 2]
to identify the positive and negative images for each query image. For each query image, the positive image is the closest neighbor in the feature embedding space at its nearby geo-position, and the negatives are far away images. The above positive-negative mining method is very efficient despite some outliers may exist in the resultant positive/negative images. If accurate positives and negatives are needed, pairwise image matching with geometric validation or SfM reconstruction  can be used. However, they are time-consuming.
The Pitts30k-training dataset  is used to train CNN, which has been shown to obtain best CNN . To test our method for IBL, the Pitts250k-test , TokyoTM-val , 24/7 Tokyo  and Sf-0 [33, 12] datasets are used. To show the generalization ability of our method for image retrieval, the Oxford 5k , Paris 6k , and Holidays  datasets are used. Details of these datasets are given in the Appendix.
For the place recognition datasets Pitts250k-test , TokyoTM-val , 24/7 Tokyo  and Sf-0 , we use the Precision-Recall curve to evaluate the performance. Specifically, for Pitts250k-test , TokyoTM-val , and 24/7 Tokyo , the query image is deemed correctly localized if at least one of the top retrieved database images is within meters from the ground truth position of the query image. The percentage of correctly recognized queries (Recall) is then plotted for different values of . For the large-scale Sf-0  dataset, the query image is deemed correctly localized if at least one of the top retrieved database images shares the same building IDs ( manually labeled by  ). For the image-retrieval datasets Oxford 5k , Paris 6k , and Holidays , the mean-Average-Precision (mAP) is reported.
We use the training method of  to compare different objectives. For the state-of-the-art triplet ranking loss, the off-the-shelf implementation  is used. For the contrastive loss , triplet images are partitioned into and pairs to calculate the loss (Eq. (2)) and gradients. For our method which treats multiple negatives independent (Our-Ind.), we first calculate the probability (Eq. (4)). is then used to calculate the gradients (Table 1) with respect to the images. The gradients are back-propagated to train CNN. For our method which jointly handles multiple negatives (Our-Joint), we use Eq.(5) to train CNN. Our implementation 111Pre-trained model and code are available at https://github.com/Liumouliu/deepIBL is based on MatConvNet . Details are given in Appendix.
7.2 Kernels for SARE
To assess the impact of kernels on fitting the pairwise -metric feature vector distances, we compare CNNs trained by Gaussian, Cauchy and Exponential kernel defined SARE-objectives, respectively. All the hyper-parameters are the same for different objectives, and the results are given in Fig. 6. CNN trained by Gaussian kernel defined SARE generally outperforms CNNs trained by others.
We find that handling multiple negatives jointly (Gaussian-Joint) leads to better training and validation performances than handling multiple negatives independently (Gaussian-Ind.). However, when testing the trained CNNs on Pitts250k-test, TokyoTM-val, and 24/7 Tokyo datasets, the recall performances are similar. Gaussian-Ind. behaves surprisingly well on the large-scale Sf-0 dataset.
7.3 Comparison with state-of-the-art
We use Gaussian kernel defined SARE objectives to train CNNs, and compare our method with state-of-the-art NetVLAD  and NetVLAD with Contextual Feature Reweighting . The complete Recall@N performance for different methods are given in Table 2.
CNNs trained by Gaussian-SARE objectives consistently outperform state-of-the-art CNNs by a large margin on almost all benchmarks. For example, on the challenging 24/7 Tokyo dataset, our-Ind. trained NetVLAD achieves recall@1 of 79.68% compared to the second-best 75.20% obtained by CRN , i.e., a relative improvement in recall of 4.48%. On the large-scale challenging Sf-0 dataset, our-Ind. trained NetVLAD achieves recall@1 of 80.60% compared to the 75.58% obtained by NetVLAD , i.e., a relative improvement in recall of 5.02%. Note that we do not use the Contextual Reweighting layer to capture the “context” within images, which has been shown to be more effective than the original NetVLAD structure . Similar improvements can be observed in other datasets. This confirms the important premise of this work: formulating the IBL problem in competitive learning framework, and using SARE to supervise the CNN training process can learn discriminative yet compact image representations for IBL. We give a visualization of 2D feature embeddings of query images from 24/7 Tokyo and Sf-0 datasets in Appendix. Images taken from the same place are mostly embedded to nearby 2D positions despite the significant variations in view point, pose, and configuration.
7.4 Qualitative Evaluation
To visualize the areas of the input image which are most important for localization, we adopt  to obtain a heat map showing the importance of different areas of the input image. The results are given in Fig. 7. As can be seen, our method focuses on regions that are useful for image geo-localization while emphasizing the distinctive details on buildings. On the other hand, the NetVLAD  emphasizes local features, not the overall building style.
7.5 Generalization on Image Retrieval Datasets
To show the generalization ability of our method, we compare the compact image representations trained by different methods on standard image retrieval benchmarks (Oxford 5k , Paris 6k , and Holidays ) without any fine-tuning. The results are given in Table 3 (The complete set of results for different output dimensions is given in Appendix). Comparing the CNN trained by our methods and the off-the-shelf NetVLAD  and CRN , in most cases, the mAP of our methods outperforms theirs’. Since our CNNs are trained using a city-scale building-oriented dataset from urban areas, it lacks the ability to understand the natural landmarks (e.g., water, boats, cars), resulting in a performance drop in comparison with the city-scale building-oriented datasets. CNN trained by images similar to images encountered at test time can increase the retrieval performance . However, our purpose here is to demonstrate the generalization ability of SARE trained CNNs, which has been justified.
We also did a fine-tuning job on the training dataset from Google Landmark Retrieval Challenge  without tuning any hyper-parameters of our method. Our method (SevenSpace) is “In the money” (i.e., top 3).
|Method||Oxford 5K||Paris 6k||Holidays|
7.6 Comparison with Metric-learning Methods
Although deep metric-learning methods have shown their effectiveness in classification and fine-grain recognition tasks, their abilities in IBL task are unknown. As another contribution of this paper, we show the performances of five current state-of-the-art deep metric-learning methods in IBL, and compare our method with : (1) contrastive loss used by ; (2) lifted structure embedding ; (3) N-pair loss ; (4) N-pair angular loss ; (5) Geo-classification loss . The implementation details are given in Appendix.
Fig. 8 shows the results of the quantitative comparison between our method and other deep metric learning methods. The complete Recall@N performance for different methods are given in Appendix. Our method outperforms the contrastive loss  and Geo-classification loss , while remains comparable with other state-of-the-art metric-learning methods.
This paper has addressed the problem of learning discriminative image representation specifically tailored for the task of Image-Based Localization (IBL). We have proposed a new Stochastic Attraction and Repulsion Embedding (SARE) objective for this task. SARE directly enforces the “attraction” and “repulsion” constraints on intra-place and inter-place feature embeddings, respectively. The “attraction” and “repulsion” constraints are formulated as a similarity-based binary classification task. It has shown that SARE improves IBL performance, outperforming other state-of-the-art methods.
Weyand, T., Kostrikov, I., Philbin, J.:
Planet-photo geolocation with convolutional neural networks.
In: European Conference on Computer Vision, Springer (2016) 37–55
Vo, N., Jacobs, N., Hays, J.:
Revisiting im2gps in the deep learning era.In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.:
Netvlad: Cnn architecture for weakly supervised place recognition.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 5297–5307
-  Kim, H.J., Dunn, E., Frahm, J.M.: Learned contextual feature reweighting for image geo-localization. In: CVPR. (2017)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31(5) (2015) 1147–1163
-  Middelberg, S., Sattler, T., Untzelmann, O., Kobbelt, L.: Scalable 6-dof localization on mobile devices. In: European conference on computer vision, Springer (2014) 268–283
-  Ventura, J., Arth, C., Reitmayr, G., Schmalstieg, D.: Global localization from monocular slam on a mobile phone. IEEE transactions on visualization and computer graphics 20(4) (2014) 531–539
-  Pan, L., Dai, Y., Liu, M., Porikli, F.: Simultaneous stereo video deblurring and scene flow estimation. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE (2017) 6987–6996
-  Pan, L., Dai, Y., Liu, M., Porikli, F.: Depth map completion by jointly exploiting blurry color images and sparse depth maps. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). (March 2018) 1377–1386
-  Liu, L., Li, H., Dai, Y., Pan, Q.: Robust and efficient relative pose with a multi-camera system for autonomous driving in highly dynamic environments. IEEE Transactions on Intelligent Transportation Systems 19(8) (Aug 2018) 2432–2444
-  Sattler, T., Torii, A., Sivic, J., Pollefeys, M., Taira, H., Okutomi, M., Pajdla, T.: Are large-scale 3d models really necessary for accurate visual localization? In: CVPR 2017-IEEE Conference on Computer Vision and Pattern Recognition. (2017)
-  Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
-  Radenović, F., Tolias, G., Chum, O.: Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In: European Conference on Computer Vision, Springer (2016) 3–20
-  Sattler, T., Leibe, B., Kobbelt, L.: Fast image-based localization using direct 2d-to-3d matching. In: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE (2011) 667–674
-  Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence 39(9) (2017) 1744–1756
-  Li, Y., Snavely, N., Huttenlocher, D.P.: Location recognition using prioritized feature matching. In: European conference on computer vision, Springer (2010) 791–804
-  Liu, L., Li, H., Dai, Y.: Efficient global 2d-3d matching for camera localization in a large-scale 3d map. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
-  Carl, T., Erik, S., Lars, H., Lucas, B., Marc, P., Torsten, S., Fredrik, K.: Semantic match consistency for long-term visual localization. ECCV (2018)
-  Sattler, T., Maddern, W., Toft, C., Torii, A., Hammarstrand, L., Stenborg, E., Safari, D., Okutomi, M., Pollefeys, M., Sivic, J., et al.: Benchmarking 6dof outdoor visual localization in changing conditions. In: Proc. CVPR. Volume 1. (2018)
-  Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: Learning global representations for image search. In: European Conference on Computer Vision, Springer (2016) 241–257
-  Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision 124(2) (2017) 237–254
-  Wu, Y.: Image based camera localization: an overview. CoRR abs/1610.03660 (2016)
-  Razavian, A., Sullivan, J., Maki, A., Carlsson, S.: A baseline for visual instance retrieval with deep convolutional networks. 4 (12 2014)
-  Seo, P.H., Weyand, T., Sim, J., Han, B.: Cplanet: Enhancing image geolocalization by combinatorial partitioning of maps. ECCV (2018)
-  Kohonen, T.: The self-organizing map. Neurocomputing 21(1) (1998) 1–6
-  Muñoz-Perez, J., Gómez-Ruiz, J.A., López-Rubio, E., Garcia-Bernal, M.A.: Expansive and competitive learning for vector quantization. Neural processing letters 15(3) (2002) 261–273
-  Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence PP(99) (2017) 1–1
Maaten, L.v.d., Hinton, G.:
Visualizing data using t-sne.
Journal of Machine Learning Research9(Nov) (2008) 2579–2605
Schroff, F., Kalenichenko, D., Philbin, J.:
Facenet: A unified embedding for face recognition and clustering.In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 815–823
-  Manmatha, R., Wu, C.Y., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE (2017) 2859–2867
-  Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M., Pajdla, T.: 24/7 place recognition by view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1808–1817
-  Chen, D.M., Baatz, G., Köser, K., Tsai, S.S., Vedantham, R., Pylvänäinen, T., Roimela, K., Chen, X., Bach, J., Pollefeys, M., et al.: City-scale landmark identification on mobile devices. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE (2011) 737–744
-  Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE (2007) 1–8
-  Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improving particular object retrieval in large scale image databases. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE (2008) 1–8
-  Jegou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometric consistency for large scale image search. Computer Vision–ECCV 2008 (2008) 304–317
-  Vedaldi, A., Lenc, K.: Matconvnet – convolutional neural networks for matlab. In: Proceeding of the ACM Int. Conf. on Multimedia. (2015)
-  Felix Grün, Christian Rupprecht, N.N.F.T.: A taxonomy and library for visualizing learned features in convolutional neural networks. In: ICML Visualization for Deep Learning Workshop. (2016)
-  Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: European conference on computer vision, Springer (2014) 584–599
-  : Google landmark retrieval challenge. https://www.kaggle.com/c/landmark-retrieval-challenge/leaderboard
-  Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4004–4012
-  Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: Advances in Neural Information Processing Systems. (2016) 1857–1865
-  Wang, J., Zhou, F., Wen, S., Liu, X., Lin, Y.: Deep metric learning with angular loss. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
In this Appendix, we describe the gradients of the loss functions which jointly handles multiple negative images (sec.1), provide implementation details (sec.2), and include additional experimental results (sec.3).
1. Handling Multiple Negatives
Give a query image , a positive image , and multiple negative images . The Kullback-Leibler divergence loss over multiple negatives is given by:
For Gaussian kernel SARE, is defined as:
where are the feature embeddings of query, positive and negative images, respectively.
Denote as , the gradients of Eq. (9) with respect to the query, positive and negative images are given by:
Similarly, for Cauchy Kernel, the loss function is given by:
Denote as , the gradients of Eq. (13) with respect to the query, positive and negative images are given by:
For Exponential Kernel, the loss function is given by:
Denote as , the gradients of Eq. (17) with respect to the query, positive and negative images are given by:
The gradients are back propagated to train the CNN.
2. Implementation Details
net is cropped at the last convolutional layer (conv5), before ReLU. The learning rate for the Pitts30K-train and Pitts250K-train datasets are set to 0.001 and 0.0001, respectively. They are halved every 5 epochs, momentum 0.9, weight decay 0.001, batch size of 4 tuples. Each tuple consist of one query image, one positive image, and ten negative images. The CNN is trained for at most 30 epochs but convergence usually occurs much faster (typically less than 5 epochs). The network which yields the best recall@5 on the validation set is used for testing.
Triplet Ranking Loss
For the contrastive loss , we set margin , and negative images producing a non-zero loss are used in gradient computation. Note that positive images are always used in training since they are not pruned out.
Geographic Classification Loss
For the geographic classification method , we use the Pitts250k-train dataset for training. We first partition the 2D geographic space into square cells, with each cell size at
. The cell size is selected the same as the evaluation metric for compatibleness, so that the correctly classified images are also the correctly localized images according to our evaluation metric. We remove the Geo-classes which do not contain images, resulting inGeo-classes. We append a fully connected layer (random initialization, with weights at ) and Softmax-log-loss layer after the NetVLAD pooling layer to predict which class the image belongs to.
For our methods (Our-Ind., and Our-Joint ), Our-Ind. treats multiple negative images independently while Our-Joint treats multiple negative images jointly. The two methods only differ in the loss function and gradients computation. For each method, the corresponding gradients are back-propagated to train the CNN.
Triplet Angular Loss
For the triplet angular loss , we use the N-pair loss function (Eq. (8) in their paper) with as it achieves the best performance on the Stanford car dataset.
For the N-pair loss , we use the N-pair loss function (Eq. (3) in their paper).
Lifted Structured Loss
For the lifted structured loss , we use the smooth loss function (Eq. (4) in their paper). Note that training images producing a zero loss () are pruned out.
3. Additional Results
Table 4 gives the details of datasets used in our experiments.
Visualization of feature embeddings.
Fig. 9 and Fig. 10 visualize the feature embeddings of the 24/7 Tokyo-query and Sf-0-query dataset computed by our method (Our-Ind.) in 2-D using the t-SNE , respectively. Images are displayed exactly at their embedded locations. Note that images taken from the same place are mostly embedded to nearby 2D positions although they differ in lighting and perspective.
|Dataset||#database images||#query images|
|Tokyo 24/7 (-test)||75,984||315|
Image retrieval for varying dimensions.
Table 5 gives the comparison of image retrieval performance or different output dimensions.
Metric learning methods
Table 6 gives the complete Recall@N performance for different methods. Our method outperforms the contrastive loss  and Geo-classification loss , while remains comparable with other state-of-the-art metric learning methods.
|Method||Dim.||Oxford 5K||Paris 6k||Holidays|