Two-stage Discriminative Re-ranking for Large-scale Landmark Retrieval

03/25/2020 ∙ by Shuhei Yokoo, et al. ∙ University of Tsukuba Preferred Infrastructure 7

We propose an efficient pipeline for large-scale landmark image retrieval that addresses the diversity of the dataset through two-stage discriminative re-ranking. Our approach is based on embedding the images in a feature-space using a convolutional neural network trained with a cosine softmax loss. Due to the variance of the images, which include extreme viewpoint changes such as having to retrieve images of the exterior of a landmark from images of the interior, this is very challenging for approaches based exclusively on visual similarity. Our proposed re-ranking approach improves the results in two steps: in the sort-step, k-nearest neighbor search with soft-voting to sort the retrieved results based on their label similarity to the query images, and in the insert-step, we add additional samples from the dataset that were not retrieved by image-similarity. This approach allows overcoming the low visual diversity in retrieved images. In-depth experimental results show that the proposed approach significantly outperforms existing approaches on the challenging Google Landmarks Datasets. Using our methods, we achieved 1st place in the Google Landmark Retrieval 2019 challenge and 3rd place in the Google Landmark Recognition 2019 challenge on Kaggle. Our code is publicly available here: <>



There are no comments yet.


page 1

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image retrieval is a fundamental problem in computer vision where given a query image, similar images must be found in a large dataset. In the case of landmark images, the variation between points of view and different parts of the landmark can be extreme, proving challenging for humans without deep knowledge of the landmark in question. One such complicated example is shown in Fig. 

1. The Scuderie del Quirinale is very visually similar to other structures such as the Vatican obelisk and the Inco Superstack, leading to erroneous retrievals. Our proposed re-ranking approach is able to exploit labeled information from the training dataset to improve the retrieval results, even when the correct images are very visually dissimilar such as drawings, different viewpoints, diverse illumination, etc.

Instance image retrieval can be seen as the task of converting the image information into an embedding where similar images are nearby. Similar to recent approaches, we focus on learning this embedding with a convolutional neural network (CNN). We adopt a cosine softmax loss to train the neural network for the retrieval task. Afterward, instead of simply using the distance in the embedding space to find related images, we exploit the label information to perform re-ranking. Our re-ranking is based on a two-step approach. In the sort-step, a discriminative model based on -NN search with soft voting which allows us to sort the initial retrieved results such that results more label-similar to the query image are given higher priority. In the insert-step, images that were originally not retrieved are inserted into the retrieval results based on the same discriminative model. This combined approach shows a significant improvement over existing approaches.

Noh et al[32] has recently provided a challenging dataset named Google Landmarks Dataset v1 (GLD-v1) for instance-level landmark image retrieval. For each landmark, there is a diversity of images including both interior and exterior images. Being able to identify the images without context is very challenging, and in many cases, positive pairs have a very different visual appearance. More recently, the dataset has been expanded in a second version (GLD-v2) to be more complex and challenging. We focus on the retrieval task in this challenging setting which due to being recent has not been fully explored yet.

Although the GLD-v2 dataset is a significant improvement over the previous version, consistency and quality are still significant open issues that can be very detrimental to results in the retrieval task. For this purpose, we also propose an automatic data cleaning approached based on filtering the training data. Although this reduces the dataset size and training budget, it ends up being beneficial to overall performance of the model.

To summarize our contributions, (1) an effective pipeline for high quality landmark retrieval, (2) a re-ranking approach based on exploiting label information, (3) results that significantly outperform existing approaches on challenging datasets.

2 Related Work

Instance Image Retrieval. Image retrieval is usually posed as a problem of finding an image embedding in which similar images have small distance, and has been traditionally done based on local descriptor based methods [43, 9, 34, 24], including the popular SIFT [30], RootSIFT [2], and SURF [4]. Bag-of-Words [43, 9] model and its variants (VLAD [24]

, Fisher Vector 

[34]) have been popular in image retrieval previous to the advent of learning-based approaches, and construct image embeddings by aggregating local descriptors. More recently, DELF [32]

has been proposed as a deep learning-based local descriptor method, which uses the attention map of CNN activation learned by only image-level annotation. See

[52] for a survey of instance image retrieval.

After the emergence of deep learning, many image retrieval methods based on deep learning have been presented. Most recent image retrieval approaches are based on deep learning [40, 3, 25, 26, 1, 15, 38]. Both utilizing off-the-shelf CNN activations as an image embedding [40, 3, 25, 26] and further fine-tuning to specific datasets [1, 15, 38] are popular approaches. An extension of VLAD called NetVLAD which is differentiable and trainable in an end-to-end fashion has also been recently proposed [1]. Gordo et al[15] proposed using a region proposal network to localize the landmark region and training a triplet network in an end-to-end fashion.

The current state-of-the-art local descriptor based method is D2R-R-ASMK [44] along with spatial verification [35]. D2R-R-ASMK is a regional aggregation method comprising a region detector based on ASMK (Aggregated Selective Match Kernels) [45]. ASMK is one of the local feature aggregation techniques. The current state-of-the-art CNN global descriptor method is that of Radenović et al[38] which employs an AP loss [41] along with re-ranking methods [50, 5]. We construct our pipeline mainly based on latter strategy and show that by using a two-stage discriminative re-ranking approach, we are able to obtain results favorable to the existing approaches.

Retrieval Loss Functions.

Instance image retrieval requires image embedding that captures the similarity well, and the loss used during learning plays an important role. Using CNN off-the-shelf embeddings has been effective for image retrieval [40, 3, 25, 26]. Babenko and Lempitsky [3] proposed using sum-pooling of CNN activation, and Lin et al[26]

proposed max-pooling of multiple regions of CNN activation. However, training specifically for the task of instance retrieval has shown more effective with contrastive loss 

[6] and triplet loss [49, 18, 42] being some of the more used losses in image retrieval [13, 15, 38, 19]. Recently, the AP loss [41]

, which optimizes the global mean average precision directly by leveraging list-wise loss formulations, has been proposed and achieved state-of-the-art results. In face recognition field, recently cosine softmax losses 

[39, 28, 27, 48, 47, 11, 51] have shown astonishing results and have become more favorable than other losses [31]. Cosine softmax losses impose L2-constraint to the features which restricts them to lie on a hypersphere of a fixed radius, with popular approaches being SphereFace [28, 27], ArcFace [11], and CosFace [48, 47], using multiplicative angular margin penalty, additive angular margin penalty, and additive cosine margin penalty, respectively. While contrastive loss and triplet loss require training techniques such as hard negative mining [42, 1], cosine softmax losses do not and easy to implement and stable in training. We show their successes are not only in face recognition but also in instance image retrieval by comparative experiments.

Re-ranking Methods. Re-ranking is a essential approach to enhance the retrieval results on the image embedding. Query expansion (QE)-based techniques are simple and popular ways of re-ranking for improving recall of retrieval system. AQE [8] is the first work that applies query expansion in vision field, and is based on averaging embeddings of top-ranked images retrieved by an initial query, and using the averaged embedding as a new query. QE [38] uses weighted average of descriptors of top-ranked images. Heavier weights are put on as the rank gets higher. DQE [2]

uses an SVM classifier and its signed distance from the decision boundary for re-ranking. Spatial verification (SP) 

[35, 33] is a method that checks the geometric consistency using local descriptors and RANSAC [14], can be combined with QE to filter images used for expansion [8]. SP can be used as re-ranking [7, 37] to improve precision, but it has an efficiency problem. Therefore, it is performed generally on a shortlist of top-ranked images only. HQE [46] leverages Hamming Embedding [23] to filter images instead of SP.

Diffusion [12] is major manifold-based approach, also known as similarity propagation, which can also be used for re-ranking. Many diffusion approaches are proposed for enhancing the performance of instance image retrieval [12, 22, 21, 50]. Diffusion can capture the image manifold in the feature space by random-walk on -NN graph. Because diffusion process tends to be expensive, spectral methods have been proposed to reduce computational cost [21], and Yang et al[50] proposes decoupling diffusion into online and offline processes to reduce online computation. EGT [5] is a recently proposed -NN graph traversal algorithm, which outperforms diffusion methods in terms of performance and efficiency.

Conventional re-ranking methods are unsupervised, which means they do not consider label information even when label information is available. In contrast, our re-ranking method can exploit label information, commonly available in many problems, and shows excellent performance in landmark retrieval tasks.

3 Method

Our approach consists of training an embedding space using a cosine softmax loss to train a CNN. Afterward, retrieval is done based on -NN search which is corrected and improved using two-stage discriminative re-ranking.

3.1 Embedding Model

Our model is based on a CNN that embeds each image into a feature-space amenable for -NN search. Our model is based on a ResNet-101 [17] augmented with Generalized Mean (GeM)-pooling [38] to aggregate the spatial information into a global descriptor.

The reduction of a descriptor dimension is crucial since it dramatically affects the computational budget and alleviates the risk of over-fitting. We reduce the dimension to 512 from 2048 by adding a fully-connected layer after the GeM-pooling layer. Additionally, a one-dimensional Batch Normalization 

[20] after the fully-connected layer is used to improve the generalization ability.

Training is done using the ArcFace [11] loss with weight regularization defined as follows




where is the input image with target class , is the batch size, denotes the weights of the last layer, is the parameters of the whole network excluding the last layer, is the embedding of using ,

is a scaling hyperparameter, and

is a margin hyperparameter. We note that and is enforced by normalizing at every iteration.

3.2 Two-stage Discriminative Re-ranking

The diversity of images belonging to the same instance is one of the main problems in image retrieval. For example, an instance of church may contain diverse samples, such as outdoor and indoor images. These images are extremely hard to identify as the same landmark without any context. Furthermore, the visual dissimilarity makes it nearly impossible to retrieve them using only visual-based embeddings. To overcome this issue, we propose two-stage discriminative re-ranking that exploits the label information. An overview of our re-ranking approach is shown in Fig. 2.

Our proposed method is composed of an auxiliary offline step and two re-ranking stages. Suppose we have a query, an index set and a train set. The index set is a database for which we perform image retrieval and has no labels, only images. First, we predict the instance-id of each sample from the index set by -NN search with soft-voting, where each sample from the index set is regarded as a query, and the train set as a database.

The score of each instance-id is calculated by accumulating the cosine similarities of the

nearest samples as follows


where is the feature embedding function with omitted for brevity, is the set of nearest neighbours in the train set, and is an indicator function. The prediction then becomes the class that maximizes for . The index set prediction can be computed in an offline manner once.

When a query is given, its instance-id is also predicted in the same way described above. Index set samples that are predicted to be the same id of the query sample are treated as “positive samples”, and those of different id as “negative samples”, and play an important role in our re-ranking approach.

Figure 3: Overview of our re-ranking procedure. “Positive” represents the samples predicted the same id as the predicted id of a query sample. “Negative” represents the samples predicted the different id from a predicted id of a query sample. Re-ranking is performed in each step based on their predicted id results.

Our re-ranking method is illustrated in Figure 3 and consists of a sort-step and insert-step. The top row in the figure shows a query (in blue) and retrieved samples from the index set by -NN search with positive samples shown in green and negative samples shown in red. Here, we consider images on the left to be more relevant to the query than the ones on the right. In the sort-step, positive samples are moved to the left of the negative samples in the ranking, maintaining the relative order of them. This re-ranking step can make results more reliable, becoming less dependent on factors such as lighting, occlusions, etc.

In the insert-step, we insert positive samples from the entire index set, which are not retrieved by the -NN search, after the re-ranked positive samples in descending order of scores which is calculated by -NN cosine similarities. This step enables us to retrieve visually dissimilar samples to a query by utilizing the label information of the train set. Here, the predicted instance-id may not always be reliable, especially when the prediction score is low. If the instance of query does not exist in the train set, there is a tendency that the prediction score becomes very low. Thus, we do not perform insertion to the sample of which the sum of the prediction score between query and sample considered to be inserted is lower than a threshold to deal with such a situation.

Discussion. Our re-ranking method can be applied when a train set exists and there are some overlaps of instances from the train set and a index set (database). Although conventional instance image retrieval datasets have no instance overlap between the train set and the index set to measure generalization performance of methods, it is natural to have instance overlap between them in most real situations. For example, in a potential landmark image search system, some users may upload their landmark photos with a landmark name, which can be a label. Thus, using these meta information is natural and essential to improve search results.

In the evaluation on the GLD-v2 dataset, the train set is not used. Since our re-ranking follows with the GLD-v2 evaluation criteria, we constructed the algorithm considering that the samples from the train set are not a target of retrieval. However, when considering the actual retrieval system, the train set can also be considered to be part of the database to be searched. Even in such cases, our re-ranking can be naturally expanded. Specifically, in our re-ranking, it is necessary to predict the instance-id of each sample of the index set in advance. However, the instance-ids of samples from the train set are known. Thus, we can use these instance-ids of train set samples by setting the prediction score 1.0. By doing so, our re-ranking can be executed without changing in other steps, no matter whether retrieved samples are from the index set or the train set.

Dataset (train set) # Samples # Labels
GLD-v1 1,225,029 14,951
GLD-v2/2.1 4,132,914 203,094
GLD-v2/2.1 (clean) 1,580,470 81,313
Table 1: Dataset statistics used in our experiments. The index and test images are not included here. GLD-v2 and GLD-v2.1 only differ in the index set and test set and thus are shown together for the train set.

4 Dataset

The Google Landmarks Dataset (GLD) is the largest dataset of instance image retrieval, which contains photos of landmarks from all over the world. The photos include a lot of variations, e.g., occlusion, lighting changes. GLD has three versions: v1, v2, and v2.1 and we overview their differences in Table 1. GLD-v1 [32] which is the first version of GLD has released in 2018. This dataset has more than 1 million samples and around 15 thousand labels. GLD-v1 was created based on the algorithm described in [53], and uses visual features and GPS coordinates for ground-truth correction. Simultaneously, the Google Landmarks Challenge 2018 was launched and GLD-v1 was used at this challenge. Currently, we can still download the dataset, but cannot evaluate with it since ground-truth was not released. GLD-v2 111, used for the Google Landmarks Challenge 2019, is the largest worldwide landmark recognition dataset available at the time. This dataset includes over 5 million images of more than 200 thousands of different landmarks. It is divided into three sets: train, test, and index. Only samples from the train set are labeled.

Since GLD-v2 was constructed by mining web landmark images without any cleaning step, each category may contain quite diverse samples: for example, images from a museum may contain outdoor images showing the building and indoor images depicting a statue located in the museum. In comparison with the GLD-v1, there is significantly more noise in the annotations. The GLD-v2.1 is a minor update of GLD-v2. Only ground truth of test set and index set are updated.

Automated Data Cleaning. The train set of GLD-v2 is very noisy because it was constructed by mining web landmark images without any cleaning step. Furthermore, training with the entire train set of GLD-v2 is complicated due to its huge scale. Therefore, we consider to automatically remove noises such as mis-annotation inspired by [15], leading to reduction of dataset size and training budget, while avoiding adverse effects of the noise for deep metric learning.

To build a clean train set, we apply spatial verification [35] to filtered images by -NN search. Specifically, cleaning the train set consists of a three-step process. First, for each image descriptor in the train set, we search its 1000 nearest neighbors from the train set. This image descriptor is obtained by our embedding model learned from the GLD-v1 dataset. Second, spatial verification is performed on up to the 100 nearest neighbors assigned to the same label as . For spatial verification, we use RANSAC [14] with affine transformation and deep local attentive features (DELF) [32]. If an inlier-count between and nearest neighbor image descriptor is greater than 30, we consider the nearest neighbor as a verified image. Finally, if the count of verified images in the second step reaches the threshold , is added to the cleaned dataset. We set in our experiment.

This automated data cleaning is very costly due to the use of spatial verification, however, it only has to be run once. Table 1 summarizes the statistics of the dataset used in our experiments. We show the effectiveness of using our cleaned dataset through our experiments in the following sections.

GLD-v2 GLD-v2.1
Private Public Private Public
Method mAP@100 mAP@100 mAP@100 P@10 MeanPos mAP@100 P@10 MeanPos
-NN search 30.22 27.81 29.63 30.76 27.02 27.66 28.87 32.60
SP [37] 23.75 22.40 23.29 24.72 28.72 22.15 23.46 33.18
AQE [8] 32.17 30.47 31.60 32.97 27.44 30.28 31.35 31.54
QE [38] 32.21 30.34 31.71 33.04 26.67 30.23 31.03 31.13
Iscen et al.’s DFS [22] 32.01 30.55 31.91 32.51 29.52 30.81 30.50 33.23
Yang et al.’s DFS [50] 31.20 29.36 30.90 31.48 29.87 29.29 29.63 33.83
EGT [5] 30.33 28.44 31.00 32.89 34.82 29.77 30.74 38.19
Ours 36.85 34.89 36.04 36.27 24.43 34.41 33.40 29.23
Ours + QE 37.34 35.59 36.55 36.68 24.44 35.12 33.85 28.11
Table 2: Comparison of our re-ranking against the other state-of-the-art re-ranking methods on top of our baseline. We report mAP@100 in GLD-v2 and mAP@100, P@10, and MeanPos in GLD-v2.1. mAP@100 is mean average precision at rank 100. P@10 is mean precision at rank 10 and higher is better. MeanPos is the mean position of the first relevant image (if no relevant image in top-100, use 101 as position) and lower is better.
Figure 4: Two examples of top-3 retrieved results from GLD-v2.1 using QE, EGT, and our approach. Query images are in blue, correct samples are in green and incorrect samples are in red. Best viewed in color.
Figure 5: Two examples of top-3 retrieved results from GLD-v2.1, improved by using our re-ranking. The first row is the result of -NN search, and the second row is the result after the sort-step, and the third row is the result after the insert-step, including the sort-step. Query images are in blue, correct samples are in green and incorrect samples are in red. Best viewed in color.

5 Experiments

5.1 Implementation Details

We pre-train the model on ImageNet 

[10] and the train set of GLD-v1 [32] first, before being trained on cleaned GLD-v2 train set with a cosine softmax loss. We use for the Generalized Mean-pooling, and use 512-dimension embedding space. We use a margin of 0.3 for the ArcFace loss and for the regularization term. For re-ranking we use and for -NN soft-voting.

We train each network for 5 epochs with commonly used data augmentation methods such as brightness shift, random cropping, and scaling. In particular, images are randomly scaled between 80% and 120% of their original size and then either cropping or zero-padding is used to return the image to the original resolution, depending on whether the image was downscaled or upscaled. Brightness is randomly modified by 0% to 10%. When constructing mini-batches for training, the images are resized to be the same size for efficient training. This might cause distortions to the input images, degrading the accuracy of the network 

[16]. To avoid this, we choose mini-batch samples so that they have similar aspect ratios, and resize them to a particular size. The size is determined by selecting tuple of width and height from depending on their aspect ratio.

Model training is done by using the stochastic gradient descent with momentum, where initial learning rate, momentum, and batch size are set to 0.001, 0.9, and 32, respectively. The cosine annealing 

[29] learning rate scheduler is used during training.

For other approaches we compare to, we follow the settings described in their respective papers. However, we have changed some hyperparameters which would found to give non-competitive results. In particular, spatial verification (SP) follows the procedure from [37] except for using DELF [32] trained with GLD-v1 as the local descriptor. In AQE [8] and QE [38], the number of retrieved results used for query expansion are set to 10 including the query itself. The of QE is set to 3.0. SP is not used to filter samples for the construction of a new query in QE different from [8]. For Iscen et al.’s diffusion (DFS) [22] and Yang et al.’s diffusion (DFS) [50], the default hyperparameters are used. The threshold of EGT [5] is set to . Hyperparameters of each method are tuned using the GLD-v2 Public split.

We use multi-scale feature extraction described in 

[15] during test time in whole experiments. The resulting features are finally averaged and re-normalized.

5.2 Evaluation Protocol

We use the Google Landmarks Dataset (GLD) [32], Oxford-5K [37], and Paris-6K [37] for experiments. GLD-v1 and GLD-v2 have three data splits: train, index and test set. The train set of GLD-v1 and GLD-v2 is used for training. Additionally, the train set of GLD-v2 is used as a train set for re-ranking. The index set and the test set of GLD-v2 and GLD-v2.1 are used for our evaluation. The index set and the test set of GLD-v1 are not used for our evaluation since we cannot obtain ground-truth of GLD-v1 and use evaluation server. Note that evaluation on GLD-v2 are performed on evaluation server of the competition page 222 and it shows only mAP@100. We report two split results, “Private” and “Public”. The Private split accounts for 67% and the Public split accounts for 33% of GLD-v2 and GLD-v2.1 respectively.

Additionally, Oxford-5K [37], and Paris-6K [37] are also used for the evaluation of loss functions and dataset comparison. Oxford-5K [37], and Paris-6K [37] are the revisited version of Oxford [35] and Paris [36]. We follow the Hard evaluation protocol [37].

5.3 Comparison with Other Re-ranking Methods

We evaluate our re-ranking method and other state-of-the-art re-ranking methods on top of our baseline in Table 2, evaluating on the GLD-v2 and GLD-v2.1 datasets. Baseline is the retrieved results by -NN search using descriptors extracted by our trained model. Surprisingly, spatial verification (SP) [37] harms the performance drastically in contrast to the common sense of instance image retrieval. After visual inspection of the results of SP, we hypothesize that this is likely caused by a large number of instances that are very similar. There are many cases where the RANSAC inlier count increases artificially due to geometrical consistency of partial region between even different instances, degrading accuracy as a result.

Experimental results show that our approach outperforms the previous re-ranking approaches on the challenging GLD dataset. Furthermore, a combination of ours and QE boosts the performance, and it suggests that our re-ranking method can be combined with existing re-ranking methods to further improve performance. A qualitative comparison with other approaches is shown in Fig. 4. We can see that our re-ranking can retrieve samples that have no visual clue to query. These samples are failed to be retrieved with QE and EGT.

5.4 Ablation Study

We perform an ablation study and report the result in Table 3 to validate each step in our re-ranking approach. We can see that both the sort-step and insert-step significantly improve results with respect to the -NN search-only baseline.

Description Private Public
Baseline 30.22 27.81
+ Sort-step 33.79 30.91
+ Insert-step 36.85 34.89
Table 3: Ablation study of each step of our re-ranking. We show the effect of adding the sort-step and both the sort-step and insert-step with respect to our strong baseline on the GLD-v2 dataset.

Additionally, we show the top-3 ranked results of each step in Fig. 5. We can see that the baseline of -NN search retrieves visually similar images no matter if it shows the same landmark as the query image or not. After each step, the correct images not retrieved as top rank samples due to the visual dissimilarity are more emphasized and ranked higher.

We test the effect of hyperparameter of -NN and and report results in Table 4. We can see that our re-ranking approach is fairly insensitive to the setting of the hyperparameters.

Private Public
1 0.0 35.35 33.28
1 0.6 35.35 33.28
1 1.2 35.36 33.28
3 0.0 36.77 34.88
3 0.6 36.85 34.89
3 1.2 35.76 33.12
5 0.0 35.78 33.98
5 0.6 35.88 34.12
5 1.2 34.68 32.09
Table 4: Effect of two-stage discriminative re-ranking hyperparameters used for -NN search and insert threshold on the GLD-v2 dataset.

5.5 Comparison of Loss Functions

Table 5 shows the comparison results among loss functions when trained with GLD-v1. ResNet-101 [17] is used as backbone network in all loss function experiments. In the triplet loss and AP loss, we use an implementation described in [38], and [38] offers a state-of-the-art global descriptor model. In CosFace [48] and ArcFace [11], we use a model described in Section 5.1 with a margin of 0.3. Note that we do not use supervised whitening in CosFace and ArcFace experiments for the sake of simplicity. We set the dimension of the global descriptor to 2048 in triplet loss and AP loss following the setting of [38, 41], and 512 in CosFace and ArcFace.

Loss Private Public Oxf Par
TripletLoss [38] 18.94 17.14 43.61 61.39
AP Loss [41] 18.71 16.30 40.87 61.62
CosFace [48] 21.35 18.41 44.78 62.95
ArcFace [11] 20.74 18.13 46.25 66.62
Table 5: Comparison of loss functions. We train with GLD-v1 and use ResNet-101 [17] for the loss function comparison. We report mAP@100 on GLD-v2 (Private and Public splits) and mAP on Hard evaluation protocol of Oxford-5K [37] (Oxf) and Paris-6K [37] (Par).
Dataset (GLD) Private Public Oxf Par
v1 20.74 18.13 46.25 66.62
v2 27.81 24.97 54.81 74.40
v2-clean 28.83 26.86 58.94 78.13
v1 + v2 29.20 26.84 56.59 77.35
v1 + v2-clean 30.22 27.81 59.93 77.82
Table 6: Evaluation of the effect of the training dataset. We use the ArcFace loss with a ResNet-101 model for this experiment. We report mAP@100 on GLD-v2 (Private and Public splits) and mAP on Hard evaluation protocol of Oxford-5K [37] (Oxf) and Paris-6K [37] (Par).

Although it is hard to compare the loss functions fairly due to the implementation differences, CosFace and ArcFace seem to outperform triplet loss and AP loss in multiple benchmarks. CosFace outperforms to ArcFace in Private and Public set of GLD-v2. ArcFace outperforms to CosFace in the other metrics.

5.6 Datasets

We perform experiments to validate the influence of the training dataset. Table 6 shows the results of comparison with various dataset combination. “v1” denotes the train set of GLD-v1, and “v2” denotes the train set of GLD-v2. “v2-clean” is the GLD-v2 train set cleaned by the automated way described in Section 4. We find that training with v2 significantly increases performance with respect to v1. The result using v2-clean for training outperforms the result using v2 either with and without v1 pre-training, in spite of reducing the sample size by three. Using v2-clean with v1 pre-training gives the best results overall.

6 Conclusion

We have presented an efficient pipeline for retrieval of landmark images from large datasets. Our work leverages recent approaches and we propose a discriminative two-step re-ranking method that shows significant improvements with respect to existing approaches. In-depth experimental results corroborate the efficacy of our approach.


  • [1] R. Arandjelovic, P. Gronát, A. Torii, T. Pajdla, and J. Sivic (2018) NetVLAD: CNN architecture for weakly supervised place recognition. TPAMI 40 (6), pp. 1437–1451. Cited by: §2, §2.
  • [2] R. Arandjelovic and A. Zisserman (2012) Three things everyone should know to improve object retrieval. In CVPR, pp. 2911–2918. Cited by: §2, §2.
  • [3] A. Babenko and V. S. Lempitsky (2015)

    Aggregating local deep features for image retrieval

    In ICCV, pp. 1269–1277. Cited by: §2, §2.
  • [4] H. Bay, T. Tuytelaars, and L. V. Gool (2006) SURF: speeded up robust features. In ECCV, pp. 404–417. Cited by: §2.
  • [5] C. Chang, G. Yu, C. Liu, and M. Volkovs (2019) Explore-exploit graph traversal for image retrieval. In CVPR, Cited by: §2, §2, Table 2, §5.1.
  • [6] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, pp. 539–546. Cited by: §2.
  • [7] O. Chum, A. Mikulík, M. Perdoch, and J. Matas (2011) Total recall II: query expansion revisited. In CVPR, pp. 889–896. Cited by: §2.
  • [8] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman (2007) Total recall: automatic query expansion with a generative feature model for object retrieval. In ICCV, pp. 1–8. Cited by: §2, Table 2, §5.1.
  • [9] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray (2004) Visual categorization with bags of keypoints. In ECCVW, Cited by: §2.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §5.1.
  • [11] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019) ArcFace: additive angular margin loss for deep face recognition. In CVPR, pp. 4690–4699. Cited by: §2, §3.1, §5.5, Table 5.
  • [12] M. Donoser and H. Bischof (2013) Diffusion processes for retrieval revisited. In CVPR, pp. 1320–1327. Cited by: §2.
  • [13] R. Filip, T. Giorgos, and C. Ondřej (2016) CNN image retrieval learns from BoW: unsupervised fine-tuning with hard examples. In ECCV, pp. 3–20. Cited by: §2.
  • [14] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24 (6), pp. 381–395. Cited by: §2, §4.
  • [15] A. Gordo, J. Almazán, J. Revaud, and D. Larlus (2017) End-to-end learning of deep visual representations for image retrieval. IJCV 124 (2), pp. 237–254. Cited by: §2, §2, §4, §5.1.
  • [16] J. Hao, J. Dong, W. Wang, and T. Tan (2016) What is the best practice for cnns applied to visual instance retrieval?. arXiv:1611.01640. Cited by: §5.1.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. External Links: Document, ISBN 978-1-4673-8851-1 Cited by: §3.1, §5.5, Table 5.
  • [18] E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In ICLRW, Cited by: §2.
  • [19] S. S. Husain and M. Bober (2019) REMAP: multi-layer entropy-guided pooling of dense cnn features for image retrieval. TIP 28 (10), pp. 5201–5213. Cited by: §2.
  • [20] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §3.1.
  • [21] A. Iscen, Y. Avrithis, G. Tolias, T. Furon, and O. Chum (2018) Fast spectral ranking for similarity search. In CVPR, pp. 7632–7641. Cited by: §2.
  • [22] A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum (2017) Efficient diffusion on region manifolds: recovering small objects with compact CNN representations. In CVPR, pp. 926–935. Cited by: §2, Table 2, §5.1.
  • [23] H. Jégou, M. Douze, and C. Schmid (2008) Hamming embedding and weak geometric consistency for large scale image search. In ECCV, pp. 304–317. Cited by: §2.
  • [24] H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid (2012) Aggregating local image descriptors into compact codes. TPAMI 34 (9), pp. 1704–1716. Cited by: §2.
  • [25] Y. Kalantidis, C. Mellina, and S. Osindero (2016) Cross-dimensional weighting for aggregated deep convolutional features. In ECCVW, pp. 685–701. Cited by: §2, §2.
  • [26] Z. Lin, Z. Yang, F. Huang, and J. Chen (2018) Regional maximum activations of convolutions with attention for cross-domain beauty and personal care product retrieval. In ACMMM, pp. 2073–2077. Cited by: §2, §2.
  • [27] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. In CVPR, pp. 6738–6746. Cited by: §2.
  • [28] W. Liu, Y. Wen, Z. Yu, and M. Yang (2016) Large-margin softmax loss for convolutional neural networks. In ICML, pp. 507–516. Cited by: §2.
  • [29] I. Loshchilov and F. Hutter (2017) SGDR: stochastic gradient descent with warm restarts. In ICLR, Cited by: §5.1.
  • [30] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. IJCV 60 (2), pp. 91–110. Cited by: §2.
  • [31] I. Masi, Y. Wu, T. Hassner, and P. Natarajan (2018) Deep face recognition: A survey. In SIBGRAPI, pp. 471–478. Cited by: §2.
  • [32] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han (2017) Large-scale image retrieval with attentive deep local features. In ICCV, pp. 3476–3485. Cited by: §1, §2, §4, §4, §5.1, §5.1, §5.2.
  • [33] M. Perdoch, O. Chum, and J. Matas (2009) Efficient representation of local geometry for large scale object retrieval. In CVPR, pp. 9–16. Cited by: §2.
  • [34] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier (2010) Large-scale image retrieval with compressed fisher vectors. In CVPR, pp. 3384–3391. Cited by: §2.
  • [35] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman (2007) Object retrieval with large vocabularies and fast spatial matching. In CVPR, Cited by: §2, §2, §4, §5.2.
  • [36] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In CVPR, Cited by: §5.2.
  • [37] F. Radenovic, A. Iscen, G. Tolias, Y. Avrithis, and O. Chum (2018) Revisiting oxford and paris: large-scale image retrieval benchmarking. In CVPR, pp. 5706–5715. Cited by: §2, Table 2, §5.1, §5.2, §5.2, §5.3, Table 5, Table 6.
  • [38] F. Radenovic, G. Tolias, and O. Chum (2018) Fine-tuning cnn image retrieval with no human annotation. TPAMI. Cited by: §2, §2, §2, §2, §3.1, Table 2, §5.1, §5.5, Table 5.
  • [39] R. Ranjan, C. D. Castillo, and R. Chellappa (2017) L2-constrained softmax loss for discriminative face verification. arXiv:1703.09507. Cited by: §2.
  • [40] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. In CVPRW, pp. 512–519. Cited by: §2, §2.
  • [41] J. Revaud, J. Almazán, R. S. de Rezende, and C. R. de Souza (2019) Learning with average precision: training image retrieval with a listwise loss. In ICCV, Cited by: §2, §2, §5.5, Table 5.
  • [42] F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: a unified embedding for face recognition and clustering. In CVPR, pp. 926–935. Cited by: §2.
  • [43] J. Sivic and A. Zisserman (2003) Video google: A text retrieval approach to object matching in videos. In ICCV, pp. 1470–1477. Cited by: §2.
  • [44] M. Teichmann, A. Araujo, M. Zhu, and J. Sim (2019) Detect-to-retrieve: efficient regional aggregation for image search. In CVPR, pp. 5109–5118. Cited by: §2.
  • [45] G. Tolias, Y. Avrithis, and H. Jégou (2016) Image search with selective match kernels: aggregation across single and multiple images. IJCV 116 (3), pp. 247–261. Cited by: §2.
  • [46] G. Tolias and H. Jégou (2014) Visual query expansion with or without geometry: refining local descriptors by feature aggregation. Pattern Recognition 47 (10), pp. 3466–3476. Cited by: §2.
  • [47] F. Wang, J. Cheng, W. Liu, and H. Liu (2018) Additive margin softmax for face verification. SPL 25 (7), pp. 926–930. Cited by: §2.
  • [48] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018) CosFace: large margin cosine loss for deep face recognition. In CVPR, pp. 5265–5274. Cited by: §2, §5.5, Table 5.
  • [49] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu (2014) Learning fine-grained image similarity with deep ranking. In CVPR, pp. 1386–1393. Cited by: §2.
  • [50] F. Yang, R. Hinami, Y. Matsui, S. Ly, and S. Satoh (2019) Efficient image retrieval via decoupling diffusion into online and offline processing. In AAAI, Cited by: §2, §2, Table 2, §5.1.
  • [51] X. Zhang, R. Zhao, Y. Qiao, X. Wang, and H. Li (2019)

    AdaCos: adaptively scaling cosine logits for effectively learning deep face representations

    In CVPR, pp. 10823–10832. Cited by: §2.
  • [52] L. Zheng, Y. Yang, and Q. Tian (2018) SIFT meets CNN: A decade survey of instance retrieval. TPAMI 40 (5), pp. 1224–1244. Cited by: §2.
  • [53] Y. Zheng, M. Zhao, Y. Song, H. Adam, U. Buddemeier, A. Bissacco, F. Brucher, T. Chua, and H. Neven (2009) Tour the world: building a web-scale landmark recognition engine. In CVPR, pp. 1085–1092. Cited by: §4.