Geometric Image Correspondence Verification by Dense Pixel Matching

04/15/2019 ∙ by Zakaria Laskar, et al. ∙ 0

This paper addresses the problem of determining dense pixel correspondences between two images and its application to geometric correspondence verification in image retrieval. The main contribution is a geometric correspondence verification approach for re-ranking a shortlist of retrieved database images based on their dense pair-wise matching with the query image at a pixel level. We determine a set of cyclically consistent dense pixel matches between the pair of images and evaluate local similarity of matched pixels using neural network based image descriptors. Final re-ranking is based on a novel similarity function, which fuses the local similarity metric with a global similarity metric and a geometric consistency measure computed for the matched pixels. For dense matching our approach utilizes a modified version of a recently proposed dense geometric correspondence network (DGC-Net), which we also improve by optimizing the architecture. The proposed model and similarity metric compare favourably to the state-of-the-art image retrieval methods. In addition, we apply our method to the problem of long-term visual localization demonstrating promising results and generalization across datasets.



There are no comments yet.


page 1

page 3

page 4

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image retrieval is a well studied problem in the field of computer vision and robotics with applications in place recognition

[4, 10, 32], localization [18, 32, 35], autonomous driving [22], and virtual reality[25] among many others. Given a query image, the image retrieval pipeline returns a ranked list of database images according to its measure of relevance to the query image. As raw pixels are not a good representation, extensive research has gone into finding discriminative and efficient image representations. The seminal work of Sivic and Zisserman [33] proposed Bag-of-Words based image representation using SIFT [19]. Later, more advanced and efficient representations were proposed in the form of VLAD [15]

descriptors and Fisher vectors 

[29]. More recently, off-the-shelf [2, 16, 17] and fine-tuned [1, 9, 26]convolutional neural network (CNN) representations have demonstrated great success in image retrieval. The models encode an input image to a global vector representation which leads to efficient retrieval allowing to use just a dot product as a similarity measure to obtain relevant database images. Once fine-tuned on auxiliary datasets with similar distribution as the target one, those methods have achieved state-of-the-art image retrieval performance [1, 9, 26]. However, the main limitation of such fine-tuned CNN representations are their generalization capabilities which is crucial in the context of city-scale localization where the database images can be quite similar in structure and appearance. Moreover, variations in illumination (night time queries) or occlusion can significantly affect the encoded global representations degrading retrieval performance due to lack of spatial information.

In this paper we leverage the advances of spatial geometry to obtain better ranking of the database images. To this end, we revisit the geometric verification problem in the context of image retrieval. That is, given an initial ranked list, of database images returned by a CNN model (NetVLAD), we seek to re-rank a shortlist of images by using dense pixel correspondences [24] which are verified by the proposed similarity functions. Previously, DGC-Net [24] has been successfully applied only to positive image pairs pairs with overlapping field of view. In this work we extend its applicability to verify positive and negative image pairs in the framework of geometric verification. That is, we demonstrate how dense pixel correspondence methods such as DGC-Net can be used to improve image retrieval by geometric verification.

In summary, the contributions of this work are threefold. First, we improve the baseline DGC-Net by constraining the matching layer to be locally and globally consistent. Second, we replace multiple decoders of the original DGC-Net architecture by the proposed universal decoder, which can be shared for feature maps in different layers of the feature pyramid of DGC-Net. Third, we formulate two similarity functions, which first rank the shortlisted database images based on structural similarity and then re-rank them using appearance based similarity.

Figure 2: Overview of the proposed pipeline. Given a query image, we first rank the database images based on global similarity (using NetVLAD). In the next stage dense pixel correspondences are computed between the query and top ranked database images. These correspondences are then verified by the proposed similarity functions utilizing geometry and CNN based image descriptors to re-rank database images according to the input query. See Sec. 4 and 5 for more details.

2 Related work

This work is closely related to image retrieval and image matching tasks. We provide a brief overview of existing approaches below.

Image retrieval methods can be broadly categorized into two categories: local descriptors [6, 13, 14, 20, 33] and global representations [1, 9, 26]. The approaches of the first category are based on either hand-engineered features such as SIFT [19] or learnt CNNs descriptors on the task of local image patch matching [23, 37]. Similarly, global representations methods can be further categorized into traditional hand-designed descriptors such as VLAD [15], Fisher Vectors [29], Bag-of-Words [33] and CNN based methods [1, 2, 9, 26]. Babenko  [2]

demonstrate that the performance of off-the-shelf CNN models pre-trained on ImageNet 

[7] fall behind traditional local descriptors. However, when trained on an auxiliary dataset, the performance improves over such hand-engineered descriptors [1, 9, 26].

In addition to the standard retrieval approaches, there are several methods that attempt to explain the similarity between the query and top ranked database images using a geometric model [5, 21, 35]

. The geometric model is estimated by fitting a simple transformation model (planar homography) to the correspondence set obtained using local descriptors such as SIFT, or off-the-shelf CNN descriptors 

[35]. In this work, we also use pre-trained CNN descriptors. However, in contrast to [35] which uses exhaustive nearest-neighbor search in descriptor space, we model the similarity using a learnt convolutional decoder. Moreover, [35] only uses coarse correspondence estimate, while our similarity decoder allows fine high resolution pixel level correspondence estimation. This is particularly important in city scale localization due to subtle differences in an overall similar architectural style observed in this scenario (Fig. 7).

Image matching. This task relates to the optical flow estimation problem. Recently proposed optical flow methods [12, 34] utilize a local correlation layer that performs spatially constrained matching in a coarse-to-fine manner. DGC-Net [24] extends this process of learning iterative refinement of pixel correspondences using a global correlation layer to handle wide viewpoint changes in the task of instance matching. Such a global correlation layer for instance matching has been used to estimate geometric transformations [27]. Melekhov  [24] demonstrate that such a method falls behind dense correspondence approaches due to the constrained range of transformations estimated by [27]. Recently, Rocco  [28] propose locally and globally constrained matching network on top of the global correlation layer which leads to improvement in instance and semantic matching. However, such a global correlation layer can only provide coarse correspondence estimates.

3 Method overview

Our contributions are related to the two last stages of the following three-stage image retrieval pipeline: 1) Given a query image, we retrieve a shortlist of relevant database images using a fast and scalable retrieval method based on representing images with a descriptor vector; 2) We perform dense pixel matching between the query and each shortlisted database image in a pairwise manner using a correspondence estimation network; 3) We determine a set of cyclically consistent dense pixel matches for each image pair and use them to compute a similarity metric, which provides the final re-ranking of the shortlist.

The particular architecture of the aforementioned retrieval pipeline used in this work is illustrated in Fig. 2. That is, we use NetVLAD [1] for the first stage, our own modified version of DGC-Net [24] for the second stage, and the proposed approach with a novel similarity metric for the third stage. Here NetVLAD is used for retrieval, but also other global image level descriptors could be used instead.

Our contributions related to stages 2) and 3) above are described in the following sections. The geometric verification method is presented in Section 4 and our modifications to the DGC-Net architecture are described in Section 5.

4 Geometric verification

Dense pixel correspondences produced by [24] do not take into account the underlying model explaining the 3D structure of the scene by the image pair. RANSAC [8] has been a popular method of choice to find the set of inliers from the whole correspondence set. However, dense pixel correspondences predicted by CNNs [24] are locally smooth due to the shared convolutional filters at different layers. As a result, RANSAC usually finds a large set of inliers even for non-matching image pairs. We propose two methods to eliminate these limitations in the following sections. That is, given an initial ranked shortlist of database images based on global representation similarity with the query image, we re-rank a new shortlist through a series of geometric verification steps (Sec. 4.1 and 4.2).

4.1 Cyclically consistent geometry

First, we determine a set of cyclically consistent pixel matches for image pair

by running our DGC-Net variant for both ordered pairs,

and , so that we get the dense correspondence maps . The cyclically consistent matches are those matches for which the combined mapping is close to an identity mapping.

The number of cyclically consistent pixel matches alone is not a sufficient condition to verify image similarity. Unstructured regions in image pairs like the sky, or roads/pavements etc. produce cyclically consistent pixel matches. These regions can occupy significant percentage of the image area resulting in a large number of cyclically consistent pixel matches.

Thus, we formulate a similarity cost function, that combines RANSAC based geometric model estimation with cyclic consistency. Given a dense pixel correspondence map, , RANSAC outputs a set of inliers,  to a transformation model (planar homography). We then estimate the subset of inliers that are cyclically consistent, using forward and backward correspondence maps predicted by our network (  and

). The intuition is that RANSAC estimation will eliminate pixels from unstructured regions in images as outliers. For geometrically dissimilar images, cyclic consistency constraint will further constrain the number of inliers as the assumption here is that transformation model obtained by RANSAC may be inconsistent in either forward or backward directions. We define a similarity function as follows


where is the image resolution. The exponential term is added to down-weight the similarity cost for image pairs which have less cyclically consistent correspondences in the inlier set. The similarity is computed in both directions, and the final similarity is the maximum of the two values, . The shortlist is re-ranked using resulting in the new shortlist .

Figure 3: Overview of the dense pixel correspondence network. Pre-trained VGG-16 network is used to create a multiscale feature pyramid of the input image pair. Correlation and neighborhood consensus layers use features from the top level of to establish coarse pixel correspondences which are then refined by the proposed unified correspondence map decoder (UCMD). In contrast to DGC-Net [24] with multiple decoders, UCMD can be applied to each level of the multi-scale feature pyramid seamlessly leading to smaller memory footprint.

4.2 Global and local similarity

Using the geometry based similarity function to re-rank the shortlist typically improves retrieval accuracy, but the retrieved list may still contain outliers as the global and local appearance similarity is not directly taken into account while computing . Hence, the top-ranked database images in the geometrically verified shortlist

are passed through a second similarity function based on global and local descriptor similarity. The second similarity function is detailed below and more costly to evaluate, as it requires dense image feature extraction, and therefore we have a two-stage re-ranking, where the second re-ranking is done only for a subset of top-ranked images from the first stage.

To obtain global similarity we use normalized global descriptors from a pre-trained network NetVLAD [1]. The network was originally trained to learn powerful representations for image retrieval. The Euclidean distance between the global representations is defined as the global similarity value . To compute local similarity, we extract hypercolumn [11] features from different NetVLAD layers, L2 normalize and concatenate them along channel dimension. The final features are again L2 normalized resulting in feature maps, , where , and are the image resolution and the final descriptor length. The local descriptor similarity is then obtained as:


where denotes inner product, and are the hypercolumn NetVLAD features at location in the warped source and target feature map , and , where is the mask containing 1s at cyclically consistent pixels. Thus, Eq. 2

computes the cosine similarity between normalized warped source and target hypercolumn descriptors at cyclically consistent pixel locations.

The final similarity function between an image pair is a function of global and local similarities, and :


Here, local similarity score is weighted by the structural similarity score . We use to re-rank the top-ranked images in to get the final shortlist for a given query.

5 Pixel correspondence estimation

To obtain dense matching between two images we use a CNN network based on the architecture of DGC-net proposed by [24]. In this section, we provide two modifications to DGC-Net leading to more compact but effective model.

Figure 4: Overview of the unified correspondence map decoder (UCMD) . The feature maps of the target and the warped source images have been split into tensors and then concatenated along the channel dimension. Further, each tensor is complemented by the correspondence map estimates (expelled from the figure for clarity) and then fed into a convolutional block with inputs and shared weights. The output feature maps of are then averaged and processed by the remaining layers of the decoder to produce refined pixel correspondence estimates.

5.1 Unified correspondence map decoder

In general, DGC-Net consists of a series of convolutional layers and activation functions as an encoder

with layers. An input image pair is fed into the encoder independently to obtain a multi-resolution feature pyramid, . Here is the feature map at the output of layer of the encoder. The encoded feature maps at the top level of , are passed through a global correlation layer, that computes exhaustive pairwise features cosine similarity. The output of correlation layer is then passed through a decoder that estimates the initial correspondence map at the same resolution as . is then iteratively refined by a series of decoders to obtain the final correspondence grid at the same resolution as input images. Each decoder, takes in as input , where is the upsampled correspondence map estimated by the previous decoder , and are the warped source and target feature maps at . However, since feature maps at level of have various number of channels, each decoder has different structure which leads to increased memory costs.

In this work, we propose a unified correspondence map decoder (UCMD) illustrated in Fig. 4. The unified decoder behaves like a recursive refinement function that operates on feature maps across different layers of . More specifically, we divide the concatenated input feature maps along the channel dimension into non-overlapping components, . Each of the feature components are concatenated with along channel dimension and fed into . In practise, rather than propagate the concatenated feature components through the whole decoder , we feed each into the first convolutional block (CB0) of as shown in Fig. 4. The resulting feature maps at the output of CB0 are subsequently averaged and passed through the remaining layers to obtain refined correspondence estimates . Thus, this process can be seen as convolution of the universal refinement function across the component channels of the concatenated input feature maps.

The number of inputs of CB0 is , where specifies the number of channels in feature maps which are concatenated along the channel dimension. The additional 2 channels comprise of the upsampled coarser pixel correspondence map estimate from the previous layer of . Therefore, is given by where is the dimensionality of the feature maps at the current layer .

Inference. During the testing phase, apart from evaluating the trained network directly we additionally follow a second strategy. We infer the pixel correspondences by feed-forwarding each through the complete decoder resulting in correspondence map estimates . The process is applied to each level of the feature pyramid . The mean is used as the final pixel correspondence map estimate. This formulation was not used during training as it did not lead to convergence.

5.2 Match consistency

The global correlation layer only measures the similarities in one direction from target to source image. However, many related works in the optical flow have shown that cyclic consistency allows the network to achieve better performance. In [28], a similar kind of global correlation layer was applied with cyclic consistency and neighborhood consensus to learn optimal feature correspondence. The idea is that matches should be consistent both locally and cyclically. That is nearby matches should be locally consistent and also the matches should be consistent in both forward and backward direction. Thereby, we integrated the Neighborhood Consensus Network (NCNet) [28] in our network. In contrast to original DGC-Net, the output of the correlation layer is now passed through NCNet with learnable parameters before being feed-forwarded through the decoders and to obtain dense pixel correspondences . We refer to this network as DGC-NC-UCMD-Net.

(a) Viewpoint I
(b) Viewpoint II
(c) Viewpoint III
(d) Viewpoint IV
(e) Viewpoint V
Figure 5: PCK metric calculated for different Viewpoint IDs of the HPatches dataset. The proposed architectures (DGC-NC-*) substantially outperform all strong baseline methods with a large margin.

6 Experiments

We discuss the experimental settings and evaluate the proposed method on two closely related tasks, establishing dense pixel correspondences between images (image matching) and retrieval-based localization.

6.1 Image matching

For this task we compare our approach with DGC-Net [24], which can handle strong geometric transformations between two views. We use training and validation splits proposed by [24] to compare both approaches fairly. More specifically, diverse synthetic transformations (affine, TPS, and homography) have been applied to Tokyo Time Machine dataset [1] to generate  20k training samples. Similarly to [24], the proposed network has been trained by minimizing distance between the ground-truth and estimated correspondence map at each level of the feature pyramid (Fig. 3). Details of the training procedure are given in supplementary.

We evaluate our method on HPatches dataset [3] and report the average endpoint error (AEPE) of the predicted pixel correspondence map. HPatches dataset consists of several sequences of real images with varying photometric changes. Each image sequence represents a reference image and 5 corresponding source images taken under a different viewpoint with the estimated ground-truth homography . As predicting a dense pixel correspondence map is closely related to optical flow estimation, we provide AEPE for strong optical flow (OF) baseline methods, FlowNet2 [12] and PWC-Net [34] respectively.

Method Viewpoint ID
FlowNet2 [12] 5.99 15.55 17.09 22.13 30.68
PWC-Net [34] 4.43 11.44 15.47 20.17 28.30
Rocco [27] 9.59 18.55 21.15 27.83 35.19
DGC-Net [24] 1.55 5.53 8.98 11.66 16.70
DGC-NC-UCMD-Net  1.90 5.02 9.08 10.18 13.24
DGC-NC-UCMD-Net (avg. est.)  1.51 4.46 8.66 9.59 12.62
DGC-NC-Net 1.24 4.25 8.21 9.71 13.35
Table 1: AEPE metric for different viewpoint IDs of the HPatches dataset (lower is better).

We calculate AEPE over all image sequences belonging to the same Viewpoint ID of the HPatches dataset and report the results in Tab. 1. Here, DGC-NC-Net refers to the original DGC-Net architecture complemented by NC layer (Sec. 5.2) with a set of independent decoders at each level of the spatial feature pyramid . Compared to DGC-Net, this model can achieve better performance reducing the overall EPE by 20% for the most extreme viewpoint difference between the reference and source images (Viewpoint V). According to Tab. 1, DGC-NC-UCMD-Net with one universal correspondence map decoder (Sec. 5.1) falls slightly behind of DGC-NC-Net (by 12% in average across all Viewpoint IDs) but it demonstrates significant advantages in terms of computation costs reducing the number of learnable parameters by 2.8 times. However, DGC-NC-UCMD-Net performance can be improved further if, at inference time, rather than averaging feature maps produced by the first convolutional block of UCMD (Fig. 4) we average predicted pixel correspondence estimates for each input feature map. We refer this model as DGC-NC-UCMD-Net (avg. est.).

In addition, we report a number of correctly matched pixels between two images by calculating PCK (Percentage of Correct Keypoints) metric with different thresholds. As shown in Fig. 5, the proposed DGC-NC-* models outperform DGC-Net by about 4% and correctly match around 62% pixels for the case where geometric transformations are the most challenging (Viewpoint V).

6.2 Localization

We study the performance of our pipeline in the context of image retrieval for image based city-scale localization. For evaluating the performance of our pipeline, we consider two localization datasets: Tokyo24/7 [36] and Aachen Day-Night [30]. For both datasets, we follow the same procedure outlined below. For a given query we first obtain a ranked list of database images, based on Euclidean distance between their global NetVLAD representations, . The top 100 ranked database images, are re-ranked according to their geometric similarity score based on . From these geometrically verified re-ranked database images, we pass the top 20, through the more expensive and stricter representation similarity function, . Based on this final similarity, the final re-ranking is done on .

24/7 Tokyo consists of 80k database images collected from Tokyo city using Google Street View System. The 315 queries were later collected using hand-held devices under challenging scenarios, viewpoint variation, night time captures and substantial occlusion.

Aachen Day-Night. The dataset has similar properties as Tokyo24/7 dataset with day-night variations in captured queries. The total number of database images is 4k with around 900 queries.

Localization metrics. The performance on the Tokyo24/7 dataset is evaluated using Recall@N, which is the number of queries that are correctly localized given nearest-neighbor database images returned by the model. The query is considered correctly localized if at least one of the relevant database images is presented in the top ranked database images. In contrast, the localization performance on Aachen Day-Night is measured in terms of accuracy of the estimated query pose. The accuracy is defined as the percentage of queries with their estimated 6DOF pose lying within a pre-defined threshold to the ground-truth pose.

Figure 6: Comparison of the proposed methods versus state-of-the approaches for place recognition.
Methods Recall
r@1 r@5 r@10
DenseVLAD [36] 67.1 74.2 76.1
NetVLAD-Pitts [1] 61.27 73.02 78.73
NetVLAD-TokyoTM [1] 71.1 83.1 86.2
Proposed Pitts 71.43 82.54 85.08
Proposed Pitts 77.14 84.44 86.67
Table 2: Localization performance on the Tokyo24/7 dataset (higher is better).

6.2.1 Tokyo 24/7

We compare the proposed approach with several strong baseline methods for place recognition. The hand-crafted methods are represented by DenseVLAD [36] which aggregates densely extracted SIFT descriptors [19]. As a CNN-based baseline method, we use NetVLAD [1] achieving state-of-the-art results on Tokyo24/7 dataset.

The Recall@N for the baseline methods are presented in Fig. 6. Our geometric verification based pipeline achieves the state-of-the-art performance at Recall@1-10. However, it is worth noting that the proposed approach ( Eq. 5) utilizes image descriptors obtained by NetVLAD pre-trained on Pittsburgh dataset (NetVLAD-Pitts in Fig. 6). Our approach significantly outperforms NetVLAD-Pitts for all Recall@N thresholds. Moreover, it is noteworthy that our method pushes the generalization performance of NetVLAD-Pitts above the NetVLAD-TokyoTM which was trained on images with similar distribution as Tokyo24/7.

6.2.2 Aachen Day-Night dataset

NetVLAD-Pitts is the only CNNs-based approach available as baseline for this dataset [30]. Since NetVLAD-Pitts outputs global vector representations, the query pose is approximated using the pose of the nearest retrieved database image. We follow a similar protocol for our method. In contrast, Active Search (AS) [31] is state-of-the-art which uses SIFT descriptors in a Bag-of-Word (BoW) [33] and a prioritized 2D-3D matching framework to estimate the 6DOF query camera pose. The method can heavily benefit from an accurate shortlist of relevant database images by running the prioritized matching only on the shortlisted database images and their corresponding 3D points. Thus, the proposed approach can be considered complementary to AS.

Results obtained by the baseline methods and our approach are presented in Tab. 3. The proposed method outperforms NetVLAD-Pitts in accurately localizing both day and night time queries. The numbers are not indicative of the localization quality of both NetVLAD-Pitts and our method since the pose of the nearest retrieved database image often exceeds the pre-defined threshold (5 meters) for computing query pose accuracy. Therefore, simple assignment of nearest database pose to the query can be erroneous and needs further attention. However, it can be seen that the retrieval quality of the proposed approach is consistently better than NetVLAD-Pitts. We leave the problem of estimating 6DOF pose for future work.

Qualitative image retrieval results on both datasets are illustrated in Fig. 7.

Methods Condition,
day night
Active Search [31] 96.6 43.9
NetVLAD-Pitts [1] 17.0 10.2
Proposed 19.7 15.3
Table 3: Localization performance on the Aachen Day-Night dataset (higher is better). The best performance among image retrieval based approaches is highlighted as italics.
Method Viewpoint ID

DGC-NC-UCMD-Net (avg. est.)
 1.51 4.46 8.66 9.59 12.62
DGC-NC-UCMD-Net (permuted)  1.91 4.92 9.15 10.16 13.06
DGC-NC-UCMD-Net (Resnet-101)  5.27 8.71 11.67 13.32 16.91
Table 4: Ablation study. AEPE metric for different viewpoint IDs of the HPatches dataset (lower is better). We analyze the influence of different aspects of the proposed method on the performance. See Sec. 6.3 for more details.

6.3 Ablation study

In this section we study the channel convolution property of the decoder, in more detail.

In Sec. 3, we argue that the decoder trained using channel convolutions learns a unified refinement function. As such, should be invariant to the order of input feature channels. To analyze this, we evaluate our model DGC-NC-UCMD-Net on HPatches dataset, where the input feature maps to at different layers of the pyramid are re-ordered according to some permutation index . The results presented in Tab. 4 demonstrate that the model achieves comparable performance across different viewpoints compared to the non-permuted case. Furthermore, being a refinement function operates on the space of representation similarity and thus should be invariant to the representations themselves. Henceforth, we replaced the VGG16 encoder in with Resnet-101 truncated at the 3rd residual layer. From the results in Tab. 4, we observe that across viewpoints III, IV and V, the Resnet-101 model achieves comparable performance to VGG16 which was used during training.

7 Conclusion

We have presented novel methods for CNN based dense pixel to pixel correspondence learning and its application to geometric verification for image retrieval. In particular, we have proposed a compact but effective CNN model for dense pixel correspondence estimation using the universal correspondence map decoder block. This reduces the memory footprint by 3 times compared to the baseline DGC-Net model.

In addition, we have integrated the matching layer in our model with neighborhood consensus [28] which further enhances the matching performance. This modified dense correspondence model along with the proposed geometric similarity functions are then applied to improve the initial ranking of database images given by NetVLAD descriptor. We have evaluated our approach on two challenging city-scale localization datasets (Tokyo24/7 and Aachen Day-Night) achieving state-of-the-art retrieval results.

(a) Tokyo24/7
(b) Aachen Day-Night
Figure 7: Qualitative results produced by NetVLAD [1] (rows 2 and 5) and the proposed method (rows 3 and 6) on two localization datasets: Tokyo24/7 and Aachen Day-Night. Each column corresponds to one test case: for each query (row 1 and 4) top-1 (Recall@1) nearest database image has been retrieved. The green and red strokes correspond to correct and incorrect retrieved images, respectively. The proposed approach can handle different illumination conditions (day/night) and significant viewpoint changes (the second column in Fig. 6(b)). More examples presented in the supplementary.


In this appendices we show additional qualitative and quantitative results of the proposed approach. In Sec. B we provide an ablation study and analyze the influence of different design choices of our method to the localization performance. We demonstrate the benefits of the unified correspondence map decoder (UCMD) compared to the architecture with multiple decoders in Sec. C Finally, qualitative localization and pixel correspondence estimation results are shown in Sec. D.

Appendix A Additional Baselines

In this work, we propose two similarity functions for geometric verification, :


where and is the number of inliers and cyclically consistent inliers between two images , respectively; is the local similarity between each hypercolumn ( and ) of the NetVLAD [1] image descriptor at location ; is the global similarity value.

We compare our method with two baselines: i) recently proposed geometric verification pipeline Inloc [35], and ii) a neural network based method that learns the scoring functions, and given {}, and as input. We present more details about the baselines next.

Inloc. Inloc is a indoor localization pipeline consisting of three primary stages: i) ranking of database images by measuring global representation similarity with a given query. The global representations are obtained from the image retrieval pipeline, NetVLAD [1]; ii) a shortlist of top ranked database images are re-ranked based on geometric verification using dense CNN descriptors. The dense descriptors are obtained from different layers of the NetVLAD pipeline followed by a coarse to fine matching using nearest-neighbor search. The geometric verification is done using a standard RANSAC based inlier count. The final score is the sum of global similarity and inlier count; iii) the top ranked geometrically verified database images are fed into a pose verification stage. The final stage first estimates candidate query poses the current shortlisted database images. The estimated pose is then verified using view synthesis, a process requiring dense database depth maps. Our proposed geometric verification pipeline is similar to Inloc components i) and ii). The pose verification stage requires depth maps which is not always available. Therefore, we evaluate Inloc pipeline until the geometric verification stage and report results in Tab. 4(b).

Learnt similarity functions. Since both Eq. 4 and Eq. 5 are hand-crafted we provide a FCNN-based model that can learn the similarity function. More specifically, we experiment with two independent models (for and ) which can predict whether two images similar or not based on , , , and . Both models have similar architectures , where the shorthand notation is used was the following: FC is a fully connected linear layer; is the number of input units (2 for and 4 for , respectively). We refer to these models as -FCNN and

-FCNN. Both models have been trained by minimizing binary cross-entropy loss function in a supervised manner.

Results. We now compare and with Inloc geometric verification pipeline on Tokyo247 dataset. Results demonstrate that our proposed function and outperform Inloc across all Recall rates as shown in Tab. 4(a). We observed that for many query-database image pairs, Inloc fails to find any inliers. This can be attributed to significant clutter, illumination change (day-night) and occlusion in this challenging dataset. The learnt similarity functions -FCNN and -FCNN have very promising results and perform better than NetVLAD. In particular, -FCNN has comparable performance to the proposed . However, -FCNN could not achieve any improvement compared to -FCNN. We leave further analysis for future work.

Methods Recall
r@1 r@5 r@10
Inloc [35] 62.54 67.62 70.48
NetVLAD-Pitts [1] 61.27 73.02 78.73
DenseVLAD [36] 67.10 74.20 76.10
-FCNN 67.94 81.90 85.08
-FCNN 63.49 81.59 85.71
Proposed Pitts 71.43 82.54 85.08
Proposed Pitts 77.14 84.44 86.67
(a) The proposed similarity functions and perform better strong baseline methods.
Methods Recall
r@1 r@5 r@10
NetVLAD-Pitts [1] 61.27 73.02 78.73
(inliers) 56.83 78.41 83.81
(cyclically consistent inliers) 70.16 82.86 85.71
64.76 82.54 85.71
Proposed Pitts 71.43 82.54 85.08
(b) Localization performance on the Tokyo247 dataset (higher is better).
Methods Recall
r@1 r@5 r@10
NetVLAD-Pitts [1] 61.27 73.02 78.73
73.65 83.49 86.67
69.84 80.95 85.08
Proposed Pitts 77.14 84.44 86.67
(c) Localization performance on the Tokyo247 dataset (higher is better).
77.78 81.34
82.04 83.70
85.37 85.94
(d) Localization performance (Recall@1) on the Pittsburgh test dataset (higher is better). We analyze the performance of different and of the original similarity function (5). The baseline, NetVLAD achieves 81.59 Recall@1
Table 5: Ablation study. We evaluate the proposed similarity functions and with different settings on Tokyo24/7 and Pittsburgh datasets.

Appendix B Ablation study

In this section we perform an ablation study on the proposed equations Eq. 4 and Eq. 5 for geometric verification. For Eq. 4, we analyze the impact of each variable, on retrieval performance on Tokyo247 dataset independently. The results are presented in Tab. 4(b).

Results. First we provide the ablation study for Eq. 4. Results demonstrate that simple Inlier count performs worse than the baseline NetVLAD and our proposed at Recall@1. However, the retrieval performance improves over NetVLAD for Recall@5 and Recall@10. Cyclically consistent inliers outperform NetVLAD across various Recall rates. Similarly, the ratio performs marginally better but it falls slightly behind of for Recall@1 (by about 6 ). Both and perform on par with the proposed across Recall@5 and Recall@10. However, has a clear performance advantage over and for Recall@1 as shown in Tab. 4(b).

Now, we perform an ablation study for Eq. 5. As mentioned in the main manuscript, the proposed is used to re-rank the top 20 database images in the shortlist, as ranked by . Here, we perform the final re-ranking using just the local descriptor similarity component, , and global representation distance, . Results in Tab. 4(c) demonstrate that re-ranking with decreases retrieval performance compared to the initial ranking by . On the other hand, local descriptor similarity weighted by significantly improves over the baselines and initial ranking by . However, the proposed combination of local and global representation similarity outperforms each individual component across all Recall rates.

The key idea here is to combine the similarity functions, , and . It is important to note that and are similarity functions, while is a distance function, hence, it is inversely proportional to global similarity. The inversely proportional functions, and can be combined in many different ways. We present a few in Tab. 4(d). The co-efficient (5 and 10) associated with in the columns of the Tab. 4(d) have been obtained using a grid search over the range on Pittsburgh test dataset. In addition, we found performs clearly better than = . Hence, we only present results for various and for in Tab. 4(d). The precise form of the combination of these similarity functions has been obtained based on validation experiments on test set of Pittsburgh dataset. Tab. 4(d) shows that various possible combinations give better performance than NetVLAD which achieves 81.59 at Recall@1. The weighting with structural similarity leads to a significant boost in retrieval performance. Such a form of weighting provides good balance requiring image pairs to have high local () and structural similarity (). Among the various combinations, the proposed Eq. 5 achieves the best performance (highlighted bold in Tab. 4(d)).

Figure 8: AEPE averaged over all HPatches [3] sequences versus memory footprint. Accuracy of both proposed methods (DGC-NC-Net and DGC-NC-UCMD-Net) is about on par, however, UCMD allows to decrease memory footprint by 30%.

Appendix C The benefits of UCMD

As shown in the main manuscript, we propose the unified correspondence map decoder which leads to a compact but efficient architecture. In order to elaborate on the benefits of UCMD, here we report the average end point error averaged over all sequences of the HPatches [3] dataset obtained by each strong baseline method (PWC-Net [34], geometric matching GM [27], and DGC-Net [24]) and allocated GPU memory. The results are illustrated in Fig. 8. In contrast to DGC-NC-Net with 5 separate decoders, the proposed UCMD can significantly decrease memory footprint (by 30%) achieving comparable accuracy.

Model Number of learnable parameters
PWC-Net [34] 8 749 280
GM [27] 3 271 576
DGC-Net [24] 2 675 338
Proposed (DGC-NC-Net) 2 685 079
Proposed (DGC-NC-UCMD-Net) 940 561
Table 6: Number of learnable parameters of two proposed architectures and strong baseline methods.

The amount of memory allocated by GM [27], DGC-Net [24], DGC-NC-Net, and DGC-NC-UCMD-Net is higher compared to PWC-Net since all those models have used pre-trained VGG-16 network as encoder. Therefore, in addition to memory consumption, we compute the total number of learnable parameters of each model and provide the results in Tab. 6.

Appendix D Qualitative results

Localization (image retrieval) performance. Fig. 9 reports an additional set of results obtained for the Tokyo24/7 dataset. Namely, it includes top-1 Nearest Neighbour (Recall@1 metric) obtained by NetVLAD [1] and our approach, respectively, for a given query. It clearly shows the proposed method improves retrieval results compared to NetVLAD and can cope with major changes in appearance (illumination changes in the scene) between the database and query images. Qualitative image retrieval results on Aachen Day-Night [30] are illustrated in Fig. 9(a).

Dense pixel correspondences are presented in Fig. 10. Each row shows one test pair from the Aachen Day-Night and Tokyo24/7 datasets, respectively. Ground truth matching keypoints are illustrated in different colors and have been used only for pixel correspondence evaluation. Keypoints of the same color are supposed to match each other. We manually indicated 3 keypoints in the target image for visualization purposes and the corresponding locations in the source image have been obtained by the proposed automatic dense matching approach. That is, given an input image pair (source and target images), our method predicts the correspondence map which is then used to obtain the location of keypoints. The results demonstrate that the proposed method can handle such challenging cases as different illumination (day/night) conditions, occlusions, and significant viewpoint changes producing accurate pixel correspondences.

Appendix E Limitations and future directions

We have demonstrated that the proposed method can localize queries under challenging conditions but it fails for very large viewpoint change ( rotation while observing the same place) and significant scale change. In addition, it would be interesting to propose an end-to-end semi-supervised approach which can efficiently learn similarity functions.

Figure 9: Qualitative results produced by NetVLAD [1] (rows 2 and 5) and the proposed method (rows 3 and 6) on Tokyo24/7 [36]. Each column corresponds to one test case: for each query (row 1 and 4) top-1 (Recall@1) nearest database image has been retrieved. The green and red strokes correspond to correct and incorrect retrieved images, respectively. The proposed approach can handle different illumination conditions (day/night) and significant viewpoint changes.
(a) Retrieval performance and pixel correspondences on Aachen Day-Night
(b) Pixel correspondences on Tokyo24/7
Figure 10: Qualitative image retrieval 9(a) and dense pixel correspondence estimation results produced by the proposed approach. We evaluate our approach on two challenging datasets: Tokyo24/7 and Aachen Day-Night. More image retrieval results are illustrated in Fig. 9. Each row of Fig. 9(b) corresponds to one test case. Ground truth keypoints have been manually selected in the target image for visualization purposes and the corresponding locations in the source image are obtained by the proposed dense matching method. Keypoints of the same color are supposed to match each other.


  • [1] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In Proc. CVPR, 2016.
  • [2] A. Babenko, A. Slesarev, A. Chigorin, and V. S. Lempitsky. Neural Codes for Image Retrieval. In Proc. ECCV, 2014.
  • [3] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proc. CVPR, 2017.
  • [4] D. M. Chen, G. Baatz, K. Köser, S. S. Tsai, R. Vedantham, T. Pylvänäinen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, et al. City-scale landmark identification on mobile devices. In Proc. CVPR, 2011.
  • [5] O. Chum and J. Matas. Matching with PROSAC - progressive sampling consensus. In CVPR, 2005.
  • [6] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval. Proc. ICCV, 2007.
  • [7] J. Deng, W.Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.
  • [8] M. A. Fischler and R. C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM, 24(6), 1981.
  • [9] A. Gordo, J. Almazán, J. Revaud, and D. Larlus. Deep Image Retrieval: Learning global representations for image search. In Proc. ECCV, 2016.
  • [10] P. Gronat, G. Obozinski, J. Sivic, and T. Pajdla.

    Learning and calibrating per-location classifiers for visual place recognition.

    In Proc. CVPR, 2013.
  • [11] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 447–456, 2015.
  • [12] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proc. CVPR, 2017.
  • [13] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In Proc. CVPR, 2010.
  • [14] H. Jegou, H. Harzallah, and C. Schmid. A contextual dissimilarity measure for accurate and efficient image search. In Proc. CVPR, 2007.
  • [15] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In Proc. CVPR, 2010.
  • [16] Y. Kalantidis, C. Mellina, and S. Osindero. Cross-dimensional Weighting for Aggregated Deep Convolutional Features. In Proc. ECCVW, 2016.
  • [17] Z. Laskar and J. Kannala. Context aware query image representation for particular object retrieval. In Proc. SCIA, 2017.
  • [18] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala. Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proc. ICCVW, pages 929–938, 2017.
  • [19] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, Nov 2004.
  • [20] A. Makadia. Feature tracking for wide-baseline image retrieval. In Proc. ECCV, 2010.
  • [21] J. Matas and O. Chum. Optimal randomized RANSAC with SPRT. In ICCV, 2005.
  • [22] C. McManus, W. Churchill, W. Maddern, A. D. Stewart, and P. Newman. Shady dealings: Robust, long-term visual localisation using illumination invariance. In Proc. ICRA, 2014.
  • [23] I. Melekhov, J. Kannala, and E. Rahtu. Image Patch Matching using Convolutional Descriptors with Euclidean Distance. In Proc. ACCVW, 2016.
  • [24] I. Melekhov, A. Tiulpin, T. Sattler, M. Pollefeys, E. Rahtu, and J. Kannala. DGC-Net: Dense Geometric Correspondence Network. In Proc. WACV, 2019.
  • [25] S. Middelberg, T. Sattler, O. Untzelmann, and L. Kobbelt. Scalable 6-dof localization on mobile devices. In Proc. ECCV, 2014.
  • [26] F. Radenović, G. Tolias, and O. Chum. Fine-tuning CNN image retrieval with no human annotation. TPAMI, 2018.
  • [27] I. Rocco, R. Arandjelović, and J. Sivic. Convolutional neural network architecture for geometric matching. In Proc. CVPR, 2017.
  • [28] I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, and J. Sivic. Neighbourhood consensus networks. In Proc. NeurIPS, 2018.
  • [29] J. Sánchez, F. Perronnin, T. Mensink, and J. J. Verbeek. Image Classification with the Fisher Vector: Theory and Practice. IJCV, 105(3):222–245, 2013.
  • [30] T. Sattler, W. Maddern, A. Torii, J. Sivic, T. Pajdla, M. Pollefeys, and M. Okutomi. Benchmarking 6DOF Urban Visual Localization in Changing Conditions. In Proc. CVPR, 2018.
  • [31] T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Okutomi, and T. Pajdla. Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization? In Proc. CVPR, 2017.
  • [32] T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt. Image retrieval for image-based localization revisited. In Proc. BMVC, 2012.
  • [33] J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In Proc. ICCV, 2003.
  • [34] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proc. CVPR, 2018.
  • [35] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii. InLoc: Indoor Visual Localization with Dense Matching and View Synthesis. In Proc. CVPR, 2018.
  • [36] A. Torii, R. Arandjelović, J. Sivic, M. Okutomi, and T. Pajdla. 24/7 Place Recognition by View Synthesis. In Proc. CVPR, 2015.
  • [37] S. Zagoruyko and N. Komodakis. Learning to Compare Image Patches via Convolutional Neural Networks. In Proc. CVPR, 2015.