1 Introduction
Keypoint matching has played a pivotal role in computer vision for well over a decade. This is clearly demonstrated by the fact that SIFT [23] remains the most cited paper in computer vision history. While many areas of computer vision are currently dominated by dense deep networks, that is, methods that take entire images as input, some problems remain best approached using sparse features. For example, despite recent attempts at tackling 6DOF pose estimation using dense networks, the topperforming models for widebaseline stereo and largescale StructurefromMotion (SfM) still rely on sparse features [49, 51, 33].
As a result, the quest for everimproving local feature descriptors goes on [23, 5, 46, 42, 39, 12, 50, 38, 41, 28, 45, 19, 25, 15, 24, 10, 31]
. These methods all seek to achieve invariance to small changes in location, orientation, scale, perspective, and illumination, along with imaging artefacts and partial occlusions. Most descriptors, however, whether learned or handcrafted, operate on SIFTlike keypoints and thus rely on simple heuristics to estimate the scale. If the scales for two keypoints do not correspond, neither will the support regions used to extract their descriptors, which is widely accepted as an unrecoverable situation. This is damaging because scale detection is often unreliable.
In this paper we demonstrate that this does not need to be the case. To this end, we go beyond the current paradigm for local descriptors, which we call the cartesian approach. This paradigm confines local descriptors to small, regularly sampled regions and relies on accurate scale estimates. By contrast, we posit that extracting the support region with a logpolar sampling scheme allows us to generate a better local representation by oversampling the immediate neighborhood of the point. We show that this approach is conducive to learning scaleinvariant descriptors with offtheshelf deep networks, enabling us to match keypoints across mismatched scales; see Fig. 4. Furthermore, we demonstrate that this representation is far less sensitive to occlusions or background motion than its cartesian counterpart, which allows us to exploit much larger image regions than was possible before to further boost performance.
Note that while logpolar representations have been used extensively by local features, this has typically involved logpolar aggregation of local statistics that are still computed on the cartesian image grid. By contrast, we propose to warp the patch using a logpolar sampling scheme and learn an optimal descriptor on this data. Fig. 1 illustrates the difference between these two approaches.
In short, we propose a new approach to represent local patches and show how to leverage it to achieve scale invariance. In the remainder of the paper, we first briefly review how scale has been handled in the vast body of literature pertaining to matching descriptors, whether learned or designed. We then describe our method and show that it outperforms the state of the art on several challenging datasets.
2 Related works
In this section we first review techniques representative of the many that have been proposed to achieve scale invariance for local feature matching, with and without explicit scale detection. Next, we discuss approaches to learning models for patch descriptors. Finally, we study the use of logpolar representations in local features. For a thorough, uptodate survey on local features please refer to [8].
Scale Invariance via Scale Detection.
The vast majority of work in the literature assumes that scale estimation is handled by the keypoint detector and that keypoints can be put in correspondence only if their scales match. This includes classical handcrafted pipelines such as SIFT [23] or SURF [5]. Image measurements are then aggregated over a correspondinglysized support region to extract the descriptor. As a result, errors in this a priori scale estimation cannot be recovered from, and the affected keypoints are simply written off as potential correspondences.
Twostage pipelines.
Special strategies can be used for rigid matching under large zoom. Zhou [52] propose a twostage approach to first coarsely register the image in scalespace and then narrow down the search scope to matches of commensurate scale. Shan [36] assume that dense SfM models are available, along with an approximate pose, and synthesize ground views from aerial viewpoints using the 3D model, for aerialtoground matching. Both methods rely on SIFT features and would directly benefit from improved, scaleinvariant descriptors such as ours.
Scale Invariance without Scale Detection.
A simple way to achieve scale invariance is to concatenate multiscale descriptors and find the best match among them. This was done in [47] to improve robustness against scale changes with ORB features [32]. ScaleLess SIFT (SLS) [14]
goes beyond that and exploits the observation that SIFT descriptors do not change drastically over close, contiguous scales, which suggests that they are embedded in a lowdimensional space. This observation can be used to find a representation more compact than their concatenation. The resulting feature vectors are still highdimensional (8k) but can be reduced by PCA to a 512dimensional vector. However, this requires a singular value decomposition for each keypoint to find its subspace, which is very costly.
The Scale and rotationInvariant Descriptor (SID) [20]
samples axisaligned derivatives over a logpolar grid, along with incremental smoothing over image regions further away from the keypoint. Thus, scale changes and rotations result in translations on the measurement matrix. Using the Fourier Transform Modulus of this signal, which is translationinvariant, makes the descriptors scale and rotationinvariant. However, SID requires fine sampling over large support regions, which fails in realworld scenarios with viewpoint changes and occlusions. SegSID
[43] addresses this shortcoming by leveraging segmentation cues to suppress image measurements from image regions not associated to the keypoint, but this requires imagelevel segmentation and is failureprone. SID also suffers from high dimensionality (3k).More importantly, both SID and SLS were designed for dense matching with SIFT Flow [22]
as a backend and are not suitable for largescale reconstruction due to their computational cost. Finally, they both rely on handcrafted features and cannot compete with the machine learning models that currently dominate the field. We now turn to these.
Learned Descriptors.
Early works applied PCA to SIFT [18], learned comparison metrics [40], or learned descriptors with convex optimization [39]
. Current research on patch descriptors is dominated by convolutional neural networks. MatchNet
[12] and DeepCompare [50] train descriptor extraction and distance metric networks using a Siamese architecture. DeepDesc [38] uses hard positive and negative mining to learn discriminative features. A tripletbased loss is introduced in [4]. L2Net [41]improves the loss function by enforcing similarity in the intermediate feature maps and penalizing highly correlated descriptor bins. HardNet
[28] extends the formulation of [38] to mine over all the samples in the batch. In [15], mining heuristics are replaced by a differentiable approximation of the average precision metric that is then used for optimization. Spectral pooling is introduced in [45] to deal with geometric transformations. An alternative to siamese and tripletbased loss functions is proposed in [19] to address their shortcomings. GeoDesc [25] uses geometry constraints for optimization. ContextDesc [24] incorporates global context, and geometric context from the keypoint distribution.All of the deep methods, except [25, 24], are trained on the same dataset [7], which consists of patches preextracted on keypoints using Difference of Gaussians (DoG) [23] or multiscale Harris corners [13]. Only keypoints that survive a 3D reconstruction by Structure from Motion (SfM) are considered, and similarly to the traditional approach, the learned models are simply expected to fail if the detector fails first. To the best of our knowledge there is no learningbased method that explicitly addresses scale invariance.
Another line of works comprises those that use deep architectures to learn keypoints and descriptors jointly. LIFT [48] is trained on patches extracted around SIFT keypoints with corresponding scales, similarly to the previous methods. LFNet [29] learns to detect the scale with selfsupervision, but in practice seems to perform best over a very narrow set of scales. SuperPoint [9] learns scale invariance at the descriptor level, which works for visual odometry but breaks in more generalized problems. D2Net [10] focuses on difficult imaging conditions and relies on a single network for detection and description. R2D2 [31] applies L2Net convolutionally while penalizing repeatable but nondiscriminative patches.
Leveraging Polar representations.
1x  

2x  
4x  
(a) Cartesian  (b) LogPolar  (c) LogPolar  (d) LogPolar  
Pooling (SIFT)  Pooling  Sampling  Patch 
Polar and logpolar representations have been extensively used in computer vision to aggregate local information, because they are robust to small changes in scale and rotation. Traditional handcrafted patch descriptors typically consist of two stages: feature extraction and feature pooling. First, image measurements such as gradients are computed for every pixel. Then, they are aggregated over small regions around the point given its location, orientation, and scale. SIFT, for instance, aggregates features (histograms of gradient orientations) over 44 cells around the keypoint; see Fig. 1.
Several descriptors aggregate features over polar or logpolar regions. GLOH [26] computes SIFT over a logpolar grid and then reduces the dimensionality by PCA. Daisy [42] aggregates oriented image gradients over a polar grid using a Gaussian kernel with a size proportional to the distance between the keypoint and the grid point, to bypass aliasing effects. The seminal Shape Contexts paper [6] introduces a descriptor for object recognition by picking points on the contour of a shape and histogramming the location of each point relative to every other point over logpolar bins. Local SelfSimilarities (LSS) [37] proposes a technique to match different image modalities by measuring internal selfsimilarities over the regions determined by a logpolar grid. Winder and Brown [46] study many pooling configurations within a framework similar to Daisy and find logpolar to be optimal among their choices. Several binary descriptors, such as BRISK [21] or FREAK [2], rely on sampling patterns over similarlydefined grids to compute intensity differences and extract the features.
Note that all of these methods define polar or logpolar regions for feature pooling, that is, the pixelwise features are always computed in cartesian space, and it is only their aggregation that takes place in logpolar space. As shown in Fig. 1, this is drastically different to our approach, which consists in warping the raw pixel data and use that representation to learn scaleinvariant models.
3 Method



(a)  (b)  (c)  (d) 
First, we describe our sampling scheme in Section 3.1 and, then, our network architecture and training strategy in Section 3.2. For the purposes of this section, we assume that the training data consists of pairs of keypoints across two images that are in correspondence in terms of location and orientation, but not necessarily scale. The actual procedure used to generate the training data is described in Section 4.1.
3.1 LogPolar Sampling
As in most papers about learning descriptors [12, 50, 38, 4, 41, 28], we use SIFT keypoints [23]. Given an image of size , a keypoint on is fully described by its center coordinates , its scale , and its orientation
. We use a Polar Transformer Network (PTN)
[11] to extract a patch around keypoint . To this end, we rely on the following coordinate transform:(1)  
Variables denote source coordinates and target coordinates, after the transform. The coordinate origin is centered on , the angle is , and the radius is given by , where is a factor that converts the SIFT scale to image pixels^{1}^{1}1Given the convention followed by OpenCV, denotes the scale multiplier of SIFT. One can extract larger image regions setting .. Finally, we construct the warped patches by looking up the intensity values in image at coordinates
with bilinear interpolation, as done in
[11]. This process is illustrated in Fig. 1.We denote patches extracted in this way as LogPol
. For comparison purposes, we also consider the standard cartesian approach, using Spatial Transformer Networks (STN)
[17] on a regularly spaced sampling grid, defined by(2)  
We denote these patches as Cart
. Note that STN and PTN were designed to facilitate wholeimage classification by allowing deep networks to manipulate data spatially, thus removing the burden of learning spatial invariance from the classifier. This is not applicable here: we only use their respective samplers, which allow us to efficiently sample the images with inline data augmentation at a negligible computational cost by applying small perturbations when extracting the patches.
The following properties of logpolar patches distinguish them from cartesian ones:

[noitemsep,nolistsep]

Rotations in cartesian space correspond to shifts on the polar axis in logpolar space (rotation equivariance).

Peripheral regions are undersampled, which means that paired patches look similar to the eye even under drastic scale changes (scale equivariance).
This phenomena is illustrated in Fig. 2. Note how the logpolar representation facilitates visual matching even when scales are mismatched. Our approach is predicated on leveraging this information effectively with the deep networks and training framework introduced in the next section.
3.2 Network Architecture and Training
To extract patch descriptors, we use a HardNet [28] architecture. As shown in Fig. 3, our network has seven convolutional layers and takes as input grayscale patches of size 3232. Input patches are preprocessed with Instance Normalization [44]
. Feature maps are zeropadded after each convolutional layer, and we use strided convolutions instead of pooling layers. Each convolution is followed by a ReLU and Batch Normalization, but the last convolution layer omits the ReLU. We apply dropout regularization with a rate of 0.1 after the last ReLU. The final convolutional layer is followed by Batch Normalization and
normalization. The output of the network is a descriptor of unit length and size 128. We found this to be a good compromise between descriptor size and performance.The standard way to train such networks is in a siamese configuration, with two copies of the network, sharing weights. Among the many loss formulations that have have been proposed [38, 4, 19, 15], we use the triplet loss of [4], as in [28]. To build the required triplets, we consider a collection of patch pairs which contain two different views of a 3D point, where , with denoting the batch size. We systematically check that the 3D points in a given batch are unique, so that and only correspond if . We denote their respective descriptors as . We then mine negative samples with the ‘hardestinbatch’ procedure of [28]. Specifically, we build a pairwise distance matrix , , where is the Euclidean distance between descriptors and if , and an arbitrarily large value otherwise. We denote the hardest negative sample for , , the one with the smallest distance, as , and the hardest negative sample for as . We consider both and as possible anchors, for all . Denoting a triplet with anchor (), positive () and negative () patches as , we form triplet taking the hardest negative example, if and otherwise. We then take the loss to be
We set the batch size
to 1000. For optimization we use Stochastic Gradient Descent (SGD) with a learning rate of 10, momentum of 0.9, weight decay
, and decay the learning rate linearly to zero within a set number of training epochs
[28]. Sampling the patches inline allows us to apply data augmentation at training time, jittering the orientation of each anchor keypoint bydegrees. Our implementation uses Pytorch as a backend. Code, models and training data are all available.
^{2}^{2}2https://github.com/cvlabepfl/logpolardescriptors4 Experiments
In Section 4.1, we introduce the dataset we built to train scaleinvariant descriptors, because there is currently none available for this purpose. We then compare ourselves to the state of the art on it. In Sections 4.2, 4.3 and 4.4, we benchmark our models on three publicly available datasets: HPatches [3], AMOS patches [30], and the CVPR’19 Photo Tourism image matching challenge [1]. As baselines, we consider: SIFT [23], TFeat [4], L2Net [41], HardNet [28], and GeoDesc [25].^{3}^{3}3We use OpenCV for SIFT, and public implementations for the rest. For our own method we consider descriptors learned with either cartesian or logpolar patches.
4.1 Results on the New Dataset
Sequence  SIFT  TFeat  L2Net  Geodesc  HardNet  Ours ()  Ours ()  

Cart  LogPol  LogPol  
‘british_museum’  5.91  3.53  3.52  4.30  3.21  2.17  2.18  0.96 
‘florence_cathedral_side’  4.36  1.30  0.51  2.13  0.40  0.23  0.23  0.20 
‘lincoln_memorial_statue’  2.89  4.32  2.28  2.61  1.65  1.30  1.31  0.91 
‘milan_cathedral’  7.08  1.98  1.48  1.86  0.35  0.19  0.12  0.07 
‘mount_rushmore’  18.71  11.94  2.52  2.27  0.43  0.42  0.32  0.22 
‘reichstag’  2.22  0.44  0.30  0.42  0.21  0.19  0.19  0.09 
‘sagrada_familia’  9.01  2.41  0.85  1.08  0.27  0.21  0.19  0.03 
‘st_pauls_cathedral’  8.64  2.01  1.48  2.45  0.68  0.42  0.46  0.20 
‘united_states_capitol’  8.67  3.90  2.64  5.43  1.60  1.33  0.98  0.53 
Average  7.50  3.54  1.73  2.51  0.98  0.72  0.67  0.36 
Nearly all learned descriptors rely on the dataset of [7] for training, which provides patches extracted over different viewpoints for three different scenes. Correspondences were established from SfM reconstructions and SIFT. They are thus biased towards keypoints that can be matched with SIFT, , commensurate in terms of scale. In order to learn scaleinvariant descriptors under realworld conditions, we require patches extracted at noncorresponding scales, for which we need the original images, which are not provided by [7]. Other datasets, such as [27] or [3], provide images along with homographies for correspondence, but focus on affine transformations and are much too small to train deep networks effectively. Therefore we collected a new dataset for training purposes. In the remainder of this section, we detail how we created it and then report our results on it.
4.1.1 Creating the Dataset
We applied COLMAP [34], a stateoftheart SfM framework, over large collections of phototourism images originally collected by [16]. These images show drastic changes in terms of viewpoint, illumination, and other imaging properties, which is crucial to learn invariance [48]. In addition to sparse reconstructions, COLMAP provides dense correspondences in the form of depth maps. We used them to generate training data by randomly selecting a pair of images and , extracting SIFT keypoints for both, and using the depth maps to build ground truth correspondences. To do this we projected each keypoint from one image to the other using the estimated poses and depth maps. We took a correspondence to be valid if the projection of keypoint in image falls within 1.5 pixels of keypoint in image . We performed a bijective check to ensure onetoone correspondences. We applied this projection in a cycle, from to and back to , to ensure that the depth estimates are consistent across both views, and discarded the putative correspondence otherwise. Points which fall in occluded areas were likewise discarded. Note that we only check for corresponding locations, but not scales: in this manner we are collecting SIFT keypoints with nonmatching scales whose distribution comes from realworld data.
We also require the orientations to be compatible across views. To guarantee this we use the ground truth camera poses to compute the difference between orientation estimates and filter out keypoint matches outside 25, as in [7]. Finally, we suppress pairs of keypoints closer than 7 pixels to each other, to exclude patches with large overlaps, which would be particularly problematic for cartesian patches.
We can similarly use the ground truth to warp the scale across images, which we do in order to estimate the frequency of inaccurate scale estimates. Given a correspondence comprised of two keypoints with scales , we warp the scale from image to image to obtain , and compute the scale difference ratio as , so that , with 1 encoding perfect scale correspondence. We histogram this ratio and use it to evaluate each method under scale changes, as depicted in Fig. 4.
Scale change  Scale change  
Orientation change  
(a) SIFT  (b) L2Net  
Orientation change  
(c) OursCart ()  (d) HardNet  
Orientation change  
(e) OursLogPol ()  (f) OursLogPol () 
We select 11 sequences for training, and 9 for testing. Please refer to the supplementary material for details. We split the training sequences into training and validation sets in a perimage basis, with a 70:30 ratio. Images are downsampled to a maximum height or width of 1024 pixels, which is the resolution that we extract keypoints at, and mirrorpadded to 15001500 to alleviate boundary effects. To obtain patches in cartesian space, we sample the image at the desired keypoints with STN. For logpolar patches we use PTN over a support region commensurate with STN; see Section 3.1. We also consider larger patches, increasing . We generate up to 1000 correspondences for each image pair, and extract the patches from the images on the fly.
Training requires negative samples, that is, points not in correspondence. Finding negatives is easy when a SfM reconstruction is available, as done in [7], ensuring that keypoints are stable across all views. This not feasible in our case. Instead, we generate training samples from a single image pair at a time. Specifically, we take one image pair from each of the 11 sequences and use it to fill roughly th of each training batch. We can then perform negative mining over the entire batch, as outlined in Section 3.2.
4.1.2 Patch Correspondence Verification
In this section we evaluate performance in terms of patch matching over the test sequences. We extract descriptors for SIFT keypoints with corresponding locations, but using their original scales, which are not always in correspondence. We train our networks with cartesian and logpolar patches, keeping all other settings identical. We use the standard metric in patch matching benchmarks, FPR95, , the False Positive Rate at 95% True Positive recall. For the baselines, we extract patches at the SIFT scale, , . We also consider for logpolar patches. We report the results in Table 1 and discuss them below.
Comparison to the state of the art.
Our models trained with logpolar patches deliver the best performance on each sequence, followed by our models trained on cartesian patches, and then HardNet. Remarkably, we achieve our best results with , which corresponds to patches much larger than those bestsuited for traditional descriptors, extracted with , a fact that we will examine more closely below. Note the small gap between HardNet and OursCartesian, which is due to the innate differences between datasets and training the latter with mismatched scales. The other baselines perform significantly worse.
Performance under large scale mismatches.
In Fig. 4 we break down the results of Table 1 in terms of orientation and scale mismatches. Note how models trained on logpolar representations can tolerate a wide range of scale mismatches. Our results show a negligible drop in performance under scale changes up to 23x, and remain useful even at 34x. All baselines degrade significantly under scale changes of 2x and become essentially useless beyond that. Note that this invariance is made possible by leveraging logpolar representations and cannot be achieved by simply exposing the models to cartesian patches exhibiting scale changes, as evidenced by the performance of OursCartesian shown in Fig. 4(c). Finally, remember that this data has been collected from realworld settings with unreliable scale detection. In other words, our models allow us to retrieve more correspondences without changing the detector.
Increasing the size of the support region.
As shown in Fig. 2, patches extracted with logpolar sampling are remarkably similar across different scales, because scale changes correspond to shifts in the horizontal dimension. This representation is not only easier to interpret visually, but also easier to learn invariant models with. Moreover, oversampling the immediate neighbourhood of the point allows us to leverage larger support regions, because the effect of occlusions and background motion in logpolar patches is smaller than in their cartesian counterparts. We demonstrate this by training models for different values of , and report the results in Table 2. Our models are able to exploit support regions much larger than cartesianbased approaches. We see performance flatten out at , and observe boundary issues beyond that point, so we use this value for all experiments in the paper. Note how the radius of the circle determining the support region is 8 times larger than the optimal value for cartesian patches, and its area 64 times larger. Note that we use an identical architecture, which can only leverage this information effectively thanks to the logpolar representation.
4.1.3 Imagelevel Patch Retrieval
Next, we evaluate our performance in terms of patch retrieval. For every image pair in the test sequence, we extract SIFT keypoints on each image and establish ground truth correspondences using the procedure outlined in Section 4.1. Matches with a difference of up to degrees in orientation are considered positives. Typically, a large percentage of the image pixels are occluded, so that it is not possible to generate a large number of matches. Instead, for every pair of images, we extract up to matches and then generate distractors, defined as keypoints further than 3 pixels away from a keypoint. The task is thus to find the ‘needle in the haystack’, where every keypoint has one positive match and negatives. We compute the distance between descriptors, extract the rank of each match, and accumulate it over all keypoints and images pairs. The results are summarized in Fig. 5. Our models with logpolar patches obtain the best results, retrieving the correct match about 97% of the time for our best model, for . They are followed by our models with cartesian patches, and HardNet. Notice that contrary to the previous experiment, we evaluate on a realistic patch retrieval scenario with a large number of distractors, which indicates that our performance holds even when sampling keypoints densely, and that it does so regardless of .
12  16  32  64  96  128  

Ours, Cart  0.72  0.77  1.36  4.79  7.03  8.43 
Ours, LogPol  0.67  0.61  0.47  0.40  0.36  0.36 
4.2 Results on HPatches
The HPatches dataset [3] contains 116 sequences with 6 images each, with either viewpoint or illumination changes. As in [7], HPatches provides preextracted patches sampled at corresponding scales, which are not useful for our purposes. However, it also provides the original images and ground truth homographies. We thus define the following protocol. We use SIFT to find keypoints and determine correspondences among them using the ground truth homographies. We consider sequences with viewpoint and illumination changes separately. This provides us with 20733 correspondences for the illumination split and 22079 correspondences for the viewpoint split. For every match, we compute the distance between a pair of corresponding descriptors and also to all the negatives in the dataset, and evaluate our models in terms of the rank1 metric, , the percentage of samples for which we can retrieve the correct match with rank 1. We show the results in Table 3. As expected, our logpolar models outperform most of the baselines, and perform better as increases. For this experiment we use the models trained on our dataset, without finetuning.
4.3 Results on AMOS patches
We also consider AMOS patches [30], a dataset released recently featuring pairs of images captured by webcams and carefully curated in order to provide correspondences. We evaluate our method on the training split, which consists of 27 sequences, each with 50 images, and which also provides keypoints with scales and orientations for every image. We use unique matching keypoint pairs across all images, obtaining a split of 13268 unique keypoint pairs. We use the same metric as for HPatches, and summarize the results in Table 4. As before, we do not retrain the models in any way. Again, our models outperform the state of the art and our results improve with the size of the support region, unlike for methods based on cartesian patches.
Method  Viewpoint split  Illumination split 

SIFT,  0.740  0.607 
HardNet,  0.813  0.707 
GeoDesc,  0.879  0.727 
OursCart,  0.828  0.722 
OursCart,  0.831  0.732 
OursCart,  0.825  0.736 
OursCart,  0.752  0.666 
OursCart,  0.681  0.616 
Ours, LogPol, 
0.833  0.729 
Ours, LogPol,  0.838  0.743 
Ours, LogPol,  0.849  0.764 
Ours, LogPol,  0.849  0.774 
Ours, LogPol,  0.847  0.774 
Method  Rank1  

6  12  16  32  64  96  
SIFT  0.551  0.518  0.516  0.510  0.480  0.436 
GeoDesc  0.434  0.396  0.389  0.416  0.438  0.417 
HardNet  0.529  0.464  0.450  0.451  0.470  0.449 
Ours, Cart  0.554  0.507  0.530  0.549  0.524  0.481 
Ours, LogPol  0.607  0.604  0.625  0.641  0.648  0.651 
Type  Method  Stereo task  Multiview task  
mAP  Rank  mAP  Rank  
DoG  SIFT (IJCV’04)  0.0277  9  0.4146  8 
TFeat (BMVC’16)  0.0357  8  0.4643  7  
L2Net (CVPR’17)  0.0400  6  0.5087  5  
HardNet (NIPS’17)  0.0425  4  0.5481  1  
GeoDesc (ECCV’18)  0.0368  7  0.5298  4  
ContextDesc (CVPR’19)  0.0439  3  0.5399  3  
e2e  SuperPoint (CVPR’18)  0.0415  5  0.4778  6 
D2Net (CVPR’19)  0.0490  1  0.3967  9  
Ours (DoG)  OursCartesian,  0.0405  —  0.5208  — 
OursLogPol,  0.0420  —  0.5389  —  
OursLogPol,  0.0432  —  0.5396  —  
OursLogPol,  0.0448  2  0.5427  2 
4.4 Results on the Phototourism Challenge
Patch matching performance does not always translate to upstream applications, as evidenced by [48, 35]. We thus also evaluate our method on the public Phototourism Image Matching challenge [1]. This benchmark features two tracks: stereo and multiview matching, and evaluates local features in terms of the quality of the reconstructed poses. Features are submitted to the organizers, who compute the results. We provide them in Table 5, including comparable baselines (up to 8k features per image, matched by bruteforce nearestneighbour) extracted from the public leaderboards. Our method ranks second on both tracks, and first in terms of average rank. Note that our observations from Section 4.1.2 carry over – models trained on logpolar patches improve with patch size, and outperform cartesian models.
5 Conclusions and Future Work
We have introduced a novel approach to learn local descriptors that goes beyond the current paradigm, which relies on image measurements sampled in cartesian space. We show that we can learn richer and more scaleinvariant representations by coupling logpolar sampling with stateoftheart deep networks. This allows us to match local descriptors across a wider range of scales, virtually for free.
References
 [1] Phototourism Challenge, CVPR 2019 Image Matching Workshop. https://imagematchingworkshop.github.io. Accessed August 1, 2019.
 [2] Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. FREAK: Fast Retina Keypoint. In CVPR, 2012.
 [3] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors. In CVPR, 2017.
 [4] Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. Learning Local Feature Descriptors with Triplets and Shallow Convolutional Neural Networks. In BMVC, 2016.
 [5] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded Up Robust Features. In ECCV, 2006.
 [6] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape Matching and Object Recognition Using Shape Contexts. PAMI, 24(24):509–522, April 2002.
 [7] Matthew Brown, Gang Hua, and Simon Winder. Discriminative Learning of Local Image Descriptors. PAMI, 2011.
 [8] Gabriela Csurka and Martin Humenberger. From handcrafted to deep local invariant features. In arXiv preprint arXiv:1807.10254, 2018.

[9]
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich.
Superpoint: SelfSupervised Interest Point Detection and
Description.
CVPR Workshop on Deep Learning for Visual SLAM
, 2018.  [10] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2Net: A Trainable CNN for Joint Detection and Description of Local Features. In CVPR, 2019.
 [11] Carlos Esteves, Christine AllenBlanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar Transformer Networks. In ICLR, 2018.
 [12] Xufeng Han, Thomas Leung, Yangqing Jia, Rahul Sukthankar, and Alexander C. Berg. MatchNet: Unifying Feature and Metric Learning for PatchBased Matching. In CVPR, 2015.
 [13] Christopher G. Harris and Mike .J. Stephens. A Combined Corner and Edge Detector. In Fourth Alvey Vision Conference, 1988.
 [14] Tal Hassner, Viki Mayzels, and Lihi ZelnikManor. On SIFTs and Their Scales. In CVPR, 2012.
 [15] Kun He, Yan Lu, and Stan Sclaroff. Local descriptors optimized for average precision. In CVPR, 2018.
 [16] Jared Heinly, Johannes L. Schoenberger, Enrique Dunn, and JanMichael Frahm. Reconstructing the World in Six Days. In CVPR, 2015.
 [17] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial Transformer Networks. In NIPS, pages 2017–2025, 2015.
 [18] Yan Ke and Rahul Sukthankar. PCASIFT: A More Distinctive Representation for Local Image Descriptors. In CVPR, pages 111–119, 2000.
 [19] Michel Keller, Zetao Chen, Fabiola Maffra, Patrik Schmuck, and Margarita Chli. Learning Deep Descriptors with ScaleAware Triplet Networks. In CVPR, 2018.
 [20] Iasonas Kokkinos, Michael Bronstein, and Alan Yuille. Dense Scale Invariant Descriptors for Images and Surfaces. Technical report, INRIA, 2012.
 [21] Stefan Leutenegger, Margarita Chli, and Roland Siegwart. BRISK: Binary Robust Invariant Scalable Keypoints. In ICCV, 2011.
 [22] Ce Liu, Jenny Yuen, and Antonio Torralba. SIFT Flow: Dense Correspondence Across Scenes and Its Applications. In ECCV, 2008.
 [23] David Lowe. Distinctive Image Features from ScaleInvariant Keypoints. IJCV, 20(2):91–110, Nov 2004.
 [24] Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. ContextDesc: Local Descriptor Augmentation with CrossModality Context. In CVPR, 2019.
 [25] Zixin Luo, Tianwei Shen, Lei Zhou, Siyu Zhu, Runze Zhang, Yao Yao, Tian Fang, and Long Quan. Geodesc: Learning Local Descriptors by Integrating Geometry Constraints. In ECCV, 2018.
 [26] Krystian Mikolajczyk and Cordelia Schmid. A Performance Evaluation of Local Descriptors. PAMI, 27(10):1615–1630, 2004.
 [27] Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and Luc Van Gool. A Comparison of Affine Region Detectors. IJCV, 65(1/2):43–72, 2005.
 [28] Anastasiia Mishchuk, Dmytro Mishkin, Filip Radenovic, and Jiri Matas. Working Hard to Know Your Neighbor’s Margins: Local Descriptor Learning Loss. In NIPS, 2017.
 [29] Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. LfNet: Learning Local Features from Images. In NIPS, 2018.
 [30] Milan Pultar, Dmytro Mishkin, and Jiri Matas. Leveraging outdoor webcams for local descriptor learning. In Computer Vision Winter Workshop, 2019.
 [31] Jerome Revaud, Philippe Weinzaepfel, César De Souza, Noe Pion, Gabriela Csurka, Yohann Cabon, and Martin Humenberger. R2D2: Repeatable and Reliable Detector and Descriptor. In arXiv Preprint, 2019.
 [32] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An Efficient Alternative to SIFT or SURF. In ICCV, 2011.
 [33] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, and Tomas Pajdla. Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions. In CVPR, 2018.
 [34] Johannes L. Schönberger and JanMichael Frahm. StructureFromMotion Revisited. In CVPR, 2016.
 [35] Johannes L. Schönberger, Hans Hardmeier, Torsten Sattler, and Marc Pollefeys. Comparative Evaluation of HandCrafted and Learned Local Features. In CVPR, 2017.
 [36] Qi Shan, Changchang Wu, Brian Curless, Yasutaka Furukawa, Carlos Hernandez, and Steven M. Seitz. Accurate Georegistration by GroundtoAerial Image Matching. In 3DV, 2014.
 [37] Eli Shechtman and Michal Irani. Matching Local SelfSimilarities Across Images and Videos. CVPR, 2007.
 [38] Edgar SimoSerra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Morenonoguer. Discriminative Learning of Deep Convolutional Feature Point Descriptors. In ICCV, 2015.
 [39] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Learning Local Feature Descriptors Using Convex Optimisation. PAMI, 2014.
 [40] Christoph Strecha, Alex Bronstein, Michael Bronstein, and Pascal Fua. LDAHash: Improved Matching with Smaller Descriptors. PAMI, 34(1), January 2012.
 [41] Yurun Tian, Bin Fan, and Fuchao Wu. L2Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space. In CVPR, 2017.
 [42] Engin Tola, Vincent Lepetit, and Pascal Fua. A Fast Local Descriptor for Dense Matching. In CVPR, 2008.
 [43] Eduard Trulls, Iasonas Kokkinos, Alberto Sanfeliu, and Francesc MorenoNoguer. Dense SegmentationAware Descriptors. Dense Image Correspondences for Computer Vision, 2015.
 [44] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Improved Texture Networks: Maximizing Quality and Diversity in FeedForward Stylization and Texture Synthesis. In CVPR, 2017.
 [45] Xing Wei, Yue Zhang, Yihong Gong, and Nanning Zheng. Kernelized Subspace Pooling for Deep Local Descriptors. In CVPR, 2018.
 [46] Simon Winder and Matthew Brown. Learning Local Image Descriptors. In CVPR, June 2007.
 [47] Alessio Xompero, Oswald Lanz, and Andrea Cavallaro. MORB: A MultiScale Binary Descriptor. In ICIP, 2018.
 [48] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. LIFT: Learned Invariant Feature Transform. In ECCV, 2016.
 [49] Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to Find Good Correspondences. In CVPR, 2018.
 [50] Sergey Zagoruyko and Nikos Komodakis. Learning to Compare Image Patches via Convolutional Neural Networks. In CVPR, 2015.

[51]
Dang Zheng, Kwang Moo Yi, Yinlin Hu, Fei Wang, Pascal Fua, and Mathieu
Salzmann.
EigendecompositionFree Training of Deep Networks with Zero EigenvalueBased Losses.
In ECCV, 2018.  [52] Lei Zhou, Siyu Zhu, Tianwei Shen, Jinglu Wang, Tian Fang, and Long Quan. Progressive Large ScaleInvariant Image Matching In Scale Space. In ICCV, 2017.
6 Beyond Cartesian Representations for Local Descriptors: Supplementary Material
6.1 Regarding the Dataset
In order to train scaleinvariant models with real data relevant to widebaseline stereo, it was necessary to collect training data. For this we rely on public collections of phototourism images in the Yahoo Flickr Creative Commons 100M (YFCC) dataset. We use COLMAP, a Structure from Motion (SfM) framework, to obtain 3D reconstructions. COLMAP provides us with sparse point clouds and depth maps for each image. We clean up the depth maps following the procedure outlined in the paper and use them, along with the ground truth camera poses, to project keypoints between corresponding images.
We sample pairs of images with a visibility check in order to guarantee that a minimum number of keypoints can be extracted and matched across both views. Specifically, we retrieve the SfM keypoints in common over both views, extract their bounding box, and reject the image pair if it is smaller than a given threshold (we use 0.5) for either image.
We use 11 sequences for training and validation and 9 for testing. We list their details in Table 6, and give some examples in Fig. 6. This data will be made publicly available along with code and pretrained models.