Being able to extract reliable sets of point correspondences between images is a fundamental requirement for a large variety of computer vision pipelines, such as Structure from Motion (SfM) [schoenberger2016sfm], SLAM [orbslam], Visual Localization [visual_localization], object detection [object_detection0, object_detection1] and object tracking [object_tracking]. The problem has been historically divided into two sequential steps: local feature extraction and pairwise matching.
The feature extraction step starts with the detection of a sparse set of salient points, referred to as keypoints, in each image. Objects visible in multiple images should trigger the detection of the same set of keypoints, in order to permit the establishment of correspondences between the images. As a consequence, the detection process is required to be robust to some degree of image alteration, such as illumination and viewpoint changes or occlusions. Several algorithms have been proposed during the last decades, classified as either blob[sift, detector_scaleaffine, detector_scale], corner [harris, fast] or region detectors [mser]
. The feature extraction step continues with the assignment of a descriptor vector to each detected keypoint, whose purpose is to describe the keypoint neighborhood. Among the many proposed algorithms for descriptor extraction[sift, orb, surf, brief, brisk], SIFT [sift] and its improved versions [sift_pca, rootsift] are the most successful ones and still remain widely used nowadays.
|MD-2-Net (ours)||R2D2 [r2d2]||ASLFeat [aslfeat]||Upright-SIFT [sift]|
More recently, with the spread of data-driven approaches, a multitude of local feature extraction methods based on deep learning have emerged. Early methods relied on existing keypoint detectors and focused on designing networks meant to extract the corresponding descriptors[hardnet, l2-net]. Later, motivated by the tight entanglement of keypoints and descriptors, the focus shifted towards the design of network architectures meant to predict keypoints and descriptors jointly [d2-net, r2d2, aslfeat]. Our method belongs to the latter group.
After the feature extraction step, in the pairwise matching, the descriptors at the different images are compared against each other in order to establish the correspondences. This last step is often responsible for a considerable part of the total computational cost of the sparse correspondence search. In order to reduce the matching complexity, we propose a deep feature extraction network capable of extracting multiple complementary keypoint sets. This permits to restrict the comparison between the descriptors at the different images only to the descriptors that belong to the same set, thus reducing the matching computational complexity. In order to train such a network, we propose a novel unsupervised loss that discourages overlaps between the different keypoint sets.
For 3D reconstruction tasks, it has been shown that the keypoint distribution has a strong influence on the quality of the recovered camera poses [down-to-earth], which leads to worse results when large image portions are not covered. To encourage an even keypoint distribution, we employ the unsupervised loss formulation originally proposed by [r2d2]
. However, the use of this loss may lead to detections in non-discriminative regions as well. In order to mitigate this side effect, we propose a variance-based weighting scheme that dampens the loss in areas where the descriptors are less discriminative.
Differently from the classical methods, which rely on carefully handcrafted algorithms, deep methods require large amounts of data in order to generalize to unseen scenes. Moreover, depth maps and poses generated from 3D reconstructions, that are possibly inaccurate and incomplete, are often used for training [r2d2, aslfeat]. In an attempt to overcome these limitations, we train our feature extraction network exclusively on images warped using random homographies. Furthermore, we augment the data with photometric distortions.
Our contribution is threefold:
We propose a deep architecture, named MD-Net, trained with a novel unsupervised loss formulation, which is capable of extracting multiple complementary sets of features. This reduces the computational complexity of the subsequent matching phase.
A training loss re-weighting based on the local descriptor variance is introduced. This discourages the detection of keypoints with less discriminative descriptors.
Our feature extraction network, which is trained exclusively on images warped using random homographies, generalizes well to 3D-related tasks as proven on two well known online benchmarks [aachen, imb].
Ii Related works
In the last decades, a multitude of algorithms addressing the sparse correspondence problem have been designed: in-depth evaluations have been carried out in [localfeaturesurvey, localfeaturebenchmark0, localfeaturebenchmark1]. With the advent of deep learning, data-driven methods were proposed to address one or more steps of the existing feature extraction pipelines. Early methods were trained to either detect repeatable keypoints [tilde, taskdetector, quadnet, keynet], or to distill compact descriptors from normalized patches, previously extracted by means of a classical method [hardnet, l2-net, sosnet]. Later, deep methods were proposed to both detect keypoints and extract their descriptors [lift, superpoint, lfnet], with a shift towards joint learning with D2-Net, R2D2 and ASLFeat [r2d2, d2-net, aslfeat]. Differently from the already listed approaches, [disk]
uses reinforcement learning to train a deep network for sparse feature extraction, obtaining good performance at the cost of a more expensive training procedure. Our method is most closely related to R2D2[r2d2], with which we share the core architecture and one of the unsupervised losses. Differently from R2D2, we employ a variance-based loss dampening, supported by a two-stage training scheme, to avoid detections in areas where the resulting descriptors are not locally discriminative. Additionally, our network is capable of detecting multiple complementary sets of keypoints. While in [largescale]
a weight is predicted for each local features based on the relevance for the downstream image retrieval task, our loss re-weighting is based on a parameter-free local measure of discriminativeness.
Deep learning has been applied successfully to the matching task as well, with [superglue]
completely replacing the traditional matching based on mutual nearest neighbors and other methods proposing learnt outlier filters[oanet, acne]. These methods lead to better matching results, but increase the matching computational complexity significantly.
Multiple strategies have been proposed in order to reduce the matching computational complexity for SfM pipelines. These are particularly useful when dealing with the reconstruction of a scene from an unordered set of images, potentially captured in different conditions. In fact, in this scenario, correspondences need to be established by matching all the possible image pairs, which results in a complexity growing quadratically in both the number of images and the number of extracted keypoints per image. One possible approach toward reducing this computational burden is to lower the number of image pairs by using strategies based on image similarities [schoenberger2016vote]. Alternatively, the number of matching operations can be reduced by using approximate nearest neighbor algorithms [approximate_nearest_neighbor, flann]. However, the former approach introduces the risk of missing valid image pairs and the latter decreases the quality of the matches [imb]. For those reasons, when high reconstruction quality is required, many 3D reconstruction pipelines still match all the possible image pairs and use the exact Mutual Nearest Neighbor (MNN) matching [schoenberger2016sfm, opensfm]. With MD-Net, we propose a novel approach that reduces the matching complexity by extracting a predefined number of disjointed feature sets at each image, which permits to limit the matching to the sole features in the same set.
Iii Model overview
Iii-a Network architecture
The network architecture, depicted in Fig. 2, is a streamlined version of R2D2 [r2d2] with the addition of our multi-detector branch. The backbone consists of a fully convolutional network where the commonly used convolution pyramid is replaced by a series of dilated convolutions, meant to increase the effective field-of-view of the network without lowering the output resolution. The backbone processes the input RGB image and outputs the feature volume . The feature volume is then fed to two different branches: the descriptor branch and the multi-detector branch. In the descriptor branch the feature volume is L2-normalized along the channel dimension to produce the final descriptor volume . This associates a -dimensional descriptor vector to each pixel of the input image. In the multi-detector branch, instead, the feature volume is squared and a single 1x1 convolutional layer is used to generate a detection heatmap volume where is the desired number of keypoint sets. In fact, each channel of this volume, hereafter referred as with , will be used to extract one set of keypoints. The resulting Multi-Detector network, named MD-Net, is rather compact and counts less than half a million parameters.
Iii-B Feature Extraction and Matching
At each heatmap , the candidate keypoints are detected as the pixel coordinates of the heatmap local maxima, after filtering out low values and applying a local Non Maxima Suppression (NMS). Given a budget of keypoints, for each heatmap we select only the candidate keypoints with the highest values in the heatmap. Finally, we obtain the local features by coupling each keypoint with its descriptor, sampled from the descriptor volume at the keypoint pixel location.
For a pair of images with features each, the Mutual Nearest Neighbor matching boils down to computing a distance matrix between the two image descriptor sets, which has a computational complexity . Thanks to our network architecture instead, only descriptors associated with the same detector heatmap need to be matched, which reduces the distance matrix size to and the corresponding computational complexity for each set to . Repeating the matching for each one of the feature sets results in an overall complexity reduction factor , as follows:
A visual intuition for the reduced computational complexity is provided in Fig. 3. The aggregated matches are obtained joining all the sets of matches.
Iv Loss formulation
The loss formulation can be split in two main components: the descriptor loss and detector loss, applied at the output of the corresponding branches, respectively.
Iv-a Descriptor loss
The descriptor loss goal is to promote discriminative descriptors, that permit to recognize the correct correspondences between the keypoints of two images. Similarly to previous works [l2-net, hardnet], we frame descriptor learning as a metric learning problem, where we promote that two corresponding keypoints have similar descriptors, while non corresponding keypoints should have dissimilar ones. To this purpose, we employ a simple hinged formulation of the Triplet Loss:
where denotes the inner product, is the set of all the sampled triplets, is the hinge margin and , , refer to the anchor descriptor, the positive correspondence descriptor and one negative descriptor, respectively. While it is trivial to build the (anchor, positive) descriptor pair, if the geometric transformation that relates the considered image pair is known, there is a virtually infinite number of possible (anchor, negative) candidates. As suggested in [hardnet], we pick the negative following the Hardest-in-Batch strategy.
Iv-B Detector loss
The detector loss goal is twofold. First, promoting heatmaps with well localized maxima, as these will determine the detected keypoints. Second, promoting repeatable heatmaps: content appearing in two images should lead to similar heatmaps, such that keypoint correspondences can be established between the two images. We design our loss as the sum of three components: the peakyness loss, the similarity loss and the dissimilarity loss. While the first two losses are applied to all the detection heatmaps in independently and then mean aggregated, the dissimilarity loss formulation considers each possible pair of detection heatmaps, in order to discourage any overlap between sets of keypoints selected by different detectors. For the sake of clarity, in the following we express each loss for the single pixel . The losses are then mean aggregated over the entire image domain.
Iv-B1 Peakyness loss
In order to encourage the network to produce well distributed local peaks, while avoiding non-discriminative areas, we propose a modified version of the peaky loss formulated in [r2d2]. The loss is defined as follows:
where is a square patch centered at the pixel and is a weight designed to avoid peaks in areas where local descriptors are not discriminative defined as follows:
and it represents the local variance of the backbone output , computed over a patch centered at , averaged along the channel dimension. Additionally, the loss in Eq.(3) is computed on the detection heatmaps in obtained from the warped image . The two losses are averaged. An example of the effect of the pixelwise weighting is shown in Fig. 4, where the detection heatmap appears smoother in the less discriminative regions.
Iv-B2 Similarity loss
In order to promote repeatable heatmaps, we adopt the following loss, that enforces consistency between the heatmaps produced by and :
where denotes the inverse warping.
Iv-B3 Dissimilarity loss
Finally, in order to promote that the detection heatmaps in lead to different sets of keypoints, we propose a novel loss that penalizes co-located peaks for each pair . Our loss is formulated as follows:
where N is the number of detectors and the binomial is the number of possible detector heatmap combinations. Similarly to the peakyness loss, Eq. (6) is applied to the detection heatmaps in as well. Fig. 5 provides an example of the keypoint sets obtained for .
V Model training
While the local variance computed over the input image can be helpful in discerning textured and flat areas, it does not directly relate to the local descriptor discriminativeness. For this reason, Eq. (3) employs the backbone output local variance instead, which is related to the descriptor volume directly and it is therefore more suitable to avoid keypoint detections in areas whose descriptors would not be particularly discriminative. However, this reasoning does not hold true at the beginning of training, when the network weights are randomly distributed and the predicted descriptor volume is not meaningful. Thus, we adopt a two-stage training procedure:
First, in the descriptor volume priming, we train only the backbone and the descriptor branch with the loss .
Then, in the joint training, we train our overall architecture with where , and balance the individual losses.
The descriptor volume priming represents the main training effort, while the joint training needs only few iterations. An added benefit of this training procedure is that changing the number of keypoint sets requires us to repeat only the joint training stage. Finally, during the joint training, the local variance
is used purely as a weighting term, i.e., the weight gradients do not participate in the backpropagation.
Vi-a Training details
We train MD-Net on patches randomly drawn from the Revisiting Oxford and Paris distractors dataset [oxford_paris]
. We implement our model in PyTorch[pytorch] and train it with the Adam optimizer [adam] (, ) and fixed on a single Nvidia GTX1080Ti. The descriptor volume priming consists of iterations and takes hours. Instead, the joint training consists only of iterations and is completed in minutes. Each iteration employs a batch of patches. Overall, the training procedure consumes a total of 710k images. Concerning the descriptor loss in Eq. (2), we set the hinge margin , sample the positive and negative descriptors on a regular grid with step 10px and classify a descriptor as negative candidate when it is more than away from the correct location. We adopt 128-dimensional descriptors. Concerning the peaky loss in Eqs. (3) and (4), we set and to be and patches, respectively. Finally, for the training scenario with detectors, we use the loss weights , and .
We test MD-Net on three popular benchmarks: HPatches [hpatches], Aachen day-night [aachen_extended] and the Image Matching Benchmark [imb]. In all the experiments we employ MD-Net with detectors, denoted MD-2-Net. The filtering threshold and the NMS radius introduced for the keypoint extraction in Sec. III-B are set to 0.7 and , respectively. We run MD-2-Net on a multi-scale image pyramid obtained by down scaling the input image by a factor until the shortest image dimension drops below . Finally, for each detector, we select the keypoints with the highest scores across the multiple scales. The main metrics in the experiments are the following, involving a pair of images that have to be matched:
MMA: The Mean Matching Accuracy is the mean ratio between the number of correct matches and the total number of proposed matches [d2-net].
MS: The Matching Score is the mean ratio between the number of correct matches and the number of keypoints extracted at one image in the area shared with the other. The metric is computed for both the images and the results are averaged [superpoint].
mAA: The mean Average Accuracy
is the area under the curve of the fraction of correctly estimated relative poses as a function of the pose error[imb].
MMA and MS are evaluated at a given pixel error threshold. In all the tables we represent the best result in bold and we underline the second best. We compare MD-2-Net with two state-of-the-art deep feature extraction networks: R2D2 [r2d2] and ASLFeat [aslfeat]. We employ their official implementations and adopt either their default parameters or those specified by the authors for each benchmark, when provided. In addition, we consider also Upright-SIFT [sift], the baseline method in the Image Matching Benchmark [imb], employing their implementation. For the purpose of a fair comparison, we do not compare with methods employing deep matchers, such as [superglue].
|Method||successfully localized percentage|
|0.25m, 2°||0.5m, 5°||5m, 10°|
Vi-B1 HPatches [hpatches]
This benchmark considers both indoor and outdoor scenes divided in two sets: v contains images of mostly planar scenes captured from different angles in the same lighting conditions, while i contains images captured from a fixed camera in different lighting conditions. Each scene, for the set and for , contains images and the ground truth homographies linking the first image to all the others. We evaluate following the methodology of D2-net [d2-net] with a maximum budget of k keypoints per image and evaluation on the sets v, i and their union, denoted overall. The performance at error thresholds greater than a few pixels are of little interest in real world applications, such as in 3D reconstruction, due to the tight geometric filters employed. For this reason, in Tab. II we report only the numerical values of MMA and MS up to 3px error. MD-2-Net obtains competitive MMA results on all the three image sets, at all the error thresholds. In particular, it is the best performing method in the overall set at both 1px and 2px, while following R2D2 closely at 3px. Additionally, MD-2-Net provides good MS results, following the best performing method ASLFeat.
Vi-B2 Aachen Day-Night [aachen, aachen_extended]
This online benchmark is part of the long-term visual localization benchmark [visuallocalization_benchmark]. It consists of two sets of images of the German city Aachen. The first set is captured during daytime and the corresponding ground truth camera intrinsics and poses are provided. The second set is captured at night instead and the benchmark target is to re-localize these query images using the first set. The online benchmark has been recently updated to v with more precise ground truth poses and additional query images. For a fair comparison, we run MD-2-Net, R2D2, ASLFeat and UprightSIFT using the same re-localization pipeline based on COLMAP [schoenberger2016sfm], available at [visuallocalization_code]. The results are reported in Tab III, where MD-2-Net achieves the highest percentages of successfully localized images at the (m, ) error threshold when considering a budget of k keypoints and it follows the other deep methods closely at the higher error thresholds. In contrast to R2D2 [r2d2] and ASLFeat [aslfeat], our network MD-2-Net is trained exclusively using synthetic homographies and neither on day-night pairs nor on 3D data.
Vi-B3 Image Matching Benchmark [imb]
This is a recent online benchmark proposed to evaluate the performance of local features [imb] in the context of stereo pose recovery and multiview reconstruction on two sets of sequences, namely Phototourism and PragueParks. It considers multiple intermediate metrics (Number of Features (NF), Number of Inlier matches (NI), Repeatability (Rep), Matching Score (MS), Number of inlier Matches filtered by COLMAP [schoenberger2016sfm] (NM), Number of triangulated Landmarks (NL), Track Length (TL), Absolute Trajectory Error (ATL)) as well as the resulting mean Average Accuracy (mAA) up to . For a more detailed description of the metrics, we refer to the benchmark documentation [imb]. We evaluate MD-2-Net on the restricted keypoint category: maximum 2048 keypoints per image. The benchmark results are reported in Tab. IV. MD-2-Net achieves competitive results in all the metrics. In particular, it provides the best mAA on both the sets, for both the stereo and multiview tasks. A qualitative comparison between the considered methods is provided in Fig. 1 for the stereo pose recovery task.
Vi-C Ablation studies
In order to test the performance of our method with a varying number of detectors, we train different instances of MD-Net with and detectors, using the same primed backbone, and test them on the HPatches dataset [hpatches]. It is important to note that , the weight of the dissimilarity loss plays a crucial role: the smaller its value, the higher the chances for multiple detectors to find very similar keypoints, and vice versa. To this purpose, we introduce the Separability metric at pixels, denoted Sep@n px. This measures the overlap between all the detected keypoints as one minus the ratio between the number of keypoints selected by one detector that are closer than n pixels to any other keypoint detected by the other detectors and the total number of detected keypoints. The higher the separability, the lower the chances of observing keypoints from different detectors falling withing pixels from each other. As an example, in our test with , leads to while setting leads to the lower . In order to ensure that Sep@3px is higher than , we empirically set and for the cases and , respectively. The results of our test are reported in Tab. I (refer to Sec. VI for more details about the dataset and metrics) and show that the model trained with two detectors, denoted MD-2-Net, offers the best trade-off between the single, the four and the eight detector versions, in terms of metrics and matching complexity.
When comparing runtimes, matching all the possible pairs between 300 images with 8000 keypoints each takes 288s with the single detector, 161s using two, 87s using four and only 51s when using eight, with an average of 6.4ms, 3.6ms, 1.9ms and 1.1ms per pair, respectively. Tests are carried out on a single Nvidia GTX1080Ti and the matching time considers the scores computation, the mutual nearest neighbor search and the match aggregation.
Vii Conclusion and future works
We introduced MD-Net, a novel deep feature extraction network capable of extracting multiple disjointed sets of local features: these can be matched independently, thus reducing the computational complexity of the matching phase. The high separability values obtained in our analysis, with varying number of detectors, confirm the effectiveness of the novel unsupervised dissimilarity loss at the basis of MD-Net. Additionally, we proposed a variance-based loss dampening scheme that, together with the two-stage training, avoids the detection of keypoints associated with non-discriminative descriptors.
Our experiments show that the network, trained unsupervised, achieves competitive results on different 3D-related tasks at a reduced matching complexity, despite being trained exclusively on images warped with random homographies.
In the future, we will consider different strategies to select the keypoints at each heatmap, and couple the proposed multi-detector paradigm with a deep matcher architecture, such as [superglue], in order to benefit from additional learnt geometric consistency while keeping the matching cost manageable. Acknowledgement: This work has been supported by the FFG, Contract No. 881844: ”ProFuture”.