Detecting interest points in RGB images and matching them across views is a fundamental capability of many robotic systems. Tasks such Simultaneous Localization and Mapping (SLAM) (cadena2016past), Structure-from-Motion (SfM) (agarwal2010bundle) and object detection assume that salient keypoints can be detected and re-identified in a wide range of scenarios, which requires invariance properties to lighting effects, viewpoint changes, scale, time of day, etc. However, these tasks still mostly rely on handcrafted image features such as SIFT (lowe1999object) or ORB (rublee2011orb), which have been shown to be limited in performance when compared to learned alternatives (balntas2017hpatches).
Deep learning methods have revolutionized many computer vision applications including 2D/3D object detection (lang2019pointpillars; tian2019fcos), semantic segmentation (li2018learning; kirillov2019panoptic)sun2019deep)
, etc. However, most learning algorithms need supervision and rely on labels which are often expensive to acquire. Moreover, supervising interest point detection is unnatural, as a human annotator cannot readily identify salient regions in images as well as key signatures or descriptors, which would allow their re-identification. Self-supervised learning methods have gained in popularity recently, being used for tasks such as depth regression(guizilini2019packnet), tracking (vondrick2018tracking) and representation learning (wang2019learning; kolesnikov2019revisiting). Following detone2018superpoint and christiansen2019unsuperpoint, we propose a self-supervised methodology for jointly training a keypoint detector as well as its associated descriptor.
Our main contributions are: (i) We introduce IO-Net (i.e. InlierOutlierNet), a novel proxy task for the self-supervision of keypoint detection, description and matching. By using a neurally-guided outlier-rejection scheme (brachmann2019neural) as an auxiliary task, we show that we are able to simultaneously self-supervise keypoint description and generate optimal inlier sets from possible corresponding point-pairs. While the keypoint network is fully self-supervised, the network is able to effectively learn distinguishable features for two-view matching, via the flow of gradients from consistently matched point-pairs. (ii) We introduce KeyPointNet, and propose two modifications to the keypoint-network architecture described in christiansen2019unsuperpoint
. First, we allow the keypoint location head to regress keypoint locations outside their corresponding cells, enabling keypoint matching near and across cell-boundaries. Second, by taking advantage of sub-pixel convolutions to interpolate the descriptor feature-maps to a higher resolution, we show that we are able to improve the fine-grained keypoint descriptor fidelity and performance especially as they retain more fine-grained detail for pixel-level metric learning in the self-supervised regime. Through extensive experiments and ablation studies, we show that the proposed architecture allows us to establish state-of-the-art performance for the task of self-supervised keypoint detection, description and matching.
2 Related Work
The recent success of deep learning-based methods in many computer vision applications, especially feature descriptors, has motivated general research in the direction of image feature detection beyond handcrafted methods. Such state-of-the-art learned keypoint detectors and descriptors have recently demonstrated improved performance on challenging benchmarks (detone2018superpoint; christiansen2019unsuperpoint; Sarlin:etal:CVPR2019). In TILDE (Verdie:etal:CVPR2015)
, the authors introduced multiple piece-wise linear regression models to detect features under severe changes in weather and lighting conditions. To train the regressors, they generatepseudo ground truth interest points by using a Difference-of-Gaussian (DoG) detector (Lowe:IJCV2004) from an image sequence captured at different times of day and seasons. LIFT (Yi:etal:ECCV2016) is able to estimate features which are robust to significant viewpoint and illumination differences using an end-to-end learning pipeline consisting of three modules: interest point detection, orientation estimation and descriptor computation. In LF-Net (Ono:etal:NIPS2018), the authors introduced an end-to-end differentiable network which estimates position, scale and orientation of features by jointly optimizing the detector and descriptor in a single module.
introduced an unsupervised learning scheme for training a shallow 2-layer network to predict feature points. SuperPoint(detone2018superpoint) is a self-supervised framework that is trained on whole images and is able to predict both interest points and descriptors. Its architecture shares most of the computation in the detection and description modules, making it fast enough for real-time operation, but it requires multiple stages of training which is not desirable in practice. Most recently, UnsuperPoint (christiansen2019unsuperpoint) presented a fast deep-learning based keypoint detector and descriptor which requires only one round of training in a self-supervised manner. Inspired by SuperPoint, it also shares most of the computation in the detection and description modules, and uses a siamese network to learn descriptors. They employ simple homography adaptation along with non-spatial image augmentations to create the 2D synthetic views required to train their self-supervised keypoint estimation model, which is advantageous because it trivially solves data association between these views. In their work, christiansen2019unsuperpoint predict keypoints that are evenly distributed within the cells and enforce that the predicted keypoint locations do not cross cell boundaries (i.e. each cell predicts a keypoint inside it). We show that this leads to sub-optimal performance especially when stable keypoints appear near cell borders. Instead, our method explicitly handles the detection and association of keypoints across cell-boundaries, thereby improving the overall matching performance. In Self-Improving Visual Odometry (DeTone:etal:arXiv2018)
, the authors first estimate 2D keypoints and descriptors for each image in a monocular sequence using a convolutional network, and then use a bundle adjustment method to classify the stability of those keypoints based on re-projection error, which serves as supervisory signal to re-train the model. Their method, however, is not fully differentiable, so it cannot be trained in an end-to-end manner. Instead, we incorporate an end-to-end differentiable and neurally-guided outlier-rejection mechanism (IO-Net) that explicitly generates an additional proxy supervisory signal for the matching input keypoint-pairs identified by our KeyPointNet architecture. This allows the keypoint descriptions to be further refined as a result of the outlier-rejection network predictions occurring during the two-view matching stage.
3 Self-supervised KeyPoint Learning
In this work, we aim to regress a function which takes as input an image and outputs keypoints, descriptors, and scores. Specifically, we define , with input image , and output keypoints , descriptors and keypoint scores ; represents the total number of keypoints extracted and it varies according to an input image resolution, as defined in the following sections. We note that throughout this paper we use to refer to the set of keypoints extracted from an image, while is used to refer to a single keypoint.
Following the work of christiansen2019unsuperpoint, we train the proposed learning framework in a self-supervised fashion by receiving as input a source image such that and a target image such that . Images and are related through a known homography transformation which warps a pixel from the source image and maps it into the target image. We define , with - e.g. the corresponding locations of source keypoints after being warped into the target frame.
Inspired by recent advances in Neural Guided Sample Consensus methods (brachmann2019neural), we define a second function which takes as input point-pairs along with associated weights according to a distance metric, and outputs the likelihood that each point-pair belongs to an inlier set of matches. Formally, we define
as a mapping which computes the probability that a point-pair belongs to an inlier set. We note thatis only used at training time to choose an optimal set of consistent inliers from possible corresponding point pairs and to encourage the gradient flow through consistent point-pairs.
An overview of our method is presented in Figure 1. We define the model parametrized by as an encoder-decoder style network. The encoder consists of 4 VGG-style blocks stacked to reduce the resolution of the image to . This allows an efficient prediction for keypoint location and descriptors. In this low resolution embedding space, each pixel corresponds to an cell in the original image. The decoder consists of 3 separate heads for the keypoints, descriptors and scores respectively. Thus for an image of input size , the total number of keypoints regressed is
, each with a corresponding score and descriptor. For every convolutional layer except the final one, batch normalization is applied with leakyReLU activation. A detailed description of our network architecture can be seen in Figure2. The IO-Net is a 1D CNN parametrized by , for which we follow closely the structure from brachmann2019neural with
default setting residual blocks and the original activation function for final layer is removed. A more detailed description of these networks can be found in the Appendix (Tables6 and 7).
3.1 KeyPointNet: Neural Keypoint Detector and Descriptor Learning
Detector Learning. Following christiansen2019unsuperpoint, the keypoint head outputs a location relative to the grid in which it operates for each pixel in the encoder embedding: . The corresponding input image resolution coordinates are computed taking into account the grid’s position in the encoder embedding. We compute the corresponding keypoint location in the target frame after warping via the known homography . For each warped keypoint, the closest corresponding keypoint in the target frame is associated based on Euclidean distance. We discard keypoint pairs for which the distance is larger than a threshold . The associated keypoints in the target frame are denoted by . We optimize keypoint locations using the following self-supervised loss formulation, which enforces keypoint location consistency across different views of the same scene:
As described earlier, the method of christiansen2019unsuperpoint does not allow the predicted keypoint locations for each cell to cross cell-boundaries. Instead, we propose a novel formulation which allows us to effectively aggregate keypoints across cell boundaries. Specifically, we map the relative cell coordinates to input image coordinates via the following function:
with , i.e. the cell size, and is a ratio relative to the cell size. are the center coordinates of each cell. By setting larger than , we allow the network to predict keypoint locations across cell borders. Our formulation predicts keypoint locations with respect to the cell center, and allows the predicted keypoints to drift across cell boundaries. We illustrate this in Figure 3, where we allow the network to predict keypoints outside the cell-boundary, thereby allowing the keypoints especially at the cell-boundaries to be matched effectively. In the ablation study (Section 4.3), we quantify the effect of this contribution and show that it significantly improves the performance of our keypoint detector.
Descriptor Learning. As recently shown by pillai2018superdepth and guizilini2019packnet, subpixel convolutions via pixel-shuffle operations (shi2016real) can greatly improve the quality of dense predictions, especially in the self-supervised regime. In this work, we include a fast upsampling step before regressing the descriptor, which promotes the capture of finer details in a higher resolution grid. The architectural diagram of the descriptor head is show in Figure 2. In the ablative analysis (Section 4.3), we show that the addition of this step greatly improves the quality of our descriptors.
We employ metric learning for training the descriptors. While the contrastive loss (Hadsell2006contrastive) is commonly used in the literature for this task, we propose to use a per-pixel triplet loss (schroff2015triplet) with nested hardest sample mining as described in tang2018geometric to train the descriptor. Recall that each keypoint in the source image has associated descriptor , an anchor descriptor, which we obtain by sampling the appropriate location in the dense descriptor map as described in detone2018superpoint. The associated descriptor in the target frame, a positive descriptor, is obtained by sampling the appropriate location in the target descriptor map based on the warped keypoint position . The nested triplet loss is therefore defined as:
which minimizes the distance between the anchor and positive descriptors, and maximizes the distance between the anchor and a negative
sample. We pick the negative sample which is the closest in the descriptor space that is not a positive sample. Any sample other than the true match can be used as the negative pair for the anchor, but the hardest negative sample will contribute the most to the loss function, and thereby accelerating the metric learning. Heredenotes the distance margin enforcing how far dissimilar descriptors should be pushed away in the descriptor space.
Score Learning. The third head of the decoder is responsible for outputting the score associated with each descriptor. At test time, this value will indicate the most reliable keypoints from which a subset will be selected. Thus the objective of is two-fold: (i) we want to ensure that feature-pairs have consistent scores, and (ii) the network should learn that good keypoints are the ones with low feature point distance. Following christiansen2019unsuperpoint we achieve this objective by minimizing the squared distance between scores for each keypoint-pair, and minimizing or maximizing the average score of a keypoint-pair if the distance between the paired keypoints is greater or less than the average distance respectively:
Here, and are the scores of the source and target frames respectively, and is the average reprojection error of associated points in the current frame, , with being the feature distance in 2D Euclidean space and being the total number of feature pairs.
3.2 IO-Net: Neural Outlier Rejection as an Auxiliary Task
Keypoint and descriptor learning is a task which is tightly coupled with outlier rejection. In this work, we propose to use the latter as a proxy task to supervise the former. Specifically, we associate keypoints from the source and target images based on descriptor distance: In addition, only keypoints with the lowest K predicted scores are used for training. Similar to the hardest sample mining, this approach accelerates the converging rate and encourages the generation of a richer supervisory signal from the outlier rejection loss. To disambiguate the earlier association of keypoint pairs based on reprojected distance defined in Section 3.1, we denote the distance metric by and specify that we refer to Euclidean distance in descriptor space. The resulting keypoint pairs along with the computed distance are passed through our proposed IO-Net which outputs the probability that each pair is an inlier or outlier. Formally, we define the loss at this step as:
where is the output of the IO-Net, while is the same Euclidean distance threshold used in Section 3. Different from a normal classifier, we also back propagate the gradients back to the input sample, i.e., , thus allowing us to optimize both the location and descriptor for these associated point-pairs in an end-to-end differentiable manner.
The outlier rejection task is related to the neural network based RANSAC(brachmann2019neural) in terms of the final goal. In our case, since the ground truth homography transform is known, the random sampling and consensus steps are not required. Intuitively, this can be seen as a special case where only one hypothesis is sampled, i.e. the ground truth. Therefore, the task is simplified to directly classifying the outliers from the input point-pairs. Moreover, a second difference with respect to existing neural RANSAC methods arises from the way the outlier network is used. Specifically, we use the outlier network to explicitly generate an additional proxy supervisory signal for the input point-pairs, as opposed to rejecting outliers.
The final training objective we optimize is defined as:
where are weights balancing different losses.
4 Experimental Results
We train our method using the COCO dataset (lin2014microsoft), specifically the 2017 version which contains training images. Note that we solely use the images, without any training labels, as our method is completely self-supervised. Training on COCO allows us to compare against SuperPoint (detone2018superpoint) and UnsuperPoint (christiansen2019unsuperpoint), which use the same data for training. We evaluate our method on image sequences from the HPatches dataset (balntas2017hpatches), which contains 57 illumination and 59 viewpoint sequences. Each sequence consists of a reference image and 5 target images with varying photometric and geometric changes for a total of 580 image pairs. In Table 2 and Table 3 we report results averaged over the whole dataset. And for fair comparison, we evaluate results generated without applying Non-Maxima Suppression (NMS).
To evaluate our method and compare with the state-of-the-art, we follow the same procedure as described in (detone2018superpoint; christiansen2019unsuperpoint) and report the following metrics: Repeatability, Localization Error, Matching Score (M.Score) and Homography Accuracy. For the Homography accuracy we use thresholds of , and pixels respectively (denoted as Cor-1, Cor-3 and Cor-5 in Table 3). The details of the definition of these metrics can be found in the appendix.
4.2 Implementation details
We implement our networks in PyTorch(paszke2017automatic) and we use the ADAM (kingma2014adam) optimizer. We set the learning rate to
and train for 50 epochs with a batch size of 8, halving the learning rate once after 40 epochs of training. The weights of both networks are randomly initialized. We set the weights for the total training loss as defined Equation (6) to , , and . These weights are selected to balance the scales of different terms. We set in order to avoid border effects while maintaining distributed keypoints over image, as described in Section 3.1. The triplet loss margin is set to . The relaxation criteria for negative sample mining is set to . When training the outlier rejection network described in Section 3.2, we set , i.e. we choose the lowest scoring pairs to train on.
We perform the same types of homography adaptation operations as detone2018superpoint: crop, translation, scale, rotation, and symmetric perspective transform. After cropping the image with (relative to the original image resolution), the amplitudes for other transforms are sampled uniformly from a pre-defined range: scale , rotation and perspective . Following christiansen2019unsuperpoint, we then apply non-spatial augmentation separately on the source and target frames to allow the network to learn illumination invariance. We add random per-pixel Gaussian noise with magnitude (for image intensity normalized to ) and Gaussian blur with kernel sizes together with color augmentation in brightness , contrast , saturation and hue . In addition, we randomly shuffle the color channels and convert color image to gray with probability .
4.3 Ablative study
|V0 - Baseline||0.633||1.044||0.503||0.796||0.868||0.491|
|V1 - Cross||0.689||0.935||0.491||0.805||0.874||0.537|
|V2 - CrossUpsampling||0.686||0.918||0.579||0.866||0.916||0.544|
|V3 - IO-Net||0.685||0.885||0.602||0.836||0.886||0.520|
|V4 - Proposed||0.686||0.890||0.591||0.867||0.912||0.544|
In this section, we evaluate five different variants of our method. All experiments described in this section are performed on images of resolution 240x320. We first define V0-V2 as (i) V0: baseline version with cross border detection and descriptor upsampling disabled; (ii) V1: V0 with cross border detection enabled; (iii) V2: V1 with descriptor upsampling enabled. These three variants are trained without neural outlier rejection, while the other two variants are (iv) V3: V2 with descriptor trained using only and without and finally (v) V4 - proposed: V3 together with loss. The evaluation of these methods is shown in Table 1. We notice that by avoiding the border effect described in Section 3.1, V1 achieves an obvious improvement in Repeatablity as well as the Matching Score. Adding the descriptor upsampling step improves the matching performance greatly without degrading the Repeatability, as can be seen by the numbers reported under V2. Importantly, even though V3 is trained without the descriptor loss defined in Section 3.1, we note further improvements in matching performance. This validates our hypothesis that the proxy task of inlier-outlier prediction can generate supervision for the original task of keypoint and descriptor learning. Finally, by adding the triplet loss, our model reported under V4 - Proposed achieves good performance which is within error-margin of the best-performing model variant, while achieving strong generalization performance across all performance metrics including repeatability, localization error, homography accuracy and matching score.
To quantify our runtime performance, we evaluated our model on a desktop with an Nvidia Titan Xp GPU on images of 240x320 resolution. We recorded FPS and FPS when running our model with and without the descriptor upsampling step.
4.4 Performance Evaluation
|Method||240x320, 300 points||480 x 640, 1000 points|
In this section, we compare the performance of our method with the state-of-the-art, as well as with traditional methods on images of resolutions and respectively. For the results obtained using traditional features as well as for LF-Net (Ono:etal:NIPS2018) and SuperPoint (detone2018superpoint) we report the same numbers as computed by (christiansen2019unsuperpoint). During testing, keypoints are extracted in each view keeping the top points for the lower resolution and points for the higher resolution from the score map. The evaluation of keypoint detection is shown in Table 2. For Repeatibility, our method notably outperforms other methods and is not significantly affected when evaluated with different image resolutions. For the Localization Error, UnsuperPoint performs better in lower resolution image while our method performs better for higher resolution.
The homography estimation and matching performance results are shown in Table 3. In general, self-supervised learning methods provide keypoints with higher matching score and better homography estimation for the Cor-3 and Cor-5 metrics, as compared to traditional handcrafted features (e.g. SIFT). For the more stringent threshold Cor-1, SIFT performs the best, however, our method outperforms all other learning based methods. As shown in Table 1, our best performing model for this metric is trained using only supervision from the outlier rejection network, without the triplet loss. This indicates that, even though fully self-supervised, this auxiliary task can generate high quality supervisory signals for descriptor training. We show additional qualitative and qualitative results of our method in the appendix.
In this paper, we proposed a new learning scheme for training a keypoint detector and associated descriptor in a self-supervised fashion. Different with existing methods, we used a proxy network to generate an extra supervisory signal from a task tightly connected to keypoint extraction: outlier rejection. We show that even without an explicit keypoint descriptor loss in the IO-Net, the supervisory signal from the auxiliary task can be effectively propagated back to the keypoint network to generate distinguishable descriptors. Using the combination of the proposed method as well as the improved network structure, we achieve competitive results in the homography estimation benchmark.
Appendix A Homography Estimation Evaluation Metric
We evaluated our results using the same metrics as detone2018superpoint. The Repeatability, Localization Error and Matching Score are generated with a correctness distance threshold of . All the metrics are evaluated from both view points for each image pair.
Repeatability. The repeatability is the ratio of correctly associated points after warping into the target frame. The association is performed by selecting the closest in-view point and comparing the distance with the correctness distance threshold.
Localization Error. The localization error is calculated by averaging the distance between warped points and their associated points.
Matching Score (M.Score). The matching score is the success rate of retrieving correctly associated points through nearest neighbour matching using descriptors.
Homography Accuracy. The homography accuracy is the success rate of correctly estimating the homographies. The mean distance between four corners of the image planes and the warped image planes using the estimated and the groundtruth homography matrices are compared with distances . To estimate the homography, we perform reciprocal descriptor matching, and we use openCV’s method with RANSAC, maximum number of iterations, confidence threshold of and error threshold .
Appendix B Detailed qualitative and quantitative analysis on HPatches
To capture the variance induced by the RANSAC component during evaluation we perform additional experiments summarized in Table4
where each entry reports the mean and standard deviation across 10 runs with varying RANSAC seeds. We notice better homography performance on the illumination subset than on the viewpoint subset. This is to be expected as the viewpoint subset contains image pairs with extreme rotation which are problematic for our method which is fully convolutional.
We also evaluate our method as well as SIFT and ORB on the graffiti, bark and boat sequences of the HPatches dataset and summarize our results in Table 5, again reporting averaged results over 10 runs. We note that our method consistently outperforms ORB. Our method performs worse than SIFT (which is more robust to extreme rotations) on the bark and boat sequences, but we obtain better results on the graffiti sequence.
Figure 4 denotes examples of successful matching under strong illumination, rotation and perspective transformation. Additionally, we also show our matches on pairs of images from the challenging graffiti, bark and boat sequences of HPatches in Figures 5, 6, and 7. Specifically, the top row in each figure shows our results, while the bottom row shows SIFT. The left sub-figure on each row shows images (1,2) of each sequence, while the right sub-figure shows images (1,6). We note that on images (1,2) our results are comparable to SIFT, while on images (1,6) we get fewer matches. Despite the extreme perspective change, we report that our method is able to successfully match features on images (1,6) of the boat sequence.
Appendix C Architecture Diagram
Output Tensor Dim.
|#0||Input RGB image||3HW|
|#1||Conv2d + BatchNorm + LReLU||3||32HW|
|#2||Conv2d + BatchNorm + LReLU + Dropout||3||32HW|
|#3||Max. Pooling ( 1/2)||3||32H/2W/2|
|#4||Conv2d + BatchNorm + LReLU||3||64H/2W/2|
|#5||Conv2d + BatchNorm + LReLU + Dropout||3||64H/2W/2|
|#6||Max. Pooling ( 1/2)||3||64H/4W/4|
|#7||Conv2d + BatchNorm + LReLU||3||128H/4W/4|
|#8||Conv2d + BatchNorm + LReLU + Dropout||3||128H/4W/4|
|#9||Max. Pooling ( 1/2)||3||128H/8W/8|
|#10||Conv2d + BatchNorm + LReLU||3||256H/8W/8|
|#11||Conv2d + BatchNorm + LReLU + Dropout||3||256H/8W/8|
|#12||Conv2d + BatchNorm + Dropout (#11)||3||256H/8W/8|
|#13||Conv2d + Sigmoid||3||1H/8W/8|
|#14||Conv2d + BatchNorm + Dropout (#11)||3||256H/8W/8|
|#15||Conv2d + Tan. Hyperbolic||3||2H/8W/8|
|#16||Conv2d + BatchNorm + Dropout (#11)||3||256H/8W/8|
|#17||Conv2d + BatchNorm||3||512H/8W/8|
|#18||Pixel Shuffle ( 2)||3||128H/8W/8|
|#19||Conv2d + BatchNorm (#8 #18)||3||256H/4W/4|
|Layer Description||K||Output Tensor Dim.|
Conv1d + ReLU
|#3||ResidualBlock (#2 #1)||-||128N|
|#4||ResidualBlock (#3 #2)||-||128N|
|#5||ResidualBlock (#4 #3)||-||128N|
|Conv1d + InstNorm + BatchNorm + ReLU||1||128N|
|Conv1d + InstNorm + BatchNorm + ReLU||1||128N|
IO-Net diagram, composed of 4 residual blocks. The network receives as input a series of 5-dimensional vector consists of keypoint pair and descriptor distance, outputs a binary inlier-outlier classification. Numbers in parenthesis indicate input layers, anddenotes feature addition.