Official Pytorch implementation of GLAMpoints
We introduce a novel CNN-based feature point detector - GLAMpoints - learned in a semi-supervised manner. Our detector extracts repeatable, stable interest points with a dense coverage, specifically designed to maximize the correct matching in a specific domain, which is in contrast to conventional techniques that optimize indirect metrics. In this paper, we apply our method on challenging retinal slitlamp images, for which classical detectors yield unsatisfactory results due to low image quality and insufficient amount of low-level features. We show that GLAMpoints significantly outperforms classical detectors as well as state-of-the-art CNN-based methods in matching and registration quality for retinal images.READ FULL TEXT VIEW PDF
Retinal template matching and registration is an important challenge in
Retinal image quality assessment is an essential task in the diagnosis o...
Stereo dense image matching can be categorized to low-level feature base...
This paper introduces a novel feature detector based only on information...
In the CNN based object detectors, feature pyramids are widely exploited...
It is hard to create consistent ground truth data for interest points in...
We conduct a subjective experiment to compare the performance of traditi...
Official Pytorch implementation of GLAMpoints
Unofficial PyTorch implementation of GLAMpoints: Greedily Learned Accurate Match points
Digital fundus images of the human retina are widely used to diagnose variety of eye diseases, such as Diabetic Retinopathy (DR), glaucoma, and Age-related Macular Degeneration (AMD) [42, 52]. For retinal images acquired during the same session and presenting small overlaps, image registration can be used to create mosaics depicting larger areas of the retina. Through image mosaicking, ophthalmologists can display the retina in one large picture, which is helpful during diagnosis and treatment planning. Besides, mosaicking of retinal images taken at different time points has been shown to be important for monitoring the progression or identification of eye diseases. More importantly, fundus image registration has been explored in eye laser treatment for DR. It allows real-time tracking of the vessels during surgical operations to ensure accurate application of the laser on the retina and minimal damage to the healthy tissues.
|a) SIFT||b) GLAMpoints|
Mosaicking usually relies on extracting repeatable interest points from the images, matching the correspondences and searching for transformations relating them. As a result, the keypoint detection is the first and the most crucial stage of this pipeline, as it conditions all further steps and therefore the success of the registration.
At the same time, classical feature detectors are general-purpose and manually optimized for outdoor, in-focus, low-noise images with sharp edges and corners. They usually fail to work with medical images, which can be distorted, noisy, have no guarantee of focus and depict soft tissue with no sharp edges (see Figure 3). Traditional methods perform sub-optimally on such images, making more sophisticated optimization necessary at a later step in the registration, such as Random Sampling Consensus (RanSaC) , bundle adjustment  and Simultaneous Localization and Mapping (SLAM) 
techniques. In these cases, supervised learning methods for keypoint detection fail or are not applicable, due to missing ground truths for feature points.
In this paper we present a method for learning feature points in a semi-supervised manner. Learned feature detectors were shown to outperform the heuristics-based methods, but they are usually optimized for repeatability, which is a proxy for the matching quality and as a result they may underperform during the final matching. On the contrary, our keypoints -GLAMpoints - are trained for the final matching accuracy and when associated with Scale-Invariant Feature Transform (SIFT)  descriptor they outperform state-of-the-art in matching performance and registration quality on retinal images. As shown in Figure 1, GLAMpoints produces significantly more correct matches than SIFT detector.
Registration based on feature points is inherently non-differentiable due to point matching and transformation estimations. We take inspiration from the loss formulation inReinforcement Learning (RL) using a reward to compute the suitability of the detected keypoints based on the final registration quality. It makes it possible to use the key performance measure, matching power, to directly train a Convolutional Neural Network (CNN). Our contribution is therefore a formulation for keypoint detection that is directly optimized for the final matching performance in an image domain.
Existing registration algorithms can be classified as area-based and feature-based approaches. The former typically rely on a similarity metric such as cross-correlation, mutual information [38, 31] or phase correlation  to compare the intensity patterns of an image pair and estimate the transformation. However, in the case of changes in illumination or small overlapping areas, the application of area-based approaches becomes challenging or infeasible. Conversely, feature-based methods extract corresponding points on pairs of images along with a set of features and search for a transformation that minimizes the distance between the detected key points. Compared with area-based registration techniques, they are more robust to changes of intensity, scale and rotation and therefore, they are considered more appropriate for problems such as medical image registration.
Typically, feature extraction and matching of two images comprise four steps: detection of interest points, computing feature descriptor for each of them, matching of corresponding keypoints and estimation of a transformation between the images using the matches. As can be seen, the detection step influences every further step and is therefore crucial for a successful registration. It requires a high image coverage and stable key points in low contrasts images.
In the literature, local interest point detectors have been thoroughly studied. SIFT 
is probably the most well known detector/descriptor in computer vision. It computes corners and blobs on different scales to achieve scale invariance and extracts descriptors using the local gradients.Speeded-Up Robust Features (SURF)  is a faster alternative, using Haar filters and integral images, while KAZE  exploits non-linear scale space for more accurate keypoint detection.
In the field of fundus imaging, a widely used technique relies on vascular trees and branch point analysis [33, 25]. However, accurate segmentation of the vascular trees is challenging and registration often fails on images with few vessels. Alternative registration techniques are based on matching repeatable local features; Chen  detected Harris corners  on low quality multi-modal retinal images and assigned them a partial intensity invariant feature (Harris-PIIFD) descriptor. They achieved good results on low quality images with an overlapping area greater than 30%, but the method is characterised by low repeatability. Wang  used SURF
features to increase the repeatability and introduced a new method for point matching to reject a large number of outliers, but the success rate drops significantly when the overlapping area diminishes below 50%. Cattin also demonstrated that SURF can be efficiently used to create mosaics of retina images even for cases with no discernible vascularisation. However this technique only appeared successful in the case of highly self-similar images. D-saddle detector/descriptor  was shown to outperform the previous methods in terms of rate of successful registration on the Fundus Image Registration (FIRE) Dataset , enabling the detection of interest points on low quality regions.
Recently, with the advent of deep learning, learned detectors based onCNN architectures were shown to outperform state-of-the-art computer vision detectors [22, 20, 50, 37, 10]. Learned Invariant Feature Transform (LIFT)  uses patches to train a fully differentiable deep CNN for interest point detection, orientation estimation and descriptor computation based on supervision from classical Structure from Motion (SfM) systems. SuperPoint  introduced a self-supervised framework for training interest point detectors and descriptors. It rises to state-of-the-art homography estimation results on HPatches  when compared to SIFT, LIFT and Oriented Fast and Rotated Brief (ORB) . The training procedure is, however, complicated and their self-supervision implies that the network can only find corner points. Altwaijry  proposed a two-step CNN for matching aerial image patches, which is a particularly challenging task due to ultra-wide baseline.  introduced a method to detect keypoint locations on different scales, utilizing high activations in recursive network feature maps. KCNN  was shown to emulate hand-crafted detectors by training small networks using keypoints detected by other methods as ground-truth. Local Feature Network (LF-NET)  is the closest to our method: a keypoint detector and descriptor is trained end-to-end in a two branch set-up, one being differentiable and feeding on the output of the other non-differentiable branch. However, they optimized their detector for repeatability between image pairs, not taking into account the matching performance.
Truong  presented an evaluation of SURF, KAZE, ORB, Binary Robust Invariant Scalable Keypoints (BRISK) , Fast Retina Keypoint (FREAK) , LIFT, SuperPoint and LF-NET both in terms of image matching and registration quality on retinal fundus images. They found that while SuperPoint outperforms all the others relative to the matching performance, LIFT demonstrates the highest results in terms of registration quality, closely followed by KAZE and SIFT. The highlighted issue was that even the best-performing detectors produce feature points which are densely positioned and as a result may be associated with a similar descriptor. This can lead to false matches and thus inaccurate or failed registrations.
Our goal is to tackle this problem by introducing a novel semi-supervised learned method for keypoint detection. Detectors are often optimized for repeatability (such asLF-NET) and not for the quality of the associated matches between image pairs. Our training procedure uses a reward concept akin to RL to extract repeatable, stable interest points with a uniform coverage and is specifically designed to maximize correct matching on a specific domain, as shown for challenging retinal slit lamp images.
Our trained network predicts the location of stable interest points, called GLAMpoints, on a full-sized gray-scale image. In this section, we explain how our training set was produced and our training procedure. As we used standard convolutional network architecture, we only briefly discuss it in the end.
We trained our model on a dataset from the ophthalmic field, namely slit lamp fundus videos, used in laser treatment (examples in Figure 3). In this application, live registration is required for an accurate ablation of the retinal tissue. Our training dataset consists of 1336 images with different resolutions, ranging from to by to . These images were acquired with multiple cameras and devices to cover large variability of appearances. They come from eye examination of 10 different patients, who were healthy or with diabetic retinopathy.
Let be the set of base images of size . At every step , an image pair is generated from an original image by applying two separate, randomly sampled homography transforms . Images and are thus related according to the homography (see supplementary material). On top of the geometric transformations, standard data augmentation methods are used: gaussian noise, changes of contrast, illumination, gamma, motion blur and the inverse of image. A subset of these appearance transformations is randomly chosen for each image and .
We define our learned function , where denotes the pixel-wise feature point probability map of size . Lacking a direct ground truth of keypoint locations, a delayed reward can be computed instead. We base this reward on the matching success, computed after registration. The training proceeds as follows:
Given a pair of images and related with the ground truth homography , our model provides a score map for each image, and .
The locations of interest points are extracted on both score maps using standard non-differentiable Non-Max-Supression (NMS), with a window size .
A 128 root-SIFT  feature descriptor is computed for each detected keypoint.
The keypoints from image are matched to those of image and vice versa using a brute force matcher . Only the matches that are found in both directions are kept.
The matches are checked according to the ground truth homography . A match is defined as true positive if the corresponding keypoint in image falls into an -neighborhood of the point in after applying . This is formulated as , where we chose as .
Let denote the set of true matches. If a given feature point ends up in the set of true positive points, it gets a positive reward. All other points/pixels are given a reward of 0. Consequently, the reward matrix for a keypoint can be defined as follows:
This leads to the following loss function:
However, a major drawback of this formulation is the large class imbalance between positively rewarded points and null-rewarded ones, where latter prevails by far, especially in the first stages of training. Given a reward with mostly zero values, the converges to a zero output. Hard mining has been shown to boost training of descriptors . Negative hard mining on the false positive matches might also enhance performance in our method, but has not been investigated in this work. Instead, to counteract the imbalance, we use sample mining: we select all true positive points and randomly sample additional from the set of false positives. We only back-propagate through the
true positive feature points and mined false positive key points. If there are more true positives than false positives, gradients are backpropagated through all found matches. This mining is mathematically formulated as a mask, equal to at the locations of the true positive key points and that of the subset of mined feature points, and equal to 0 otherwise. The loss is thus formulated as follows:
where denotes the element-wise multiplication.
An overview of the training steps is given in Figure 2. Importantly, only step 1 is differentiable with respect to the loss. We learn directly on a reward which is the result of non differentiable actions, without supervision.
It should be noted that the descriptor we used is the root-SIFT version without rotation invariance. The reason is that it performs better on slitlamp images than root-SIFT detector/descriptor with rotation invariance (see supplementary material for details). The aim of this paper is to investigate the detector only and therefore we used rotation-dependent root-SIFT for consistency.
A standard 4-level deep Unet  with a final sigmoid activation was used to learn
. It comprises of 3x3 convolution blocks with batch normalization andRectified Linear Unit (ReLU) activations (see Figure 2,c). Since the task of keypoint detection is similar to pixel-wise binary segmentation (class interest point or not), Unet was a promising choice due to its past successes in binary and semantic segmentation tasks.
In this section, we describe the testing dataset and the evaluation protocol. We then compare state-of-the-art detectors, quantitatively and qualitatively to our proposed GLAMpoints.
In this study we used the following test datasets:
The slit lamp dataset: from retinal videos of 3 patients (different from the ones used for training), a random set of 206 frame pairs was selected as testing samples, with size to by to . Examples are shown in Figure 3. The pairs were selected to have an overlap ranging from 20 to 100%. They are related by affine transformations and rotations up to 15 degrees. Using a dedicated software tool, all pairs of images were manually annotated following common procedures  with at least 5 corresponding points, which were then used to estimate the ground truth homographies relating the pairs. As the slit lamp images depict small area of retina, it is justified to apply the planar assumption in generating homographies [13, 24].
The FIRE dataset : a publicly available retinal image registration dataset with ground truth annotations. It consists of 129 retinal images forming 134 image pairs. The original images of 2912x2912 pixels were-down scaled to 15% of their original size, to match the resolution of the training set. Examples of such images are shown in Figure 5.
As a pre-processing step for testing fundus images we isolated the green channel, applied adaptive histogram equalization and a bilateral filter to reduce noise and enhance the appearance of edges as proposed in . The effect of pre-processing can be seen in Figure 1.
We evaluated the performance using the following metrics:
Repeatability describes the percentage of detected points in image that are within an -distance () to points in after transformation with :
Matching performance. Matches were found using the Nearest Neighbor Distance Ratio (NNDR) strategy, as proposed in : two keypoints are matched if the descriptor distance ratio between the first and the second nearest neighbor is below a certain threshold . Then, the following metrics were evaluated:
M.score, the ratio of correct matches over the total number of keypoints extracted by the detector in the shared viewpoint region .
Coverage fraction, measures the coverage of an image by correctly matched key points. A coverage mask was generated from correctly matched key points, each one adding a disk of fixed radius (25px) as in .
We computed the homography relating the reference to the transformed image by applying RanSaC algorithm to remove outliers from the detected matches.
Registration success rate. We furthermore evaluated the registration accuracy achieved after using key points computed by different detectors as in [15, 47]. To do so, we compared the reprojection error of six fixed points of the reference image (denoted as ) onto the other. For each image pair for which a homography was found, the quality of the registration was assessed with the median error (MEE) and the maximum error (MAE) of the distances between corresponding points after transformation.
Using these metrics, we defined different thresholds on MEE and MAE that define , and registrations. We consider registration if not enough keypoints or matches were found for a homography (minimum 4), if it involves a flip or the estimated scaling component is greater than 4 or smaller than . We classified the result as when and and as otherwise. The values for the thresholds were found empirically by post-viewing the results. Using the above definitions, we calculated the success rate of each class, equal to the percentage of image pairs for which the registration falls into each category. These metrics are the most important quantitative evaluation criteria of the overall performance in a real-world setting.
To evaluate the performance of our GLAMpoints detector associated with root-SIFT descriptor, we compared its matching ability and registration quality against well known detectors and descriptors. Among them, SIFT , KAZE  and LIFT  were shown to perform well on fundus images by Truong . Moreover, we compared our method to other CNN-based detectors-descriptors: LF-NET  and SuperPoint . We used the authors’ implementation of LIFT (pretrained on Picadilly), SuperPoint and LF-NET (pretrained on indoor data, which gives substantially better results on fundus images than the version pretrained on outdoor data) and OpenCV implementation for SIFT and KAZE. A rotation-dependent version of root-SIFT descriptor is used due to its better performance on our test set compared to the rotation invariant version. For the remainder of the paper, SIFT descriptor refers to root-SIFT.
Training of GLAMpoints
was performed using Tensorflow with mini-batch size of 5 and the Adam optimizer  with learning rate and = (0.9, 0.999) for 35 epochs. For each batch we randomly cropped patches of the full-resolution image to speed up the computation. GLAMpoints (NMS10) was trained and tested with a NMS window equal to 10px. It must be noted that other NMS windows can be applied, which obtain similar performance.
Table 1 presents the success rate of registration evaluated on the slit lamp dataset. Without pre-processing, most detectors show lower performance compared to the pre-processed images, but GLAMpoints performs well even on raw images. While the success rate of acceptable registrations of SIFT, KAZE and SuperPoint drops by 20 to 30% between pre-processed and raw images, GLAMpoints as well as LIFT and LF-NET show only a decrease of 3 to 6%. Besides, LF-NET, LIFT and GLAMpoints detect a steady average number of keypoints (around 485 for preprocessed and 350 non-preprocessed) independently of the pre-processing, whereas the other detectors see a reduction half. In general, GLAMpoints shows the highest performance for both raw and pre-processed images in terms of registration success rate. The robust results of our method indicate that while our detector performs as well or better on good quality images compared to the heuristic-based methods, its performance does not drop on lower quality images.
|Failed [%]||Inaccurate [%]||Acceptable [%]|
|Detector||Failed [%]||Inaccurate [%]||Acceptable [%]|
|Random grid (SIFT)||0.0||62.62||37.38|
While SIFT extracts a large number of keypoints (205.69 on average for unprocessed images and 431.03 for pre-processed), they appear in clusters as shown in Figure 5. As a result, even if the repeatability is relatively high, the close positioning of the interest points leads to a large number of rejected matches, as the nearest-neighbours are very close to each other. This is evidenced by the low coverage fraction, and (Figure 4). With a similar value of repeatability, our approach extracts interest points widely spread and trained for their matching ability (highest coverage fraction), resulting in more true positive matches (second highest M.score and AUC), as shown in Figure 4.
LF-NET, similar to SIFT, shows high repeatability, which can be explained by its training strategy, which preferred repeatability over accurate matching objective. However, its M.score and AUC are in the bottom part of the ranking (Figure 4). While the performance of LF-NET may increase if it was trained on fundus images, its training procedure requires images pairs with their relative pose and corresponding depth maps, which would be extremely difficult - if not impossible - to obtain for fundus images.
It is worth noting that SuperPoint scored the highest and but in this case the metrics are artificially inflated because very few keypoints are detected (35,88 and 59,21 on average for raw and pre-processed images respectively). This translates to relatively small coverage fraction and one of the lowest repeatability, leading to few possible correct matches.
As part of an ablation study, we trained GLAMpoints with different descriptors (Table 1, top). While it performs best with the SIFT descriptor, the results show that for every considered descriptor (SIFT, ORB, BRISK), it improves upon the corresponding original detector.
To benchmark the detection results we used the descriptors that were developed/trained jointly with the given detector and thus can be considered as optimal. For instance in  the combination of the LIFT/LIFT detector/descriptor outperformed LIFT/SIFT. For completeness, we present the registration results of baseline detectors combined with root-SIFT descriptor in Table 1. As can be seen, it does not improve the result compared to the original descriptor.
Finally, to verify that the performance gain of GLAMpoints does not come solely from the uniform and dense spread of the detected keypoints we computed the success rate for keypoints in a random, uniformly distributed grid (Table1,bottom), which underperforms in comparison. This shows that our detector predicts not only uniform but also significant points.
|Failed [%]||Inaccurate [%]||Acceptable [%]|
Table 2 shows the results for success rates of registrations. The presented method outperforms baselines both in terms of success rate and global accuracy of non-failed registrations. As all the images in FIRE dataset present good quality with highly contrasted vascularization, we did not apply pre-processing. We also did not find it necessary to use the available background masks to filter out keypoints detected outside of the retina as generally they were not matched and did not contribute to the final registration.
It is interesting to note the gap of 33.6% in the success rate of acceptable registrations between GLAMpoints and SIFT. As both use the same descriptor, this difference can be only explained by the quality of the detector. As can be seen in Figure 5, SIFT detects a restricted number of keypoints densely positioned solely on the vascular tree and in the image borders, while GLAMpoints extracts interest points over the entire retina, including challenging areas such as the fovea and avascular zones, leading to a substantial rise in the number of correct matches.
Even though GLAMpoints outperforms all other detectors, LIFT and SuperPoint also present high performance on the FIRE dataset. This dataset contains images with well-defined corners on a clearly contrasted vascular tree and LIFT extracts keypoints spread over the entire image, while SuperPoint was trained to detect corners on synthetic primitive shapes. However, as evidenced on the slit lamp dataset, the performance of SuperPoint strongly deteriorates on images with less clear features.
|a) SIFT||b) GLAMpoints|
To further demonstrate a possible extension of our method to other image domains, we computed its predictions on natural images. Note that we used the same GLAMpoints model trained on slit lamp images.
Globally, GLAMpoints reaches a success rate of 75.38% for acceptable registrations, against 85.13% for the best performing detector - SIFT with rotation invariance - and 83.59% for SuperPoint. In terms of , and coverage fraction it scores respectively second, second and first best. In contrast, of GLAMpoints is only second to last after SIFT, KAZE and LF-NET even though it successfully registers more images. This result shows once again that repeatability is not the most adequate metric to measure the performance of a detector. The detailed results can be found in the supplementary material.
Finally, it should be noted that the outdoor images of this dataset are significantly different from medical fundus images and contain much greater variability of structures, which indicates a promising generalization of our model to unseen image domains.
In case of slit lamp videos the end goal is to create retinal mosaics. Using 10 videos containing 25 to 558 images, we generated mosaics by registering consecutive frames using keypoints detected by different methods. We calculated the average number of frames before the registration failed (due to the lack of extracted keypoints or correct matches between a pair of images). Over those 10 videos, the average number of registered frames before failure is 9.98 for GLAMpoints and only 1.04 for SIFT.
Example mosaics are presented in Figure 6. For the same video, SIFT failed after 34 frames when the data was pre-processed and only after 11 frames on the original data. In contrast, GLAMpoints successfully registered 53 consecutive images, without visual errors. The mosaics were created with frame to frame matching with the blending method of  and without bundle adjustment.
|a) 11 frames||b) 34 frames||c) 53 frames|
The run time of detection is computed over 84 pairs of images with a resolution of 660px by 350px. The GLAMpoints architecture was run on a Nvidia GeForce GTX 1080 GPU while NMS and SIFT
used CPU. Mean and standard deviation of run time forGLAMpoints and SIFT are presented in Table 3.
|Detection image I||CNN: 16.28 96.86, NMS: 11.2 1.05||28.94 1.88|
|Total||27.48 98.74||45.58 4.69|
In this paper we introduce GLAMpoints - a keypoint detector optimized for matching performance. This is in contrast to other detectors that are optimized for repeatability of keypoints, ignoring their correctness for matching. GLAMpoints detects significantly more keypoints that lead to correct matches even in low textured images, which do not present many features. As a result, no explicit pre-processing of the images is required. We train our detector on generated image pairs avoiding the need for ground truth correspondences. Our method produces state-of-the-art matching and registration results of medical fundus images and our experiments show that it can be further extended to other domains, such as natural images.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015.
Conference on Computer Vision and Pattern Recognition, 2012.