RF-Net: An End-to-End Image Matching Network based on Receptive Field
This paper proposes a new end-to-end trainable matching network based on receptive field, RF-Net, to compute sparse correspondence between images. Building end-to-end trainable matching framework is desirable and challenging. The very recent approach, LF-Net, successfully embeds the entire feature extraction pipeline into a jointly trainable pipeline, and produces the state-of-the-art matching results. This paper introduces two modifications to the structure of LF-Net. First, we propose to construct receptive feature maps, which lead to more effective keypoint detection. Second, we introduce a general loss function term, neighbor mask, to facilitate training patch selection. This results in improved stability in descriptor training. We trained RF-Net on the open dataset HPatches, and compared it with other methods on multiple benchmark datasets. Experiments show that RF-Net outperforms existing state-of-the-art methods.READ FULL TEXT VIEW PDF
RF-Net: An End-to-End Image Matching Network based on Receptive Field
Establishing correspondences between images plays a key role in many Computer Vision tasks, including but not limited to wide-baseline stereo, image retrieval, and image matching. A typical feature-based matching pipeline consists of two components: detecting keypoints with attributions (scales, orientation), and extracting descriptors. Many existing methods focus on building/training keypoint detectors or feature descriptors individually. However, when integrating these separately optimized subcomponents into a matching pipeline, individual performance gain may not directly add up. Jointly training detectors and descriptors to make them optimally cooperate with each other, hence, is more desirable. However, training such a network is difficult because the two subcomponents have their individually different objectives to optimize. Not many successful end-to-end matching pipelines have been reported in literatures. LIFT 
is probably the first notable design towards this goal. However, LIFT relies on the output of SIFT detector to initialize the training, and hence, its detector behaves similarly to the SIFT detector. The recent network, SuperPoint, achieves this end-to-end training. But its detector needs to be pre-trained on synthetic image sets, and whole network is trained using images under synthesized affine transformations. The more recent LF-Net  is inspired by Q-learning, and uses a Siamese architecture to train the entire network without the help of any hand-craft method. In this paper, we develop an end-to-end matching network with enhanced detector and descriptor training modules, which we elaborate as follows.
|(a) LF-Det||(b) RF-Det|
Keypoint Detection. Constructing response maps is a general way to find keypoints. LIFT  obtains response maps by directly applying convolutions on different resolutions of the input image. SuperPoint whose width and height are only of the input. Hence the response on represents a highly abstract feature of the input image and the size of the feature’s receptive field is larger than 8 pixels. LF-Net uses ResNet  to produce abstract feature maps from the input image, then build response maps by convolution on the abstract feature maps at different resolutions. Therefore, the response on each map has a large receptive field. In this work, we build response maps using concerned receptive fields. Specifically, we apply convolution to produce feature maps related to the increasing receptive field (Figure 1 (b)). For example, applying convolution with a kernel size of
and stride of 1, the receptive field will increase to 3, 5, 7 and so on. This design produces more effective response maps for keypoints detection.
Feature Descriptor. Training descriptors in an end-to-end network is very different from training individual ones. Existing (individual) descriptor training is often done on well-prepared datasets such as the Oxford Dataset , UBC PhotoTour Dataset , and HPatches Dataset . In contrast, in the end-to-end network training, patches need to be produced from scratch. In LF-Net, patch pairs are sampled by rigidly transforming patches surrounding keypoints in image to image . However, a defect of this simple sampling strategy could affect the descriptor training. Specifically, two originally far-away keypoints, after transformed, could become very close to each other. As a result, a negative patch could look very similar to an anchor patch and positive patch. This will confuse the network during training. This situation brings labeling ambiguity and effect descriptor training. We propose a general loss function term called neighbor mask to overcome this issue. Neighbor mask can be used in both triplet loss and its variants.
Integrating our new backbone detector and the descriptor network, our sparse matching pipeline is also trained in an end-to-end manner, without involving any hand-designed component. We observe that the descriptor’s performance greatly influences the detector’s training, and a more robust descriptor helps detector learn better. Therefore, in each training iteration, we train descriptor twice and detector once. To show the effectiveness of our approach through comprehensive and fair evaluations, we compare our RF-Net with other methods with three evaluation protocols in two public datasets, HPatches  and EF Dataset . Matching experiments demonstrate that our RF-Net outperforms existing state-of-the-art approaches.
The main contributions of this paper are in three aspects. (1) We propose a new receptive field based detector, which generates more effective scale space and response maps. (2) We propose a general loss function term for descriptor learning which improves the robustness of patch sampling. (3) Our integrated RF-Net supports effective end-to-end training, which leads to better matching performance than existing approaches.
A typical feature-based matching pipeline consists of two components: detecting keypoints with attributions (scales, orientation), and extracting descriptors. Many recent learning based pipelines focus on improving one of these modules, such as feature detection [22, 33, 19, 26]
, orientation estimation and descriptor representation [17, 24, 8]. The deficiency of these approaches is that the performance gain from one improved component may not directly correspond to the improvement of the entire pipeline [29, 23].
Hand-crafted approaches like SIFT , is probably the most well-known traditional local feature descriptor. A big limitation of SIFT is the speed. SURF  approximates LoG use a box filter and significantly speeds up the detection. Other popular hand-crafted features include WADE , Edge Foci , Harris corners  and its affine-covariant .
use machine learning approach to speed up the process of corner detection. TILDE learns from pre-aligned images of the same scene at different illumination conditions. Although being trained with the assistance from SIFT, TILDE can still identify keypoints missed by SIFT, and perform better than SIFT on the evaluated datasets. Quad-Network  is trained unsupervisedly with a “ranking” loss.  combines this “ranking” loss with a “Peakedness” loss and produces a more repeatable detector. Lenc et al.  proposes to train a feature detector directly from the covariant constraint. Zhang et al.  extends the covariant constraint by defining the concepts of “standard patch” and “canonical feature”. The method of  learns to estimate orientation to improve feature point matching.
Descriptor learning is the focus of many work for image alignment. DeepDesc  applies a Siamese network, MatchNet  and Deepcompare , to learn nonlinear distance matrix for matching. A series of recent works have considered more advanced model architectures and triplet-based deep metric learning formulations, including UCN , TFeat , GLoss , L2-Net , Hard-Net  and He et al. . Recent works focus on designing better loss functions, while still using the same network architecture proposed in L2-Net .
Building end-to-end matching frameworks have been less explored. LIFT  was probably the first attempt to build such a network. It combines three CNNs (for the detector, orientation estimator, and descriptor) through differentiable operations. While it aims to extract an SfM-surviving subset of DoG detections, its detector and orientation estimator are fed with a patch instead of the whole image, and hence, are not trained end-to-end. SuperPoint 
trains a fully-convolutional neural network that consists of a single shared encoder and two separate decoders (for feature detection and description respectively). Synthetic shapes are used to generate images for detector’s pre-training, and synthetic homographic transformations are used to produce image pairs for detector’s fine-tuning. The more recent LF-Net presents a novel deep architecture and a training strategy to learn a local feature pipeline from scratch. Based on a Siamese Network structure, LF-Net predicts on one branch, and generates ground truth on another branch. It is fed with a QVGA sized image and produces multi-scale response maps. Next, it processes the response maps to output three dense maps, representing keypoints saliency, scale, and orientation, respectively.
Our RF-Net consists of a detector, called RF-Det, which is based on receptive feature maps, and a description extractor whose architecture is the same as L2-Net , but with a modified loss function. The design of the whole network structure is depicted in Figure 2. During testing, the detector network RF-Det takes in an image and outputs a score map , an orientation map , and a scale map
. These three maps produce the locations, orientations, and scales of keypoints, respectively. Patches cropped from these maps will be fed to the descriptor module to extract fixed-length feature vectors for matching.
Constructing scale space response maps is the basis for keypoint detection. We denote the response maps as , where and is the total layer number. The LF-Net  uses abstract feature maps extracted from ResNet  to construct its response maps. Each response in the abstract feature maps represents a high-level feature extracted from a large region in the image, while the low-level features are not extracted. Thus, every map in is a large-scale response in the scale space.
Our idea is to preserve both high-level and low-level features when constructing the response maps , and use some maps (e.g., with smaller index) to offer small-scale response, and some others (e.g., with bigger index) to offer large-scale response.
Following this idea, we use hierarchical convolutional layers to produce feature maps with increasing receptive fields. Therefore, each response in describes the abstracted features extracted from a certain range of the image, and this range increases as the convolution applies. Then we apply one convolution on each to produce response maps in the multi scale space.
In our implementation, we set . And the hierarchical convolutional layers consist of sixteen kernels followed by an instance normalization 
and leaky ReLU activations. We also add shortcut connection between each layer, which does not change the receptive field in feature maps and makes training of the network easier. To produce multi-scale response maps , we use one
kernel followed by an instance normalization. All convolution are zero-padded to make the output size same as the input.
Following the commonly adopted strategy, we select high-response pixels as keypoints. Response maps represent pixels’ response on multi-scales, so we produce the keypoint score map from it. Then we design the keypoint detection similar to LF-Net , except that our response maps are constructed by receptive feature maps.
Specifically, we perform two softmax operators to produce the score map . The purpose of the first softmax operator is to produce sharper response maps . The first softmax operator is applied over a window sliding on with the same zero padding. Then we merge all the into the final score map with the second operator, by
where is the Hadamard product, and indicates the probability of a pixel being a keypoint. The second is applied on a window sliding on .
Estimations of the orientation and scale are also produced based on . We apply convolutions on with two kernels to produce multi-scale orientation maps (see Figure 1 (b)) whose values indicate the and of the orientation. The values are used to compute an angle using the function. Then we apply the same product to merge all into the final orientation map , by
To produce the scale map , we apply the similar operation used in orientation estimation:
where is the receptive field size of the .
We develop the descriptor extraction module in the network following a structure similar to the L2-Net . This structure is also adopted in other recent descriptor learning frameworks such as Hard-Net  and He et al. 
. Specifically, this descriptor network consists of seven convolution layers, each followed by a batch normalization and ReLU, except for the last one. The output descriptors are L2 normalized, and its dimension is 128. We denote the output descriptors as. While we adopt this effective network structure similar to many recent descriptor extraction modules, we use a different loss function, which is discussed in the following.
A keypoint detector predicts keypoints’ locations, orientations, and scales. Therefore, its loss function consists of score loss and patch loss. Patch descriptor is independent from the detection component, once the keypoints are selected. Hence, we use another description loss to train it.
|(a) SIFT||(b) FAST+Hard-Net||(c) LF-net||(d) RF-Net|
Score loss. In this feature matching problem, because it is unclear which points are important, we cannot produce ground truth score maps through human labeling. Good detectors should be able to find the corresponding interest points when the image undergoes a transformation. A simple approach is to let the two score maps and (produced from images and , respectively) to have the same score at the corresponding locations. A simple approach to implement the idea is to minimize Mean Square Loss (MSE) between corresponding locations on and . However, this approach turned out to be not very effective in our experiments.
LF-Net suggests another approach. We fed image pair and into network to produce and . We process to produce ground truth , then define the score loss to be the MSE between and . More specifically, given the ground truth perspective matrix, first, we select the top keypoints from the warped score map , and we denote this as operation . Then, we generate a clean ground truth score map
by placing Gaussian kernels with standard deviationat those locations. This operation is denoted as . Then for warping, we apply a perspective transform . This score loss is finally written as:
If a keypoint falls outside the image , we drop it from the optimization process.
Patch loss. Keypoint orientation and scale affect the patches cropped from the image; and descriptors extracted from patches influence matching precision. We define a patch loss to optimize detector to detect more consistent keypoints. We hope that the patches cropped from the corresponding position are as similar as possible.
Specifically, we select the top keypoints from , then warp their spatial coordinates back to , and form the keypoint with orientation and scale from and predicted by each image. We extract descriptors and at these corresponding patches and . The patch loss can be formulated as
Unlike LF-Net that selects keypoints from , we select keypoints from . This is because in many public training datasets (e.g. HPatches), there is no background mask available. After transformed, Keypoints selected from may be out of range on image . Therefore, the training data sampling method we use is more general.
Description loss. The description loss we use is based on the hard loss proposed in Hard-Net . The hard loss maximizes the distance between the closest positive and closest negative example in the batch. Considering the patches sampled from scratch may bring label ambiguity, we improve the hard loss by a neighbor mask, which makes descriptor training more stable. We formulate description loss as
Here is the closest non-matching descriptor to where
is the closest non-matching descriptor to where
Function computes the Euclidean distance between the centroids of the two patches. We call it a neighbor mask. If a patch is very close to , then and should be a correct match. If a patch is very close to , then and should be a correct match. Therefore, we call patch a positive patches of if their centroid distance is less than a threshold . We mask it when collecting negative samples for .
In summary, we train description network with and train detection network with :
Training data. We trained our network on open dataset HPatches . This is a recent dataset for local patch descriptor evaluation consists of 116 sequences of 6 images with known homography. The dataset is split into two parts: - 59 sequences with significant viewpoint change and - 57 sequences with significant illumination change, both natural and artificial. We split the viewpoint sequences by a ratio of (53 sequences for training and validation, and rest 6 sequences for testing).
At training stage, we resized all images into , then converted images to gray for simplicity and normalized them individually using their mean and standard deviation. Differ with LF-Net , we do not have depth maps for each image, so all pixels in the image were used for training.
About the training patches for description extractor, we cropped image patches and resized them to by selecting the top keypoints with their orientation and scale. To keep differentiability, we used a bilinear sampling scheme of  for cropping.
Training detail. At training stage, we extracted keypoints for training, but at the testing stage, we can choose as many keypoints as desired. For optimization, we used ADAM , and set initial learning rate both for detector and descriptor, and trained descriptor twice and then trained detector once. The in neighbor mask is 5.
|L2-Net+Zhang et al.||0.685||0.425||0.235|
|Hard-Net+Zhang et al.||0.671||0.557||0.273|
Beside HPatches illumination and viewpoint sequences, we also evaluated our model on EF Dataset . EF Dataset has 5 sequences of 38 images which contains drastic illumination and background clutter changes.
The definition of a match depends on the matching strategy. To evaluate the entire local feature pipeline performance, we use three matching strategies from  to calculate match score for quantitative evaluation:
The first is nearest neighbor (NN) based matching, two regions and are matched if the descriptor is the nearest neighbor to . With this approach, a descriptor has only one match.
The second is nearest neighbor with a threshold (NNT) based matching, two regions and are matched if the descriptor is the nearest neighbor to and if the distance between them is below a threshold .
The third is nearest neighbor distance ratio (NNR) based matching, two regions and are matched if , where is the first and is the second nearest neighbor to .
All matching strategies compared each descriptor of the reference image with each descriptor of the transformed image. To emphasize accurate localization of keypoints, follow [18, 19], we used -pixel threshold instead of the overlap measure used in . All learned descriptors have been L2 normalized and their distance range is at . For fairness, we also L2 normalized hand-craft descriptors and set as the nearest neighbor threshold and as the nearest neighbor distance ratio threshold.
We compared RF-Net to three types of methods, the first one is full local feature pipelines, SIFT , SURF , LF-Net . The second one is hand-craft detector integrated with learned descriptor, that is DoG , SURF  FAST  and ORB  integrated with L2-Net  and Hard-Net . The third one is learned detector integrated with a learned descriptor, that is Zhang et al.  integrated with L2-Net  and Hard-Net . We use the authors’ release for L2-Net, Hard-Net, LF-Net and Zhang et al., and OpenCV for the rest. For LF-Net and Zhang et al., we trained them same as RF-Net in 53 viewpoint image sequences cut from HPatches . For Hard-Net and L2-Net, we trained them in 53 viewpoint patches sequences provided by HPatches. The length of all feature descriptors is 128 dimension and L2-Normalized.
|Zhang et al.||0.827||0.894||0.917||0.516||0.664||0.747||0.588||0.638||0.638||0.714|
As shown in Table. 1, our RF-Net outperforms all others and sets the new state-of-the-art on HPatches and EF Dataset. Our RF-Net outperforms the closest competitor by , and relative in the three sequences.
Match score represents the correct ratio in method prediction, while match quantity represents the correct predicted quantity. Figure. 4 depicts the match score and match quantity in all evaluations, and our RF-Net get both high match score and match quantity. The pipeline of ORB combined with Hard-Net also achieves good match quantity in NN and NNT protocols, but it does not perform well in NNR protocol. This indicates descriptors extracted by this pipeline have high nearest neighbor distance ratio, while our RF-Net does not have this problem.
|(a) FAST||(b) LF-Det||(c) RF-Det|
We also give the experiment results about how response layers effect the RF-Net and LF-Net in Figure. 5. For RF-Net, match score increases with the number of response layers and saturates after , and the gap in performance between the LF-Net and RF-Net starts from and increases as increases. This demonstrates that receptive field based response maps are more effective than abstract feature based method
In this section, we examine the importance of various components of our architecture. We replaced LF-Det with RF-Det, and trained them with the same training data to show the effectiveness of our RF-Det. Table. 2 shows the pipeline performance improved by replacing LF-Det with RF-Det.
To mine the effectiveness of modules in RF-Net, we try to remove neighbor mask and orientation estimation module from RF-Net. Table. 3 shows neighbor mask brings remarkable match improvement to RF-Net. Even we removed orientation prediction, our RF-Net still gets state-of-the-art match score, this represents the robustness of our RF-Det.
Table. 4 shows the repeatability performance of hand-craft approaches, Zhang et al., LF-Det and our RF-Det. Although FAST does not perform best on image match, it gets the highest repeatability. Pipeline of matching is a cooperation task between detector and descriptor. As shown in Figure. 6, the keypoints detected by learned end-to-end detector (LF-Det and RF-Det) are more sparse than FAST. This indicates sparse keypoints are easier to match, because too close keypoints may produce patches too similar to match. Therefore, a parse detector works better on this task. Compare RF-Det with LF-Det, RF-Det indeed gets a higher repeatability than LF-Det in all sequences. This also benefited from the receptive field design.
In Figure. 3, We also give some qualitative results on the task of matching challenging pairs of images provided by EF Dataset and HPatches. We selected top keypoints firstly, then matched them by the nearest neighbor distance ratio matching strategy with threshold. We compared our method with the SIFT , FAST  detector integrated with Hard-Net , and LF-Net . The images in top two rows are from EF Dataset, and the images in bottom two rows are from HPatches. These images are under large illumination changes or perspective transformation. As shown in Figure. 3, our method produced the maximum quantity of green matching lines and fewer red failed match lines.
We present a novel end-to-end deep network, RF-Net, for local feature detection and description. To learn more robust response maps, we propose a novel keypoint detector based on receptive field. We also design a loss function term, neighbor mask, to learn a more stable descriptor. Both of these designs bring significant performance improvement to the matching pipeline. We conducted qualitative and quantitative evaluations in three data sequences and showed significant improvements over existing state-of-the-art.
This work was supported by the National Natural Science Foundation of China (No. U1605254, 61728206) and the National Science Foundation of USA EAR-1760582.
Quad-networks: unsupervised learning to rank for interest point detection.CVPR, 2017.
L2-net: Deep learning of discriminative patch descriptor in euclidean space.CVPR, pages 6128–6136, 2017.