1 Introduction
As a fundamental research direction in computer vision, finding the correspondences among pairs of images has been widely utilized in plenty of down-stream tasks, including optical flow estimation
[19, 8, 53], visual localization [46, 32, 34], camera position calibration [15, 41], 3D reconstruction [5, 11], and visual tracking [30]. Given a pair of images, according to how the queries and correspondences are determined, the applications mentioned above can be generally categorized into sparse matching and dense matching. The former focuses on two sets of keypoints being sparsely and respectively extracted from both images and matched to minimize a pre-defined alignment error [21, 34, 15]; the latter treats all pixels in the first image as queries which are densely mapped to the other image for correspondences [19, 37, 59, 51].The above two kinds of applications were studied independently for a long time, and various optimizations were designed separately. Recently, COTR [14] claims that these two applications can be naturally modeled within a unified framework since the only difference between the sparse and dense matching is the number of points to query. It proposes to recursively apply a transformer-based [4, 52, 7] model at multiple scales in a gradually zooming-in manner to obtain accurate correspondences. Though impressive performance has been achieved, its complex off-line pipeline and slow inference speed seriously limit its practicality in real-world applications.

We argue that there are three main reasons leading to the unsatisfactory COTR. The first is the recursive zoom-in refinement framework, which must re-extract the corresponding features in the next local patch matching. In the case of many queries, these features are likely to overlap, which means plenty of repeated and redundant calculations. The second is switching the role of the queries and correspondences to filter out the mismatched queries, which double the overall computation. The third is that the staged training strategy leads to unstable training convergence which needs to be carefully fine-tuned.
Instead of sacrificing speed for performance, in this work, we present an efficient correspondence transformer network (ECO-TR), showing that both efficiency and effectiveness can be achieved within a single feed-forward pass. Specifically, we propose to complete the coarse-to-fine refinement process of the found correspondences in a stage-by-stage manner. Our framework consists of a bottom-up convolutional neural network (CNN) for multi-scale feature extraction and several top-down transformer blocks corresponding to different matching accuracies. During the coarse-to-fine refinement process, rather than cropping image patches of different positions and sizes according to the coarsely predicted coordinates and recursively re-feeding them into CNN to obtain the corresponding feature maps, we obtain the multi-scale feature maps
w.r.t. the input image at one time by taking advantages of the pyramid and translation invariance nature of modern CNNs, and directly crop on the collected feature maps. The proposed feature-level cropping method can effectively avoid repeated calculations. To a certain extent, the inference speed of the model does not increase linearly with the increase of query points.To further improve the efficiency of our framework, an Adaptive Query-Clustering (AQC) module is proposed to gather similar queries into a cluster, which speeds up the inference. Moreover, we propose an uncertainty module to estimate the confidence of the predicted correspondences, which achieves good performance on outlier detection nearly for free. As illustrated in Table 1, our approach can process 1000 queries within one second on a single NVIDIA Telsa V100 GPU for a pair of images with size , which is around 40 times faster than COTR under the same conditions.
To evaluate the performance of the proposed approach, we report the results on multiple challenging datasets covering both sparse and dense correspondence finding tasks. Experimental results demonstrate that our method surpasses COTR in performance and speed by a large margin. In addition, we conduct extensive ablation experiments to better understand the impact of each component in our framework. The contributions are summarized below:
-
We propose a new coarse-to-fine framework for finding correspondence that can be applied to both sparse and dense matching tasks. Our method can be optimized end-to-end and evaluate an arbitrary number of queries within a single feed-forward.
-
We design an adaptive query-clustering strategy and an uncertainty-based outlier filtering module to achieve a better balance between efficiency and effectiveness.
-
Our method significantly outperforms the existing best-performing functional method in speed and still achieves comparable performance in sparse correspondence tasks and better in dense correspondence tasks.
2 Related Work
2.0.1 Sparse methods.
The most common paradigm for sparse image matching pipelines consists of three stages: keypoint detection, keypoint description, and feature matching. In terms of the detection stage, a sparse set of repeatable and matchable keypoints are selected by the detection methods [31, 35, 2, 50], which are robust against viewpoint changes and different lighting conditions. Then, the keypoints are described by patch-level input or image-level input. Patch-based description methods [44, 24, 45, 10] take cropped patches as inputs and are usually trained by metric learning. Image-based description methods such as [9, 6, 27, 22, 43]
take a full image as input and apply fully-convolutional neural networks
[20] to generate dense descriptors. This kind of method usually combines detector and descriptor, which share the same backbone in training and yield better performance on both tasks.Traditional feature matching methods use Nearest Neighbor (NN) search to find potential matches. Recently, many approaches [3, 54, 55, 56, 40]
filter outliers by heuristics or learned priors. SuperGlue
[33] uses an attentional graph neural network and optimal transport method to obtain state-of-the-art performance on sparse matching tasks. Unlike the method mentioned above, given some keypoints as queries, COTR [14] refines the matches in the other image recursively by correspondence neural network. Following COTR, we design an end-to-end model to accelerate this scheme.2.0.2 Dense methods.
The main purpose of dense matching is to estimate the optical flow. NC-Net [29] represents all keypoints and possible correspondences as a 4D correspondence volume restricted to low-resolution images. Sparse NC-Net [28] applies sparse correlation layers instead of all possible correspondences to mitigate this restriction, whereby higher resolution images can be tackled. DRC-Net [17] reduces the computational cost and promotes performance by using coarse-resolution and fine-resolution feature maps of different layers. GLU-Net [48] finds pixel-wise correspondences by global and local features extracted from images with different resolutions. GOCor [47] disambiguates features in similar regions via an improved feature correlation layer. PDC-Net [49] excludes incorrect dense matches in occluded and homogeneous regions by estimating an uncertainty map and filtering the inaccurate correspondences. Patch2Pix [58] replaces pixel-level matches with patch-level match proposals and later refines them by regression layers. LoFTR [39]
establishes accurate semi-dense matches with linear transformers in a coarse-to-fine manner. For COTR, the dense matching result is generated by interpolating sufficient sparse queries’ results. Same with COTR, our method can give dense matching results by interpolation, too.
2.0.3 Functional methods.
The functional method in image matching. COTR is the first one that obtains matches by a functional correspondence finding architecture. Given a pair of images and coordinates of one query, COTR regresses the possible match in the other image via a transformer-based correspondence finding network. Each query is processed independently, and dense correspondences are estimated by interpolating sparse correspondences using Delaunay triangulation of the queries. However, being a recursive method, it will be extremely time-consuming when many keypoints are queried. We mitigate this problem in an end-to-end manner, which runs dozens of times faster than COTR and achieves comparable or superior performance.
3 Coarse-to-Fine Refinement Network
This section describes the proposed end-to-end framework that can find the correspondences for arbitrary queries given a pair of images within a single feed-forward pass in detail.
3.1 Overall Pipeline

We show a schematic diagram of the overall pipeline of the proposed framework in Fig. 2. It mainly consists of a bottom-up multi-scale feature extraction pathway based on the CNN and a top-down coarse-to-fine prediction refinement pathway based on the transformer. Given a pair of images and , we first resize them to the same spatial resolution (, is the ‘batch’ dimension) and feed them into the CNN backbone to obtain multi-scale features. After that, the collected multi-scale features are used along with the input queries to predict the correspondences in a coarse-to-fine, gradually refining manner in the top-down pathway. We also predict an uncertainty score w.r.t. each correspondence representing how confident the network is of its prediction, which can be utilized to filter out the outliers nearly for free. Since it could be a bunch of queries to be processed in one feed-forward, we further introduce an adaptive query-clustering strategy to better balance efficiency and effectiveness. The following subsections describe the above-mentioned components in detail.
3.2 Efficient Feature Extraction
To obtain correspondence locations precisely, existing work usually crops image patches around potential matching regions and iteratively feeds them back into the network in a progressively enlarged manner. The main drawbacks of the aforementioned practice are: 1) the input image is cropped and resized into patches multiple times with different zoom-in factors around each query position. Each patch generated is then fed into the network, which involves many redundant computations. 2) Image patches for each query are cropped and processed by the network independently, which usually means serial processing and inefficient use of computational resources. We found that the main cause of these two shortcomings can be attributed to the setting of cropping patches at different spatial levels directly on the image.
Considering the pyramid and translation invariance nature of modern CNNs, we propose to alleviate the drawbacks mentioned above by deferring the cropping operation after the feature extraction step. Specifically, we first obtain the multi-scale feature maps w.r.t. each input image at one time and then directly crop on the collected feature maps to get feature patches at any position and scale. We take the ResNet-50[26] network as our backbone for multi-scale feature extraction without loss of generality. Following the previous success in generating more powerful and representative features, we attach a pyramid pooling module (PPM) [57] to capture more global information at the top of ResNet-50. The output of PPM and the side-outputs at res1-4 stages of the ResNet-50 network are collected to build a hierarchical multi-scale feature integration structure. As shown in the left part of Fig. 2, to meet the needs of the subsequent top-down pathway which has three refinement stages (i.e., coarse, middle, and fine), we choose to combine the intermediate outputs of {PPM, res4}, {res2-4} and {res1-3} stages, respectively. The integrated three sets of features (denoted as ) are then resized to , , and spatial resolutions w.r.t. the input stitched images pair, respectively.
![]() |
![]() |
3.3 Coarse-to-Fine Prediction Refinement
The schematic pipeline of the coarse-to-fine prediction refinement process is shown in the middle part (light orange parallelogram background) of Fig. 2. Generally speaking, it consists of three successively connected stages: coarse, middle, and fine, respectively responsible for predicting correspondences with different precision. Each stage is a transformer building block of three encoders and three decoders. The coarse stage () takes a set of queries of arbitrary numbers and the entire previously combined features as input. It outputs the coarsely predicted correspondences set along with their uncertainty scores. With the guidance of the coordinates in and , we crop square patches centered at them on the previously collected middle-level features with a fixed window size of , as illustrated by the dashed arrows in the middle left of Fig. 2. The cropped feature patches are then re-arranged into a new batch along with the input queries (normalized based on the cropping centers and window sizes) being forwarded to the next stage (i.e., the middle stage ()). The fine stage shares similar procedures with the middle stage. After the fine stage, we obtain the final outputs of the proposed framework: the finest correspondences and their uncertainty scores .
For each stage, concatenated backbone features are supplemented by 2D linear positional encoding in the sinusoidal format and flattened before being fed into the transformer encoder. During the decode stage, coordinates of queries with positional encoding attend to the output of the transformer encoder. Here, we disallow self-attention among the query points, for queries are independent of each other. COTR computes the cycle consistency errors and rejects matches whose errors are greater than a specified threshold to filter out uncertain matches, which doubles the computational cost. To further accelerate our framework, we introduce an uncertainty estimation branch. Two FFN branches follow the outputs of the last transformer decoder. One is employed to regress the corresponding relative coordinates of each query, and the other is to predict the uncertainties of these coordinates. Unreliable predictions with high uncertainties will be filtered during the inference stage.
Having predicted matches and their uncertainties of level , loss is calculated by:
(1) |
where is ground truth matches coordinates of queries and is the threshold of level , where represents stages coarse, middle, and fine. We set , , during training.
All three stages are supervised during training at the same time. Specifically, the final loss is defined as
(2) |
Experiments show that the mid- and fine-level supervision during training provides predictions for corresponding stages and gives distinctive back-propagation signals to the CNN backbone, which is beneficial to the prediction of coarse-level. More details are provided in Sec. 4.5.
; K-means class number
; Distance threshold3.4 Adaptive Query-Clustering
The transformer structure is capable of processing many queries in one forward propagation. To improve efficiency, each patch should contain as many queries as possible. A straightforward practice is to directly slice the input images pair into two sets of grids according to the pre-defined window sizes and strides (usually, the stride is set equal to the corresponding window size). By densely coupling the patches between these two sets, any query-correspondence pair can be assigned to one of the patch pairs. We denote the above way of point-to-patch assignment as GRID for simplicity. However, we observe that an inevitable drawback of the query-correspondence independent kind of assignment strategies is that some matches will always exist around the patches’ borders, which usually got sub-optimal matching results. We attribute this unsatisfying phenomenon to the lack of sufficient contextual information around the border area.
To achieve a better trade-off between efficiency and effectiveness, we propose an Adaptive Query-Clustering(AQC) algorithm to automatically and dynamically assign images patches for all query-correspondence pairs, as illustrated in Alg. 1. To demonstrate the superiority of AQC, we compare it with GRID in Sec. 4.5.3. Experiments show that clustering by AQC gives better performance than GRID.
3.5 Implementation Details
We implemented our model in PyTorch
[25]. The local feature CNN uses a modified version of ResNet-50 as a backbone without pretraining. For coarse-to-fine refinement modules, we set the crop window size , . For the AQC module, we set , . The distance threshold is set to 0.8 times of the corresponding side of patches during training and 0.6 times during inference. More details can be found in the supplementary material.4 Experiments
We evaluate our method across several datasets. We do not retrain or fine-tune our model on any other dataset for a fair comparison. Experiments are arranged as follows:
-
We evaluate the pose estimation task on the same scene as COTR from Megadepth
[18] dataset for sparse matching. -
For ablations studies, we evaluate the impact of each proposed contribution using the ETH3D dataset.
Method | AEPE | PCK-1px | PCK-3px | PCK-5px |
---|---|---|---|---|
LiteFlowNet [13] | 118.85 | 13.91 | - | 31.64 |
PWC-Net [38] | 96.14 | 13.14 | - | 37.14 |
GLU-Net [48] | 25.05 | 39.55 | 71.52 | 78.54 |
GLU-Net+GOCor [47] | 20.16 | 41.55 | - | 81.43 |
COTR+Interp (reproduce) [14] | 3.83 | 36.64 | 76.65 | 87.42 |
ECO-TR+Interp | 2.67 | 40.19 | 79.89 | 90.24 |
COTR(reproduce) [14] | 3.62 | 38.72 | 80.90 | 90.85 |
ECO-TR | 2.52 | 38.02 | 79.79 | 90.71 |
4.1 Results on HPatches Dataset
We evaluate ECO-TR on the HPatches dataset for dense matching tasks in the first experiment. HPatches dataset contains 116 scenes, with 57 scenes changing in viewpoint and 59 scenes changing in lighting conditions. Following COTR, we evaluate the dense matching results on viewpoint-changing splits. Same with GLU-Net, we resize the reference image during our evaluation, while COTR is evaluated under the original scale in its experiments, which is not comparable in PCK value. Therefore, we reproduce the number of COTR under fair settings. For each method, we find a maximum of 1,000 matches from each pair. Then, we interpolate correspondences on the Delaunay triangulation map of the queries and get the dense correspondences. The results are reported in Table 1.
For the dense matching task, ECO-TR achieves better performance than COTR under all metrics. For the matching accuracy, COTR is a little better than ECO-TR evaluated by PCK. We attribute this gap to the difference in image resolution. COTR can utilize high-resolution images via four recursive zoom-ins, which is unmanageable for ECO-TR due to its end-to-end architecture. The average endpoint error(AEPE) for ECO-TR is lower than COTR.
Method | KITTI-2012 | KITTI-2015 | ||
---|---|---|---|---|
AEPE | Fl.[%] | AEPE | Fl.[%] | |
LiteFlowNet [13] | 4.00 | 17.47 | 10.39 | 28.50 |
PWC-Net [38] | 4.14 | 20.28 | 10.35 | 33.67 |
DGC-Net [23] | 8.50 | 32.28 | 14.97 | 50.98 |
GLU-Net [48] | 3.34 | 18.93 | 9.79 | 37.52 |
RAFT [42] | - | - | 5.04 | 17.8 |
GLU-Net+GOCor [47] | 2.68 | 15.43 | 6.68 | 27.57 |
PDC-Net [49] | 2.08 | 7.98 | 5.22 | 15.13 |
+ Interp. [14] | 1.47 | 8.79 | 3.65 | 13.65 |
ECO-TR + Interp. | 1.46 | 6.64 | 3.16 | 12.10 |
[14] | 1.15 | 6.98 | 2.06 | 9.14 |
ECO-TR | 0.96 | 3.77 | 1.40 | 6.39 |
4.2 Results on KITTI Dataset
We use the KITTI dataset to evaluate the performance of our method under real road scenes. KITTI2012 dataset contains static scenes only, while the KITTI2015 dataset has more challenging dynamic scenes. Following [42, 47, 14]
, we use the training split, which has ground truth of camera intrinsics, poses, and depth maps collected by LIDAR. All methods above-mentioned were trained on other datasets and evaluated on this training split. In line with previous works[DGC, GLU, GOC, COTR], We employ the Average End-point Error (AEPE) and percentage of optical flow outliers (Fl) as evaluation metrics. Here, inliers are defined as AEPE
3 pixels or . Same with COTR, We sample points for a fair comparison.As shown in Table 2, our method outperforms all others on these two datasets. For example, our method achieves AEPE and on KITTI-2012 and KITTI-2015, respectively, which is higher than COTR on average. The interpolated results are slightly worse than the sparse results, yet still better than the other dense methods by a large margin, including PDC-Net, which estimates dense correspondence and excludes unreliable matches, too. Qualitative examples on KITTI dataset are illustrated in Fig. 4.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
(a) Input image | (b) COTR | (c) ECO-TR | (d) COTR | (e) ECO-TR |
4.3 Results on ETH3D Dataset
ETH3D dataset contains ten image sequences of indoor and outdoor scenes and provides ground truth sparse correspondences under different frame intervals. Following COTR, we report the performance of our method under pairs with seven different intervals, from 3 to 15, respectively. The results in Table 3 show that our proposal outperforms other competitors under all rates, especially when matching pairs with large geometric transformations, i.e. pairs with a higher rate.
Method | AEPE | ||||||
---|---|---|---|---|---|---|---|
rate=3 | rate=5 | rate=7 | rate=9 | rate=11 | rate=13 | rate=15 | |
LiteFlowNet [13] | 1.66 | 2.58 | 6.05 | 12.95 | 29.67 | 52.41 | 74.96 |
PWC-Net [38] | 1.75 | 2.10 | 3.21 | 5.59 | 14.35 | 27.49 | 43.41 |
DGC-Net [23] | 2.49 | 3.28 | 4.18 | 5.35 | 6.78 | 9.02 | 12.23 |
GLU-Net [48] | 1.98 | 2.54 | 3.49 | 4.24 | 5.61 | 7.55 | 10.78 |
COTR+Interp. [14] | 1.71 | 1.92 | 2.16 | 2.47 | 2.85 | 3.23 | 3.76 |
ECO-TR+Interp. | 1.52 | 1.70 | 1.87 | 2.06 | 2.21 | 2.44 | 2.69 |
COTR [14] | 1.66 | 1.82 | 1.97 | 2.13 | 2.27 | 2.41 | 2.61 |
ECO-TR | 1.48 | 1.61 | 1.72 | 1.81 | 1.89 | 1.97 | 2.06 |
4.4 Results on Megadepth Dataset
![]() |
![]() |
MegaDepth [18] images show extreme viewpoint and appearance variations. The poses of images are generated via structure-from-motion and multi-view stereo (MVS) methods, which can be used as ground truth during evaluation. We choose St. Paul’s Cathedral as our test scene. We sample 900 pairs of images that have commonly visible regions. Mean average accuracy(mAA) at a and error threshold are reported here, where the error is defined as the maximum of angular error in rotation and translation. For COTR, we follow the strategy used in its paper and evaluate the performance under different numbers of matches. For ECO-TR, we estimate the scale of buildings in pairs first. We sample sparse points in one image as queries and predict their correspondences by coarse-stage ECO-TR. Then, we crop original images and obtain patches that share regions of two images. We resize cropped patches and feed them to the model again, and take random points in one image as queries and find reliable matches with low uncertainty in the other image. To further improve performance, a cycle consistency check is applied here. To compare the performance under the same number of matches, we drop some matches randomly. For a fair comparison, other settings except the matching method are fixed for two methods. The results in Table 4 show that ECO-TR gives a comparable performance, while our pipeline is significantly faster than COTR. Qualitative examples of MegaDepth are illustrated in Fig. 5.
[dir=NW]Method#Matches | N=2048 | N=1024 | N=512 | N=300 | N=100 | |||||
---|---|---|---|---|---|---|---|---|---|---|
@5 | @10 | @5 | @10 | @5 | @10 | @5 | @10 | @5 | @10 | |
COTR | 0.443 | 0.660 | 0.448 | 0.665 | 0.434 | 0.650 | 0.434 | 0.654 | 0.410 | 0.626 |
ECO-TR | 0.453 | 0.661 | 0.452 | 0.664 | 0.447 | 0.656 | 0.430 | 0.652 | 0.418 | 0.636 |
4.5 Ablation Studies
In this section, we will conduct several ablation experiments on ETH3D dataset to discuss the efficiency and effectiveness of our method. More ablations on KITTI dataset are provided in the supplementary material.
4.5.1 Analysis of inference time.
Table 5 reports the time cost of each component of ECO-TR. Table 6 further compares the runtimes of the corresponding components between ECO-TR and COTR with similar GPU memory costs (about 8192MB). As can be seen, all components in ECO-TR are more efficient than COTR’s, where the end-to-end framework (pre- and post-process in an end-to-end manner) contributes most to the efficiency.
#points | pre- and post-process | backbone | |||
---|---|---|---|---|---|
0.1k | 0.036 | 0.064 | 0.012 | 0.120 | 0.081 |
10k | 0.037 | 0.062 | 0.026 | 0.480 | 1.740 |
4.5.2 Analysis of multistage zoom-ins.
First, we analyze the effect of multistage zoom-ins architecture. As shown in Table 7, we evaluate the result of ECO-TR without middle- and fine-stage inference (). It leads to substantially worse results. Adding middle-stage inference benefits the results() but is still less effective than three stages version(). We can see that the design of three-stage refinement is essential for good performance. Furthermore, instead of training with the supervision of all three branches, we detach the middle-stage and fine-stage branches during training(). The result shows that it leads to worse results, which indicates that deeply supervised models give more distinctive features which yield better performance.
Method | #points | backbone | transformer | pre- and post-process | sum |
---|---|---|---|---|---|
COTR | 0.1k | 0.67 | 3.74 | 1.03 | 5.44 |
ECO-TR | 0.1k | 0.06 | 0.21 | 0.04 | 0.31 |
COTR | 10k | 92.55 | 60.71 | 280.27 | 433.53 |
ECO-TR | 10k | 0.06 | 2.24 | 0.05 | 2.35 |
4.5.3 Analysis of clustering method.
We test the performance of our pipeline with different clustering methods mentioned in Sec. 3.4. GRID and AQC are evaluated under the same distance threshold for a fair comparison. The results of AQC and GRID clustering are provided in and in Table 7, respectively. The result shows that our Adaptive Query-Clustering yields better performance than GRID clustering. The gap between the two strategies gradually increases as the difficulty of test pairs increases.
4.5.4 Analysis of transformer type.
We replace the full attention transformer block in our middle- and fine-stage model with the linear substitution [16] used in LoFTR, and the corresponding results are shown in . Compared with full attention result in , the AEPE of pairs with rate=3 increases by 0.02 and pairs with rate=3,5 increase by 0.01, while still better than other methods in Table 3 by a large margin. Furthermore, the average inference time of ECO-TR is reduced by 20 percent when the linear transformer is applied, but this generally leads to a slight degradation in performance. It shows our pipeline has the potential to be further accelerated at a small cost.
4.5.5 Analysis of outlier filtering method.
We compare the effectiveness of the uncertainty-based outlier filtering algorithm in Table 7. We run ECO-TR with different filtering strategies. employs cycle consistency check as a filter, and employs uncertainty estimation as a filter. The result shows that filtering by uncertainty estimation gives better performance than filtering by cycle consistency check method. Additionally, employs uncertainty estimation and cycle consistency checks together. Results show that by further using these two strategies together, ECO-TR achieves better performance.
AEPE | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
rate=3 | 5.21 | 5.63 | 2.47 | 1.53 | 1.53 | 1.64 | 1.53 | 1.55 | 1.53 | 1.48 | 1.48 |
rate=9 | 7.17 | 7.50 | 3.09 | 2.11 | 2.11 | 2.32 | 2.11 | 2.12 | 2.00 | 1.82 | 1.81 |
rate=15 | 9.19 | 9.53 | 3.83 | 2.72 | 2.72 | 3.10 | 2.72 | 2.74 | 2.45 | 2.08 | 2.06 |
5 Conclusions
This paper introduces an efficient coarse-to-fine transformer-based network for local feature matching. The main improvement is from three sides: 1) We propose an efficient network structure in a coarse-to-fine manner, fully utilizing the information from different layers and can be trained integrally. 2) We design an adaptive query-clustering (AQC) module that gathers similar query points in the same patch and achieves a better balance between efficiency and effectiveness. 3) An uncertainty-based outlier detection module is proposed to filter out the queries without correspondence. Our method significantly improves the speed of functional matching and achieves comparable or better performance both on sparse and dense matching tasks.
5.0.1 Limitations
The main limitation is that the training of ECO-TR requires a large amount of GPU computing resources. In addition, simple interpolation and refinement techniques limit the performance of dense estimates. We leave these for the future work.
5.0.2 Acknowledgments
This work was supported by the National Science Fund for Distinguished Young Scholars (No.62025603), the National Natural Science Foundation of China (No. U21B2037, No. 62176222, No. 62176223, No. 62176226, No. 62072386, No. 62072387, No. 62072389, and No. 62002305), Guangdong Basic and Applied Basic Research Foundation(No.2019B1515120049), and the Natural Science Foundation of Fujian Province of China (No.2021J01002).
References
-
[1]
(2017)
HPatches: a benchmark and evaluation of handcrafted and learned local descriptors.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 5173–5182. Cited by: item 1. - [2] (2019) Key. net: keypoint detection by handcrafted and learned cnn filters. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5836–5844. Cited by: §2.0.1.
- [3] (2017) Gms: grid-based motion statistics for fast, ultra-robust feature correspondence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4181–4190. Cited by: §2.0.1.
- [4] (2020) End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Cited by: §1.
- [5] (2014) Fast and accurate image matching with cascade hashing for 3d reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–8. Cited by: §1.
- [6] (2018) Superpoint: self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236. Cited by: §2.0.1.
- [7] (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1.
- [8] (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §1.
- [9] (2019) D2-net: a trainable cnn for joint description and detection of local features. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 8092–8101. Cited by: §2.0.1.
- [10] (2019) Beyond cartesian representations for local descriptors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 253–262. Cited by: §2.0.1.
- [11] (2019) A performance evaluation of local features for image-based 3d reconstruction. IEEE Transactions on Image Processing 28 (10), pp. 4774–4789. Cited by: §1.
- [12] (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: item 1.
- [13] (2018) Liteflownet: a lightweight convolutional neural network for optical flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8981–8989. Cited by: Table 1, Table 2, Table 3.
- [14] (2021) Cotr: correspondence transformer for matching across images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6207–6217. Cited by: Figure 1, §1, §2.0.1, §4.2, Table 1, Table 2, Table 3.
- [15] (2021) Image matching across wide baselines: from paper to practice. International Journal of Computer Vision 129 (2), pp. 517–547. Cited by: §1.
-
[16]
(2020)
Transformers are rnns: fast autoregressive transformers with linear attention.
In
International Conference on Machine Learning
, pp. 5156–5165. Cited by: §4.5.4. - [17] (2020) Dual-resolution correspondence networks. Advances in Neural Information Processing Systems 33, pp. 17346–17357. Cited by: §2.0.2.
- [18] (2018) Megadepth: learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2041–2050. Cited by: item 2, §4.4.
- [19] (2010) Sift flow: dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 978–994. Cited by: §1.
- [20] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.0.1.
- [21] (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §1.
- [22] (2020) Aslfeat: learning local features of accurate shape and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6589–6598. Cited by: §2.0.1.
- [23] (2019) Dgc-net: dense geometric correspondence network. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1034–1042. Cited by: Table 2, Table 3.
- [24] (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. Advances in neural information processing systems 30. Cited by: §2.0.1.
-
[25]
(2019)
Pytorch: an imperative style, high-performance deep learning library
. Advances in neural information processing systems 32. Cited by: §3.5. - [26] (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28. Cited by: §3.2.
- [27] (2019) R2D2: repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195. Cited by: §2.0.1.
- [28] (2020) Efficient neighbourhood consensus networks via submanifold sparse convolutions. In European conference on computer vision, pp. 605–621. Cited by: §2.0.2.
- [29] (2018) Neighbourhood consensus networks. Advances in neural information processing systems 31. Cited by: §2.0.2.
- [30] (2005) Fusing points and lines for high performance tracking. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Vol. 2, pp. 1508–1515. Cited by: §1.
- [31] (2006) Machine learning for high-speed corner detection. In European conference on computer vision, pp. 430–443. Cited by: §2.0.1.
- [32] (2019) From coarse to fine: robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12716–12725. Cited by: §1.
- [33] (2020) Superglue: learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4938–4947. Cited by: §2.0.1.
- [34] (2012) Improving image-based localization by active correspondence search. In European conference on computer vision, pp. 752–765. Cited by: §1.
-
[35]
(2017)
Quad-networks: unsupervised learning to rank for interest point detection
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1822–1830. Cited by: §2.0.1. - [36] (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3260–3269. Cited by: item 1.
- [37] (2014) A quantitative analysis of current practices in optical flow estimation and the principles behind them. International Journal of Computer Vision 106 (2), pp. 115–137. Cited by: §1.
- [38] (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8934–8943. Cited by: Table 1, Table 2, Table 3.
- [39] (2021) LoFTR: detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8922–8931. Cited by: §2.0.2.
- [40] (2020) Acne: attentive context normalization for robust permutation-equivariant learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11286–11295. Cited by: §2.0.1.
- [41] (2016) City-scale localization for cameras with known vertical direction. IEEE transactions on pattern analysis and machine intelligence 39 (7), pp. 1455–1461. Cited by: §1.
- [42] (2020) Raft: recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pp. 402–419. Cited by: §4.2, Table 2.
- [43] (2020) D2d: keypoint extraction with describe to detect approach. In Proceedings of the Asian Conference on Computer Vision, Cited by: §2.0.1.
- [44] (2017) L2-net: deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 661–669. Cited by: §2.0.1.
- [45] (2019) Sosnet: second order similarity regularization for local descriptor learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11016–11025. Cited by: §2.0.1.
- [46] (2018) Semantic match consistency for long-term visual localization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 383–399. Cited by: §1.
- [47] (2020) GOCor: bringing globally optimized correspondence volumes into your neural network. Advances in Neural Information Processing Systems 33, pp. 14278–14290. Cited by: §2.0.2, §4.2, Table 1, Table 2.
- [48] (2020) GLU-net: global-local universal network for dense flow and correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6258–6268. Cited by: §2.0.2, Table 1, Table 2, Table 3.
- [49] (2021) Learning accurate dense correspondences and when to trust them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5714–5724. Cited by: §2.0.2, Table 2.
- [50] (2020) DISK: learning local features with policy gradient. Advances in Neural Information Processing Systems 33, pp. 14254–14265. Cited by: §2.0.1.
- [51] (2017) Demon: depth and motion network for learning monocular stereo. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5038–5047. Cited by: §1.
- [52] (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1.
- [53] (2013) DeepFlow: large displacement optical flow with deep matching. In Proceedings of the IEEE international conference on computer vision, pp. 1385–1392. Cited by: §1.
- [54] (2018) Learning to find good correspondences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2666–2674. Cited by: §2.0.1.
- [55] (2019) Learning two-view correspondences and geometry using order-aware network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5845–5854. Cited by: §2.0.1.
- [56] (2019) Nm-net: mining reliable neighbors for robust feature correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 215–224. Cited by: §2.0.1.
- [57] (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §3.2.
- [58] (2021) Patch2pix: epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4669–4678. Cited by: §2.0.2.
- [59] (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1851–1858. Cited by: §1.