Log In Sign Up

ECO-TR: Efficient Correspondences Finding Via Coarse-to-Fine Refinement

by   Dongli Tan, et al.
Xiamen University

Modeling sparse and dense image matching within a unified functional correspondence model has recently attracted increasing research interest. However, existing efforts mainly focus on improving matching accuracy while ignoring its efficiency, which is crucial for realworld applications. In this paper, we propose an efficient structure named Efficient Correspondence Transformer (ECO-TR) by finding correspondences in a coarse-to-fine manner, which significantly improves the efficiency of functional correspondence model. To achieve this, multiple transformer blocks are stage-wisely connected to gradually refine the predicted coordinates upon a shared multi-scale feature extraction network. Given a pair of images and for arbitrary query coordinates, all the correspondences are predicted within a single feed-forward pass. We further propose an adaptive query-clustering strategy and an uncertainty-based outlier detection module to cooperate with the proposed framework for faster and better predictions. Experiments on various sparse and dense matching tasks demonstrate the superiority of our method in both efficiency and effectiveness against existing state-of-the-arts.


page 6

page 11

page 12


COTR: Correspondence Transformer for Matching Across Images

We propose a novel framework for finding correspondences in images based...

Patch2Pix: Epipolar-Guided Pixel-Level Correspondences

Deep learning has been applied to a classical matching pipeline which ty...

3DG-STFM: 3D Geometric Guided Student-Teacher Feature Matching

We tackle the essential task of finding dense visual correspondences bet...

Dual-Resolution Correspondence Networks

We tackle the problem of establishing dense pixel-wise correspondences b...

Multi-Features Guidance Network for partial-to-partial point cloud registration

To eliminate the problems of large dimensional differences, big semantic...

NeuralMarker: A Framework for Learning General Marker Correspondence

We tackle the problem of estimating correspondences from a general marke...

Semantic-Sparse Colorization Network for Deep Exemplar-based Colorization

Exemplar-based colorization approaches rely on reference image to provid...

1 Introduction

As a fundamental research direction in computer vision, finding the correspondences among pairs of images has been widely utilized in plenty of down-stream tasks, including optical flow estimation 

[19, 8, 53], visual localization [46, 32, 34], camera position calibration [15, 41], 3D reconstruction [5, 11], and visual tracking [30]. Given a pair of images, according to how the queries and correspondences are determined, the applications mentioned above can be generally categorized into sparse matching and dense matching. The former focuses on two sets of keypoints being sparsely and respectively extracted from both images and matched to minimize a pre-defined alignment error [21, 34, 15]; the latter treats all pixels in the first image as queries which are densely mapped to the other image for correspondences [19, 37, 59, 51].

The above two kinds of applications were studied independently for a long time, and various optimizations were designed separately. Recently, COTR [14] claims that these two applications can be naturally modeled within a unified framework since the only difference between the sparse and dense matching is the number of points to query. It proposes to recursively apply a transformer-based [4, 52, 7] model at multiple scales in a gradually zooming-in manner to obtain accurate correspondences. Though impressive performance has been achieved, its complex off-line pipeline and slow inference speed seriously limit its practicality in real-world applications.

Figure 1: Comparison of the inference time between the proposed ECO-TR and COTR [14]. The query numbers are set from 100 to 10,000. As we can see, the time-consuming of COTR increases linearly as the number of points increases, while our method basically does not change.

We argue that there are three main reasons leading to the unsatisfactory COTR. The first is the recursive zoom-in refinement framework, which must re-extract the corresponding features in the next local patch matching. In the case of many queries, these features are likely to overlap, which means plenty of repeated and redundant calculations. The second is switching the role of the queries and correspondences to filter out the mismatched queries, which double the overall computation. The third is that the staged training strategy leads to unstable training convergence which needs to be carefully fine-tuned.

Instead of sacrificing speed for performance, in this work, we present an efficient correspondence transformer network (ECO-TR), showing that both efficiency and effectiveness can be achieved within a single feed-forward pass. Specifically, we propose to complete the coarse-to-fine refinement process of the found correspondences in a stage-by-stage manner. Our framework consists of a bottom-up convolutional neural network (CNN) for multi-scale feature extraction and several top-down transformer blocks corresponding to different matching accuracies. During the coarse-to-fine refinement process, rather than cropping image patches of different positions and sizes according to the coarsely predicted coordinates and recursively re-feeding them into CNN to obtain the corresponding feature maps, we obtain the multi-scale feature maps

w.r.t. the input image at one time by taking advantages of the pyramid and translation invariance nature of modern CNNs, and directly crop on the collected feature maps. The proposed feature-level cropping method can effectively avoid repeated calculations. To a certain extent, the inference speed of the model does not increase linearly with the increase of query points.

To further improve the efficiency of our framework, an Adaptive Query-Clustering (AQC) module is proposed to gather similar queries into a cluster, which speeds up the inference. Moreover, we propose an uncertainty module to estimate the confidence of the predicted correspondences, which achieves good performance on outlier detection nearly for free. As illustrated in Table 1, our approach can process 1000 queries within one second on a single NVIDIA Telsa V100 GPU for a pair of images with size , which is around 40 times faster than COTR under the same conditions.

To evaluate the performance of the proposed approach, we report the results on multiple challenging datasets covering both sparse and dense correspondence finding tasks. Experimental results demonstrate that our method surpasses COTR in performance and speed by a large margin. In addition, we conduct extensive ablation experiments to better understand the impact of each component in our framework. The contributions are summarized below:

  • We propose a new coarse-to-fine framework for finding correspondence that can be applied to both sparse and dense matching tasks. Our method can be optimized end-to-end and evaluate an arbitrary number of queries within a single feed-forward.

  • We design an adaptive query-clustering strategy and an uncertainty-based outlier filtering module to achieve a better balance between efficiency and effectiveness.

  • Our method significantly outperforms the existing best-performing functional method in speed and still achieves comparable performance in sparse correspondence tasks and better in dense correspondence tasks.

2 Related Work

2.0.1 Sparse methods.

The most common paradigm for sparse image matching pipelines consists of three stages: keypoint detection, keypoint description, and feature matching. In terms of the detection stage, a sparse set of repeatable and matchable keypoints are selected by the detection methods [31, 35, 2, 50], which are robust against viewpoint changes and different lighting conditions. Then, the keypoints are described by patch-level input or image-level input. Patch-based description methods  [44, 24, 45, 10] take cropped patches as inputs and are usually trained by metric learning. Image-based description methods such as  [9, 6, 27, 22, 43]

take a full image as input and apply fully-convolutional neural networks 

[20] to generate dense descriptors. This kind of method usually combines detector and descriptor, which share the same backbone in training and yield better performance on both tasks.

Traditional feature matching methods use Nearest Neighbor (NN) search to find potential matches. Recently, many approaches  [3, 54, 55, 56, 40]

filter outliers by heuristics or learned priors. SuperGlue 

[33] uses an attentional graph neural network and optimal transport method to obtain state-of-the-art performance on sparse matching tasks. Unlike the method mentioned above, given some keypoints as queries, COTR [14] refines the matches in the other image recursively by correspondence neural network. Following COTR, we design an end-to-end model to accelerate this scheme.

2.0.2 Dense methods.

The main purpose of dense matching is to estimate the optical flow. NC-Net [29] represents all keypoints and possible correspondences as a 4D correspondence volume restricted to low-resolution images. Sparse NC-Net [28] applies sparse correlation layers instead of all possible correspondences to mitigate this restriction, whereby higher resolution images can be tackled. DRC-Net [17] reduces the computational cost and promotes performance by using coarse-resolution and fine-resolution feature maps of different layers. GLU-Net [48] finds pixel-wise correspondences by global and local features extracted from images with different resolutions. GOCor [47] disambiguates features in similar regions via an improved feature correlation layer. PDC-Net [49] excludes incorrect dense matches in occluded and homogeneous regions by estimating an uncertainty map and filtering the inaccurate correspondences. Patch2Pix [58] replaces pixel-level matches with patch-level match proposals and later refines them by regression layers. LoFTR [39]

establishes accurate semi-dense matches with linear transformers in a coarse-to-fine manner. For COTR, the dense matching result is generated by interpolating sufficient sparse queries’ results. Same with COTR, our method can give dense matching results by interpolation, too.

2.0.3 Functional methods.

The functional method in image matching. COTR is the first one that obtains matches by a functional correspondence finding architecture. Given a pair of images and coordinates of one query, COTR regresses the possible match in the other image via a transformer-based correspondence finding network. Each query is processed independently, and dense correspondences are estimated by interpolating sparse correspondences using Delaunay triangulation of the queries. However, being a recursive method, it will be extremely time-consuming when many keypoints are queried. We mitigate this problem in an end-to-end manner, which runs dozens of times faster than COTR and achieves comparable or superior performance.

3 Coarse-to-Fine Refinement Network

This section describes the proposed end-to-end framework that can find the correspondences for arbitrary queries given a pair of images within a single feed-forward pass in detail.

3.1 Overall Pipeline

Figure 2: The pipeline of our proposed framework. It takes a pair of images (bottom-left) and a set of queries ({Q}) of arbitrary numbers as input and outputs the correspondences ({}) and uncertainty scores ({}), respectively. The right part illustrates the feature patches cropping process during each prediction refinement stage.

We show a schematic diagram of the overall pipeline of the proposed framework in Fig. 2. It mainly consists of a bottom-up multi-scale feature extraction pathway based on the CNN and a top-down coarse-to-fine prediction refinement pathway based on the transformer. Given a pair of images and , we first resize them to the same spatial resolution (, is the ‘batch’ dimension) and feed them into the CNN backbone to obtain multi-scale features. After that, the collected multi-scale features are used along with the input queries to predict the correspondences in a coarse-to-fine, gradually refining manner in the top-down pathway. We also predict an uncertainty score w.r.t. each correspondence representing how confident the network is of its prediction, which can be utilized to filter out the outliers nearly for free. Since it could be a bunch of queries to be processed in one feed-forward, we further introduce an adaptive query-clustering strategy to better balance efficiency and effectiveness. The following subsections describe the above-mentioned components in detail.

3.2 Efficient Feature Extraction

To obtain correspondence locations precisely, existing work usually crops image patches around potential matching regions and iteratively feeds them back into the network in a progressively enlarged manner. The main drawbacks of the aforementioned practice are: 1) the input image is cropped and resized into patches multiple times with different zoom-in factors around each query position. Each patch generated is then fed into the network, which involves many redundant computations. 2) Image patches for each query are cropped and processed by the network independently, which usually means serial processing and inefficient use of computational resources. We found that the main cause of these two shortcomings can be attributed to the setting of cropping patches at different spatial levels directly on the image.

Considering the pyramid and translation invariance nature of modern CNNs, we propose to alleviate the drawbacks mentioned above by deferring the cropping operation after the feature extraction step. Specifically, we first obtain the multi-scale feature maps w.r.t. each input image at one time and then directly crop on the collected feature maps to get feature patches at any position and scale. We take the ResNet-50[26] network as our backbone for multi-scale feature extraction without loss of generality. Following the previous success in generating more powerful and representative features, we attach a pyramid pooling module (PPM) [57] to capture more global information at the top of ResNet-50. The output of PPM and the side-outputs at res1-4 stages of the ResNet-50 network are collected to build a hierarchical multi-scale feature integration structure. As shown in the left part of Fig. 2, to meet the needs of the subsequent top-down pathway which has three refinement stages (i.e., coarse, middle, and fine), we choose to combine the intermediate outputs of {PPM, res4}, {res2-4} and {res1-3} stages, respectively. The integrated three sets of features (denoted as ) are then resized to , , and spatial resolutions w.r.t. the input stitched images pair, respectively.

Figure 3: Illustration of the uncertainty estimation branch. Green and red points indicate matches with low uncertainties and high uncertainties, respectively. ECO-TR gives ambiguous predictions in textureless regions and the border area with high uncertainties.

3.3 Coarse-to-Fine Prediction Refinement

The schematic pipeline of the coarse-to-fine prediction refinement process is shown in the middle part (light orange parallelogram background) of Fig. 2. Generally speaking, it consists of three successively connected stages: coarse, middle, and fine, respectively responsible for predicting correspondences with different precision. Each stage is a transformer building block of three encoders and three decoders. The coarse stage () takes a set of queries of arbitrary numbers and the entire previously combined features as input. It outputs the coarsely predicted correspondences set along with their uncertainty scores. With the guidance of the coordinates in and , we crop square patches centered at them on the previously collected middle-level features with a fixed window size of , as illustrated by the dashed arrows in the middle left of Fig. 2. The cropped feature patches are then re-arranged into a new batch along with the input queries (normalized based on the cropping centers and window sizes) being forwarded to the next stage (i.e., the middle stage ()). The fine stage shares similar procedures with the middle stage. After the fine stage, we obtain the final outputs of the proposed framework: the finest correspondences and their uncertainty scores .

For each stage, concatenated backbone features are supplemented by 2D linear positional encoding in the sinusoidal format and flattened before being fed into the transformer encoder. During the decode stage, coordinates of queries with positional encoding attend to the output of the transformer encoder. Here, we disallow self-attention among the query points, for queries are independent of each other. COTR computes the cycle consistency errors and rejects matches whose errors are greater than a specified threshold to filter out uncertain matches, which doubles the computational cost. To further accelerate our framework, we introduce an uncertainty estimation branch. Two FFN branches follow the outputs of the last transformer decoder. One is employed to regress the corresponding relative coordinates of each query, and the other is to predict the uncertainties of these coordinates. Unreliable predictions with high uncertainties will be filtered during the inference stage.

Having predicted matches and their uncertainties of level , loss is calculated by:


where is ground truth matches coordinates of queries and is the threshold of level , where represents stages coarse, middle, and fine. We set , , during training.

All three stages are supervised during training at the same time. Specifically, the final loss is defined as


Experiments show that the mid- and fine-level supervision during training provides predictions for corresponding stages and gives distinctive back-propagation signals to the CNN backbone, which is beneficial to the prediction of coarse-level. More details are provided in Sec. 4.5.

Input: Coordinates of queries ; Matches of predicted by previous stage ; Iteration number

; K-means class number

; Distance threshold
Output: All patch pairs and corresponding matches in these patches
1 for i = to  do
2       Divide to clusters by K-means algorithm, and assign class labels to every pair in ;
3       for each class  do
4             Set = all pairs labeled ;
5             Set = the center coordinates of ;
6             Set = the center coordinates of ;
7             for each pair in  do
8                   if or
9                     Set the class label of
10             end for
11            Crop patches centered at and and assign pairs labeled to these patches
12       end for
13      Set = all pairs labeled -1
15 end for
16for each pair labeled in  do
17       Crop patches centered at and , and assign pair to these patches
18 end for
Algorithm 1 Adaptive Query-Clustering Algorithm

3.4 Adaptive Query-Clustering

The transformer structure is capable of processing many queries in one forward propagation. To improve efficiency, each patch should contain as many queries as possible. A straightforward practice is to directly slice the input images pair into two sets of grids according to the pre-defined window sizes and strides (usually, the stride is set equal to the corresponding window size). By densely coupling the patches between these two sets, any query-correspondence pair can be assigned to one of the patch pairs. We denote the above way of point-to-patch assignment as GRID for simplicity. However, we observe that an inevitable drawback of the query-correspondence independent kind of assignment strategies is that some matches will always exist around the patches’ borders, which usually got sub-optimal matching results. We attribute this unsatisfying phenomenon to the lack of sufficient contextual information around the border area.

To achieve a better trade-off between efficiency and effectiveness, we propose an Adaptive Query-Clustering(AQC) algorithm to automatically and dynamically assign images patches for all query-correspondence pairs, as illustrated in Alg. 1. To demonstrate the superiority of AQC, we compare it with GRID in Sec. 4.5.3. Experiments show that clustering by AQC gives better performance than GRID.

3.5 Implementation Details

We implemented our model in PyTorch 

[25]. The local feature CNN uses a modified version of ResNet-50 as a backbone without pretraining. For coarse-to-fine refinement modules, we set the crop window size , . For the AQC module, we set , . The distance threshold is set to 0.8 times of the corresponding side of patches during training and 0.6 times during inference. More details can be found in the supplementary material.

4 Experiments

We evaluate our method across several datasets. We do not retrain or fine-tune our model on any other dataset for a fair comparison. Experiments are arranged as follows:

  1. Dense matching tasks are evaluated on HPatches [1], KITTI [12], and ETH [36] datasets. Following COTR’s evaluation protocol, we evaluate the results of sampled matches and interpolated dense optical flow.

  2. We evaluate the pose estimation task on the same scene as COTR from Megadepth 

    [18] dataset for sparse matching.

  3. For ablations studies, we evaluate the impact of each proposed contribution using the ETH3D dataset.

Method AEPE PCK-1px PCK-3px PCK-5px
LiteFlowNet [13] 118.85 13.91 - 31.64
PWC-Net [38] 96.14 13.14 - 37.14
GLU-Net [48] 25.05 39.55 71.52 78.54
GLU-Net+GOCor [47] 20.16 41.55 - 81.43
COTR+Interp (reproduce)  [14] 3.83 36.64 76.65 87.42
ECO-TR+Interp 2.67 40.19 79.89 90.24
COTR(reproduce)  [14] 3.62 38.72 80.90 90.85
ECO-TR 2.52 38.02 79.79 90.71
Table 1: Quantitative results on HPatches. Average End Point Error (AEPE) and Percentage of Correct Keypoints (PCK) are reported here. For each method, different thresholds (1px, 3px and 5px) of PCK are used. For a fair comparison of PCK, we report the reproduced results of COTR under the same image size.

4.1 Results on HPatches Dataset

We evaluate ECO-TR on the HPatches dataset for dense matching tasks in the first experiment. HPatches dataset contains 116 scenes, with 57 scenes changing in viewpoint and 59 scenes changing in lighting conditions. Following COTR, we evaluate the dense matching results on viewpoint-changing splits. Same with GLU-Net, we resize the reference image during our evaluation, while COTR is evaluated under the original scale in its experiments, which is not comparable in PCK value. Therefore, we reproduce the number of COTR under fair settings. For each method, we find a maximum of 1,000 matches from each pair. Then, we interpolate correspondences on the Delaunay triangulation map of the queries and get the dense correspondences. The results are reported in Table 1.

For the dense matching task, ECO-TR achieves better performance than COTR under all metrics. For the matching accuracy, COTR is a little better than ECO-TR evaluated by PCK. We attribute this gap to the difference in image resolution. COTR can utilize high-resolution images via four recursive zoom-ins, which is unmanageable for ECO-TR due to its end-to-end architecture. The average endpoint error(AEPE) for ECO-TR is lower than COTR.

Method KITTI-2012 KITTI-2015
AEPE Fl.[%] AEPE Fl.[%]
LiteFlowNet [13] 4.00 17.47 10.39 28.50
PWC-Net [38] 4.14 20.28 10.35 33.67
DGC-Net [23] 8.50 32.28 14.97 50.98
GLU-Net [48] 3.34 18.93 9.79 37.52
RAFT [42] - - 5.04 17.8
GLU-Net+GOCor [47] 2.68 15.43 6.68 27.57
PDC-Net [49] 2.08 7.98 5.22 15.13
+ Interp. [14] 1.47 8.79 3.65 13.65
ECO-TR + Interp. 1.46 6.64 3.16 12.10
 [14] 1.15 6.98 2.06 9.14
ECO-TR 0.96 3.77 1.40 6.39
Table 2: Quantitative results on KITTI. Average End Point Error (AEPE) and flow outlier ratio (Fl) on KITTI-2012 and KITTI-2015 are reported below. means we evaluated it with DenseMatching tools provided by the authors of GLU-Net.

4.2 Results on KITTI Dataset

We use the KITTI dataset to evaluate the performance of our method under real road scenes. KITTI2012 dataset contains static scenes only, while the KITTI2015 dataset has more challenging dynamic scenes. Following [42, 47, 14]

, we use the training split, which has ground truth of camera intrinsics, poses, and depth maps collected by LIDAR. All methods above-mentioned were trained on other datasets and evaluated on this training split. In line with previous works[DGC, GLU, GOC, COTR], We employ the Average End-point Error (AEPE) and percentage of optical flow outliers (Fl) as evaluation metrics. Here, inliers are defined as AEPE

3 pixels or . Same with COTR, We sample points for a fair comparison.

As shown in Table 2, our method outperforms all others on these two datasets. For example, our method achieves AEPE and on KITTI-2012 and KITTI-2015, respectively, which is higher than COTR on average. The interpolated results are slightly worse than the sparse results, yet still better than the other dense methods by a large margin, including PDC-Net, which estimates dense correspondence and excludes unreliable matches, too. Qualitative examples on KITTI dataset are illustrated in Fig. 4.

(a) Input image (b) COTR (c) ECO-TR (d) COTR (e) ECO-TR
Figure 4: Qualitative results on KITTI – We show the error map (Columns (b, c)) and optical flow (Columns (d, e)) for three pairs from KITTI-2015. ECO-TR provided clearer outlines of moving objects.

4.3 Results on ETH3D Dataset

ETH3D dataset contains ten image sequences of indoor and outdoor scenes and provides ground truth sparse correspondences under different frame intervals. Following COTR, we report the performance of our method under pairs with seven different intervals, from 3 to 15, respectively. The results in Table 3 show that our proposal outperforms other competitors under all rates, especially when matching pairs with large geometric transformations, i.e. pairs with a higher rate.

Method AEPE
rate=3 rate=5 rate=7 rate=9 rate=11 rate=13 rate=15
LiteFlowNet [13] 1.66 2.58 6.05 12.95 29.67 52.41 74.96
PWC-Net [38] 1.75 2.10 3.21 5.59 14.35 27.49 43.41
DGC-Net [23] 2.49 3.28 4.18 5.35 6.78 9.02 12.23
GLU-Net [48] 1.98 2.54 3.49 4.24 5.61 7.55 10.78
COTR+Interp. [14] 1.71 1.92 2.16 2.47 2.85 3.23 3.76
ECO-TR+Interp. 1.52 1.70 1.87 2.06 2.21 2.44 2.69
COTR [14] 1.66 1.82 1.97 2.13 2.27 2.41 2.61
ECO-TR 1.48 1.61 1.72 1.81 1.89 1.97 2.06
Table 3: Results on ETH3D. We evaluated our method over pairs of ETH3D images sampled from different frame intervals. Average End Point Error (AEPE) are reported here. Lower AEPE is better.

4.4 Results on Megadepth Dataset

Figure 5: Qualitative results on MegaDepth dataset. We set queries on left images and obtain matches in right images. We estimate the relative pose between image pairs and the angular errors in rotation and translation are reported in the upper-left corner. The number of inliers evaluated by epipolar distance is shown as well.

MegaDepth [18] images show extreme viewpoint and appearance variations. The poses of images are generated via structure-from-motion and multi-view stereo (MVS) methods, which can be used as ground truth during evaluation. We choose St. Paul’s Cathedral as our test scene. We sample 900 pairs of images that have commonly visible regions. Mean average accuracy(mAA) at a and error threshold are reported here, where the error is defined as the maximum of angular error in rotation and translation. For COTR, we follow the strategy used in its paper and evaluate the performance under different numbers of matches. For ECO-TR, we estimate the scale of buildings in pairs first. We sample sparse points in one image as queries and predict their correspondences by coarse-stage ECO-TR. Then, we crop original images and obtain patches that share regions of two images. We resize cropped patches and feed them to the model again, and take random points in one image as queries and find reliable matches with low uncertainty in the other image. To further improve performance, a cycle consistency check is applied here. To compare the performance under the same number of matches, we drop some matches randomly. For a fair comparison, other settings except the matching method are fixed for two methods. The results in  Table 4 show that ECO-TR gives a comparable performance, while our pipeline is significantly faster than COTR. Qualitative examples of MegaDepth are illustrated in Fig. 5.

[dir=NW]Method#Matches N=2048 N=1024 N=512 N=300 N=100
@5 @10 @5 @10 @5 @10 @5 @10 @5 @10
COTR 0.443 0.660 0.448 0.665 0.434 0.650 0.434 0.654 0.410 0.626
ECO-TR 0.453 0.661 0.452 0.664 0.447 0.656 0.430 0.652 0.418 0.636
Table 4: Quantitative results on MegaDepth. We evaluated our method against COTR with different numbers of predicted matches. Mean average accuracy(mAA) at a and error threshold are reported here.

4.5 Ablation Studies

In this section, we will conduct several ablation experiments on ETH3D dataset to discuss the efficiency and effectiveness of our method. More ablations on KITTI dataset are provided in the supplementary material.

4.5.1 Analysis of inference time.

Table 5 reports the time cost of each component of ECO-TR. Table 6 further compares the runtimes of the corresponding components between ECO-TR and COTR with similar GPU memory costs (about 8192MB). As can be seen, all components in ECO-TR are more efficient than COTR’s, where the end-to-end framework (pre- and post-process in an end-to-end manner) contributes most to the efficiency.

#points pre- and post-process backbone
0.1k 0.036 0.064 0.012 0.120 0.081
10k 0.037 0.062 0.026 0.480 1.740
Table 5: Detailed inference time (sec.) of each component.

4.5.2 Analysis of multistage zoom-ins.

First, we analyze the effect of multistage zoom-ins architecture. As shown in  Table 7, we evaluate the result of ECO-TR without middle- and fine-stage inference (). It leads to substantially worse results. Adding middle-stage inference benefits the results() but is still less effective than three stages version(). We can see that the design of three-stage refinement is essential for good performance. Furthermore, instead of training with the supervision of all three branches, we detach the middle-stage and fine-stage branches during training(). The result shows that it leads to worse results, which indicates that deeply supervised models give more distinctive features which yield better performance.

Method #points backbone transformer pre- and post-process sum
COTR 0.1k 0.67 3.74 1.03 5.44
ECO-TR 0.1k 0.06 0.21 0.04 0.31
COTR 10k 92.55 60.71 280.27 433.53
ECO-TR 10k 0.06 2.24 0.05 2.35
Table 6: Detailed comparison of inference time (sec.) with COTR.

4.5.3 Analysis of clustering method.

We test the performance of our pipeline with different clustering methods mentioned in Sec. 3.4. GRID and AQC are evaluated under the same distance threshold for a fair comparison. The results of AQC and GRID clustering are provided in and in  Table 7, respectively. The result shows that our Adaptive Query-Clustering yields better performance than GRID clustering. The gap between the two strategies gradually increases as the difficulty of test pairs increases.

4.5.4 Analysis of transformer type.

We replace the full attention transformer block in our middle- and fine-stage model with the linear substitution [16] used in LoFTR, and the corresponding results are shown in . Compared with full attention result in , the AEPE of pairs with rate=3 increases by 0.02 and pairs with rate=3,5 increase by 0.01, while still better than other methods in Table 3 by a large margin. Furthermore, the average inference time of ECO-TR is reduced by 20 percent when the linear transformer is applied, but this generally leads to a slight degradation in performance. It shows our pipeline has the potential to be further accelerated at a small cost.

4.5.5 Analysis of outlier filtering method.

We compare the effectiveness of the uncertainty-based outlier filtering algorithm in Table 7. We run ECO-TR with different filtering strategies. employs cycle consistency check as a filter, and employs uncertainty estimation as a filter. The result shows that filtering by uncertainty estimation gives better performance than filtering by cycle consistency check method. Additionally, employs uncertainty estimation and cycle consistency checks together. Results show that by further using these two strategies together, ECO-TR achieves better performance.

rate=3 5.21 5.63 2.47 1.53 1.53 1.64 1.53 1.55 1.53 1.48 1.48
rate=9 7.17 7.50 3.09 2.11 2.11 2.32 2.11 2.12 2.00 1.82 1.81
rate=15 9.19 9.53 3.83 2.72 2.72 3.10 2.72 2.74 2.45 2.08 2.06
Table 7: Ablations on ETH3D. We evaluate the impact of each component of our method over image pairs from the ETH3D dataset. Pairs are sampled from 3 different frame intervals, which indicate varying difficulty levels. Average End Point Error (AEPE) is reported here. Lower AEPE is better.

5 Conclusions

This paper introduces an efficient coarse-to-fine transformer-based network for local feature matching. The main improvement is from three sides: 1) We propose an efficient network structure in a coarse-to-fine manner, fully utilizing the information from different layers and can be trained integrally. 2) We design an adaptive query-clustering (AQC) module that gathers similar query points in the same patch and achieves a better balance between efficiency and effectiveness. 3) An uncertainty-based outlier detection module is proposed to filter out the queries without correspondence. Our method significantly improves the speed of functional matching and achieves comparable or better performance both on sparse and dense matching tasks.

5.0.1 Limitations

The main limitation is that the training of ECO-TR requires a large amount of GPU computing resources. In addition, simple interpolation and refinement techniques limit the performance of dense estimates. We leave these for the future work.

5.0.2 Acknowledgments

This work was supported by the National Science Fund for Distinguished Young Scholars (No.62025603), the National Natural Science Foundation of China (No. U21B2037, No. 62176222, No. 62176223, No. 62176226, No. 62072386, No. 62072387, No. 62072389, and No. 62002305), Guangdong Basic and Applied Basic Research Foundation(No.2019B1515120049), and the Natural Science Foundation of Fujian Province of China (No.2021J01002).


  • [1] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk (2017) HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 5173–5182. Cited by: item 1.
  • [2] A. Barroso-Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk (2019) Key. net: keypoint detection by handcrafted and learned cnn filters. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5836–5844. Cited by: §2.0.1.
  • [3] J. Bian, W. Lin, Y. Matsushita, S. Yeung, T. Nguyen, and M. Cheng (2017) Gms: grid-based motion statistics for fast, ultra-robust feature correspondence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4181–4190. Cited by: §2.0.1.
  • [4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Cited by: §1.
  • [5] J. Cheng, C. Leng, J. Wu, H. Cui, and H. Lu (2014) Fast and accurate image matching with cascade hashing for 3d reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–8. Cited by: §1.
  • [6] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) Superpoint: self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236. Cited by: §2.0.1.
  • [7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1.
  • [8] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §1.
  • [9] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler (2019) D2-net: a trainable cnn for joint description and detection of local features. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 8092–8101. Cited by: §2.0.1.
  • [10] P. Ebel, A. Mishchuk, K. M. Yi, P. Fua, and E. Trulls (2019) Beyond cartesian representations for local descriptors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 253–262. Cited by: §2.0.1.
  • [11] B. Fan, Q. Kong, X. Wang, Z. Wang, S. Xiang, C. Pan, and P. Fua (2019) A performance evaluation of local features for image-based 3d reconstruction. IEEE Transactions on Image Processing 28 (10), pp. 4774–4789. Cited by: §1.
  • [12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: item 1.
  • [13] T. Hui, X. Tang, and C. C. Loy (2018) Liteflownet: a lightweight convolutional neural network for optical flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8981–8989. Cited by: Table 1, Table 2, Table 3.
  • [14] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, and K. M. Yi (2021) Cotr: correspondence transformer for matching across images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6207–6217. Cited by: Figure 1, §1, §2.0.1, §4.2, Table 1, Table 2, Table 3.
  • [15] Y. Jin, D. Mishkin, A. Mishchuk, J. Matas, P. Fua, K. M. Yi, and E. Trulls (2021) Image matching across wide baselines: from paper to practice. International Journal of Computer Vision 129 (2), pp. 517–547. Cited by: §1.
  • [16] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret (2020) Transformers are rnns: fast autoregressive transformers with linear attention. In

    International Conference on Machine Learning

    pp. 5156–5165. Cited by: §4.5.4.
  • [17] X. Li, K. Han, S. Li, and V. Prisacariu (2020) Dual-resolution correspondence networks. Advances in Neural Information Processing Systems 33, pp. 17346–17357. Cited by: §2.0.2.
  • [18] Z. Li and N. Snavely (2018) Megadepth: learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2041–2050. Cited by: item 2, §4.4.
  • [19] C. Liu, J. Yuen, and A. Torralba (2010) Sift flow: dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 978–994. Cited by: §1.
  • [20] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.0.1.
  • [21] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §1.
  • [22] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan (2020) Aslfeat: learning local features of accurate shape and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6589–6598. Cited by: §2.0.1.
  • [23] I. Melekhov, A. Tiulpin, T. Sattler, M. Pollefeys, E. Rahtu, and J. Kannala (2019) Dgc-net: dense geometric correspondence network. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1034–1042. Cited by: Table 2, Table 3.
  • [24] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. Advances in neural information processing systems 30. Cited by: §2.0.1.
  • [25] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    Advances in neural information processing systems 32. Cited by: §3.5.
  • [26] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28. Cited by: §3.2.
  • [27] J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, and M. Humenberger (2019) R2D2: repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195. Cited by: §2.0.1.
  • [28] I. Rocco, R. Arandjelović, and J. Sivic (2020) Efficient neighbourhood consensus networks via submanifold sparse convolutions. In European conference on computer vision, pp. 605–621. Cited by: §2.0.2.
  • [29] I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, and J. Sivic (2018) Neighbourhood consensus networks. Advances in neural information processing systems 31. Cited by: §2.0.2.
  • [30] E. Rosten and T. Drummond (2005) Fusing points and lines for high performance tracking. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Vol. 2, pp. 1508–1515. Cited by: §1.
  • [31] E. Rosten and T. Drummond (2006) Machine learning for high-speed corner detection. In European conference on computer vision, pp. 430–443. Cited by: §2.0.1.
  • [32] P. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk (2019) From coarse to fine: robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12716–12725. Cited by: §1.
  • [33] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2020) Superglue: learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4938–4947. Cited by: §2.0.1.
  • [34] T. Sattler, B. Leibe, and L. Kobbelt (2012) Improving image-based localization by active correspondence search. In European conference on computer vision, pp. 752–765. Cited by: §1.
  • [35] N. Savinov, A. Seki, L. Ladicky, T. Sattler, and M. Pollefeys (2017)

    Quad-networks: unsupervised learning to rank for interest point detection

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1822–1830. Cited by: §2.0.1.
  • [36] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3260–3269. Cited by: item 1.
  • [37] D. Sun, S. Roth, and M. J. Black (2014) A quantitative analysis of current practices in optical flow estimation and the principles behind them. International Journal of Computer Vision 106 (2), pp. 115–137. Cited by: §1.
  • [38] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8934–8943. Cited by: Table 1, Table 2, Table 3.
  • [39] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou (2021) LoFTR: detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8922–8931. Cited by: §2.0.2.
  • [40] W. Sun, W. Jiang, E. Trulls, A. Tagliasacchi, and K. M. Yi (2020) Acne: attentive context normalization for robust permutation-equivariant learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11286–11295. Cited by: §2.0.1.
  • [41] L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson (2016) City-scale localization for cameras with known vertical direction. IEEE transactions on pattern analysis and machine intelligence 39 (7), pp. 1455–1461. Cited by: §1.
  • [42] Z. Teed and J. Deng (2020) Raft: recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pp. 402–419. Cited by: §4.2, Table 2.
  • [43] Y. Tian, V. Balntas, T. Ng, A. Barroso-Laguna, Y. Demiris, and K. Mikolajczyk (2020) D2d: keypoint extraction with describe to detect approach. In Proceedings of the Asian Conference on Computer Vision, Cited by: §2.0.1.
  • [44] Y. Tian, B. Fan, and F. Wu (2017) L2-net: deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 661–669. Cited by: §2.0.1.
  • [45] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas (2019) Sosnet: second order similarity regularization for local descriptor learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11016–11025. Cited by: §2.0.1.
  • [46] C. Toft, E. Stenborg, L. Hammarstrand, L. Brynte, M. Pollefeys, T. Sattler, and F. Kahl (2018) Semantic match consistency for long-term visual localization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 383–399. Cited by: §1.
  • [47] P. Truong, M. Danelljan, L. V. Gool, and R. Timofte (2020) GOCor: bringing globally optimized correspondence volumes into your neural network. Advances in Neural Information Processing Systems 33, pp. 14278–14290. Cited by: §2.0.2, §4.2, Table 1, Table 2.
  • [48] P. Truong, M. Danelljan, and R. Timofte (2020) GLU-net: global-local universal network for dense flow and correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6258–6268. Cited by: §2.0.2, Table 1, Table 2, Table 3.
  • [49] P. Truong, M. Danelljan, L. Van Gool, and R. Timofte (2021) Learning accurate dense correspondences and when to trust them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5714–5724. Cited by: §2.0.2, Table 2.
  • [50] M. Tyszkiewicz, P. Fua, and E. Trulls (2020) DISK: learning local features with policy gradient. Advances in Neural Information Processing Systems 33, pp. 14254–14265. Cited by: §2.0.1.
  • [51] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox (2017) Demon: depth and motion network for learning monocular stereo. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5038–5047. Cited by: §1.
  • [52] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1.
  • [53] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid (2013) DeepFlow: large displacement optical flow with deep matching. In Proceedings of the IEEE international conference on computer vision, pp. 1385–1392. Cited by: §1.
  • [54] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua (2018) Learning to find good correspondences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2666–2674. Cited by: §2.0.1.
  • [55] J. Zhang, D. Sun, Z. Luo, A. Yao, L. Zhou, T. Shen, Y. Chen, L. Quan, and H. Liao (2019) Learning two-view correspondences and geometry using order-aware network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5845–5854. Cited by: §2.0.1.
  • [56] C. Zhao, Z. Cao, C. Li, X. Li, and J. Yang (2019) Nm-net: mining reliable neighbors for robust feature correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 215–224. Cited by: §2.0.1.
  • [57] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §3.2.
  • [58] Q. Zhou, T. Sattler, and L. Leal-Taixe (2021) Patch2pix: epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4669–4678. Cited by: §2.0.2.
  • [59] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1851–1858. Cited by: §1.