Establishing reliable correspondences across images is an essential step to recover relative camera pose and scene structure in many computer vision tasks, sush as Structure-from-Motion (SfM) , Multiview Stereo(MVS)  and Simultaneous Localization and Mapping (SLAM) . In classical pipelines, correspondences are obtained by nearest neighbour search (NN
) of local feature descriptors and are usually further pruned by heuristic tricks, such as mutual nearest neighbour check (MNN) and ratio test (RT) .
-like networks to reject outliers of putative correspondences. In these works, the correspondence coordinates are fed into permutation-equivariant networks, then inlier likelihood scores are predicted for each correspondence. Despite showing exciting results, these methods are limited in two aspects: 1) They operate on pre-matched correspondences, whereas finding more matches than vanilla nearest-neighbour matching is impossible. 2) They only reason about the geometric distribution of putative correspondences, neglecting the critical information of original local visual descriptors.
Another thread of methods cast feature matching as a graph matching problem [36, 5, 50], which mitigates the limit of vanilla nearest neighbour correspondences. The representative work, SuperGlue , constructs densely-connected graphs between image keypoints to exchange messages about both visual and geometric context. However, the superior performance comes along with high computation and memory cost, especially when being applied to keypoints of larger number (e.g. up to ). As illustrated in Fig. 1(a), the message passing layer of SuperGlue first calculates similarity scores exhaustively between every two nodes, then gathers features to pass messages densely in the graph. This results in computational complexity of for matrix multiplications, and memory occupation of to hold the attention matrix, supposing that the keypoint number is and feature channel is . The complexity increases even drastically for deeper graph networks. Given this, exploring more efficient and compact message passing operation is of practical significance.
Besides the major efficiency bottleneck, it is debatable if such densely-connected graph introduces too much redundancy or insignificant message exchange that may hinder the representation ability, especially in the context of feature matching where the match set is highly outlier-contaminated and a large portion of keypoints are unrepeatable. As a result, most graph edges from SuperGlue  tend to have zero strength, as reported in its original paper and also observed in our experiments. This phenomenon indicates that even a sparse graph is largely sufficient and less distracted from unnecessary message exchanges.
In this paper, we propose Seeded Graph Matching Network (SGMNet) to mitigate above limitations from two aspects. First, inspired by guided matching approaches [10, 43, 30], we design a Seeding Module that initializes the matching from a small set of reliable matches so as to more effectively identify inlier compatibility. Second, we draw inspiration from graph pooling operations [62, 56], and construct a Seeded Graph Neural Network whose graph structure is largely sparsified to lower the computation and reduce the redundancy. Specifically, three operations are proposed to construct our message passing blocks. As illustrated in Fig. 1(b), instead of densely attending to all features within/across images, original keypoint features are first pooled by 1) Attentional Pooling through a small set of seed nodes, of which the features will be further enhanced by 2) Seed Filtering, and finally recovered back to original keypoints through 3) Attentional Unpooling.
By using seeds as attention bottleneck between images, the computational complexity for attention is reduced from to , where is the number of seeds. When , for example, features are pooled into seeds, the actual computation will be significantly cut down. We evaluate SGMNet under different tasks to demonstrate both its efficiency and effectiveness, and summarize our contributions threefold:
A seeding mechanism is introduced in graph matching framework to effectively identify inlier compatibility.
A greatly sparsified graph neural network is designed that enables more efficient and clean message passing.
Competitive or higher accuracy is reported with remarkably improved efficiency over dense attentional GNN. As an example, when matching features, SGMNet runs times faster and consumes 50% less GPU memory than SuperGlue.
2 Related Works
Learnable image matching.
Integrating deep learning techniques into geometry-based computer vision task, such as MVS[58, 59, 64, 65] and Visual Localization [67, 46]
, has received inspiring success during the past few years. As the front-end component for geometry estimation, learnable image matching has also been proven effective, where works in this area can be roughly divided into two categories. The first one focuses on improving local descriptors[60, 41, 27, 26, 31, 49] and keypoints [53, 12, 28, 35, 11]
with convolutional neural networks, while methods in the second category attempt to embed learning techniques into matching strategy, which involves learnable outlier rejection[61, 44, 63] and robust estimator .
Recently, a new framework, SuperGlue , is proposed to integrate feature matching and outlier rejection into a single graph neural network (GNN). Though exhibiting promising results in different tasks, SuperGlue still suffers from the excessive computational cost of fully connected self/cross attention operation, especially when used to match features in high number.
Compared with SuperGlue, our method shares the same advantages, that is, feature matching and refinement are integrated into one single network and allows for end-to-end training. However, our network significantly reduces the computational and memory cost due to its efficient attention block, which is specially designed for image matching.
Efficient transformer architectures. Transformer  architectures have gained intensive interest during the past few years. Specially, in the contexts of graph convolution, the attention mechanism in transformer can be used to pass messages across nodes in graph structure [13, 54]. Despite its effectiveness in a wide range of tasks, one major concern about transformer is its quadratic complexity w.r.t. the input size, which hinders its application under large query/key elements number.
Recently, many efforts have been made to address the attention efficiency problem. In [24, 7] predefined sparse attention pattern are adopted to cut down memory/computation cost. In [47, 19], attention span is pruned by using learnable partitioning or grouping on input elements. In [55, 21], pooling operation is utilized to reduce elements number. Despite the inspiring progress, works in this area generally focus on self-attention, where keys and queries are derived from the same element set, while their effectiveness in cross-attention, where keys and queries come from two unaligned sets, remains unstudied.
We draw inspiration from induced set attention (ISA) , where a set of learned but fixed nodes are utilized as bottleneck for efficient self attention. To be compatible with cross attention in graph matching, we establish attention between seed matches and original point sets. The selected reliable correspondences align features in both sides and pass message in a low cost way.
Graph matching. Graph matching, which aims to generating node correspondences across graphs, is a widely used model for feature matching in both 2D [50, 5] and 3D [22, 2] domain. Mathematically formulated as a quadratic assignment problem (QAP) , graph matching is NP-hard in its most general form and requires infeasible expensive solver for precise solutions. Despite the intractable nature of general graph matching, some methods [14, 15, 29] leverage partially pre-matched correspondences, also called seeds, to help matching, which are referred to as Seeded Graph Matching(SGM). Inspired by SGM, our network integrate seeds into a GNN framework for compact message passing and robust matching.
We present Seeded Graph Matching Network, shorten as SGMNet, for learning correspondences between two sets of keypoints and their associated visual descriptors. As illustrated in Fig. 2, our network produces matches in two stages: 1) Seeding Module generates seeds to guide compact message passing, and 2) Seeded Graph Neural Network leverages seeds as message bottleneck to update per-node feature. In the following parts, an overview of our network architecture will be introduced first, followed by detailed description of each module.
Given a pair of images and , with and keypoints and associated visual descriptors respectively, indexed by , our objective is to establish reliable and robust keypoint matches across two images.
We formulate the keypoint matching task as a graph matching problem, where the nodes are the keypoints of each image. Instead of apply fully connected graph, we generate a set of keypoint correspondences, which we refer to as seed matches, to guide message pass across nodes in two graphs for subsequent matching. This critical difference allows for processing large number of keypoints with significantly lower memory and computation cost.
The input to our network are keypoints , in two images, , where and is the coordinate of keypoint in image . is associated dimensional visual descriptor.
Positions of keypoints are embedded into high dimensional feature space and combined with descriptors by element-wise summation for initial representation , .
A Seeding Module follows to construct a set of seed matches . , and are then fed into our Seeded Graph Neural Network which reasons about visual appearance similarity, neighbourhood consensus as well as the guidance provided by seed matches jointly to update keypoint features.
Inspired by the cascade refinement structure in OANet , a second seeding module, or reseeding module, is introduced to generate more accurate seeds based on the updated features, which helps to further refine matches with another Seeded GNN. Final matches are then generated by formulating assignment matrix.
3.2 Seeding Module
Proposing a set of seed matches lays the foundation for subsequent matching. For initial seeding, we adopt a simple yet effective strategy: we generate putative matches by nearest neighbour matching and use the inverse of distance ratio, i.e., the ratio of distance to first and second nearest neighbours , as reliability scores. We adopt Non-Maximum Suppression (NMS) for a better spatial coverage of seeds. More details of seeding module can be found in Appendix A.2. Despite potential noise in initial seeds, our network maintains robustness with proposed weighted unpooling operation and reseeding strategy, which will be discussed later.
The seeding module outputs , where , are index lists for seed matches in each image.
3.3 Seeded Graph Neural Network (Seeded GNN)
Seeded GNN takes initial position-embedded features , and leverages seed matches as attention bottlenecks for message passing. To this end, we adopt a pooling-processing-unpooling strategy in each processing unit: seed features first gather information from full point sets on each side through Attentional Pooling, then processed by Seed Filtering operation and finally be recoverd back to original size by Attentional Unpooling. Our Seeded GNN is constructed by stacking 6(3) such processing units for initial(refinement) stages.
Weighted attentional aggregation. We first introduce a weighted version of attentional aggregation, which allows for sharper and cleaner data-dependent message passing.
In a -dimensional feature space, for vectors to be updated: , vectors to be attended to: and a weight vector: , the weighted attentional aggregation Att is defined as,
means row-wise softmax. is linear projection of and are linear projections of . is renewed representation for .
By attentional aggregation, elements in retrieve and aggregate information from elements in . A weighting vector is applied on to adjust the importance for each element in .
Attentional pooling. As the first step in message passing, seed matches retrieves contexts from full keypoint set through attentional aggregration,
For input features , in layer , features for seed matches are first retrieved by indices
where is the indexing operation. Seed matches are then updated by retrieving context from nodes in each graph
where is all one vector, which means no weights are applied.
A mutilayer perceptron follows to fuse the seed features,
where means concatenation along row dimension.
The outputs , , which encode both visual and position context for each graph and information from seed matches themselves, are fed into subsequent operations.
Seed filtering. We propose seed filtering operation to (1) conduct intra/inter-graph communication between seed matches and (2) suppress the influence of outlier seed matches. More specifically, intra/inter-graph attentional aggregation is applied to the input seed correspondence features .
In addition, a context normalization  branch is used to predict inlier likelihood scores for each seed correspondence, which will be used as weighting score for seed features in later unpooling stage.
where CN is a lightweight stacked context normalization  blocks. Detailed structures of CN brach can be found in Appendix A.2.
The outputs of seed filtering are the filtered features , and inlier scores of seed matches.
Attentional unpooling. After the message exchange between seed matches and inlier score prediction, an inlier-score-weighted attentional aggregation is adopted to broadcast the pooled contexts to every keypoint in each graph, which we refer to as attentional unpooling.
Taking , , inlier score and , as input, attentional unpooling outputs the updated keypoint features , .
Applying inlier score to the aggregation process suppresses information broadcast from false seed matches and results in cleaner feature update, which contributes to the robustness of our network w.r.t. seeding noise (Appendix D.2, Appendix Fig 10).
Assignment matrix formulation. After all processing units, the updated features are used to construct the assignment matrix. Sinkhorn  algorithm is applied on the correlation matrix of features with a dustbin channel to produce final assignment matrix .
Given the keypoint features , after processing blocks, we compute the assignment matrix by
where , is a learnable parameter for dustbin. We derive final matches from the assignment matrix with a confidence threshold to remove outliers.
Although Seeded GNN based on initial seeding exhibits strong capability to identify underlying matches, a second seeding module using updated feature provides even cleaner and richer seeds to further improve the performance. Thus, we adopt a reseeding module. Different from initial seeding where NN matches and ratio scores of raw descriptors are used, reseeding module employs assignment matrix of updated features to regenerate seeds. More specifically, matches with highest score in both rows and columns are selected as candidates, where top-k matches are selected as new seeds and are fed into a second Seeded GNN for refinement. More details can be found in Appendix A.2.
Seeding module only outputs seed indices that requires no gradient back-propagation, thus our network is fully differentiable and can be trained end-to-end with supervision from indices of ground truth matches and unmatchable points , where a point is regarded as unmatchable if there are no matchable points in the other image. From assignment matrix for reseeding, final assignment matrix , and inlier scores for processing units, we formulate our loss as two parts,
is cross entropy loss for inlier/outlier binary classification in -th processing unit, a seed correspondence is labeled as inlier if its epipolar distance is less than a threshold. is a weight to balance the two loss terms.
|RootSIFT ||NN+RT||52.9||92 (123)||82.1||208 (287)||61.9||365 (438)||90.6||847 (928)|
|OANet ||58.6||119 (167)||84.7||219 (306)||62.3||454 (396)||89.0||773 (854)|
|SuperGlue ||61.1||218 (466)||86.8||382 (767)||65.9||655 (1037)||91.0||1261 (1746)|
|SGMNet||62.0||248 (524)||85.9||397 (789)||66.6||704 (1132)||91.2||1097 (1506)|
|ContextDesc ||NN+RT||62.4||169 (277)||85.5||222 (426)||58.7||456 (625)||90.6||1134 (1416)|
|OANet||65.3||187 (288)||86.7||294 (425)||53.2||295 (327)||89.0||791 (907)|
|SuperGlue||67.0||260 (579)||89.1||491 (695)||60.1||408(690)||91.1||1401 (1897)|
|SGMNet||70.8||370 (616)||89.9||514 (705)||61.6||423 (705)||90.3||1204 (1724)|
|SuperPoint ||MNN||34.5||152 (421)||72.8||287 (717)||56.7||280 (420)||88.6||848 (1490)|
|OANet||62.9||186 (343)||91.2||280 (477)||61.4||332 (473)||82.2||482 (736)|
|SuperGlue||68.8||287 (719)||92.9||414 (987)||59.1||512 (1038)||88.7||957 (1777)|
|SGMNet||70.3||327 (829)||93.2||450 (1098)||65.8||666 (1315)||86.3||954(1851)|
3.6 Implementation Details
We train our network on GL3D dataset , which covers both indoor/outdoor scenes, to obtain general purpose model. We sample keypoints and seeds during training. We use Adam optimizer with learning rate of in optimization and inlier score weight in loss is set to 250. We use 6/3 processing blocks for initial/refine stage and the gradients flow between the two stages are blocked in early iterations(140k iterations). We use 4-head attention in both attentional pooling/unpooling operations. For all experiments, we use a confidence threshold of 0.2 to retain matches and seeding number of , where is the number of keypoints. More details including training data generation and hyper-parameters can be found in Appendix A.
In the following sessions, we provide experiments results of our methods under a wide range of tasks, as well as further analysis of its computation and memory efficiency.
4.1 Image Matching
The performance of our method is first evaluated on image matching tasks and three benchmarks in two-view pose estimation, YFCC100M, FM-Bench  and ScanNet  dataset, are used for demonstration.
For YFCC100M , we follow the setting in OANet  and choose 4 sequences for testing. FM-Bench  comprises four subsets in different scenarios: KITTI  for driving settings, TUM  for indoor SLAM settings, Tanks and Temples (T&T)  and CPC  for wide-baseline reconstruction tasks. ScanNet  is a widely used indoor reconstruction dataset. Following SuperGlue , we use 1500 pairs in test set for evaluation.
Evaluation protocols. On YFCC100M and ScanNet dataset, pose estimation is performed on the correspondences after RANSAC post-processing. We report 1) AUC [36, 63, 61] under different thresholds, computed from the angular differences between ground truth and estimated vectors for both rotation and translation; 2) Mean matching score (M.S.) [36, 11], the ratio of correct matches and total keypoint number; 3) Mean precision (Prec.) [36, 11] of the generated matches. We detect up to keypoints for all features on YFCC100M, up to keypoints for superpoint on ScanNet and up to keypoints for other features .
On FM-Bench dataset, we estimate fundamental matrix for each evaluated pair with RANSAC post-processing, and use the normalized symmetric epipolar distance (SGD) [66, 3] as originally defined in FM-Bench paper, to measure the difference between the estimated fundamental matrix and the ground truth. An estimate is considered correct if its normalized SGD to ground truth is lower than a threshold ( is used by default), and up to keypoints are detected for each test pair. Following FM-Bench paper  , we report: 1) recall (%Recall) on fundamental matrix estimation; 2) mean number of correct correspondences (#Corrs(-m)) after/before RANSAC.
|RootSIFT||NN + RT ||49.07||58.76||68.58||8.23||29.79|
|ContextDesc||NN + RT||57.90||68.47||78.35||9.39||59.72|
|RootSIFT||NN + RT ||9.08||19.75||32.66||2.28||28.83|
|ContextDesc||NN + RT||11.07||23.52||37.66||5.29||28.71|
Comparative methods. We compare our method with heuristic pruning strategy, ratio test  or MNN, and a various of learning-based matching methods [63, 36, 45, 10, 61, 40]. These methods are applied on both handcrafted descriptors [25, 1] and learning-based local features [26, 11].
For a fair comparison, OANet, SuperGlue and SGMNet are all re-trained using the same sequences of GL3D , where 1k keypoints are sampled per image. Noted that the official training code of SuperGlue is not available, and its public model (denoted as SuperGlue) is trained on MegaDepth  and Oxford and Paris dataset . Instead, we retrain SuperGlue on GL3D  with similar data selection criteria described in orignal paper. This re-implementation achieves even better results on YFCC100M (Table 3) for RootSIFT than those reported in the original paper. However, there still remains some performance gap when using SuperPoint , even though we have carefully tuned the training and enquired the authors about details. Nevertheless, we consider our re-implementation of SuperGlue  faithful and thus can be fairly compared. We report results of both official model and our re-implementation.
Results. For YFCC100M, ScanNet and two wide-baseline datasets of FM-Bench (CPC and T&T), our method mostly shows competitive results compared with the state-of-the-arts. For two small-baseline datasets in FM-Bench (TUM and KITTI), the advantages of all learnable methods tend to degenerate due to the reduced matching difficulty. Our method matches most inlier correspondences on almost all dataset regarding M.S. on YFCC100M/ScanNet and #Corrs(-m) on FM-Bench while maintaining a high matching precision, which contributes to the final pose accuracy. Though not specially trained in indoor scenarios, our method generalizes well on indoor setting.
4.2 Visual Localization
To evaluate how our method benefits real downstream applications, we integrate it in a visual localization pipeline and evaluate its performance.
Datasets. We resort to Aachen Day-Night dataset  to evaluate the effectiveness of our method on visual localization task. Aachen Day-Night consists of reference images and ( daytime, nighttime) query images. All images are taken in urban scenes.
Evaluation protocols. We use the official pipeline of Aachen Day-Night benchmark. Correspondences between reference images are first used to triangulate a 3D reconstruction. Correspondences between each query and its retrieved reference images are then generated to recover the relative pose. Consistent with the official benchmark, we report the pose estimation accuracy under different thresholds. We extract keypoints for RootSIFT, ContextDesc and keypoints for SuperPoint.
Results. Compared with SuperGlue, our method exhibits better results when using RootSIFT and competitive results when using SuperPoint or ContextDesc. Our method consistently outperforms OANet using all three descriptors. The overall performance proves the generalization ability of our method on real challenging applications.
|Feature||Matcher||0.25m, 2°||0.5m, 5°||5m, 10°|
In above experiments, the proposed method has shown competitive results against the state-of-the-arts. In this session, we demonstrate the major advantage of our method in time/memory efficiency compared with SuperGlue, a closely related method based on GNN.
Time/memory consumption. As shown in Fig. 5(a), the time cost of our method is remarkably lower than SuperGlue. Specifically, we report both run time with and without Sinkhorn iterations on a GTX 1080 GPU, in order to more precisely demonstrate the substantial improvements on GNN design itself. It is noteworthy that on keypoints and without Sinkhorn iterations, the proposed method reduces the runtime by one magnitude. Besides, due to the reduction of redundancy, SGMNet also delivers better convergence during training (see Appendix C).
As shown in Fig. 5(b), during test phase, our method consumes half of memory than SuperGlue when keypoint number is larger than , where the major memory peak of our method is seeding phase and sinkhorn iterations. The advantage becomes even more significant during training. With a batch size of 16 and keypoint number of , SuperGlue occupies up to 23GB GPU memory for training while SGMNet only consumes less than 9GB memory.
Performance gain when using more keypoints. Within a reasonable range, larger keypoint number generally improves performance of downstream tasks, thus a manageable matching cost is of practical significance to extend the applicability. As a showcase, we vary the keypoint number of RootSIFT when evaluating on Aachen Day-Night Dataset. As can be seen from Fig. 5(c), the accuracy of SGMNet and SuperGlue increases as more keypoints are used. Considering the efficiency advantage of our method, SGMNet delivers better trade-off when increasing the keypoint number. We also provide an experiment of SfM, a typical keypoint-consuming application in Appendix D.3.
|NN + RT||68.58||10.05||56.38|
|SGMNet w Rand. Seed||71.25||12.94||55.57|
|SGMNet w/o W.U.||78.64||17.07||81.26|
|SGMNet w/o A.P.||79.35||17.11||82.15|
|SGMNet w/o Reseeding||80.41||17.12||84.47|
5.1 Ablation Study
To evaluate the effectiveness of different components of our method, we conduct ablation study on YFCC100M dataset using RootSIFT. As shown in Table 5, all different component in our network contributes to the final performance notably. In particular, seeding reliable matches plays an important role, which further proves that seed matches is able to guide message across images for robust matching.
5.2 Comparison with Filter-based Methods
For a well-around comparison, we provide in Table 6 more experiment results with filter-based methods (outlier rejection). When using 4k keypoints, SGMNet achieves best performance among all comparative methods while runs 4 times faster than SuperGlue when setting sinkhorn iterations as 10. Despite the fast inference speed of SOTA filter-based methods, a considerable performance gap still remains compared with GNN based methods.
5.3 Designs of SGMNet
We experiment on other designs for SGMNet, including other seeding strategy and pooling operations in GNN/Transformer architecture [52, 55, 62], e.g. diffPool [63, 62] and set transformer  . We find 1) learnable seeding acheives limitied improvement over our simple heuristic seeding strategy 2) other general pooling operations, which are often verified on self-attention in GNN/Transformer architecture, are not effective alternatives of our seed-based pooling. Details of our experiments can be found in Appendix B. Hyper-parameter study on seeding number can be found in Appendix D.
In this paper, we propose SGMNet, a novel graph neural network for efficient image matching. The new operations we have developed enable message passing with a compact attention pattern. Experiments on different tasks and datasets prove that our method improves the accuracy of feature matching and downstream tasks to competitive or higher level against the state-of-the-arts with a modest computation/memory cost.
Acknowledgment. This work is supported by Hong Kong RGC GRF 16206819, 16203518, T22-603/15N and Guangzhou Okay Information Technology with the project GZETDZ18EG05.
A Implementation Details
We provide in this part details about our implementations as well as experiment settings.
A.1 Training Details
Training Data. We use GL3D dataset to generate training data. GL3D dataset is originally based on 3D reconstruction of 543 different scenes, including landmarks and small objects, while in its latest version additional 713 sequences of internet tourism photos are added.
We sample 1000 pairs for each sequence and filter out pairs that are either too hard or too easy for training. More specifically, we use the common track ratio provided by original dataset and rotation angles between cameras to determine pair difficulty, pairs with common track ratio in range and rotation angle in range are kept.
We reproject keypoints between images with depth maps and use reprojection distances to determine ground truth matches and unmatchable points. More specifically, a keypoint is labeled as unmatchable if its reprojection distances with all keypoints in the other image are larger than 10 pixels, while a pair of keypoints that are mutual nearest after reprojection and with a reprojection distance lower than 3 pixels are considered ground truth matches. We further filter out pairs with ground matches fewer than 50.
Our data generation protocol yields around 400k training pairs in total.
Training Parameters. We use Adam optimizer for training optimization with learning rate of . We apply learning rate decay after 300k iterations with decay rate of 0.999996 until 900k iterations. We block the gradient flow between initial/refinement stages for the first 140k iterations.
A.2 Network Details
As is done in SuperGlue , we apply multi-head attention mechanism for all weighted attention operations in our network with head number of 4.
For initial Seeding Module, we apply additional mutual nearest neighbour check and ratio test to seed correspondences. NMS is also employed in seeding. The radius for NMS is , where is index set for all keypoints, is distance between keypoint , denotes set size, is a hyper parameter, which we set as for all experiment.
For Reseeding Module, we apply sinkhorn algorithm with 10 iterations on correlation matrix to obtain assignment matrix. Correspondences with top-k scores on both dimensions are sampled as seeds. NMS is also applied in reseeding module.
For the inlier likelihood predictor in Seeded GNN, we use the structure illustrated in Fig 7 (down).
We set iterations number of Sinkhorn algorithm to 100.
A.3 Experiment Settings
We use OpenCV implementation for SIFT and official implementation of SuperPoint and ContextDesc. For ContextDesc, we use the latest public model (denoted as ContextDesc++upright in the official GitHub repository).
YFCC100M. We extract SIFT and ContextDesc with images in original resolution and resize images so that the longest dimension is 1600 to extract SuperPoint. We use OpenCV findEssentialMat and recoverPose functions to recover relative poses with the embedded RANSAC, of which the threhold is set as 1 pixel under resolution of resized images. For matching score and precision, we use epipolar distance of to determine inlier matches.
ScanNet. We resize images to [640,480] resolution to extract keypoints and use same protocol to recover relative poses, mathing scores and precision as in YFCC100M.
FM-Bench. The original evaluation pipeline of FM-Bench is based on Matlab and we reimplement it with Python. The parameter for evaluation is consistent with the original implementation. We use OpenCV function for fundamental matrix estimation and set the threshold of embedded RANSAC to 1 pixel. Compared with original implementation, our evaluation pipeline tends to give out higher accuracy, especially for wide baseline datasets. We believe it is due to better performance of OpenCV function and is beneficial for a more precise evaluation.
Aachen Day-Night. We use the official pipeline and default parameters for evaluation. We extract upright feature for both RootSIFT and ContextDesc.
B Designs of SGMNet
Despite the effectiveness of SGMNet, we provide our experiment results and analysis on some other potential designs in this part.
Learned Seeding. After obtaining initial nearest neighbour correspondences, we employ a light-weight permutation invariant Network (architecture illustrated in Fig 7) to determine each correspondence’s inlier likelihood score, as is done in previous works [63, 61]. We thus sample correspondences with top-k inlier-score instead of ratio scores as seeding correspondences.
Results. We report results in Table 7. Although applying light-weight pointCN block for inlier seeds prediction slightly increases seeding precision, it’s not enough to bring meaningful impacts on the level of pose estimation.
In general, we are open for the possibility to increase matching quality by introducing more complicated seeding strategies. However, targeting at efficient matching, our seeding method achieves good balance between performance and cost.
The critical component in our method for efficiency message passing is essentially pooling of original keypoints. In this part, we apply two well-studied pooling designs either in GNN or transformer architecture, namely DiffPool  and Induced Set Attention , to image matching task and evaluate their performance.
DiffPool. As a pooling operation in GNN, DiffPool predicts assignment matrix based on each node’s embedding in graph, which is designed to build hierarchical and sparse graph representation . In OANet , DiffUnpool, the counter part of DiffPool, is proposed to recover node clusters to original size. As an experiment, we apply DiffPool/Unpool to keypoint graph in each image as a substitution for our proposed attentional pooling. Cross/self attention  will be performed on the pooled clusters for message exchange.
Induced Set Attention. Induced set attention(ISA) is first proposed in Set Transformer . Different from using seed features as attention bottleneck, ISA adopts a set of learned fixed features(induced point) as attention pass between set elements and is only verified on self-attention for sparse input. We subsitute our seed-based attention with ISA. More specifically, we let the network learned induced points for both sides and let induced points attend to original keypoints(both cross/self) to perform message pass.
Results. We report evaluation result on YFCC100M. For all pooling method, we set pooling number to 128 and extract up to keypoints. As illustrated in Table 7, both DiffPool and ISA shows only marginal improvements over baselines, which indicates that applying pooling methods designed for generic GNN/efficient transformer is not necessarily effective for image matching tasks, and further prove that our seed based attentional pooling/unpooling operation is critical for the success of our method.
C Fast Convergence
The compactness of SGMNet not only contributes to the cut-down on computation/memory complexity but also leads to a faster convergence for training. In Fig. 8 we plot the training curve for both SGMNet and SuperGlue. As illustrated, SGMNet takes fewer iterations to reach convergence.
D Additional Experiment Results
D.1 Impact of Seeding Number
SGMNet requires seeding a set of seed correspondences, the number of which not only influences our method’s efficiency but also accuracy. Therefore, it is important to investigate the impact of seeding number. We carefully conduct grid search on YFCC100M with different keypoint and seeding number.
As is illustrated in Fig 9, an approximate proportional relationship between keypoint/seed numbers yields best performance, as too many seeds may deliver less reliable guidance while seeding too few correspondences results in severe information lost.
|Methods||#Registered Images||#Sparse Points||Mean Rro. Error||Mean Track Len.||Matching Time||Total Time|
|NN+Ratio+Mutual Check||799||132265||0.58px||11.27||1h 13min||2h 22min|
|SuperGlue||916||223950||0.95px||10.93||42h 34min||44h 06min|
|SGMNet||943||276240||1.10px||10.73||5h 37min||7h 56min|
D.2 Robustness to Seed Noise
To evaluate the robustness of our method w.r.t. potential false seeds, we conduct experiment on YFCC100M. More specifically, for each pair we select a set of inlier matches, which is determined using ground truth, and pad them with random sampled noise to construct seed correspondences with different precision. We feed the pre-selected seed correspondences to Seeded GNN instead of applying Seeding Module.
As is shown in Fig 10, SGMNet maintains high matching quality even with heavily noisy seed correspondences. It’s noteworthy that for SGMNet without weighted unpooling, the pose estimation accuracy degenerate more rapidly as seed precision decreases, which indicates lower robustness to seed noise. More visualizations related to noisy seed and matching results can be seen in Fig 6.
D.3 SfM Experiment
Typical Structure from Motion(SfM) pipeline usually involves extracting keypoints in large number(e.g. 8k) and matching among hundreds/thousands of images to obtain ultra accurate poses and more complete reconstructions. In this section, we embed different matching methods into COLMAP SfM pipeline for comparison. We reconstruct challenging Alamo scene from 1DSFM , which involves 2915 images taken under very different illumination conditions. 8k RootSIFT features are extracted for each image and we set sinkhorn iterations to 10 for both SGMNet and SuperGlue. We use a GTX 1080 GPU to preform matching sequentially. Apart from common statistics for reconstruction, we also report time consumption for matching and the whole SfM pipeline.
As shown in Tab 8, both SuperGlue and SGMNet produces much more complete reconstruction compared with vanilla NN matching and heuristic pruning. However, SuperGlue largely lengthen the time for whole SfM pipeline, while our method retains the matching time to a feasible level.
E More Visualizations
-  Relja Arandjelović and Andrew Zisserman. Three things everyone should know to improve object retrieval. In CVPR, 2012.
-  Xuyang Bai, Zixin Luo, Lei Zhou, Hongkai Chen, Lei Li, Zeyu Hu, Hongbo Fu, and Chiew-Lan Tai. Pointdsc: Robust point cloud registration using deep spatial consistency. In CVPR, 2021.
-  Jia-Wang Bian, Yu-Huan Wu, Ji Zhao, Yun Liu, Le Zhang, Ming-Ming Cheng, and Ian Reid. An evaluation of feature matchers for fundamental matrix estimation. In BMVC, 2019.
-  Eric Brachmann and Carsten Rother. Neural-guided ransac: Learning where to sample model hypotheses. In ICCV, 2019.
-  T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009.
Luca Cavalli, Viktor Larsson, Martin Ralf Oswald, Torsten Sattler, and Marc
Handcrafted outlier detection revisited.In ECCV, 2020.
-  Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. In arXiv, 2019.
-  Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPs, 2013.
-  Angela Dai, Matthias Nießner, Michael Zollöfer, Shahram Izadi, and Christian Theobalt. Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface re-integration. ACM Transactions on Graphics 2017 (TOG), 2017.
-  François Darmon, Mathieu Aubry, and Pascal Monasse. Learning to guide local feature matches. 2020.
-  D. DeTone, T. Malisiewicz, and A. Rabinovich. Superpoint: Self-supervised interest point detection and description. In CVPRW, 2018.
-  Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. In CVPR, 2019.
-  Matthias Fey, Jan E. Lenssen, Christopher Morris, Jonathan Masci, and Nils M. Kriege. Deep graph matching consensus. In ICLR, 2020.
-  Donniell E. Fishkind, Sancar Adali, Heather G. Patsolic, Lingyao Meng, Digvijay Singh, Vince Lyzinski, and Carey E. Priebe. Seeded graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  Catherine FRAIKIN and Paul Van Dooren. Graph matching with type constraints on nodes and edges. Dagstuhl Seminar Proceedings, 2007.
-  A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
-  M. Goesele, B. Curless, and S.M. Seitz. Multi-view stereo revisited. In CVPR, 2006.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift., 2015.
-  Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In ICLR, 2020.
-  Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics, 2017.
-  Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML, 2019.
-  M. Leordeanu and M. Hebert. A spectral technique for correspondence problems using pairwise constraints. In ICCV, 2005.
-  Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018.
-  Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In ICLR, 2018.
-  David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
-  Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. Contextdesc: Local descriptor augmentation with cross-modality context. In CVPR, 2019.
-  Zixin Luo, Tianwei Shen, Lei Zhou, Siyu Zhu, Runze Zhang, Yao Yao, Tian Fang, and Long Quan. Geodesc: Learning local descriptors by integrating geometry constraints. In ECCV, 2018.
-  Zixin Luo, Lei Zhou, Xuyang Bai, Hongkai Chen, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. Aslfeat: Learning local features of accurate shape and localization. In CVPR, 2020.
-  Vince Lyzinski, Daniel L. Sussman, Donniell E. Fishkind, Henry Pao, Li Chen, Joshua T. Vogelstein, Youngser Park, and Carey E. Priebe. Spectral clustering for divide-and-conquer graph matching. Parallel Computing, 2015.
-  Josef Maier, Martin Humenberger, Markus Murschitz, Oliver Zendel, and Markus Vincze. Guided matching based on statistical optical flow for fast and robust correspondence analysis. In ECCV, 2016.
-  Anastasiia Mishchuk, Dmytro Mishkin, Filip Radenovic, and Jiri Matas. Working hard to know your neighbor’s margins: Local descriptor learning loss. In NeurIPs, 2017.
-  R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós. Orb-slam: A versatile and accurate monocular slam system. IEEE Transactions on Robotics, 2015.
-  Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
Filip Radenović, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondřej
Revisiting oxford and paris: Large-scale image retrieval benchmarking.In CVPR, 2018.
-  Jerome Revaud, Philippe Weinzaepfel, César De Souza, Noe Pion, Gabriela Csurka, Yohann Cabon, and Martin Humenberger. R2d2: Repeatable and reliable detector and descriptor. In NeurIPs, 2019.
-  Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In CVPR, 2020.
-  Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, and Tomas Pajdla. Benchmarking 6dof outdoor visual localization in changing conditions. In CVPR, 2018.
-  Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, 2016.
-  Tianwei Shen, Zixin Luo, Lei Zhou, Runze Zhang, Siyu Zhu, Tian Fang, and Long Quan. Matchable image retrieval by learning from surface reconstruction. In ACCV, 2018.
-  Xi Shen, François Darmon, Alexei A Efros, and Mathieu Aubry. Ransac-flow: generic two-stage image alignment. ECCV, 2020.
-  Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Discriminative Learning of Deep Convolutional Feature Point Descriptors. In ICCV, 2015.
-  J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In IROS, 2012.
K. Sun, W. Tao, and Y. Qian.
Guide to match: Multi-layer feature matching with a hybrid gaussian mixture model.IEEE Transactions on Multimedia, 2020.
-  Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, and Kwang Moo Yi. Acne: Attentive context normalization for robust permutation-equivariant learning. In CVPR, 2020.
-  Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, and Kwang Moo Yi. Acne: Attentive context normalization for robust permutation-equivariant learning. In CVPR, 2020.
-  Shitao Tang, Chengzhou Tang, Rui Huang, Siyu Zhu, and Ping Tan. Learning camera localization via dense scene matching. In CVPR, 2021.
-  Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In ICML, 2020.
-  Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 2016.
-  Yurun Tian, Bin Fan, and Fuchao Wu. L2-net: Deep learning of discriminative patch descriptor in euclidean space. In CVPR, 2017.
-  Lorenzo Torresani, Vladimir Kolmogorov, and Carsten Rother. Feature correspondence via graph matching: Models and global optimization. In ECCV, 2008.
-  S. Umeyama. An eigendecomposition approach to weighted graph matching problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1988.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPs. 2017.
-  Yannick Verdie, Kwang Moo Yi, Pascal Fua, and Vincent Lepetit. Tilde: A temporally invariant learned detector. In CVPR, 2015.
-  Runzhong Wang, Junchi Yan, and Xiaokang Yang. Learning combinatorial embedding networks for deep graph matching. In ICCV, 2019.
-  Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. In arXiv, 2020.
-  Z. Wang and S. Ji. Second-order pooling for graph neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
-  Kyle Wilson and Noah Snavely. Robust global translations with 1dsfm. In ECCV, 2014.
-  Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In ECCV, 2018.
-  Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In CVPR, 2019.
-  Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. In ECCV, 2016.
-  Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to find good correspondences. In CVPR, 2018.
-  Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In NeurIPs, 2018.
-  Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen, Long Quan, and Hongen Liao. Learning two-view correspondences and geometry using order-aware network. In ICCV, 2019.
-  Jingyang Zhang, Yao Yao, Shiwei Li, Zixin Luo, and Tian Fang. Visibility-aware multi-view stereo network. In BMVC, 2020.
-  Jingyang Zhang, Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Learning stereo matchability in disparity regression networks. In ICPR, 2020.
-  Zhengyou Zhang. Determining the epipolar geometry and its uncertainty: A review. IJCV, 1998.
Lei Zhou, Zixin Luo, Tianwei Shen, Jiahui Zhang, Mingmin Zhen, Yao Yao, Tian
Fang, and Long Quan.
Kfnet: Learning temporal camera relocalization using kalman filtering.In CVPR, 2020.