Learning to Match Features with Seeded Graph Matching Network

08/19/2021 ∙ by Hongkai Chen, et al. ∙ Tsinghua University The Hong Kong University of Science and Technology 14

Matching local features across images is a fundamental problem in computer vision. Targeting towards high accuracy and efficiency, we propose Seeded Graph Matching Network, a graph neural network with sparse structure to reduce redundant connectivity and learn compact representation. The network consists of 1) Seeding Module, which initializes the matching by generating a small set of reliable matches as seeds. 2) Seeded Graph Neural Network, which utilizes seed matches to pass messages within/across images and predicts assignment costs. Three novel operations are proposed as basic elements for message passing: 1) Attentional Pooling, which aggregates keypoint features within the image to seed matches. 2) Seed Filtering, which enhances seed features and exchanges messages across images. 3) Attentional Unpooling, which propagates seed features back to original keypoints. Experiments show that our method reduces computational and memory complexity significantly compared with typical attention-based networks while competitive or higher performance is achieved.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 10

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Establishing reliable correspondences across images is an essential step to recover relative camera pose and scene structure in many computer vision tasks, sush as Structure-from-Motion (SfM) [38], Multiview Stereo(MVS) [17] and Simultaneous Localization and Mapping (SLAM) [32]. In classical pipelines, correspondences are obtained by nearest neighbour search (NN

) of local feature descriptors and are usually further pruned by heuristic tricks, such as mutual nearest neighbour check (

MNN) and ratio test (RT[25].

In the past few years, great efforts have been made on designing learnable matching strategy. Early works [61, 44, 63] in this direction utilize PointNet [33]

-like networks to reject outliers of putative correspondences. In these works, the correspondence coordinates are fed into permutation-equivariant networks, then inlier likelihood scores are predicted for each correspondence. Despite showing exciting results, these methods are limited in two aspects: 1) They operate on pre-matched correspondences, whereas finding more matches than vanilla nearest-neighbour matching is impossible. 2) They only reason about the geometric distribution of putative correspondences, neglecting the critical information of original local visual descriptors.

Figure 1: The designs of message passing layer. (a) SuperGlue densely connects every node in the graph, resulting in computational complexity. (b) Instead, the proposed network, SGMNet, adopts pooling/unpooling operations, reducing the complexity to , where in general.

Another thread of methods cast feature matching as a graph matching problem [36, 5, 50], which mitigates the limit of vanilla nearest neighbour correspondences. The representative work, SuperGlue [36], constructs densely-connected graphs between image keypoints to exchange messages about both visual and geometric context. However, the superior performance comes along with high computation and memory cost, especially when being applied to keypoints of larger number (e.g. up to ). As illustrated in Fig. 1(a), the message passing layer of SuperGlue first calculates similarity scores exhaustively between every two nodes, then gathers features to pass messages densely in the graph. This results in computational complexity of for matrix multiplications, and memory occupation of to hold the attention matrix, supposing that the keypoint number is and feature channel is . The complexity increases even drastically for deeper graph networks. Given this, exploring more efficient and compact message passing operation is of practical significance.

Besides the major efficiency bottleneck, it is debatable if such densely-connected graph introduces too much redundancy or insignificant message exchange that may hinder the representation ability, especially in the context of feature matching where the match set is highly outlier-contaminated and a large portion of keypoints are unrepeatable. As a result, most graph edges from SuperGlue [36] tend to have zero strength, as reported in its original paper and also observed in our experiments. This phenomenon indicates that even a sparse graph is largely sufficient and less distracted from unnecessary message exchanges.

In this paper, we propose Seeded Graph Matching Network (SGMNet) to mitigate above limitations from two aspects. First, inspired by guided matching approaches [10, 43, 30], we design a Seeding Module that initializes the matching from a small set of reliable matches so as to more effectively identify inlier compatibility. Second, we draw inspiration from graph pooling operations [62, 56], and construct a Seeded Graph Neural Network whose graph structure is largely sparsified to lower the computation and reduce the redundancy. Specifically, three operations are proposed to construct our message passing blocks. As illustrated in Fig. 1(b), instead of densely attending to all features within/across images, original keypoint features are first pooled by 1) Attentional Pooling through a small set of seed nodes, of which the features will be further enhanced by 2) Seed Filtering, and finally recovered back to original keypoints through 3) Attentional Unpooling.

By using seeds as attention bottleneck between images, the computational complexity for attention is reduced from to , where is the number of seeds. When , for example, features are pooled into seeds, the actual computation will be significantly cut down. We evaluate SGMNet under different tasks to demonstrate both its efficiency and effectiveness, and summarize our contributions threefold:

  • [leftmargin=*]

  • A seeding mechanism is introduced in graph matching framework to effectively identify inlier compatibility.

  • A greatly sparsified graph neural network is designed that enables more efficient and clean message passing.

  • Competitive or higher accuracy is reported with remarkably improved efficiency over dense attentional GNN. As an example, when matching features, SGMNet runs times faster and consumes 50% less GPU memory than SuperGlue.

2 Related Works

Learnable image matching.

Integrating deep learning techniques into geometry-based computer vision task, such as MVS 

[58, 59, 64, 65] and Visual Localization [67, 46]

, has received inspiring success during the past few years. As the front-end component for geometry estimation, learnable image matching has also been proven effective, where works in this area can be roughly divided into two categories. The first one focuses on improving local descriptors 

[60, 41, 27, 26, 31, 49] and keypoints [53, 12, 28, 35, 11]

with convolutional neural networks, while methods in the second category attempt to embed learning techniques into matching strategy, which involves learnable outlier rejection 

[61, 44, 63] and robust estimator [4].

Recently, a new framework, SuperGlue [36], is proposed to integrate feature matching and outlier rejection into a single graph neural network (GNN). Though exhibiting promising results in different tasks, SuperGlue still suffers from the excessive computational cost of fully connected self/cross attention operation, especially when used to match features in high number.

Compared with SuperGlue, our method shares the same advantages, that is, feature matching and refinement are integrated into one single network and allows for end-to-end training. However, our network significantly reduces the computational and memory cost due to its efficient attention block, which is specially designed for image matching.

Figure 2: The network architecture of SGMNet, which takes the local features as input, then generates seed matches from seeding module, and finally extracts correspondence features from Seeded GNN with multiple attentional blocks. In practice, updated features will be fed into a reseeding module and a 3-layer seeded GNN for refinement, while we omit this procedure here for simplicity.

Efficient transformer architectures. Transformer [52] architectures have gained intensive interest during the past few years. Specially, in the contexts of graph convolution, the attention mechanism in transformer can be used to pass messages across nodes in graph structure [13, 54]. Despite its effectiveness in a wide range of tasks, one major concern about transformer is its quadratic complexity w.r.t. the input size, which hinders its application under large query/key elements number.

Recently, many efforts have been made to address the attention efficiency problem. In [24, 7] predefined sparse attention pattern are adopted to cut down memory/computation cost. In [47, 19], attention span is pruned by using learnable partitioning or grouping on input elements. In [55, 21], pooling operation is utilized to reduce elements number. Despite the inspiring progress, works in this area generally focus on self-attention, where keys and queries are derived from the same element set, while their effectiveness in cross-attention, where keys and queries come from two unaligned sets, remains unstudied.

We draw inspiration from induced set attention (ISA) [21], where a set of learned but fixed nodes are utilized as bottleneck for efficient self attention. To be compatible with cross attention in graph matching, we establish attention between seed matches and original point sets. The selected reliable correspondences align features in both sides and pass message in a low cost way.

Graph matching. Graph matching, which aims to generating node correspondences across graphs, is a widely used model for feature matching in both 2D [50, 5] and 3D [22, 2] domain. Mathematically formulated as a quadratic assignment problem (QAP) [51], graph matching is NP-hard in its most general form and requires infeasible expensive solver for precise solutions. Despite the intractable nature of general graph matching, some methods [14, 15, 29] leverage partially pre-matched correspondences, also called seeds, to help matching, which are referred to as Seeded Graph Matching(SGM). Inspired by SGM, our network integrate seeds into a GNN framework for compact message passing and robust matching.

3 Methodology

We present Seeded Graph Matching Network, shorten as SGMNet, for learning correspondences between two sets of keypoints and their associated visual descriptors. As illustrated in Fig. 2, our network produces matches in two stages: 1) Seeding Module generates seeds to guide compact message passing, and 2) Seeded Graph Neural Network leverages seeds as message bottleneck to update per-node feature. In the following parts, an overview of our network architecture will be introduced first, followed by detailed description of each module.

3.1 Overview

Given a pair of images and , with and keypoints and associated visual descriptors respectively, indexed by , our objective is to establish reliable and robust keypoint matches across two images.

We formulate the keypoint matching task as a graph matching problem, where the nodes are the keypoints of each image. Instead of apply fully connected graph, we generate a set of keypoint correspondences, which we refer to as seed matches, to guide message pass across nodes in two graphs for subsequent matching. This critical difference allows for processing large number of keypoints with significantly lower memory and computation cost.

The input to our network are keypoints , in two images, , where and is the coordinate of keypoint in image . is associated dimensional visual descriptor.

Positions of keypoints are embedded into high dimensional feature space and combined with descriptors by element-wise summation for initial representation , .

(1)

A Seeding Module follows to construct a set of seed matches . , and are then fed into our Seeded Graph Neural Network which reasons about visual appearance similarity, neighbourhood consensus as well as the guidance provided by seed matches jointly to update keypoint features.

Inspired by the cascade refinement structure in OANet [63], a second seeding module, or reseeding module, is introduced to generate more accurate seeds based on the updated features, which helps to further refine matches with another Seeded GNN. Final matches are then generated by formulating assignment matrix.

3.2 Seeding Module

Proposing a set of seed matches lays the foundation for subsequent matching. For initial seeding, we adopt a simple yet effective strategy: we generate putative matches by nearest neighbour matching and use the inverse of distance ratio, i.e., the ratio of distance to first and second nearest neighbours [25], as reliability scores. We adopt Non-Maximum Suppression (NMS) for a better spatial coverage of seeds. More details of seeding module can be found in Appendix A.2. Despite potential noise in initial seeds, our network maintains robustness with proposed weighted unpooling operation and reseeding strategy, which will be discussed later.

The seeding module outputs , where , are index lists for seed matches in each image.

3.3 Seeded Graph Neural Network (Seeded GNN)

Seeded GNN takes initial position-embedded features , and leverages seed matches as attention bottlenecks for message passing. To this end, we adopt a pooling-processing-unpooling strategy in each processing unit: seed features first gather information from full point sets on each side through Attentional Pooling, then processed by Seed Filtering operation and finally be recoverd back to original size by Attentional Unpooling. Our Seeded GNN is constructed by stacking 6(3) such processing units for initial(refinement) stages.

Weighted attentional aggregation. We first introduce a weighted version of attentional aggregation, which allows for sharper and cleaner data-dependent message passing.

In a -dimensional feature space, for vectors to be updated: , vectors to be attended to: and a weight vector: , the weighted attentional aggregation Att is defined as,

(2)

where

(3)

means row-wise softmax. is linear projection of and are linear projections of . is renewed representation for .

By attentional aggregation, elements in retrieve and aggregate information from elements in . A weighting vector is applied on to adjust the importance for each element in .

Attentional pooling. As the first step in message passing, seed matches retrieves contexts from full keypoint set through attentional aggregration,

For input features , in layer , features for seed matches are first retrieved by indices

(4)

where is the indexing operation. Seed matches are then updated by retrieving context from nodes in each graph

(5)

where is all one vector, which means no weights are applied.

A mutilayer perceptron follows to fuse the seed features,

(6)

where means concatenation along row dimension.

The outputs , , which encode both visual and position context for each graph and information from seed matches themselves, are fed into subsequent operations.

Figure 3: Visualization of attentional pooling/unpooling. In attentional pooling, one pair of seed match aggregates context from the other keypoints, while in attentional unpooling each original keypoint retrieves renewed message from seed matches.

Seed filtering. We propose seed filtering operation to (1) conduct intra/inter-graph communication between seed matches and (2) suppress the influence of outlier seed matches. More specifically, intra/inter-graph attentional aggregation is applied to the input seed correspondence features .

(7)
(8)

In addition, a context normalization [61] branch is used to predict inlier likelihood scores for each seed correspondence, which will be used as weighting score for seed features in later unpooling stage.

(9)

where CN is a lightweight stacked context normalization [61] blocks. Detailed structures of CN brach can be found in Appendix A.2.

The outputs of seed filtering are the filtered features , and inlier scores of seed matches.

Attentional unpooling. After the message exchange between seed matches and inlier score prediction, an inlier-score-weighted attentional aggregation is adopted to broadcast the pooled contexts to every keypoint in each graph, which we refer to as attentional unpooling.

Taking , , inlier score and , as input, attentional unpooling outputs the updated keypoint features , .

(10)

Applying inlier score to the aggregation process suppresses information broadcast from false seed matches and results in cleaner feature update, which contributes to the robustness of our network w.r.t. seeding noise (Appendix D.2, Appendix Fig 10).

Assignment matrix formulation. After all processing units, the updated features are used to construct the assignment matrix. Sinkhorn [8] algorithm is applied on the correlation matrix of features with a dustbin channel to produce final assignment matrix .

Given the keypoint features , after processing blocks, we compute the assignment matrix by

(11)
(12)

where , is a learnable parameter for dustbin. We derive final matches from the assignment matrix with a confidence threshold to remove outliers.

3.4 Reseeding

Although Seeded GNN based on initial seeding exhibits strong capability to identify underlying matches, a second seeding module using updated feature provides even cleaner and richer seeds to further improve the performance. Thus, we adopt a reseeding module. Different from initial seeding where NN matches and ratio scores of raw descriptors are used, reseeding module employs assignment matrix of updated features to regenerate seeds. More specifically, matches with highest score in both rows and columns are selected as candidates, where top-k matches are selected as new seeds and are fed into a second Seeded GNN for refinement. More details can be found in Appendix A.2.

3.5 Loss

Seeding module only outputs seed indices that requires no gradient back-propagation, thus our network is fully differentiable and can be trained end-to-end with supervision from indices of ground truth matches and unmatchable points , where a point is regarded as unmatchable if there are no matchable points in the other image. From assignment matrix for reseeding, final assignment matrix , and inlier scores for processing units, we formulate our loss as two parts,

(13)

where

(14)

is cross entropy loss for inlier/outlier binary classification in -th processing unit, a seed correspondence is labeled as inlier if its epipolar distance is less than a threshold. is a weight to balance the two loss terms.

 

Feature Matcher CPC T&T TUM KITTI
%Recall #Corrs(-m) %Recall #Corrs(-m) %Recall #Corrs(-m) %Recall #Corrs(-m)

 

RootSIFT [1] NN+RT 52.9 92 (123) 82.1 208 (287) 61.9 365 (438) 90.6 847 (928)
OANet [63] 58.6 119 (167) 84.7 219 (306) 62.3 454 (396) 89.0 773 (854)
SuperGlue [36] 61.1 218 (466) 86.8 382 (767) 65.9 655 (1037) 91.0 1261 (1746)
SGMNet 62.0 248 (524) 85.9 397 (789) 66.6 704 (1132) 91.2 1097 (1506)
ContextDesc [26] NN+RT 62.4 169 (277) 85.5 222 (426) 58.7 456 (625) 90.6 1134 (1416)
OANet 65.3 187 (288) 86.7 294 (425) 53.2 295 (327) 89.0 791 (907)
SuperGlue 67.0 260 (579) 89.1 491 (695) 60.1 408(690) 91.1 1401 (1897)
SGMNet 70.8 370 (616) 89.9 514 (705) 61.6 423 (705) 90.3 1204 (1724)
SuperPoint [11] MNN 34.5 152 (421) 72.8 287 (717) 56.7 280 (420) 88.6 848 (1490)
OANet 62.9 186 (343) 91.2 280 (477) 61.4 332 (473) 82.2 482 (736)
SuperGlue 68.8 287 (719) 92.9 414 (987) 59.1 512 (1038) 88.7 957 (1777)
SuperGlue 76.7 302(712) 96.6 431(985) 59.0 121(177) 89.5 354(526)
SGMNet 70.3 327 (829) 93.2 450 (1098) 65.8 666 (1315) 86.3 954(1851)

 

Table 1: Evaluation results on FM-Bench [3], where %Recall denotes mean recall of all pairs, #Corrs(-m) denotes mean number of inlier correspondences after/before RANSAC. SuperGlue indicates the results obtained from officially released model.

3.6 Implementation Details

We train our network on GL3D dataset [39], which covers both indoor/outdoor scenes, to obtain general purpose model. We sample keypoints and seeds during training. We use Adam optimizer with learning rate of in optimization and inlier score weight in loss is set to 250. We use 6/3 processing blocks for initial/refine stage and the gradients flow between the two stages are blocked in early iterations(140k iterations). We use 4-head attention in both attentional pooling/unpooling operations. For all experiments, we use a confidence threshold of 0.2 to retain matches and seeding number of , where is the number of keypoints. More details including training data generation and hyper-parameters can be found in Appendix A.

4 Experiments

In the following sessions, we provide experiments results of our methods under a wide range of tasks, as well as further analysis of its computation and memory efficiency.

4.1 Image Matching

Datasets.

The performance of our method is first evaluated on image matching tasks and three benchmarks in two-view pose estimation, YFCC100M 

[48], FM-Bench [3] and ScanNet [9] dataset, are used for demonstration.

For YFCC100M [48], we follow the setting in OANet [63] and choose 4 sequences for testing. FM-Bench [3] comprises four subsets in different scenarios: KITTI [16] for driving settings, TUM [42] for indoor SLAM settings, Tanks and Temples (T&T) [20] and CPC [57] for wide-baseline reconstruction tasks. ScanNet [9] is a widely used indoor reconstruction dataset. Following SuperGlue [36], we use 1500 pairs in test set for evaluation.

Evaluation protocols. On YFCC100M and ScanNet dataset, pose estimation is performed on the correspondences after RANSAC post-processing. We report 1) AUC [36, 63, 61] under different thresholds, computed from the angular differences between ground truth and estimated vectors for both rotation and translation; 2) Mean matching score (M.S.[36, 11], the ratio of correct matches and total keypoint number; 3) Mean precision (Prec.[36, 11] of the generated matches. We detect up to keypoints for all features on YFCC100M, up to keypoints for superpoint on ScanNet and up to keypoints for other features .

On FM-Bench dataset, we estimate fundamental matrix for each evaluated pair with RANSAC post-processing, and use the normalized symmetric epipolar distance (SGD) [66, 3] as originally defined in FM-Bench paper, to measure the difference between the estimated fundamental matrix and the ground truth. An estimate is considered correct if its normalized SGD to ground truth is lower than a threshold ( is used by default), and up to keypoints are detected for each test pair. Following FM-Bench paper [3] , we report: 1) recall (%Recall) on fundamental matrix estimation; 2) mean number of correct correspondences (#Corrs(-m)) after/before RANSAC.

 

Feature Matcher AUC M.S. Prec.
@5° @10° @20°

 

RootSIFT NN + RT [25] 49.07 58.76 68.58 8.23 29.79
AdaLAM(4k) [6] 57.78 68.01 77.38 7.92 83.15
OANet [63] 58.00 67.80 77.46 5.84 81.80
SuperGlue [36] 59.25 70.38 80.44 - -
SuperGlue 63.82 73.33 82.26 16.59 81.08
SGMNet 62.72 72.52 81.48 17.08 86.08

 

ContextDesc NN + RT 57.90 68.47 78.35 9.39 59.72
AdaLAM(4k) [6] 60.75 70.91 80.23 9.12 85.45
OANet 62.28 72.56 81.80 9.33 88.49
SuperGlue 65.98 75.17 83.64 20.38 82.95
SGMNet 66.63 76.21 84.33 20.57 87.34

 

SuperPoint MNN 31.05 40.85 52.64 15.12 24.64
AdaLAM(2k) [6] 40.20 49.03 59.11 10.17 72.57
OANet 48.80 59.06 70.02 12.48 71.95
SuperGlue 67.10 76.18 84.37 21.58 88.64
SuperGlue 60.37 70.51 80.00 19.47 78.74
SGMNet 61.22 71.02 80.45 22.36 85.44

 

SIFT PointCN [61] 47.98 58.13 68.67 - -
ACNe [45] - - 78.00 - -
LGLFM [10] 49.60 60.36 71.37 - -

 

- RANSAC-Flow [40] 64.88 73.31 81.56 - -

 

Table 2: Results on YFCC100M [48], where AUC evaluates the pose accuracy, denotes mean matching score and denotes mean matching precision. SuperGlue indicates the results obtained from original paper or officially released model.

 

Feature Matcher AUC M.S. Prec.
@5° @10° @20°

 

RootSIFT NN + RT [25] 9.08 19.75 32.66 2.28 28.83
AdaLAM [6] 8.24 18.57 31.01 3.10 47.59
OANet [63] 10.71 23.10 37.42 3.20 36.93
SuperGlue 13.12 27.99 43.92 8.50 42.53
SGMNet 12.82 27.92 44.55 8.79 45.55

 

ContextDesc NN + RT 11.07 23.52 37.66 5.29 28.71
AdaLAM 8.45 19.81 33.11 6.58 44.08
OANet 11.95 24.49 40.56 5.12 40.43
SuperGlue 15.70 31.67 48.22 10.75 42.83
SGMNet 15.46 31.55 48.64 9.99 48.14

 

SuperPoint MNN 9.44 21.57 36.41 13.27 30.17
AdaLAM 6.72 15.82 27.37 13.19 44.22
OANet 10.04 25.09 38.01 10.56 44.61
SuperGlue 13.95 29.48 46.07 15.82 44.18
SuperGlue 16.19 33.82 51.86 18.50 47.32
SGMNet 15.40 32.06 48.32 16.97 48.01

 

Table 3: Results on Scannet [9]. SuperGlue indicates the results obtained from officially released model.
Figure 4: Correspondence visualizations. We showcase with SIFT features, and compare the results of traditional matching (MNN + RT), SuperGlue and our method (SGMNet). More visualizations available in Appendix.

Comparative methods. We compare our method with heuristic pruning strategy, ratio test [25] or MNN, and a various of learning-based matching methods [63, 36, 45, 10, 61, 40]. These methods are applied on both handcrafted descriptors [25, 1] and learning-based local features [26, 11].

For a fair comparison, OANet, SuperGlue and SGMNet are all re-trained using the same sequences of GL3D [39], where 1k keypoints are sampled per image. Noted that the official training code of SuperGlue is not available, and its public model (denoted as SuperGlue) is trained on MegaDepth [23] and Oxford and Paris dataset [34]. Instead, we retrain SuperGlue on GL3D [39] with similar data selection criteria described in orignal paper. This re-implementation achieves even better results on YFCC100M (Table 3) for RootSIFT than those reported in the original paper. However, there still remains some performance gap when using SuperPoint [11], even though we have carefully tuned the training and enquired the authors about details. Nevertheless, we consider our re-implementation of SuperGlue [36] faithful and thus can be fairly compared. We report results of both official model and our re-implementation.

Results. For YFCC100M, ScanNet and two wide-baseline datasets of FM-Bench (CPC and T&T), our method mostly shows competitive results compared with the state-of-the-arts. For two small-baseline datasets in FM-Bench (TUM and KITTI), the advantages of all learnable methods tend to degenerate due to the reduced matching difficulty. Our method matches most inlier correspondences on almost all dataset regarding M.S. on YFCC100M/ScanNet and #Corrs(-m) on FM-Bench while maintaining a high matching precision, which contributes to the final pose accuracy. Though not specially trained in indoor scenarios, our method generalizes well on indoor setting.

4.2 Visual Localization

To evaluate how our method benefits real downstream applications, we integrate it in a visual localization pipeline and evaluate its performance.

Datasets. We resort to Aachen Day-Night dataset [37] to evaluate the effectiveness of our method on visual localization task. Aachen Day-Night consists of reference images and ( daytime, nighttime) query images. All images are taken in urban scenes.

Evaluation protocols. We use the official pipeline of Aachen Day-Night benchmark. Correspondences between reference images are first used to triangulate a 3D reconstruction. Correspondences between each query and its retrieved reference images are then generated to recover the relative pose. Consistent with the official benchmark, we report the pose estimation accuracy under different thresholds. We extract keypoints for RootSIFT, ContextDesc and keypoints for SuperPoint.

Results. Compared with SuperGlue, our method exhibits better results when using RootSIFT and competitive results when using SuperPoint or ContextDesc. Our method consistently outperforms OANet using all three descriptors. The overall performance proves the generalization ability of our method on real challenging applications.

 

Feature Matcher 0.25m, 2° 0.5m, 5° 5m, 10°

 

RootSIFT [1] MNN 43.9 56.1 65.3
OANet [63] 69.4 83.7 94.9
SuperGlue [36] 63.3 80.6 98.0
SGMNet 70.4 85.7 98.0
ContextDesc [26] MNN 65.3 80.6 90.8
OANet 74.5 86.7 99.0
SuperGlue 77.6 86.7 99.0
SGMNet 75.5 87.8 99.0
SuperPoint [11] MNN 71.4 78.6 87.8
OANet 77.6 86.7 98.0
SuperGlue 79.6 90.8 100.0
SuperGlue 76.5 88.8 99.0
SGMNet 77.6 88.8 99.0

 

Table 4: Evaluation results on Aachen Day-Night dataset. We report pose accuracy under different thresholds for challenging night spilt. We include the results of official released model of SuperGlue with SuperPoint (denoted as SuperGlue).
Figure 5: The computation (a) and memory (b) efficiency compared the proposed method with SuperGlue. We report memory occupation averaged by batch size for training. The effect of keypoint number on Aachen Day-Night dataset is illustrated in (c).

4.3 Scalability

In above experiments, the proposed method has shown competitive results against the state-of-the-arts. In this session, we demonstrate the major advantage of our method in time/memory efficiency compared with SuperGlue, a closely related method based on GNN.

Time/memory consumption. As shown in Fig. 5(a), the time cost of our method is remarkably lower than SuperGlue. Specifically, we report both run time with and without Sinkhorn iterations on a GTX 1080 GPU, in order to more precisely demonstrate the substantial improvements on GNN design itself. It is noteworthy that on keypoints and without Sinkhorn iterations, the proposed method reduces the runtime by one magnitude. Besides, due to the reduction of redundancy, SGMNet also delivers better convergence during training (see Appendix C).

As shown in Fig. 5(b), during test phase, our method consumes half of memory than SuperGlue when keypoint number is larger than , where the major memory peak of our method is seeding phase and sinkhorn iterations. The advantage becomes even more significant during training. With a batch size of 16 and keypoint number of , SuperGlue occupies up to 23GB GPU memory for training while SGMNet only consumes less than 9GB memory.

Performance gain when using more keypoints. Within a reasonable range, larger keypoint number generally improves performance of downstream tasks, thus a manageable matching cost is of practical significance to extend the applicability. As a showcase, we vary the keypoint number of RootSIFT when evaluating on Aachen Day-Night Dataset. As can be seen from Fig. 5(c), the accuracy of SGMNet and SuperGlue increases as more keypoints are used. Considering the efficiency advantage of our method, SGMNet delivers better trade-off when increasing the keypoint number. We also provide an experiment of SfM, a typical keypoint-consuming application in Appendix D.3.

 

Matcher AUC@20° M.S. Prec.

 

NN + RT 68.58 10.05 56.38
SGMNet w Rand. Seed 71.25 12.94 55.57
SGMNet w/o W.U. 78.64 17.07 81.26
SGMNet w/o A.P. 79.35 17.11 82.15
SGMNet w/o Reseeding 80.41 17.12 84.47
SGMNet full 81.48 17.08 86.08

 

Table 5: Results of ablation study. w/o A.P. stands for w/o attentional pooling, where seed features are directly sent to seed filtering process without attending to original keypoints. w/o W.U. stands for w/o weighted unpooling, where vanilla attention are preformed in unpooling process. Rand. Seed means selecting seed correspondences randomly instead of picking the top-k scores. w/o Reseeding means only use initial seeding.

5 Discussions

5.1 Ablation Study

To evaluate the effectiveness of different components of our method, we conduct ablation study on YFCC100M dataset using RootSIFT. As shown in Table 5, all different component in our network contributes to the final performance notably. In particular, seeding reliable matches plays an important role, which further proves that seed matches is able to guide message across images for robust matching.

 

Type Matcher AUC Time(ms)
@5° @10° @20°

 

GNN SGMNet-10 66.62 76.33 84.71 114.81
SGMNet 67.42 76.63 84.66 284.66
SuperGlue-10 65.85 75.35 84.05 458.52
SuperGlue 66.65 75.41 84.12 604.11
Filter OANet 63.75 73.60 82.43 21.30
ACNe 63.37 73.67 82.74 18.26
Handcrafted AdaLAM 57.78 68.01 77.38 4.59
GMS 26.15 33.53 42.13 5.44

 

Table 6: Results on YFCC100M using 4k SIFT features. -10 means setting sinkhorn iterations as 10 instead of 100

5.2 Comparison with Filter-based Methods

For a well-around comparison, we provide in Table 6 more experiment results with filter-based methods (outlier rejection). When using 4k keypoints, SGMNet achieves best performance among all comparative methods while runs 4 times faster than SuperGlue when setting sinkhorn iterations as 10. Despite the fast inference speed of SOTA filter-based methods, a considerable performance gap still remains compared with GNN based methods.

5.3 Designs of SGMNet

We experiment on other designs for SGMNet, including other seeding strategy and pooling operations in GNN/Transformer architecture [52, 55, 62], e.g. diffPool [63, 62] and set transformer [21] . We find 1) learnable seeding acheives limitied improvement over our simple heuristic seeding strategy 2) other general pooling operations, which are often verified on self-attention in GNN/Transformer architecture, are not effective alternatives of our seed-based pooling. Details of our experiments can be found in Appendix B. Hyper-parameter study on seeding number can be found in Appendix D.

6 Conclusion

In this paper, we propose SGMNet, a novel graph neural network for efficient image matching. The new operations we have developed enable message passing with a compact attention pattern. Experiments on different tasks and datasets prove that our method improves the accuracy of feature matching and downstream tasks to competitive or higher level against the state-of-the-arts with a modest computation/memory cost.

Acknowledgment. This work is supported by Hong Kong RGC GRF 16206819, 16203518, T22-603/15N and Guangzhou Okay Information Technology with the project GZETDZ18EG05.

Supplementary Appendix

A Implementation Details

We provide in this part details about our implementations as well as experiment settings.

A.1 Training Details

Training Data. We use GL3D dataset to generate training data. GL3D dataset is originally based on 3D reconstruction of 543 different scenes, including landmarks and small objects, while in its latest version additional 713 sequences of internet tourism photos are added.

We sample 1000 pairs for each sequence and filter out pairs that are either too hard or too easy for training. More specifically, we use the common track ratio provided by original dataset and rotation angles between cameras to determine pair difficulty, pairs with common track ratio in range and rotation angle in range are kept.

We reproject keypoints between images with depth maps and use reprojection distances to determine ground truth matches and unmatchable points. More specifically, a keypoint is labeled as unmatchable if its reprojection distances with all keypoints in the other image are larger than 10 pixels, while a pair of keypoints that are mutual nearest after reprojection and with a reprojection distance lower than 3 pixels are considered ground truth matches. We further filter out pairs with ground matches fewer than 50.

Our data generation protocol yields around 400k training pairs in total.

Figure 6: Visualizations of raw putative matches(left), seed correspondences(middle) and matches obtained by SGMNet(right). Note that even with heavily noisy seeds, SGMNet is capable of discovering underlying patterns, which are leveraged to guided message pass across keypoints for robust and accurate matching.

Training Parameters. We use Adam optimizer for training optimization with learning rate of . We apply learning rate decay after 300k iterations with decay rate of 0.999996 until 900k iterations. We block the gradient flow between initial/refinement stages for the first 140k iterations.

A.2 Network Details

As is done in SuperGlue [36], we apply multi-head attention mechanism for all weighted attention operations in our network with head number of 4.

For initial Seeding Module, we apply additional mutual nearest neighbour check and ratio test to seed correspondences. NMS is also employed in seeding. The radius for NMS is , where is index set for all keypoints, is distance between keypoint , denotes set size, is a hyper parameter, which we set as for all experiment.

For Reseeding Module, we apply sinkhorn algorithm with 10 iterations on correlation matrix to obtain assignment matrix. Correspondences with top-k scores on both dimensions are sampled as seeds. NMS is also applied in reseeding module.

For the inlier likelihood predictor in Seeded GNN, we use the structure illustrated in Fig 7 (down).

We set iterations number of Sinkhorn algorithm to 100.

A.3 Experiment Settings

We use OpenCV implementation for SIFT and official implementation of SuperPoint and ContextDesc. For ContextDesc, we use the latest public model (denoted as ContextDesc++upright in the official GitHub repository).

YFCC100M. We extract SIFT and ContextDesc with images in original resolution and resize images so that the longest dimension is 1600 to extract SuperPoint. We use OpenCV findEssentialMat and recoverPose functions to recover relative poses with the embedded RANSAC, of which the threhold is set as 1 pixel under resolution of resized images. For matching score and precision, we use epipolar distance of to determine inlier matches.

ScanNet. We resize images to [640,480] resolution to extract keypoints and use same protocol to recover relative poses, mathing scores and precision as in YFCC100M.

FM-Bench. The original evaluation pipeline of FM-Bench is based on Matlab and we reimplement it with Python. The parameter for evaluation is consistent with the original implementation. We use OpenCV function for fundamental matrix estimation and set the threshold of embedded RANSAC to 1 pixel. Compared with original implementation, our evaluation pipeline tends to give out higher accuracy, especially for wide baseline datasets. We believe it is due to better performance of OpenCV function and is beneficial for a more precise evaluation.

Aachen Day-Night. We use the official pipeline and default parameters for evaluation. We extract upright feature for both RootSIFT and ContextDesc.

B Designs of SGMNet

Despite the effectiveness of SGMNet, we provide our experiment results and analysis on some other potential designs in this part.

AUC M.S. Prec. Prec.(S)
Designs @5° @10° @20°
Learned Seeding 62.80 72.55 81.09 17.25 85.15 51.22
DiffPool 50.85 60.50 69.68 10.18 61.52 -
ISA 52.11 61.44 70.03 10.65 63.24 -
SGMNet 62.72 72.52 81.48 17.08 86.08 39.24
NN+RT 49.07 58.76 68.58 10.05 56.38 -
Table 7: Evaluation Results on YFCC100M using RootSIFT with different designs. Prec.(S) denotes precision of seed correspondences.
Figure 7: PointCN structure(down) we use to construct learned seeding module(top). / denotes context normalization [61]

/batch normalization 

[18] with input of channels.

Learned Seeding. After obtaining initial nearest neighbour correspondences, we employ a light-weight permutation invariant Network (architecture illustrated in Fig 7) to determine each correspondence’s inlier likelihood score, as is done in previous works [63, 61]. We thus sample correspondences with top-k inlier-score instead of ratio scores as seeding correspondences.

Results. We report results in Table 7. Although applying light-weight pointCN block for inlier seeds prediction slightly increases seeding precision, it’s not enough to bring meaningful impacts on the level of pose estimation.

In general, we are open for the possibility to increase matching quality by introducing more complicated seeding strategies. However, targeting at efficient matching, our seeding method achieves good balance between performance and cost.

The critical component in our method for efficiency message passing is essentially pooling of original keypoints. In this part, we apply two well-studied pooling designs either in GNN or transformer architecture, namely DiffPool [62] and Induced Set Attention [21], to image matching task and evaluate their performance.

DiffPool. As a pooling operation in GNN, DiffPool predicts assignment matrix based on each node’s embedding in graph, which is designed to build hierarchical and sparse graph representation [62]. In OANet [63], DiffUnpool, the counter part of DiffPool, is proposed to recover node clusters to original size. As an experiment, we apply DiffPool/Unpool to keypoint graph in each image as a substitution for our proposed attentional pooling. Cross/self attention [36] will be performed on the pooled clusters for message exchange.

Induced Set Attention. Induced set attention(ISA) is first proposed in Set Transformer [21]. Different from using seed features as attention bottleneck, ISA adopts a set of learned fixed features(induced point) as attention pass between set elements and is only verified on self-attention for sparse input. We subsitute our seed-based attention with ISA. More specifically, we let the network learned induced points for both sides and let induced points attend to original keypoints(both cross/self) to perform message pass.

Results. We report evaluation result on YFCC100M. For all pooling method, we set pooling number to 128 and extract up to keypoints. As illustrated in Table 7, both DiffPool and ISA shows only marginal improvements over baselines, which indicates that applying pooling methods designed for generic GNN/efficient transformer is not necessarily effective for image matching tasks, and further prove that our seed based attentional pooling/unpooling operation is critical for the success of our method.

C Fast Convergence

The compactness of SGMNet not only contributes to the cut-down on computation/memory complexity but also leads to a faster convergence for training. In Fig. 8 we plot the training curve for both SGMNet and SuperGlue. As illustrated, SGMNet takes fewer iterations to reach convergence.

Figure 8: Convergence curve.

D Additional Experiment Results

Figure 9: The effect of seed number when varing the keypoint number. Numbers in grids are Exact AUC@20° using RootSIFT.
Figure 10: Relationship between pose estimation accuracy and seed precision. Note SGMNet maintain a high matching quality even with seed precision of only 20%.
Figure 11: Reconstruction results of vanilla nearest neighbour matching(left) and SGMNet(right). The completeness of reconstruction is determined by matching quality between some critical frames. In this case, NN matching fails to generate descent correspondences between tall building(b)/statue(c) and tall building(b)/remains(a), which results in incomplete reconstruction, while SGMNet registered theses critical frames successfully.

D.1 Impact of Seeding Number

SGMNet requires seeding a set of seed correspondences, the number of which not only influences our method’s efficiency but also accuracy. Therefore, it is important to investigate the impact of seeding number. We carefully conduct grid search on YFCC100M with different keypoint and seeding number.

As is illustrated in Fig 9, an approximate proportional relationship between keypoint/seed numbers yields best performance, as too many seeds may deliver less reliable guidance while seeding too few correspondences results in severe information lost.

Methods #Registered Images #Sparse Points Mean Rro. Error Mean Track Len. Matching Time Total Time
NN+Ratio+Mutual Check 799 132265 0.58px 11.27 1h 13min 2h 22min
SuperGlue 916 223950 0.95px 10.93 42h 34min 44h 06min
SGMNet 943 276240 1.10px 10.73 5h 37min 7h 56min
Table 8: SfM results for Alamo scene in 1DSFM dataset.

D.2 Robustness to Seed Noise

To evaluate the robustness of our method w.r.t. potential false seeds, we conduct experiment on YFCC100M. More specifically, for each pair we select a set of inlier matches, which is determined using ground truth, and pad them with random sampled noise to construct seed correspondences with different precision. We feed the pre-selected seed correspondences to Seeded GNN instead of applying Seeding Module.

As is shown in Fig 10, SGMNet maintains high matching quality even with heavily noisy seed correspondences. It’s noteworthy that for SGMNet without weighted unpooling, the pose estimation accuracy degenerate more rapidly as seed precision decreases, which indicates lower robustness to seed noise. More visualizations related to noisy seed and matching results can be seen in Fig 6.

D.3 SfM Experiment

Typical Structure from Motion(SfM) pipeline usually involves extracting keypoints in large number(e.g. 8k) and matching among hundreds/thousands of images to obtain ultra accurate poses and more complete reconstructions. In this section, we embed different matching methods into COLMAP SfM pipeline for comparison. We reconstruct challenging Alamo scene from 1DSFM [57], which involves 2915 images taken under very different illumination conditions. 8k RootSIFT features are extracted for each image and we set sinkhorn iterations to 10 for both SGMNet and SuperGlue. We use a GTX 1080 GPU to preform matching sequentially. Apart from common statistics for reconstruction, we also report time consumption for matching and the whole SfM pipeline.

As shown in Tab 8, both SuperGlue and SGMNet produces much more complete reconstruction compared with vanilla NN matching and heuristic pruning. However, SuperGlue largely lengthen the time for whole SfM pipeline, while our method retains the matching time to a feasible level.

E More Visualizations

See Fig. 12 and Fig. 6.

Figure 12: More visualizations.

References

  • [1] Relja Arandjelović and Andrew Zisserman. Three things everyone should know to improve object retrieval. In CVPR, 2012.
  • [2] Xuyang Bai, Zixin Luo, Lei Zhou, Hongkai Chen, Lei Li, Zeyu Hu, Hongbo Fu, and Chiew-Lan Tai. Pointdsc: Robust point cloud registration using deep spatial consistency. In CVPR, 2021.
  • [3] Jia-Wang Bian, Yu-Huan Wu, Ji Zhao, Yun Liu, Le Zhang, Ming-Ming Cheng, and Ian Reid. An evaluation of feature matchers for fundamental matrix estimation. In BMVC, 2019.
  • [4] Eric Brachmann and Carsten Rother. Neural-guided ransac: Learning where to sample model hypotheses. In ICCV, 2019.
  • [5] T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009.
  • [6] Luca Cavalli, Viktor Larsson, Martin Ralf Oswald, Torsten Sattler, and Marc Pollefeys.

    Handcrafted outlier detection revisited.

    In ECCV, 2020.
  • [7] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. In arXiv, 2019.
  • [8] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPs, 2013.
  • [9] Angela Dai, Matthias Nießner, Michael Zollöfer, Shahram Izadi, and Christian Theobalt. Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface re-integration. ACM Transactions on Graphics 2017 (TOG), 2017.
  • [10] François Darmon, Mathieu Aubry, and Pascal Monasse. Learning to guide local feature matches. 2020.
  • [11] D. DeTone, T. Malisiewicz, and A. Rabinovich. Superpoint: Self-supervised interest point detection and description. In CVPRW, 2018.
  • [12] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. In CVPR, 2019.
  • [13] Matthias Fey, Jan E. Lenssen, Christopher Morris, Jonathan Masci, and Nils M. Kriege. Deep graph matching consensus. In ICLR, 2020.
  • [14] Donniell E. Fishkind, Sancar Adali, Heather G. Patsolic, Lingyao Meng, Digvijay Singh, Vince Lyzinski, and Carey E. Priebe. Seeded graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [15] Catherine FRAIKIN and Paul Van Dooren. Graph matching with type constraints on nodes and edges. Dagstuhl Seminar Proceedings, 2007.
  • [16] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
  • [17] M. Goesele, B. Curless, and S.M. Seitz. Multi-view stereo revisited. In CVPR, 2006.
  • [18] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift., 2015.
  • [19] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In ICLR, 2020.
  • [20] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics, 2017.
  • [21] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML, 2019.
  • [22] M. Leordeanu and M. Hebert. A spectral technique for correspondence problems using pairwise constraints. In ICCV, 2005.
  • [23] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018.
  • [24] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In ICLR, 2018.
  • [25] David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  • [26] Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. Contextdesc: Local descriptor augmentation with cross-modality context. In CVPR, 2019.
  • [27] Zixin Luo, Tianwei Shen, Lei Zhou, Siyu Zhu, Runze Zhang, Yao Yao, Tian Fang, and Long Quan. Geodesc: Learning local descriptors by integrating geometry constraints. In ECCV, 2018.
  • [28] Zixin Luo, Lei Zhou, Xuyang Bai, Hongkai Chen, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. Aslfeat: Learning local features of accurate shape and localization. In CVPR, 2020.
  • [29] Vince Lyzinski, Daniel L. Sussman, Donniell E. Fishkind, Henry Pao, Li Chen, Joshua T. Vogelstein, Youngser Park, and Carey E. Priebe. Spectral clustering for divide-and-conquer graph matching. Parallel Computing, 2015.
  • [30] Josef Maier, Martin Humenberger, Markus Murschitz, Oliver Zendel, and Markus Vincze. Guided matching based on statistical optical flow for fast and robust correspondence analysis. In ECCV, 2016.
  • [31] Anastasiia Mishchuk, Dmytro Mishkin, Filip Radenovic, and Jiri Matas. Working hard to know your neighbor’s margins: Local descriptor learning loss. In NeurIPs, 2017.
  • [32] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós. Orb-slam: A versatile and accurate monocular slam system. IEEE Transactions on Robotics, 2015.
  • [33] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
  • [34] Filip Radenović, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondřej Chum.

    Revisiting oxford and paris: Large-scale image retrieval benchmarking.

    In CVPR, 2018.
  • [35] Jerome Revaud, Philippe Weinzaepfel, César De Souza, Noe Pion, Gabriela Csurka, Yohann Cabon, and Martin Humenberger. R2d2: Repeatable and reliable detector and descriptor. In NeurIPs, 2019.
  • [36] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In CVPR, 2020.
  • [37] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, and Tomas Pajdla. Benchmarking 6dof outdoor visual localization in changing conditions. In CVPR, 2018.
  • [38] Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, 2016.
  • [39] Tianwei Shen, Zixin Luo, Lei Zhou, Runze Zhang, Siyu Zhu, Tian Fang, and Long Quan. Matchable image retrieval by learning from surface reconstruction. In ACCV, 2018.
  • [40] Xi Shen, François Darmon, Alexei A Efros, and Mathieu Aubry. Ransac-flow: generic two-stage image alignment. ECCV, 2020.
  • [41] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Discriminative Learning of Deep Convolutional Feature Point Descriptors. In ICCV, 2015.
  • [42] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In IROS, 2012.
  • [43] K. Sun, W. Tao, and Y. Qian.

    Guide to match: Multi-layer feature matching with a hybrid gaussian mixture model.

    IEEE Transactions on Multimedia, 2020.
  • [44] Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, and Kwang Moo Yi. Acne: Attentive context normalization for robust permutation-equivariant learning. In CVPR, 2020.
  • [45] Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, and Kwang Moo Yi. Acne: Attentive context normalization for robust permutation-equivariant learning. In CVPR, 2020.
  • [46] Shitao Tang, Chengzhou Tang, Rui Huang, Siyu Zhu, and Ping Tan. Learning camera localization via dense scene matching. In CVPR, 2021.
  • [47] Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In ICML, 2020.
  • [48] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 2016.
  • [49] Yurun Tian, Bin Fan, and Fuchao Wu. L2-net: Deep learning of discriminative patch descriptor in euclidean space. In CVPR, 2017.
  • [50] Lorenzo Torresani, Vladimir Kolmogorov, and Carsten Rother. Feature correspondence via graph matching: Models and global optimization. In ECCV, 2008.
  • [51] S. Umeyama. An eigendecomposition approach to weighted graph matching problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1988.
  • [52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPs. 2017.
  • [53] Yannick Verdie, Kwang Moo Yi, Pascal Fua, and Vincent Lepetit. Tilde: A temporally invariant learned detector. In CVPR, 2015.
  • [54] Runzhong Wang, Junchi Yan, and Xiaokang Yang. Learning combinatorial embedding networks for deep graph matching. In ICCV, 2019.
  • [55] Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. In arXiv, 2020.
  • [56] Z. Wang and S. Ji. Second-order pooling for graph neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [57] Kyle Wilson and Noah Snavely. Robust global translations with 1dsfm. In ECCV, 2014.
  • [58] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In ECCV, 2018.
  • [59] Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In CVPR, 2019.
  • [60] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. In ECCV, 2016.
  • [61] Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to find good correspondences. In CVPR, 2018.
  • [62] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In NeurIPs, 2018.
  • [63] Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen, Long Quan, and Hongen Liao. Learning two-view correspondences and geometry using order-aware network. In ICCV, 2019.
  • [64] Jingyang Zhang, Yao Yao, Shiwei Li, Zixin Luo, and Tian Fang. Visibility-aware multi-view stereo network. In BMVC, 2020.
  • [65] Jingyang Zhang, Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Learning stereo matchability in disparity regression networks. In ICPR, 2020.
  • [66] Zhengyou Zhang. Determining the epipolar geometry and its uncertainty: A review. IJCV, 1998.
  • [67] Lei Zhou, Zixin Luo, Tianwei Shen, Jiahui Zhang, Mingmin Zhen, Yao Yao, Tian Fang, and Long Quan.

    Kfnet: Learning temporal camera relocalization using kalman filtering.

    In CVPR, 2020.