Image Retrieval for Structure-from-Motion via Graph Convolutional Network

09/17/2020 ∙ by Shen Yan, et al. ∙ 0

Conventional image retrieval techniques for Structure-from-Motion (SfM) suffer from the limit of effectively recognizing repetitive patterns and cannot guarantee to create just enough match pairs with high precision and high recall. In this paper, we present a novel retrieval method based on Graph Convolutional Network (GCN) to generate accurate pairwise matches without costly redundancy. We formulate image retrieval task as a node binary classification problem in graph data: a node is marked as positive if it shares the scene overlaps with the query image. The key idea is that we find that the local context in feature space around a query image contains rich information about the matchable relation between this image and its neighbors. By constructing a subgraph surrounding the query image as input data, we adopt a learnable GCN to exploit whether nodes in the subgraph have overlapping regions with the query photograph. Experiments demonstrate that our method performs remarkably well on the challenging dataset of highly ambiguous and duplicated scenes. Besides, compared with state-of-the-art matchable retrieval methods, the proposed approach significantly reduces useless attempted matches without sacrificing the accuracy and completeness of reconstruction.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Contemporary Structure-from-Motion (SfM) systems [1, 2, 3]

widely employ image retrieval techniques to relieve the heavy computational burden of image matching process, assuming that image pairs only with high visual similarity are likely to match. The retrieval methods for SfM are commonly implemented within two steps: Step 1, map every image in the dataset to individual vectors via an embedding function; Step 2, for each query image, find its nearest neighbors through a certain similarity metrics between quantized vectors. A variety of approaches have been developed in such two operations. For example, in Step 1, vocabulary tree models 

[4, 5] or CNN-based approaches [6, 7] are proposed to describe image features as a whole. In Step 2, KD-Tree [8] or Ball-Tree [9] is often adopted to accelerate the approximate search. However, while the index techniques have shown promising results in effectively filtering unnecessary matches, they have still been underachieving.

Figure 1: Basic idea of our method. (a) This paper intends to perform image retrieval. (b-c) Directly retrieve image with hyper-parameters ( or ). (d) Our idea: use GCN to learn surrounding local context for retrieval prediction.

We believe that former retrieval techniques face two major challenges. First, the embedding function in Step 1 is vulnerable to symmetric or repetitive textured patterns in ambiguous scenes. The embedding function, either trained by vocabulary tree or CNN model, cannot make features extracted from ambiguous structure manifest a notable difference. Such visually similar yet distinctive patterns are therefore inappropriately identified as overlapping in Step 2. To make matters worse, these misidentified pairs not only deceive the retrieval algorithms but can also pass the two-view geometry verification and form erroneous pairwise epipolar geometry. The false matches significantly mislead the direction of reconstruction and give rise to incomplete folded structure or total collapse of SfM. As a result, a sufficient subset of matches that does not contain potentially wrong epipolar geometry is superior to redundant matches that may have incorrect matching pairs for 3D reconstruction.

Second, it is extremely difficult to set up exactly sufficient match pairs for SfM. Previous research usually tries to achieve this purpose by empirically adjusting the number of retrieved items or the similarity threshold score in Step 2. The problem is that, on the one hand, smaller hyper-parameter ( or ) settings will cause the missing of true positive matches, which may lead to a decline in the completeness of SfM or even model disconnections. On the other hand, larger hyper-parameter ( or ) settings will bring about some false positive matches, which will inevitably result in inefficiency and inaccuracy of SfM. Moreover, as the density of scenic spots varies, it is improper to assume that all views share a consistent amount of index items .

To mitigate the two challenges mentioned above, we introduce a novel retrieval method based on Graph Convolutional Network (GCN) to generate accurate pairwise matches without costly redundancy. The framework of the proposed method can be summarized as follow.

For a query image, we build a Query Enclosing Subgraph (QES) around it to bring in candidate retrieved items and depict its local context. The motivation behind this work is that the similarity likelihood between a node and its neighbors can be effectively reasoned from its local topological information [10, 11, 12]

. Then, we adopt a Graph Convolutional Network (GCN) to learn to integrate valuable context knowledge and classify nodes in QES with positive or negative output label. All positive samples are regarded as sharing scene overlaps with the query image. Note that we only consider nearest neighbors of the query image as candidate retrieved items in practice, as only a few of matches are needed to be retrieved in SfM.

Since our algorithm grasps context information provided by QES, the measure of similarity which is unable to be computed in image feature space, can be effectively calculated in topological space. The symmetric and repetitive textured patterns can be successfully distinguished as they reveal different properties in the later space. Besides, due to the fact that GCN model directly returns prediction results for candidate index images, it is no longer necessary to try to carefully select hyper-parameters ( or ). We show that the retrieved items inferred from GCN model basically cover all required match pairs for SfM and do not contain too much redundancy. The main idea of the proposed method is illuminated in Fig 1.

Our main contributions could be summarized as follows:

  1. We convert the problem of matchable image retrieval to the problem of node binary classification in subgraphs, which helps to overcome scene ambiguous difficulty.

  2. We propose a learnable GCN to automatically predict the matchable relationship between candidate image pairs, and the generated pairwise matches are proved to be exactly enough for SfM.

  3. We conduct extensive experiments on various kinds of 3D reconstruction datasets and compare our approach to vocabulary tree and CNN-based models. Our method outperforms state-of-the-art retrieval methods on challenging ambiguous dataset and can offer precisely enough matchable pairs for SfM.

2 Related Work

2.1 Image Retrieval Techniques for Structure-from-Motion

Vocabulary tree [4, 5] is the most extensively used technique to rank images in dataset given a query photo, which has been implemented by most publicly available SfM pipelines [1, 2, 13, 3, 14]

as a preemptive pruning step. A vocabulary tree is learned typically from hierarchically clustering local feature descriptors of all images in dataset. Then, Term Frequency Inverse Document Frequency (TF-IDF) is utilized to efficiently score the similarity of images with inverted files. Research community reaches a higher level of maturity by improving quantization procedure 

[15, 16, 17], adopting compact representations [18, 19, 20, 21, 22], incorporating geometric cues [23, 16, 15, 24], and applying query expansion [25, 26, 27]. Although vocabulary tree assists SfM pipelines to eliminate computation cost, substantial memory footprints are still required during both constructing and indexing processes.

Recent developments [28, 29, 30, 31, 31]

illuminate that Convolutional Neural Networks (CNN) offer an attractive alternative for image encoding with small memory footprint. Object retrieval task has already applied deep CNN descriptors to represent images. However, most CNN-based object retrieval methods build on the assumption that images should share salient semantic regions like landscapes or architecture. In real 3D reconstruction, many photographs merely serve as bridge to connect partial scenes, with discontinuous or even no semantically meaningful regions.

Filip Radenovic et al. [6] and Shen et al. [7] specialize on solving the matchable image retrieval mission of 3D reconstruction. They employ state-of-the-art reconstruction algorithms to rebuild 3D models, which are re-projected on images to generate ground-truth supervised data. These training data ensure that images will be retrieved according to scene overlaps rather than semantic similarity. Filip Radenovic et al. adopt a siamese architecture with contrastive loss [32]. Besides, they introduce learned whitening and R-MAC [30] to improve performance. Shen et al. employ a triplet loss [33, 34] architecture, with pre-matching regional code (PRC) to boost accuracy at the expense of reducing efficiency.

In summary, matchable image retrieval task in 3D reconstruction have experienced developments from vocabulary tree to CNN-based methods. The most important part in previous work is to find an embedding function to map images into a compact feature space. However, these methods only consider visual information, leading to the result that ambiguous patterns cannot be convincingly differentiated. Besides, previous research ignores how to acquire exactly enough match pairs for SfM.

2.2 Graph Convolutional Network (GCN)

Recently, there is increasing interest in extending deep learning approaches for graph data 

[35, 36, 37, 38, 39, 40, 41], such as e-commence, social network and molecular chemistry. Similar to using CNN on Euclidean data, GCN is proposed to deal with irregular graph data. According to the definition of convolution on graph structure, GCN is divided into two main streams, the spectral-based approaches [35, 36, 37] and the spatial-based approaches [38, 39, 40, 41]

. Spectral-based GCN develops a graph convolution based on Graph Fourier Transform theory, while spatial-based GCN directly performs manually-defined convolution based on a node’s spatial relations.

GCN has many applications across different tasks and domains including node classification [42] and link prediction [43]

. Actually, link prediction could be regarded as binary classification problem. Traditional methods calculate the linkage likelihood between two given nodes by developing carefully designed heuristics 

[44, 45, 46, 47]. However, a significant limitation of these heuristics is that they lack universal applicability to different kinds of graphs. Zhang and Chen therefore propose a Weisfeiler-Lehman Neural Machine [10]

and a graph neural network 

[11] to learn general subgraph structure for linkage likelihood computation. Based on their work, Wang et al. further propose a linkage based face clustering algorithm [12], utilizing potential identities to group a set of faces. These methods are closely related to our work, since we solve the image retrieval problem by adopting a GCN to infer the matchable information between a query image and its neighbors.

3 Proposed Approach

3.1 Overview

Assume that we have a collection of unordered images with geometric overlaps, for each query image , we aim to find an index set , where is the number of retrieved items. To find the retrieval set, one typical pipeline is to first map images into a certain compact feature space via an embedding function . Then nearest neighbors of is searched by a defined similarity measurement .


However, when images are acquired from a site with highly ambiguous structure, this methodology fails. In what follows we provide an example for illustration. Fig 2(a) shows an extremely symmetric dataset Temple-Of-Heaven, which is composed of 341 rotationally symmetric images. In Fig 2(b-c), we observe that some photos taken in quite different positions look extremely similar. This phenomenon explains why it is unreliable to retrieve images only by visual features. Fortunately, in Fig 2(d), we notice that though two image pairs both staying close in feature space, they have totally different distances in topological space. Intuitively, we consider utilizing the local context surroundings to supply extra information for boosting matchable retrieval performance.

Figure 2: Example of Temple-of-Heaven dataset.

Suppose that the image collections have already been embedded as , we treat each feature as node . For query node , the primary action is to exploit some kind of data structure to describe its local context. We look on query node as a center, and build a subgraph around it called Query Enclosing Subgraph (QES). The construction of QES is described in detail in Section 3.2. Given a QES as input data, we then deploy a Graph Convolution Network (GCN) on it for node binary classification and the network directly outputs retrieved items (marked as positive). The mechanism of GCN is presented in Section 3.3.

3.2 Construction of Query Enclosing Subgraph

For query node , Query Enclosing Subgraph (QES) is represented as , where is the set of nodes, and is the set of undirected edges. Let denote a node and denote an undirected edge between and , and represents the number of nodes. The adjacency matrix is an matrix with if and if . has node attributes , where is a node feature matrix with indicating the feature vector of a node . As QES consists of three different data types, namely nodes , edges and features , we correspondingly construct QES by three stages illustrated in Fig 3.

Figure 3: Construction of Query Enclosing Subgraph (QES) with three stages.

3.2.1 Stage 1: Node discovery

We use to denote the set of nearest neighbors (NNs) of a node , which are searched through Equation 1. For query node , we first add its 1-hop NNs nodes to the unordered node list . Then, nodes in 2-hop are iteratively added to the node set . denote the number of nearest neighbors in the first and second hop respectively. Although this chain can be continuously extended, we only sample NNs of up to 2-hop. This is because 2-hop QES already covers all the information needed to calculate any first and second-order heuristics for link prediction. In addition, note that query node itself is excluded from .

3.2.2 Stage 2: Append edges among nodes

Assuming we have obtained a node set from Stage 1, the next step is appending edges among the nodes. We traverse all nodes , search its NNs among all nodes in original entire dataset. If a node also appears in , we insert an undirected edge into the edge set .

3.2.3 Stage 3: Node feature calculation

The embedding function used to extract image global feature is a pre-trained CNN-based [7] model. We assign each node with extracted feature vector . In order to make share information of query node , we uniformly subtract node features by query feature . The final feature matrix is described as follows:

Figure 4: Overall architecture of GCN.

3.3 Graph Convolutional Network on QES

After accomplishing the construction of QES for query node , we apply a Graph Convolution Network (GCN) on it to perform retrieving. The GCN determines if a node in QES is positive (should be retrieved to query node ) or negative (should not be retrieved to query node ). Specifically, we introduce the adopted GCN in two aspects: the graph convolutional layer and the overall architecture.

3.3.1 Graph Convolutional Layer

The graph convolutional layer basically follows GCN [37] with slight modifications, who takes node feature matrix together with an adjacency matrix as input and output a filtered feature matrix .

A graph convolutional layer first encapsulates each node’s hidden representation by aggregating feature information from its neighbors. This operation is achieved by left multiplying

by an aggregation matrix . The aggregation matrix is defined as , where is a diagonal matrix with . Then we concatenate feature matrix with aggregation feature matrix

along the feature dimension. After feature aggregation, a non-linear transformation is applied to the resulted outputs, the weight matrix parameters

is to be learned. Formally, a graph convolutional layer in our paper has the following formulation,


where operator represents matrix concatenation and

is a non-linear activation function.

3.3.2 Overall Architecture

The proposed GCN model can be regarded as a combination of two components: feature extraction part and the node classification part, as shown in Fig 4

. For feature extraction part, the main block is a stack of four graph convolution layers activated by the ReLU function. By stacking four layers, the final hidden representation of each node receives messages from a further neighborhood. After that, we add a couple of fully connected layers in order to wrap up the high-level node representations. For node classification part, we use the cross-entropy loss function after the sigmoid activation for optimization. Because only a few of retrieved items matter in SfM, in train phase, we only backpropagate the gradient for the nodes of the 1-hop neighbors; in test phase, we perform node binary classification on the 1-hop nodes as well.

(a) View graph of CNN-based method
(b) View graph of our method
Figure 5: (a) We observe that there are numerous match errors by directly applying pretrained CNN-based models with . (b) The proposed GCN method effectively eliminates these mistakes and finds sufficient retrieved images.
Figure 6: Experiment results on ambiguous dataset. From top to bottom are Books, Cereal, Cup, Desk, Oats and Street respectively. From left to right: the 1st column - one view of an ambiguous scene, from 2nd to 4th columns - SfM model using matches from Voc-Tree, SiaMAC and MIRorR, the 5th column - SfM model using our retrieved matches, the 6th column - SfM model using manually judged matches. Green box indicates structure only rebuilt by our method, while red box indicates wrong structure reconstructed by comparing methods.
(a) sensitivity of by MAC
(b) sensitivity of by R-MAC
Figure 7: Sensitivity of , and feature type.

To demonstrate the effect of GCN on image retrieval task, we use the view graph of Temple-of-Heaven to give an explanation in Fig 5. In this dataset, the center part of view graph should be empty as front and back views cannot be matched. We can distinctly notice that our method has fewer false matches than pretrained CNN-based model.

(a) sensitivity of parameter
(b) sensitivity of parameter
(c) sensitivity of parameter ()
(d) sensitivity of parameter ()
Figure 8: Sensitivity of parameters , and .

4 Experiment

4.1 Datasets Overview

To evaluate the effectiveness of our GCN-based image retrieval algorithm for SfM, we conduct experiments on different kinds of datasets: GL3D [7], HKUST ambiguous dataset [14], public outdoor dataset [3], and 1DSfM dataset [48].

GL3D is a large-scale dataset specially created for 3D reconstruction and geometry-related learning problems. GL3D provides the degree of mesh overlaps () and common track() between images pairs from accurate mesh re-projection, which serves to our supervised GCN training. HKUST ambiguous dataset is a small-scale dataset containing scenes with symmetric and duplicated structures. Public outdoor dataset includes medium-scale images specially for 3D reconstruction task. 1DSfM dataset consists of thousands of Internet photos downloaded from Flickr. Note that a large number of images in this dataset may be unrelated with 3D reconstruction.

4.2 Evaluation Metrics

For information retrieval system, mean Average Precision (mAP) is a very popular metric to measure the performance. However, mAP is not suitable for assessing image retrieval in SfM. There are two reasons for it: (1) as all retrieved items in the list (no matter ranked high or low) will be equally used for later SfM pipeline, the retrieval ranking is not important; (2) SfM pipeline has to confirm a retrieval number

to form match pairs, the precision and recall calculation less than

are meaningless. Therefore, we adopt precision, recall and F-measure with retrieval number to measure the experimental performance.

As for SfM evaluation, in matching procedure, we report the number of total attempted matches (TAM), the number of useful matches (UM) which pass geometric verification, and the running time to express the efficiency of our method. In mapping procedure, we record the number of recovered cameras, the number of sparse points, the number of observation points, and the re-projection error to describe the completeness and accuracy of rebuilt models.

VOC-Tree [14] SiaMAC [6] MIRorR [7] Ours
 Score depth=6, branch=8 MAC R-MAC MAC + Lw MAC R-MAC PR-MAC MAC R-MAC
Precision 0.589 0.591 0.5971 0.6197 0.6309 0.6263 0.6422 0.6661 0.5864
Recall 0.564 0.5626 0.5667 0.5936 0.6015 0.5947 0.6113 0.5882 0.7008
F-measure 0.498 0.4979 0.5027 0.5248 0.5336 0.5288 0.5434 0.55 0.5651
Table 1: Experimental results on the matchable image retrieval task (k=25).
VOC-Tree [14] SiaMAC [6] MIRorR [7] Ours
 Score depth=6, branch=8 MAC R-MAC MAC + Lw MAC R-MAC PR-MAC MAC R-MAC
Precision 0.2288 0.2712 0.2726 0.2835 0.2753 0.282 0.2921 0.6661 0.5864
Recall 0.7414 0.8489 0.8534 0.8857 0.8687 0.8792 0.9027 0.5882 0.7008
F-measure 0.3166 0.3728 0.3749 0.3903 0.3801 0.3881 0.4013 0.55 0.5651
Table 2: Experimental results on the matchable image retrieval task (k=100).

4.3 Baselines Introduction

We include three baselines, VOC-Tree [14], MIRorR [7], and SiaMAC [6], which are used for image retrieval task especially in SfM. For a fair comparison, we tune the compared baselines to their best performance as described in their papers.

4.4 Parameters Selection

During QES construction, there are three kinds of hyper-parameters: and for discovering nodes; for appending edges; feature types MAC and R-MAC from pretrained embedding function [7] for calculating node features. Type MAC means that the representation features are extracted from full images, while type R-MAC implies that representation features are generated from summing up regional features of multiple different scales.

During GCN training on QES, and determine whether image pairs should be matched. We treat image pairs matched as long as or , where and are pre-set threshold scores.

In training phase, as only a few of matches matter in SfM, we select . In order to avoid QES being too complicate to affect the efficiency, we set and . To explore the impact of various values of and and different feature types of MAC and R-MAC, we conduct two groups of experiments and the results are shown in Fig 7. We find that R-MAC always has a better performance than MAC, and GCN model could get the highest F-measure score when setting , . In following experiments, we decide to adopt R-MAC as feature type.

In testing phase, as there is no need to follow the same settings with the training phase, we carry out comprehensive experiments to investigate the impact of different values of , and . The results are reported in Fig 8. First, we keep and constant, vary to show how statistics change. We observe in Fig 8(a) that has no significant effect on the results. Next, we fix and , test the sensitivity of parameter . As reported in Fig 8(b), larger brings more candidate links to be predicted, thus yielding higher recall but lower precision. At last, for parameter , we conduct two groups of experiments. In Fig 8(c) and Fig 8(d), we observe that parameter has a similar effect with parameter .

4.5 Experiments for Matchable Image Retrieval

From parameter selection part, we find that settings with , , , , have a balanced performance between precision and recall, and can get the highest F-measure score. In this section, we apply this configuration for the matchable image retrieval experiments in GL3D.

The results are shown in Table 1 with and Table 2 with respectively. Our approach outperforms other comparing methods in terms of F-measure score. Through Table 1 and Table 2 we can clearly realize how difficult it is to select a proper retrieve number for previous research that can guarantee both completeness and efficiency. Our approach does not need to set this annoying parameter.

VOC-Tree [14] SiaMAC [6] MIRorR [7] Ours
 Scene UM/TAM Time (min) UM/TAM Time (min) UM/TAM Time (min) UM/TAM Time (min) Speedup
fc 1939/2339 (0.83) 0.400 Fail Fail 2090/2624 (0.80) 0.440 1412/1458 (0.97) 0.260 x1.5
stadium 1619/2699 (0.60) 1.367 1513/2658 (0.57) 1.459 1330/2685 (0.50) 1.523 939/1298 (0.72) 0.604 x2.3
garrard-hall 1030/1520 (0.68) 0.829 1018/1520 (0.67) 0.810 1044/1567 (0.67) 0.867 502/546 (0.92) 0.272 x3.0
south-building 1546/2079 (0.74) 1.263 1613/2043 (0.79) 1.322 1650/1996 (0.83) 1.287 900/941 (0.96) 0.569 x2.2
graham-hall 5898/10475 (0.56) 5.902 5965/8845 (0.67) 4.964 5831/8600 (0.68) 4.740 4846/6635 (0.73) 3.935 x1.2
person-hall 3645/5734 (0.64) 3.484 3168/5162 (0.61) 3.185 3165/4986 (0.63) 3.031 2422/3608 (0.67) 2.199 x1.4
Table 3: Experimental results of image matching on public outdoor dataset.
Scene Method Images Registered Sparse Points Observations Track Length Repro. Error
fc VOC-Tree 150 150 26513 140062 5.2828 0.4839
SiaMAC Fail Fail Fail Fail Fail
MIRorR 150 26295 136650 5.1968 0.4837
Ours 150 24738 126795 5.1255 0.4745
stadium VOC-Tree 157 157 84723 381464 4.5024 1.0471
SiaMAC 154 81387 366107 4.5 1.0373
MIRorR 156 77175 345419 4.4758 1.0279
Ours 155 72632 322246 4.4367 1.0101
garrard-hall VOC-Tree 100 100 57081 331992 5.8161 1.024
SiaMAC 100 56920 331489 5.8238 1.025
MIRorR 100 57047 331838 5.817 1.0248
Ours 100 55184 319800 5.7952 0.9997
south-building VOC-Tree 128 128 85599 514822 6.0145 0.5909
SiaMAC 128 85660 514248 6.0033 0.5905
MIRorR 128 85626 514552 6.0093 0.5911
Ours 128 83630 501698 5.999 0.5824
graham-hall VOC-Tree 562 556 271255 1711980 6.3113 1.0869
SiaMAC 559 292897 1603609 5.475 1.0268
MIRorR 560 287487 1623510 5.6473 1.0492
Ours 555 261222 1560843 5.9751 1.0343
person-hall VOC-Tree 330 330 200196 1406306 7.0247 1.156
SiaMAC 238 139566 975117 6.9868 1.0845
MIRorR 330 198354 1389135 7.0033 1.141
Ours 330 202641 1362836 6.7254 1.108
Table 4: Experimental results of mapping on public outdoor dataset.
VOC-Tree [14] SiaMAC [6] MIRorR [7] Ours
 Scene UM/TAM Time (min) UM/TAM Time (min) UM/TAM Time (min) UM/TAM Time (min) Speedup
Alamo 32532/214257 (0.15) 82.144 54665/209073 (0.26) 75.730 27629/160767 (0.17) 57.890 21740/49604 (0.43) 19.612 x3.0
Ellis Island Fail Fail 46699/183036 (0.26) 62.286 Fail Fail 23227/52847 (0.44) 20.007 x2.3
Gendarmenmarkt 31494/100432 (0.31) 40.043 43558/104397 (0.42) 40.736 20018/82438 (0.24) 31.626 19875/32856 (0.60) 13.291 x2.4
Madrid Metropolis 13477/97553 (0.14) 32.712 23337/92380 (0.25) 28.725 9584/73410 (0.13) 21.389 9632/26543 (0.36) 7.129 x3.0
Roman Forum 39711/168677 (0.24) 65.761 65516/168650 (0.39) 65.510 30894/129521 (0.24) 49.525 29228/47952 (0.61) 18.317 x2.7
Tower of London 17181/115211 (0.15) 44.538 28761/110655 (0.26) 41.054 12536/87786 (0.14) 31.879 11452/27272 (0.42) 11.153 x2.9
Table 5: Experimental results of image matching on 1DSfM dataset.
Scene Method Images Registered Sparse Points Observations Track Length Repro. Error
Alamo VOC-Tree 2915 938 166907 2003316 12.0026 0.6496
SiaMAC 967 172913 2067066 11.9544 0.6606
MIRorR 925 159171 1894934 11.905 0.6364
Ours 862 146979 1767485 12.0254 0.6317
Ellis Island VOC-Tree 2587 Fail Fail Fail Fail Fail
SiaMAC 410 106275 738273 6.947 0.8137
MIRorR Fail Fail Fail Fail Fail
Ours 326 87070 606156 6.9617 0.8109
Gendarmenmarkt VOC-Tree 1463 1057 208925 1366892 6.5425 0.7185
SiaMAC 1054 209716 1358449 6.4776 0.7217
MIRorR 980 168822 1090269 6.4581 0.7205
Ours 997 181568 1138802 6.272 0.6926
Madrid Metropolis VOC-Tree 1344 452 71281 518857 7.279 0.6228
SiaMAC 503 77667 554452 7.1388 0.6186
MIRorR 489 66476 464502 6.9875 0.5898
Ours 420 61111 437402 7.1575 0.6098
Roman Forum VOC-Tree 2364 1641 332421 2769546 8.3314 0.7308
SiaMAC 1677 348262 2888378 8.2937 0.7339
MIRorR 1623 306893 2506663 8.1678 0.7117
Ours 1575 296194 2405047 8.1198 0.7105
Tower of London VOC-Tree 1567 742 166622 13883301 8.332 0.6152
SiaMAC 765 171273 1426670 8.3298 0.62198
MIRorR 740 151625 1252361 8.2596 0.6014
Ours 570 135843 1151801 8.4789 0.6078
Table 6: Experimental results of mapping on 1DSfM dataset.

4.6 Experiments for SfM

In this section, we conduct plenty of reconstruction experiments to demonstrate the integration of image retrieval techniques with SfM. All reconstructions are implemented in the framework of COLMAP [3], which can be found in the supplementary material.

First, we report SfM results on challenging HKUST ambiguous dataset. As a small number of wrong pairwise matches in this dataset may cause a failure of reconstruction, the accuracy of image retrieval is extremely important. For our GCN model, we decide to apply configurations with , , , , in training phase and , , in testing phase. For fairness consideration, we select for compared retrieval methods to improve their accuracy. Since different retrieval methods lead to great diversities in reconstruction results, we directly show the rebuilding models in Fig 6. We find that our GCN-based retrieval method displays obvious advantages comparing visually-based retrieval approaches.

Second, we conduct experiments on public outdoor dataset. As this dataset is created especially for 3D reconstruction, we adopt a normal configuration with , , , , in training phase and , , in testing phase. We select for comparing methods in consideration of both efficiency and completeness. Table 3 provides statistics of matching process, and Table 4 provides statistics of mapping process. The matching and mapping experiments imply that although our method generates minimal attempted match pairs, almost all of them could pass epipolar geometry verification and contribute to accurate and complete 3D models. This property can help to save massive computation resources especially in large-scale dataset.

At last, we conduct experiments on 1DSfM dataset. Since 1DSfM dataset is downloaded from the Flickr, a large number of images in the dataset have nothing to do with final reconstruction, which is quite different from our training dataset GL3D. To narrow down this gap, we try to retrieve more images to prevent missing necessary images. We set , , , , in training phase and , , in testing phase. For comparing methods, we set due to the same reason. The matching results are reported in Table 5, and the mapping results are reported in Table 6. The experiments show that we achieve comparable results with other retrieval approaches costing less computation.

5 Conclusion

In this paper, we propose a novel image retrieve method for SfM via Graph Convolutional Network (GCN). We emphasize that the local context surrounding query image provides rich information about the matchable likelihood between this image and its neighbors. By constructing Query Enclosing Subgraph (QES) to depict the local context, we adopt GCN to directly predict whether test image pairs share scene overlaps. Extensive experiments indicate that the proposed method can handle challenging scenes with ambiguous structure, and significantly reduces the number of image pairs for matching without degrading the quality of subsequent SfM pipeline.


  • [1] Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S.M., Szeliski, R.: Building rome in a day. Communications of the ACM 54 (2011) 105–112
  • [2] Moulon, P., Monasse, P., Marlet, R.: Global fusion of relative motions for robust, accurate and scalable structure from motion.

    In: Proceedings of the IEEE International Conference on Computer Vision. (2013) 3248–3255

  • [3] Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4104–4113

  • [4] Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: null, IEEE (2003) 1470
  • [5] Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). Volume 2., Ieee (2006) 2161–2168
  • [6] Radenović, F., Tolias, G., Chum, O.: Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In: European conference on computer vision, Springer (2016) 3–20
  • [7] Shen, T., Luo, Z., Zhou, L., Zhang, R., Zhu, S., Fang, T., Quan, L.: Matchable image retrieval by learning from surface reconstruction. In: Asian Conference on Computer Vision, Springer (2018) 415–431
  • [8] Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (1975) 509–517
  • [9] Omohundro, S.M.: Five balltree construction algorithms. International Computer Science Institute Berkeley (1989)
  • [10] Zhang, M., Chen, Y.: Weisfeiler-lehman neural machine for link prediction. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (2017) 575–583
  • [11] Zhang, M., Chen, Y.: Link prediction based on graph neural networks. In: Advances in Neural Information Processing Systems. (2018) 5165–5175
  • [12] Wang, Z., Zheng, L., Li, Y., Wang, S.: Linkage based face clustering via graph convolution network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2019) 1117–1125
  • [13] Sweeney, C., Sattler, T., Hollerer, T., Turk, M., Pollefeys, M.: Optimizing the viewing graph for structure-from-motion. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 801–809
  • [14] Shen, T., Zhu, S., Fang, T., Zhang, R., Quan, L.: Graph-based consistent matching for structure-from-motion. In: European Conference on Computer Vision, Springer (2016) 139–155
  • [15] Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE conference on computer vision and pattern recognition, IEEE (2007) 1–8
  • [16] Jegou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometric consistency for large scale image search. In: European conference on computer vision, Springer (2008) 304–317
  • [17] Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improving particular object retrieval in large scale image databases. In: 2008 IEEE conference on computer vision and pattern recognition, IEEE (2008) 1–8
  • [18] Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE (2010) 3384–3391
  • [19] Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and machine intelligence 34 (2011) 1704–1716
  • [20] Radenović, F., Jégou, H., Chum, O.: Multiple measurements and joint dimensionality reduction for large scale image search with short vectors. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. (2015) 587–590
  • [21] Arandjelovic, R., Zisserman, A.: All about vlad. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (2013) 1578–1585
  • [22] Tolias, G., Furon, T., Jégou, H.: Orientation covariant aggregation of local descriptors with embeddings. In: European Conference on Computer Vision, Springer (2014) 382–397
  • [23] Chum, O., et al.: Large-scale discovery of spatially related images. IEEE transactions on pattern analysis and machine intelligence 32 (2009) 371–377
  • [24] Shen, X., Lin, Z., Brandt, J., Wu, Y.: Spatially-constrained similarity measurefor large-scale object retrieval. IEEE transactions on pattern analysis and machine intelligence 36 (2013) 1229–1241
  • [25] Chum, O., Mikulik, A., Perdoch, M., Matas, J.: Total recall ii: Query expansion revisited. In: CVPR 2011, IEEE (2011) 889–896
  • [26] Qin, D., Gammeter, S., Bossard, L., Quack, T., Van Gool, L.: Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors. In: CVPR 2011, IEEE (2011) 777–784
  • [27] Tolias, G., Jégou, H.: Visual query expansion with or without geometry: refining local descriptors by feature aggregation. Pattern recognition 47 (2014) 3466–3476
  • [28] Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: European conference on computer vision, Springer (2014) 584–599
  • [29] Sharif Razavian, A., Sullivan, J., Maki, A., Carlsson, S.: A baseline for visual instance retrieval with deep convolutional networks. In: International Conference on Learning Representations, May 7-9, 2015, San Diego, CA, ICLR (2015)
  • [30] Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879 (2015)
  • [31] Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval. arXiv preprint arXiv:1510.07493 (2015)
  • [32] Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). Volume 1., IEEE (2005) 539–546
  • [33] Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.: Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 1386–1393
  • [34] Schroff, F., Kalenichenko, D., Philbin, J.:

    Facenet: A unified embedding for face recognition and clustering.

    In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 815–823
  • [35] Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013)
  • [36] Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems. (2016) 3844–3852
  • [37] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
  • [38] Hechtlinger, Y., Chakravarti, P., Qin, J.: A generalization of convolutional neural networks to graph-structured data. arXiv preprint arXiv:1704.08165 (2017)
  • [39] Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in neural information processing systems. (2017) 1024–1034
  • [40] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
  • [41] Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition.

    In: Thirty-second AAAI conference on artificial intelligence. (2018)

  • [42] Shchur, O., Mumme, M., Bojchevski, A., Günnemann, S.: Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018)
  • [43] Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. Journal of the American society for information science and technology 58 (2007) 1019–1031
  • [44] Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. (1998)
  • [45] Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. (2002) 538–543
  • [46] Zhou, T., Lü, L., Zhang, Y.C.: Predicting missing links via local information. The European Physical Journal B 71 (2009) 623–630
  • [47] Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18 (1953) 39–43
  • [48] Wilson, K., Snavely, N.: Robust global translations with 1dsfm. In: European Conference on Computer Vision, Springer (2014) 61–75