1 Introduction
The prestigious largescale SfM methods [1, 16, 25, 27, 43, 46]
have already provided ingenious designs in feature extraction
[31, 55], overlapping image detection [1, 16, 25, 37], feature matching and verification [56], and bundle adjustment [13, 35, 57]. However, the largescale accurate and consistent camera registration problem has not been completely solved, not to mention in a parallel fashion.To fit a whole camera registration problem into a single computer, previous works [1, 16, 25, 43, 46] generally drastically discard the connectivities among cameras and tracks by first building a skeletal geometry of iconic images [30] and registering the remaining cameras with respect to the skeletal reconstruction. The other approaches [23, 34, 40, 49, 51] generate exclusive camera clusters for partial reconstruction and finally merge them together. Such losses of cameratocamera connectivities remarkably decrease the accuracy and consistency of the final reconstruction. Instead, this work tries to preserve the cameratocamera connectivities and their corresponding tracks for a highly accurate and consistent reconstruction. We propose an iterative camera clustering algorithm that splits the original SfM problem into several smaller subproblems in terms of clusters of cameras with overlapping. We then exploit this scalable framework to solve the whole SfM problem, including track generation, local SfM, 3D point triangulation and bundle adjustment far exceeding the memory of a single computer in a parallel scheme.
To obtain the global camera poses from partial sparse reconstructions, the hybrid SfM methods [3, 49] directly use similarity transformations to roughly merge clusters of cameras together and possibly lead to inconsistent camera poses across clusters. Others [14, 23, 29, 40, 51] hierarchically merge camera pairs and triplets and are sensitive to the order of the merging process. Given that the cameratocamera connectivities are preserved by our clustering algorithm at all possible, we instead apply the accurate and robust relative poses from incremental SfM [1, 39, 42, 45, 56] to the global motion averaging framework [2, 5, 7, 8, 9, 10, 17, 18, 19, 22, 32, 38, 44], and obtain the global camera poses.
The contributions of our approach are threefold. First, we introduce a highly scalable framework to handle SfM problems exceeding the memory of a single computer. Second, a camera clustering algorithm is proposed to guarantee that sufficient cameratocamera connectivities and corresponding tracks are preserved in camera registration. Finally, we present a hybrid SfM method that uses relative motions from incremental SfM to globally average the camera poses and achieve the stateoftheart accuracy evaluated on benchmark datasets [47]. To the best of my knowledge, ours is the first pipeline able to reconstruct highly accurate and consistent camera poses from more than one million highresolution images in a parallel manner.
2 Related Works
Based on an initial camera pair, the wellknown incremental SfM method [45] and its derivations [1, 39, 42, 56] progressively recover the pose of the “nextbestview” by carrying out perspectivethreepoint (P3P) [28] combined with RANSAC [15] and nonlinear bundle adjustment [52]
to effectively remove outlier epipolar geometry and feature correspondences. However, frequent intermediate bundle adjustment leads to incredible time consumption and drifted optimization convergence, especially on largescale datasets. In contrast, the global SfM methods
[2, 5, 7, 8, 9, 10, 17, 18, 19, 22, 32, 38, 44] solve all the camera poses simultaneously from the available relative poses, the computation of which is highly parallel, and can effectively avoid drifting errors. Compared with incremental SfM methods, global SfM methods are, however, more sensitive to possible erroneous epipolar geometry despite the various delicate designs of epipolar geometry filters [10, 20, 24, 26, 34, 41, 53, 54, 58, 59].In this paper, we embrace the advantages of both incremental and global SfM methods and exploit a hybrid SfM formulation. The previous hybrid methods [14, 23, 29, 40, 51] are limited to smallscale or sequential datasets. Havlena [23] form the final 3D model by merging atomic 3D models from camera triples together, while the merging process is not robust depending solely on common 3D points. Bhowmick [3]
directly estimate the similarity transformations to combine camera clusters but produce possibly inconsistent camera poses across clusters. The work in
[49] incrementally merges multiple cameras while suffering from severe drifting errors. In contrast, we apply the robust relative poses from partial reconstruction by local incremental SfM to the global motion averaging framework and provide highly consistent and accurate camera poses. The work in [49] optimizes the relative poses by solving a single global optimization problem rather than multiple local problems, and suffers from scalability in very largescale datasets.To tackle the scalability problem of largescale SfM, previous works generally exploit a skeletal [46] or simplified graph [1, 16, 25, 43] of iconic images [30]. Although millions of densely sampled Internet images can be roughly registered, numerous geometry connectivities are discarded. Therefore, such approaches can hardly guarantee a highly accurate and consistent reconstruction in our scenario consisted of uniformly captured highresolution images. The hybrid SfM pipelines [3, 23] employing exclusive clusters of cameras lose a large number of connectivities among cameras and tracks during the cluster partition as well. Instead, our proposed camera clustering algorithm produces clusters of cameras with overlapping guaranteeing that sufficient cameratocamera connectivities and corresponding tracks are validated and preserved in camera registration and consequently achieve superior reconstruction accuracy and consistency.
3 Scalable Formulation
3.1 Preliminary
We start with a given set of images , their corresponding SIFT [31] features and matching correspondences where is a set of inlier feature correspondences verified by epipolar geometry [21] between two images and . Each image is associated with a camera . The target of this paper is then to compute the global camera poses of all the cameras with projection matrices denoted by .
3.2 Camera Clustering
As the problem of SfM, in particular camera registration, scales up, the following two problems emerge. First, the problem size gradually exceeds the memory of a single computer. Second, the high degree parallelism of our distributed computing system can hardly be fully utilized. We therefore introduce a camera clustering algorithm to split the original SfM problem into several smaller manageable subproblems in terms of clusters of cameras and associated images. Specifically, our goal of camera clustering is to find camera clusters such that all the SfM operations of each cluster can be fitted into a single computer for efficient processing (size constraint) and that all the clusters have sufficient overlapping cameras with adjacent clusters to guarantee a complete reconstruction when their corresponding partial reconstructions are merged together in motion averaging (completeness constraint).
3.2.1 Clustering Formulation
In order to encode the relationships between all the cameras and associated tracks, we introduce a camera graph , in which each node represents a camera , each edge with weight connects two different cameras and . In the subsequent scalable SfM, both local incremental SfM and bundle adjustment [13] encourage cameras with great numbers of common features to be grouped together for a robust geometry estimation. We therefore define the edge weight as the number of feature correspondences, namely . Our target is then to partition all the cameras denoted by a graph into a set of camera clusters denoted by while satisfying the following size and completeness constraints.
Size constraint
We encourage the number of cameras of each camera cluster to be small and of similar size. First, each camera cluster should be small enough to be fit into a single computer for efficient local SfM operations. Particularly for local incremental SfM, a comparatively smallscale problem can effectively avoid redundant timeconsuming intermediate bundle adjustment [52] and possible drifting. Second, a balanced problem partition stimulates a fully utilization of the distributed computing system. The size constraint is therefore defined as
(1)  
where is the upper bound of the number of cameras of a cluster. We can see from Figure 3 that both the average relative rotation and translation errors computed from local incremental SfM in a cluster first remarkably decrease and then stabilize as the number of cameras in a cluster increases. The acceptable number of cameras in a cluster is therefore in a large range and we choose for the tradeoff between accuracy and efficiency.
Completeness constraint
The completeness constraint is introduced to preserve cameratocamera connectivities, which provides relative poses for motion averaging to generate global camera poses. However, a complete preserving of cameratocamera connectivities introduces many repeated cameras in different clusters and the size constraint can hardly be satisfied [4]. We therefore define the completeness ratio of a camera cluster as which quantifies the degree cameras covered in one camera cluster are also covered by other camera clusters. It limits the number of repeated cameras and guarantees that all the clusters have sufficient overlapping cameras with adjacent clusters for a complete reconstruction. Then, we have
(2) 
As shown in Figure 4, a large completeness ratio encourages less loss of cameratocamera connectivities while results in more duplicated cameras in different clusters. Balancing the tradeoff between accuracy and efficiency, we choose . Here, less than of cameratocamera connectivities are discarded and approximately 1.8 times of the original number of cameras are reconstructed in local SfM. In contrast, exclusive camera clustering () leads to a loss of of cameratocamera connectivities.
3.2.2 Clustering Algorithm
We propose a twostep algorithm to solve the camera clustering problem. A sample output of this algorithm is illustrated in Figure 5.
1. Graph division
We guarantee the size constraint by recursively splitting a camera cluster violating the size constraint into smaller components. Starting with the camera graph , we iteratively apply normalizedcut algorithm [12], which guarantees an unbiased vertex partition, to divide any subgraph not satisfying the size constraint into two balanced subgraphs and , until that no subgraphs violate the size constraint. Intuitively, camera pairs with great numbers of common features have high edge weights and are less likely to be cut.
2. Graph expansion
We enforce the completeness constraint by introducing sufficient overlapping cameras between adjacent camera clusters. More specifically, we first sort the edges discarded in graph division by edge weight in descending order, and iteratively add the edge and associate vertices and randomly to one of its connected subgraphs and if the completeness ratio of the subgraph is smaller than . Here, denotes the subgraph containing vertex . Such process is iterated until no additional edges can be added to any of the subgraph. It is noteworthy that the completeness constraint is not difficult to satisfy after adding a small subset of discarded edges and associated vertices.
The size constraint may be violated after graph expansion, and we iterate between graph division and graph expansion until both constraints are satisfied.
3.3 Camera Cluster Categorization
The camera clusters from the clustering algorithm are divided into two categories, namely independent and interdependent camera clusters. We define the final camera clusters from our clustering algorithm as interdependent camera clusters since they share overlapping cameras with adjacent clusters. Such interdependent clusters are used in subsequent parallel local incremental SfM. Accordingly, we define all the fully exclusive camera clusters before graph expansion as independent camera clusters which are used in the following parallel 3D point triangulation and parallel bundle adjustment. We also leverage the independent camera clusters to build a hierarchical camera cluster tree , in which each leaf node corresponds to an independent camera cluster and each nonleaf node is associated with an intermediate camera cluster during the recursive binary graph division. The hierarchical camera cluster tree is an important structure in the subsequent parallel track generation. Next, we can base on the camera clusters from our clustering algorithm to implement a scalable SfM pipeline.
4 Scalable Implementation
4.1 Track Generation
The first step of scalable SfM is to use the pairwise feature correspondences to generate globally consistent tracks across all the images, and the problem is solved by a standard UnionFind [33] algorithm. However, as the size of the input images scales up, it gradually becomes impossible to concurrently load all the feature and associate match files into the memory of a single computer for track generation. We therefore base on the hierarchical camera cluster tree to perform track generation and avoid caching all the features and correspondences into memory at once. In detail, we define as the node in the th level of , and and are respectively the left and right child of . For the track generation subproblem associated with sibling leaf nodes and , we load all their features and correspondences into memory, generate the tracks corresponding to , reallocate the memory of features and correspondences, and save the tracks associated with into storage. As for the two sibling nonleaf nodes and , we only load the correspondences and tracks associated with both nodes, merge them, and save the tracks corresponding to into storage. Such processes are iteratively performed from the bottom up until the globally consistent tracks with respect to the root node of are obtained. All the track generation processes associated with each level of are handled in parallel under a standard framework of MapRedeuce [11].
4.2 Local Incremental SfM
For the cameras and corresponding tracks of every interdependent camera cluster denoted by the subgraph , we perform local incremental SfM in parallel. Local incremental SfM is vital to the subsequent motion averaging in two aspects. First, RANSAC [15] based filters and repeated partial bundle adjustment [52] can remove erroneous epipolar geometry and feature correspondences. Second, incremental SfM considers robust view () pose estimation [28, 36] and produces superior accurate and robust relative rotations and translations than the generally adopted essential matrix based [2, 5, 18, 38]
and trifocal tensor based methods
[26, 34] even for the camera pairs with weak association, large angle of views, and great scale variation. Figure 6 and the statistics of the benchmark datasets [47] ( and ) in Table 2 confirm the statement above.4.3 Motion Averaging
Now, all the relative motions of camera pairs with feature correspondences from local incremental SfM are used to compute the global camera poses. The work in [8] is first adopted for efficient and robust global rotation averaging.
4.3.1 Translation Averaging
Translation averaging is challenging for two reasons. First, it is difficult to discard erroneous epipolar geometry resulted from noisy feature correspondences. Second, an essential matrix can only encode the direction of a relative translation [38]. Thanks to local incremental SfM, the majority of erroneous epipolar geometry is filtered, and the only problem remained is to solve the scale ambiguity.
The work in [10] first globally averages the scales of all the relative translations and perform a convex optimization to solve scaleaware translation averaging. Özyesil [38] obtain the convex “least unsquared deviations” formulation by introducing a complicated quadratic constraint. Given that all the relative translations from one camera cluster are up to the same scale factor , we instead formulate our translation averaging as a convex problem by solving the camera positions and cluster scales simultaneously. Obviously, the scale factors computed in terms of clusters are more robust than the pairwise scales [10, 38] in terms of relative poses, especially for the camera pairs with weak association.
With the global rotations computed from [8] fixed, a linear equation of camera positions can be obtained as:
(3) 
where is a relative translation between two cameras and estimated in the th cluster associated with a scale . Equation 3 can be rewritten as: . Then we form the representations of all the cluster scales and camera positions as and respectively, and we have:
(4) 
Here, is a matrix with an appropriate location of replaced by , and otherwise. is a matrix with appropriate locations of and replaced by and respectively, and otherwise. Then, we can collect all such linear equations from the available cameratocamera connectivities into the following single linear equation system:
(5) 
where and are sparse matrices made by stacking all the associate matrices and respectively.
Dataset  # images  Average epipolar error [pixels]  Number of connected camera pairs  Number of 3D points  
[3]  [49]  Ours  [49]  [3]  Ours  [3]  [49]  Ours  
Pittsburg  388  6.74  5.48  2.88  0.90  0.81  0.78  7.4K  6.7K  8.2K  9.4K  10.1K  10.3K  57K  64K  74K  77K  80K  81K 
Campus  1550  3.42  3.93  2.74  1.22  0.72  0.66  43.2K  34.7K  61.7K  69.4K  72.2K  76.0K  156K  173K  248K  252K  276K  294K 
After removing the gauge freedom by setting and , we can obtain the positions of all the cameras by solving the following robust convex optimization problem that is more robust to outliers than methods and converges rapidly to a global optimum,
(6) 
Since the baseline length is encoded by the changes of cluster scales, our translation averaging algorithm can effectively handle the scale ambiguity, especially for collinear camera motion, and is much wellposed than the essential matrix based approaches [5, 18, 38, 54], which only consider the directions of relative translations and are limited to the parallel rigid graph [38].
4.4 Bundle Adjustment
For each independent camera cluster, we triangulate [21] their corresponding 3D points with sufficient visible cameras () from their feature correspondences validated by local incremental SfM based on the averaged global camera geometry. Then, we follow the stateoftheart algorithm proposed by Eriksson [13] for distributed bundle adjustment. Since this work [13] declares to have no restriction on the partitions of cameras, we refer to the independent camera clusters with their associate cameras, tracks and projections as the subproblems of the objective function of bundle adjustment.
4.5 Discussion
Given the same global camera rotations from [8] and relative translations from local SfM, Figure 7 verifies that our translation averaging algorithm recovers more accurate camera positions than the stateoftheart translation averaging methods [10, 34, 38, 48, 54]. Although the optimal solution to no loss of relative motions compared with the original camera graph can hardly be obtained in our clustering algorithm, the statistical comparison shown in Table 2 still demonstrates the superior accuracy of camera poses from our pipeline over the stateoftheart SfM approaches [10, 34, 48, 56] on the benchmark dataset [47].
Figure 8 shows the comparison with the hybrid SfM methods [3, 49] using exclusive camera clusters on the datasets [10] consisted of sequential images with closeloop. We regard our independent camera clusters as the clusters adopted in [3, 49]. We can see that our global method with interdependent camera clusters successfully guarantees closeloop while those [3, 49] with exclusive camera clusters fail.
The statistical comparison with the hybrid SfM methods [3, 49] are shown in Table 1. To measure the consistency of camera poses, we use the epipolar error that is the median distance between the features and corresponding epipolar lines computed from the feature correspondences of all the camera pairs, the number of camera pairs connected by 3D points, and the number of final 3D points. Since our clustering algorithm introduces sufficient camera connectivities for a fully constrained global motion averaging rather than directly merging exclusive camera clusters [3, 49], the epipolar error of our approach is only of that of the work [3, 49], the number of connected camera pairs is times of that of the work [3, 49], and we generate times more 3D points than the work in [3, 49]. Table 1 also provides the results of our approach with different complements ratio. We can see that a larger completeness ratio, namely more cameratocamera connectivities, guarantees a more accurate and complete sparse reconstruction.
Dataset  Wu [56]  Cui [10]  Moulon [34]  Sweeney [48]  Ours  

FountainP11  7.7  16.2  0.08  0.07  69.7  0.11  0.13  11.3  2  0.04  0.05  2.9 
EntryP10  8.5  59.4  0.13  0.16  71.5  0.11  0.19  67.1  2  0.04  0.08  6.0 
HerzJesuP8  10.7  21.7  0.69  4.26  69.7  0.68  4.23  5.7  2  0.71  4.58  3.7 
HerzJesuP25  21.3  73.1  0.11  0.19  293.0  0.13  0.20  8.6  4  0.03  0.07  17.3 
CastleP19  320.1  573.9  0.34  0.64  544.6  0.43  0.76  619.1  3  0.07  0.08  23.3 
CastleP30  204.1  671.2  0.34  0.64  739.6  0.32  0.59  566.8  5  0.06  0.09  35.8 
Datasets  Accuracy [meters]  Time [seconds]  

# images  1DSfM [54]  Colmap [42]  Cui [10]  Sweeney [50]  Theia [48]  Ours  1DSfM [54]  Colmap [42]  Cui [10]  Sweeney [50]  Theia [48]  Ours  
Alamo  577  529  0.3  552  0.3  540  0.5  533  0.4  558  0.4  549  0.2  752  910  499  840  476  568  129  198  413  497  173  63  264 
Ellis Island  227  214  0.3  209  0.6  206  0.7  203  0.5  220  4.7  221  0.5  139  171  137  301  158  209  14  33  14  28  26  14  45 
Metropolis  341  291  0.5  324  1.4  281  3.0  272  0.4  321  1.0  298  0.2  201  244  302  532  27  64  94  161  34  47  88  26  125 
Montreal N.D.  450  427  0.4  437  0.3  433  0.3  416  0.3  448  0.4  445  0.3  1135  1249  352  688  632  678  133  266  107  164  167  72  261 
Notre Dame  553  507  1.9  543  0.5  547  0.2  501  1.2  540  0.2  514  0.2  1445  1599  432  708  458  549  161  247  196  331  246  64  338 
NYC Library  332  295  0.4  304  0.6  303  0.3  294  0.4  321  0.9  290  0.3  392  468  311  412  169  210  83  154  47  62  79  52  144 
Piazza del Popolo  350  308  2.2  332  1.2  336  1.4  302  1.8  326  1.0  334  0.5  191  249  246  336  126  191  72  101  46  61  72  16  93 
Piccadilly  2152  1956  0.7  2062  0.6  1980  0.4  1928  1.0  2055  0.7  2114  0.4  2425  3483  623  1814  984  1553  702  1246  72  330  932  542  1614 
Roman Forum  1084  989  0.2  1062  1.7  1033  2.8  966  0.7  1045  2.2  1079  0.4  1245  1457  823  1122  310  482  847  1232  183  244  604  201  902 
Tower of London  572  414  1.0  450  0.7  458  1.2  409  0.9  456  1.4  458  1.0  606  648  542  665  488  558  92  246  130  154  320  75  410 
Union Square  789  710  3.4  726  2.8  570  4.2  701  2.1  720  5.0  720  1.5  340  452  430  532  45  99  102  243  27  48  145  50  207 
Vienna Cathedral  836  770  0.4  799  1.2  774  1.6  771  0.6  797  2.6  793  0.5  2837  3139  930  1254  438  580  422  607  111  244  712  167  905 
Yorkminster  437  401  0.1  416  0.8  407  0.6  409  0.3  414  1.4  407  0.3  777  899  724  924  602  662  71  102  59  92  199  67  281 
5 Experiments
Implementation
We implement our approach in C++ and perform all the experiments on a distributed computing system consisted of 10 computers each of which has 6Core (12 threads) Intel 3.40 GHz processors and 128 GB memory. All the computers are deployed on a scalable network file system similar to Hadoop File System. We implement a multicore bundle adjustment solver similar to PBA [57] to solve all the nonlinear optimization problems, and a solver like [6] to solve Equation 6. We also utilize Graclus [12] to handle the normalizedcut problem.
Benchmark datasets
The statistics of the comparisons of the benchmark datasets [47] with absolute measurements of camera poses between the stateoftheart methods [10, 34, 48, 56] and our proposed method are shown in Table 2. Since the number of cameras of the largest benchmark dataset CastleP30 is only 30, we set rather than adopted by our pipeline to force that valid camera clusters can be generated. Specifically, we can see that the average errors of relative rotations (), relative translations (), and corresponding camera positions () from our algorithm are all obviously smaller than the work in [10, 34, 48, 56].
Internet datasets
Table 3 shows the statistical comparisons with the stateoftheart SfM pipelines [10, 42, 50, 48, 54] on the Internet dataset. We can see that our approach achieves the best accuracy measured by the median camera position errors (in meters) after bundle adjustment in 8 out of 13 datasets. Moreover, we register the most cameras in 4 out of 13 datasets. Among these methods [10, 42, 50, 48, 54], Theia SfM [48] is the most efficient. We can therefore conclude that our SfM pipeline achieve slightly better accuracy and its efficiency is comparable to the stateoftheart methods [10, 42, 50, 48, 54] on the datasets captured in the wild.
Dataset  # images  Resolution  Clustering time [minutes]  Pipeline time [hours]  Peak memory [GB]  
Partition  Expansion  Total  TG  LS  MA  BA  Total  Original  Ours  
TG  MA  BA  TG  MA  BA  
City A  1210106  50 Mpixel  164.8k  23867  25.24  18.84  46.88  59.02  34.62  75.26  56.04  275.74  2933.76  39.81  10159.62  34.62  39.81  0.53 
City B  138200  24 Mpixel  73.0k  2721  6.62  4.61  11.71  5.73  3.62  7.34  6.24  23.43  207.76  4.59  666.92  16.47  4.59  0.63 
City C  91732  50 Mpixel  170.1k  1723  5.12  3.17  8.62  2.64  2.30  4.27  7.76  18.10  162.50  3.04  492.39  12.33  3.04  0.62 
City D  36480  36 Mpixel  96.4k  635  2.01  1.25  3.57  1.11  1.21  1.71  3.31  7.64  55.70  1.21  176.57  4.87  1.21  0.67 
Dataset  Resolution [Mpixels]  # registered cameras  # tracks  Avg. track length  Avg. reproj. error [pixels] 

Theia [48]  2.25  19,014  6.78M  4.6  1.84 
OpenMVG [34]  2.25  13,254  4.21M  4.9  1.67 
VisualSfM [56]  2.25  7,230  2.64M  4.3  0.88 
Colmap [42]  2.25  21,431  5.75M  5.2  0.86 
Ours  36.15  36,428  27.8M  6.2  1.18 
Cityscale datasets
The statistics of the input cityscale datasets are shown in Table 4. The image resolution ranges from to megapixels and the average number of detected features of each image ranges from 73.0K to 170.1K. We can see that the estimated peak memory of the largest CityA dataset is 2.9TB, 39.81GB, and 10.2TB in track generation, motion averaging and bundle adjustment respectively if handle by the standard SfM pipeline [34] in a single computer, which obviously runs out of memory of our servers with 128GB memory. The same goes for the other standard SfM pipelines [42, 48, 56]. However, our pipeline can even recover 1.21 million accurate and consistent camera poses and 1.68 billion sparse 3D points of the largest CityA dataset. The corresponding peak memory dramatically drops to 34.62GB and 0.53GB in track generation and bundle adjustment respectively. In Figure 10, we further provide the visual results of the cityscale datasets containing both mesh and textured models with delicate details to qualitatively demonstrate the high accuracy of the finally recovered camera poses. As shown in Table 5, we fit the whole CityD dataset to the standard SfM pipeline [34, 42, 48, 56] by resizing images. We can see that downsampling images leads to an obviously smaller number of registered cameras.
Running time
We test the Internet dataset [54] on a single computer to make a fair comparison on running time, and Table 3 shows that our efficiency is comparable to the works in [26, 38, 48, 54]. As for the cityscale datasets, we note in Table 4 that the running time of track generation and local incremental SfM grows linearly as the number of images increases, while the running time of bundle adjustment, the complexity of which is given cameras and 3D points even in a distributed manner, and motion averaging that can only be handled in a single computer gradually dominates as the number of images drastically increases. Even for the CityB dataset, our parallel computing system composed of 10 computers can successfully reconstruct 138 thousand cameras and 100 million sparse 3D points within one day. Notably, because of the concise design of our clustering algorithm, the range of its running time on the cityscale datasets is from 3.57 to 11.71 minutes, which is extremely efficient compared with the time cost of the whole SfM pipeline.
Limitations
Thanks to the fully scalable formulation of our SfM pipeline in terms of camera clusters, the peak memory of track generation of our pipeline is only 2.1%8.7% of the standard pipeline [10, 34, 45, 48, 56], and the peak memory of bundle adjustment of our approach is even 0.13.8‰ of the standard pipeline. However, since our motion averaging formulation (Section 4.3) still solves all the camera poses considering available relative motions at once, it is limited by the memory of a single computer. We are therefore interested to exploit our scalable formulation to solve largescale motion averaging problems in a scalable and parallel manner, and leave this for future study.
6 Conclusions
In this paper, we propose a parallel pipeline able to handle accurate and consistent SfM problems far exceeding the memory of a single computer. A graphbased camera clustering algorithm is first introduced to divide the original problem into subproblems while preserving sufficient connectivities among cameras for a highly accurate and consistent reconstruction. A hybrid SfM method embracing the advantages of both incremental and global SfM methods is subsequently proposed to merge partial reconstructions into a globally consistent reconstruction. Our pipeline is able to handle cityscale SfM problems containing one dataset with 1.21 million highresolution images, which runs out of memory in the available approaches, in a highly scalable and parallel manner with superior accuracy and consistency over the stateoftheart methods.
References
 [1] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski. Building rome in a day. Commun. ACM, 54(10):105–112, 2011.
 [2] M. ArieNachimson, S. Z. Kovalsky, I. KemelmacherShlizerman, A. Singer, and R. Basri. Global motion estimation from point matches. In 3DIMPVT, 2012.
 [3] B. Bhowmick, S. Patra, and A. Chatterjee. Divide and conquer: Efficient largescale structure from motion using graph partitioning. In ACCV, 2014.
 [4] F. Bourse, M. Lelarge, and M. Vojnovic. Balanced graph edge partition. In SIGKDD, 2014.
 [5] M. Brand, M. Antone, and S. Teller. Spectral solution of largescale extrinsic camera calibration as a graph embedding problem. In ECCV, 2004.
 [6] E. Candès and J. Romberg. magic: Recovery of sparse signals via convex programming, 2005.
 [7] L. Carlone, R. Tron, K. Daniilidis, and F. Dellaert. Initialization techniques for 3d slam: a survey on rotation estimation and its use in pose graph optimization. In ICRA, 2015.
 [8] A. Chatterjee and V. M. Govindu. Efficient and robust largescale rotation averaging. In ICCV, 2013.
 [9] Z. Cui, N. Jiang, C. Tang, and P. Tan. Linear global translation estimation with feature tracks. In BMVC, 2015.
 [10] Z. Cui and P. Tan. Global structurefrommotion by similarity averaging. In ICCV, 2015.
 [11] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008.

[12]
I. S. Dhillon, Y. Guan, and B. Kulis.
Weighted graph cuts without eigenvectors a multilevel approach.
PAMI, 29(11):1944–1957, 2007.  [13] A. Eriksson, J. Bastian, T.J. Chin, and M. Isaksson. A consensusbased framework for distributed bundle adjustment. In CVPR, 2016.

[14]
M. Farenzena, A. Fusiello, and R. Gherardi.
Structureandmotion pipeline on a hierarchical cluster tree.
In ICCV Workshops, 2009.  [15] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981.
 [16] J.M. Frahm, P. FiteGeorgel, D. Gallup, T. Johnson, R. Raguram, C. Wu, Y.H. Jen, E. Dunn, B. Clipp, S. Lazebnik, and M. Pollefeys. Building rome on a cloudless day. In ECCV, 2010.
 [17] T. Goldstein, P. Hand, C. Lee, V. Voroninski, and S. Soatto. Shapefit and shapekick for robust, scalable structure from motion. In ECCV, 2016.
 [18] V. M. Govindu. Combining twoview constraints for motion estimation. In CVPR, 2001.
 [19] V. M. Govindu. Liealgebraic averaging for globally consistent motion estimation. In CVPR, 2004.
 [20] V. M. Govindu. Robustness in motion averaging. In ACCV, 2006.

[21]
R. Hartley and A. Zisserman.
Multiple view geometry in computer vision
. Cambridge university press, 2003.  [22] R. I. Hartley, J. Trumpf, Y. Dai, and H. Li. Rotation averaging. IJCV, 103(3):267–305, 2013.
 [23] M. Havlena, A. Torii, and T. Pajdla. Efficient structure from motion by graph optimization. In ECCV, 2010.
 [24] J. Heinly, E. Dunn, and J.M. Frahm. Correcting for Duplicate Scene Structure in Sparse 3D Reconstruction. In ECCV, 2014.
 [25] J. Heinly, J. L. Schönberger, E. Dunn, and J. M. Frahm. Reconstructing the world* in six days. In CVPR, 2015.
 [26] N. Jiang, Z. Cui, and P. Tan. A global linear method for camera pose registration. In ICCV, 2013.
 [27] B. Klingner, D. Martin, and J. Roseborough. Street view motionfromstructurefrommotion. In ICCV, 2013.
 [28] L. Kneip, D. Scaramuzza, and R. Siegwart. A novel parametrization of the perspectivethreepoint problem for a direct computation of absolute camera position and orientation. In CVPR, 2011.
 [29] M. Lhuillier and L. Quan. A quasidense approach to surface reconstruction from uncalibrated images. PAMI, 27(3):418–433, 2005.
 [30] X. Li, C. Wu, C. Zach, S. Lazebnik, and J.M. Frahm. Modeling and recognition of landmark image collections using iconic scene graphs. In ECCV, 2008.
 [31] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 60(2):91–110, 2004.
 [32] D. Martinec and T. Pajdla. Robust rotation and translation estimation in multiview reconstruction. In ICPR, 2007.
 [33] P. Moulon and P. Monasse. Unordered feature tracking made fast and easy. In CVMP, 2012.
 [34] P. Moulon, P. Monasse, and R. Marlet. Global fusion of relative motions for robust, accurate and scalable structure from motion. In ICCV, 2013.
 [35] K. Ni, D. Steedly, and F. Dellaert. Outofcore bundle adjustment for largescale 3d reconstruction. In ICCV, 2007.
 [36] D. Nistér. An efficient solution to the fivepoint relative pose problem. PAMI, pages 756–770, 2004.
 [37] D. Nistér and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, 2006.
 [38] O. Özyesil and A. Singer. Robust camera location estimation by convex programming. In CVPR, 2015.
 [39] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch. Visual modeling with a handheld camera. IJCV, 59(3):207–232, 2004.
 [40] B. Resch, H. P. Lensch, O. Wang, M. Pollefeys, and A. S. Hornung. Scalable structure from motion for densely sampled videos. In CVPR, 2015.
 [41] R. Roberts, S. N. Sinha, R. Szeliski, and D. Steedly. Structure from motion for scenes with large duplicate structures. In CVPR, 2011.
 [42] J. L. Schönberger and J.M. Frahm. Structurefrommotion revisited. In CVPR, 2016.
 [43] J. L. Schönberger, F. Radenović, O. Chum, and J. M. Frahm. From single image query to detailed 3d reconstruction. In CVPR, 2015.
 [44] S. N. Sinha, D. Steedly, and R. Szeliski. A multistage linear approach to structure from motion. In ECCVworkshop RMLE, 2010.
 [45] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring image collections in 3d. SIGGRAPH, 2006.
 [46] N. Snavely, S. M. Seitz, and R. Szeliski. Skeletal graphs for efficient structure from motion. In CVPR, 2008.
 [47] C. Strecha, W. von Hansen, L. V. Gool, P. Fua, and U. Thoennessen. On benchmarking camera calibration and multiview stereo for high resolution imagery. In CVPR, 2008.
 [48] C. Sweeney. Theia multiview geometry library: Tutorial & reference. http://theiasfm.org.
 [49] C. Sweeney, V. Fragoso, T. Höllerer, and M. Turk. Large scale sfm with the distributed camera model. In 3DV, 2016.
 [50] C. Sweeney, T. Sattler, T. Hollerer, M. Turk, and M. Pollefeys. Optimizing the viewing graph for structurefrommotion. In ICCV, 2015.
 [51] R. Toldo, R. Gherardi, M. Farenzena, and A. Fusiello. Hierarchical structureandmotion recovery from uncalibrated images. CoRR, abs/1506.00395, 2015.
 [52] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon. Bundle adjustment  a modern synthesis. In LNCS, 2000.
 [53] K. Wilson and N. Snavely. Network principles for sfm: Disambiguating repeated structures with local context. In ICCV, 2013.
 [54] K. Wilson and N. Snavely. Robust global translations with 1dsfm. In ECCV, 2014.
 [55] C. Wu. SiftGPU: A GPU implementation of scale invariant feature transform (SIFT). http://cs.unc.edu/~ccwu/siftgpu, 2007.
 [56] C. Wu. Towards lineartime incremental structure from motion. In 3DV, 2013.
 [57] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz. Multicore bundle adjustment. In CVPR, 2011.
 [58] C. Zach, A. Irschara, and H. Bischof. What can missing correspondences tell us about 3d structure and motion? In CVPR, 2008.
 [59] C. Zach, M. Klopschitz, and M. Pollefeys. Disambiguating visual relations using loop constraints. In CVPR, 2010.
Comments
There are no comments yet.