Progressive Structure from Motion

by   Alex Locher, et al.

Structure from Motion or the sparse 3D reconstruction out of individual photos is a long studied topic in computer vision. Yet none of the existing reconstruction pipelines fully addresses a progressive scenario where images are only getting available during the reconstruction process and intermediate results are delivered to the user. Incremental pipelines are capable of growing a 3D model but often get stuck in local minima due to wrong (binding) decisions taken based on incomplete information. Global pipelines on the other hand need the access to the complete viewgraph and are not capable of delivering intermediate results. In this paper we propose a new reconstruction pipeline working in a progressive manner rather than in a batch processing scheme. The pipeline is able to recover from failed reconstructions in early stages, avoids to take binding decisions, delivers a progressive output and yet maintains the capabilities of existing pipelines. We demonstrate and evaluate our method on diverse challenging public and dedicated datasets including those with highly symmetric structures and compare to the state of the art.


page 1

page 3

page 7


Leveraging Vision Reconstruction Pipelines for Satellite Imagery

Reconstructing 3D geometry from satellite imagery is an important topic ...

Progressive NAPSAC: sampling from gradually growing neighborhoods

We propose Progressive NAPSAC, P-NAPSAC in short, which merges the advan...

A Unified View-Graph Selection Framework for Structure from Motion

View-graph is an essential input to large-scale structure from motion (S...

Building with Drones: Accurate 3D Facade Reconstruction using MAVs

Automatic reconstruction of 3D models from images using multi-view Struc...

Guidelines for data analysis scripts

Unorganized heaps of analysis code are a growing liability as data analy...

ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines

The use of FPGAs for efficient graph processing has attracted significan...

Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration

This paper presents an FPGA runtime framework that demonstrates the feas...

1 Introduction

3D reconstruction from individual photographs is a long studied topic in computer vision [30, 1, 12]. The field of sfm deals with the intrinsic and extrinsic calibration of sets of images and recovers a sparse 3D structure of the scene at the same time. Traditional methods are usually designed as batch processing algorithms, where image acquisition and image processing are separated into two independent steps. This contrasts with current demand, when one would like to be able to convert an object or a scene into a 3D model anytime and anywhere, just by using the mobile phone one is carrying in her pocket. Recent developments in mobile technology and the availability of 3D printers raised the need for 3D content even more and underlay the importance of 3D modeling. In a user-centric scenario, images are taken on the spot and processed by a 3D modeling pipeline on-the-fly [15, 18]. Any feedback which gets available to the user helps to guide her acquisition and, even more importantly, assures that the 3D model represents the real-world object in the desired quality. In a collaborative scenario, multiple users acquire pictures of the same object and images are gathered on the reconstruction server in the cloud. The 3D model is progressively built and intermediate reconstruction results are shown to the user. Images are getting available as they are taken and the sfm pipeline has never access to the complete set of information in the dataset. Moreover, the whole reconstruction process might have a starting, but no predefined end point. Users might always decide to add more images to an existing model.

(a) incremental SfM points
(b) progressive SfM points
(c) progressive SfM clusters
(d) progressive SfM viewgraph
Figure 1: Opposed to existing methods, a favorable image order is not crucial for the proposed progressive sfm pipeline. While the baseline method (a) fails to recover the structure of the scene our method successfully reconstructs the temple (b). Individual clusters are identified in the viewgraph (d) and merged (c) based on a lightweight optimization.

In this work we therefore propose a progressive sfm pipeline which avoids taking (potentially fatal) binding decisions and therefore is as independent of the input image order as possible. Moreover, the proposed pipeline reuses already computed intermediate results in later steps and is suited for delivering progressive modeling results back to the user within seconds.

1.1 Related Work

Classical sfm pipelines are typically not suited to be used in such a progressive – multiuser-centric scenario. Global pipelines [21, 28]

start by estimating poses of all cameras in the dataset and estimate the structure in a second step. Evidently, this relies on the access to the complete dataset which contradicts the idea of progressive 3D modeling. Sequential sfm pipelines, sometimes termed SLAM 

[7, 22], are inherently suitable for processing (potentially infinite) streams of images. Nevertheless, the underlying assumption often is that images neighboring in the sequence are spatially close in the scene, which is easily violated when streams from multiple users are combined together. Incremental sfm pipelines [30, 35, 26] build a 3D model by initializing the structure from a small seed and gradually growing it by adding additional cameras. This scheme is closer to the requested progressive scenario but, unfortunately, is strongly dependent on the order in which images are added to the model [20, 27]. View selection algorithms [33] carefully determine the image order usually by employing the global matching information which is not available in the progressive case. Hierarchical sfm pipelines [10, 8] try to overcome the problem of improper seed selection by starting from several seed locations at the same time and eventually merging the partial 3D models into a single global model in later stages. Sweeney et al. [32] group multiple individual cameras and optimize them jointly as a distributed camera. However all hierarchical methods require the knowledge of all images in advance. Our proposed method is partially inspired by these approaches but in order to provide the progressive capability, the hierarchy is not fixed but rather re-defined every time new images are added to the reconstruction process.

Due to the lack of global information, an incremental pipeline would connect new images to the existing model based on incomplete information which often causes corrupted 3D models [36, 25, 14, 11]. Even more importantly, incremental pipelines cannot recover from wrong decisions taken based on missing information and therefore can easily get stuck in only locally optimal reconstructions. The only reliable solution would be re-running an incremental or global pipeline from scratch when new images become available which leads to an impractical algorithm runtime. Heinly et al. [11] detected erroneous reconstructions of scenes with duplicate structure in a post processing step by evaluating different splits of the model into submodels and potentially merge them in the correct configuration by leveraging conflicting observations. Our method in contrary is a full fledged sfm pipeline which avoids getting trapped in a local minima in the first place. Faulty configurations are detected and corrected on-the-fly and not in a post processing step.

Our work bases on a lightweight representation of the complete scene as a viewgraph. Many existing approaches investigated robustification of the global view-graph by filtering out bad epipolar geometries [33, 6] and enforcing loop constraints [37]. Wilson and Snavely [34] are able to reconstruct scenes with repetitive structures by scoring repetitive features using local clustering. Recent work of Cui [5] takes a similar approach to our pipeline. The Hybrid sfm pipeline estimates all rotations of the viewgraph in a community based global algorithm. The estimated orientations are leveraged in the second phase of the pipeline by estimating the translations and structure of the scene in an incremental scheme with reduced dimensionality. While sharing the idea of combining global and incremental schemes, HSfM is a pure batch processing algorithm. The global rotation averaging needs access to the complete view graph in advance which is not available in a progressive scheme. In our work we combine a dynamic global view graph with a local clustering based on a connectivity score and combine the advantages of incremental and global structure from motion. In order to accommodate for the demands of a progressive pipeline, we allow for flexibility in already reconstructed parts of the model and constantly verify the local reconstructions against the globally optimized viewgraph.

1.2 Contributions

We propose a novel progressive sfm pipeline which enables 3D reconstruction in an anytime anywhere multiuser-centric scenario. Unlike traditional pipelines, the proposed approach avoids taking binding decisions, does not depend on the order of incoming images and is able to recover from wrong decisions taken due to the lack of information. Moreover, the computed intermediate results are propagated along the reconstruction process resulting in an efficient use of computational resources.

2 Progressive Reconstruction

In the following section we give an overview of our progressive sfm pipeline and detail individual components and key aspects later on.

2.1 Overview

Our progressive sfm pipeline takes an ordered sequence of images with its geometrically verified correspondences as the input and delivers a sparse pointcloud and calibrated camera poses as the output (see Figure 2). The resulting sparse configuration is updated with every image added to the scene. A viewgraph with nodes being images and edges connecting pairs of matched images is gradually built and serves as the global knowledge throughout the whole reconstruction. On every iteration of the algorithm, the viewgraph is clustered based on local connectivity and individual clusters are processed locally. In each of the local clusters a robust rotation averaging scheme filters out wrong two-view geometries and the 3D structure is estimated using either an incremental or a global sfm pipeline. The cluster configuration between two time steps is tracked and the already estimated parts of the 3D model are passed to the next stage. The global configuration of individual clusters is estimated in the last stage by robustly estimating 7 DoF similarity transforms between them using the remaining inter-cluster constraints. Generally, the local incremental method enables robust and efficient reconstructions while the viewgraph combined with the robust rotation averaging injects the global knowledge and allows for correction of corrupted 3D models.

Figure 2: An overview of the main steps of the progressive sfm pipeline and its involved components.

2.2 Progressive Viewgraph

The algorithm takes a (randomly) ordered sequence of images as the input. Every incoming image is matched against the most relevant images already present in the scene and geometrically verified pairwise correspondences are obtained. A viewgraph with images as vertices is maintained at every time step . Two vertices are connected by an undirected edge iff there exists a minimum amount of correspondences between them where is the matching matrix. Every edge has an associated relative rotation and translation direction which is obtained by decomposing either the estimated essential matrix in the calibrated case or the fundamental matrix when the focal length is not known.

The order in which images are fed to a reconstruction pipeline plays an important role and every snapshot of the viewgraph only captures the past information of the reconstruction process. This is why filtering of supposedly wrong two-view geometries at this stage can be very dangerous. It might happen that a geometry is inconsistent with other local geometries in the neighborhood111This can, e.g., be checked by computing the cumulative rotation of loops in which the edge is participating [37]

at the current time-step but connecting images added in later steps may show that the two-view geometry was actually correct. As a wrong decision in the global viewgraph could lead to a local minimum in the reconstruction, i.e. a corrupted 3D model, we do not conduct any outlier rejection and defer robustification to a later stage.

2.3 Clustering

Motivated by the general observation that densely connected regions of the viewgraph are likely to form a 3D model worth reconstruction, the viewgraph is clustered in a second step. The distance between two vertices is based on the weighted Jaccard distance of the adjacent edges where connections to neighboring vertices are weighted by the number of verified correspondences.


A set of clusters is hierarchically grown by single-linkage clustering until no single edge with between two clusters exists. Single-linkage clustering tends to generate clusters with a chain-like topology where the two ending nodes might have a distance way larger than the defined threshold. While this might be a disadvantage in other applications it is actually beneficial in our application as local (incremental) reconstruction pipeline performs well for such graph structures.

Incremental Clustering

Due to the single-linkage hierarchical clustering, a simplified incremental scheme can be used to update an existing cluster topology with a new node. While generally a single extra node can cause a complete change in topology of the clustered graph, the changes are limited to clusters which are connected to the new node. As a result only clusters with an edge connecting to the new node have to be re-clustered which leads to an efficient and scalable implementation. Note that in worst case the whole graph still might be updated – but in most cases a new image only connects to few clusters and therefore most of the existing clusters remain untouched.

2.4 Cluster Tracking and Recycling

Between the transition of two timesteps and the topology of clusters can undergo large changes but mostly will either stay the same or be extended by the new image. All changes in the clusters have to be propagated to the eventually estimated 3D structure. We therefore keep track of the nodes changing cluster between the two timesteps and add or remove the corresponding images in the local reconstruction. The recycling of intermediate (partial) 3D reconstructions can be realized by a merge and split scheme. If a group of interconnected nodes transfer together from one cluster to another, the corresponding cameras in the 3D model can be separated from the rest and merged into the potentially existing structure of the new cluster.

2.5 Cluster Reconstruction

Once individual clusters have been identified by the hierarchical clustering, the 3D structure of the images and correspondences of every cluster is estimated. The cluster reconstruction process is only triggered if its topology has changed meaning either some nodes were added or some nodes were removed from the cluster. If the topology of the cluster is unchanged between the two time steps and , the reconstruction step is skipped as a whole.

A sub-graph capturing all vertices and edges of the cluster is extracted in a first step. All following operations are restricted to the scope of the extracted graph .

Robust Rotation Averaging

The unfiltered viewgraph is potentially corrupted by outliers and has to be cleaned up for further usage. As this step is performed in every iteration it is safe to reject outlier two-view geometries. As a result of the repetitive execution of the filtering stage, it is important to use a computationally efficient filtering scheme. As proposed by Chatterjee et al. [3] we estimate the global rotations of the subgraph with a robust optimization. Global Rotations are initialized by concatenating the relative rotations of a mst extracted from the viewgraph using the number of verified inliers as edge weights. The global rotations are then optimized by minimizing the relative rotation errors .


By using the Lie-algebraic approximation of the relative rotation, the error can be expressed as a difference of the corresponding rotations . Where denotes a rotation by angle around the unit axis .


The relative rotations can be encoded in a sparse matrix where each row only has two nonzero entries. The robust norm combined with an edge weighting depending on the number of verified correspondences allows us to obtain the global rotations of the cluster in an iterative scheme by optimizing Eqn. 4 in every step. For more details we refer the reader to the original publication [3].


The obtained global rotations can optionally be refined by irls method using a robustified norm. As we are primarily interested in rejecting outliers, the results of the optimization is accurate enough and we omit the refinement step.

Finally edges with a relative error above are removed from the local viewgraph.


Using the global rotations and the pairwise matches as the input, the structure of the individual clusters is estimated with different methods depending on the current status (Figure 3).


If we could not recover any structure from and the cluster contains at least nodes but less than images a new reconstruction is starting using an incremental sfm pipeline similar to [30, 35]. If the number of images in the cluster is larger than , incremental pipelines get inefficient and a global pipeline is more suited. We therefore refine the global rotations and estimate global position using the 1DSfM method [34]. The structure is then obtained by triangulating consistent feature-tracks and the configuration is refined by a bundle adjustment step.


In most cases an existing local reconstruction can be extended by one or multiple images which are either new or transferred from another cluster. In order to extend an existing 3D reconstruction with a new image we extend existing feature tracks by the new correspondences and estimate the 3D position of the camera by the P3P algorithm [16]. New tracks are added and potential conflicting tracks are split up. A local bundle adjustment refines the structure and calibration of the newly added cameras. If the model has grown by more than percent, a bundle adjustment step over the local reconstruction refines the whole structure.


If a large part (more than nodes) of an already estimated local model is transferred to another cluster, its already estimated structure is kept and transferred to the new cluster. Commonly estimated tracks are fused and a bundle adjustment step refines the structure. Potentially unestimated cameras are then added in an incremental scheme as described before.

Figure 3: Depending on the status of an individual cluster, its structure is estimated with different methods.

Detection of Bad Configurations

Due to the incremental reconstruction scheme for cluster reconstruction we can reuse most of the intermediate results from an earlier stage and propagate them to later stages. The incremental scheme is highly dependent on the actual image ordering and therefore some unfortunate decisions taken in early stages (e.g. wrongly connecting an image due to missing information) cannot be recovered in later stages. Global clustering usually solves this problem for us as the wrongly connected image or sub-model is likely to be transferred to another cluster at a later stage. But it might also happen that the image actually belongs to the same cluster and yet is wrongly connected. Without any additional counter measurements, we would end up with a corrupted reconstruction in such cases.

By using the relative global rotations of the local viewgraph (which is independent of the image order) we can evaluate an error measure between the rotations of the current local reconstruction and the global rotations .


If the rotation error is above a certain limit the reconstruction is likely to be erroneous. Hence it is reset and re-estimated in the next phase of the reconstruction.

2.6 Cluster Pose Estimation

The output of the pipeline so far are multiple locally highly consistent but disconnected model parts. In order to merge them to a final representation we make use of the remaining inter-cluster connections. As this step of merging multiple models is often rather fragile and can easily go wrong, several methods employed a rather conservative merging criteria as multiple cameras of overlap [10] or a high amount of common points [35]. Instead, we are estimating pairwise similarity transforms between clusters and and optimize the global positions of the cluster centers. For robustification, we use a two stage verification scheme for individual inter-cluster constraints.

In the first stage, a similarity transform based on 3D-to-3D correspondences from every single edge between clusters and is estimated in a RANSAC scheme [13, 9]. The vertices (camera centers) of are afterwards transformed using the estimated transformation and the camera configuration of the obtained combined model is compared to the individual configuration before. A neighborhood similarity measure evaluates how many of the cameras in the merged cluster would retain their closest neighbors and for the cameras from clusters and , respectively. denotes the camera center of the node and the operator transforms a point by the similarity transform.


Edges with the neighborhood similarity measure below are rejected as outliers for the final constraint estimation. The measure is motivated by the observation that nearby cameras should already be clustered by the original algorithm and prevents merged models where the cameras are extensively interleaved.

In the second stage, a final transformation between the clusters is estimated using all correspondences of edges that survived the first filtering stage. All relative constraints are gathered and a cluster graph with local reconstructions as nodes and pairwise inter-cluster constraints as edges is created.


The global poses of the clusters are afterwards obtained by iteratively minimizing the pairwise relative pose constraints , where denotes the robust Huber error function.

While this in the optimal case leads to a single 3D model, the flexibility and efficiency of the global optimization allows us to choose a much less conservative threshold for cluster merging. In addition to its efficiency, the cluster graph representation can easily incorporate additional extrinsic constraints, e.g., the gravity vector for orientation with respect to the ground plane or coarse cluster-wise GPS locations for geo-referencing. Such additional constraints can help to put individual models in relative perspective even if pure vision based inter-cluster constraints are not (yet) available.

3 Experiments and Results

In the following section, we first give some implementation details of the proposed solution and introduce the conducted experiments.

3.1 Implementation

We implemented the proposed pipeline as a C++ software. SIFT [19] features are detected on all images and described with the RootSift descriptor [2]. The pipeline is setup for stream processing dedicated by the progressive reconstruction scheme. Every incoming image is indexed by a 1M word vocabulary trained on the independent Oxford5k [23] dataset and the top 100 nearest neighbors are retrieved using a fast tf-idf scoring [29]. Matches are geometrically verified by estimating the fundamental or essential matrix (depending on the availability of the focal length) as well as a homography matrix in a RANSAC scheme using the general USAC framework [24]. The relative rotation and translation direction are obtained by decomposing the fundamental, essential, or homography matrix (depending on the amount of inliers). The incremental and global reconstruction of cluster centers bases on the publicly available Theia library [31] and the final clustergraph optimization is realized in the g2o [17] framework. The following parameters were used within all the conducted experiments: , , , , , .

3.2 Baselines

Throughout our expermiments we compare the proposed progressive pipeline against two linear pipelines (VisualSFM [35] and incremental Theia [31]) as well as to the recently published hybrid pipeline HSfM [5]. For an unbiased comparision, all geometrically verified matches were precomputed and imported into the various pipelines. The streaming part of the pipeline was skipped and matches directly loaded from disk.

Figure 4: Individual snapshots showing the viewgraph (top) and the sparse structure (bottom) of a reconstruction of the temple scene using a very unfortunate image ordering document the algorithm’s recovery ability. Two separate parts of the temple are initially wrongly reconstructed in a single model (a-b) and are later successfully separated (c) and the real structure is recovered (d).

Within the experiments we run the pipelines in different modes:

batch mode

All images are added to the pipeline at once and the model is reconstructed without intermediate results. This is the classical sfm operation mode.

progressive mode

Images are fed to the pipeline in a certain order and intermediate reconstructions are enforced. In VisualSFM this is realized by the “Add Image” option and in Theia we dictated the order in which views are added to an ongoing reconstruction by a slight code modification.

Figure 5: Behavior of the incremental and progressive sfm pipelines on the TempleOfHeaven dataset. In the well behaving “linear” case (blue), both pipelines reconstruct the temple. If the input order is shuffled the incremental pipeline gets stuck in a local minimum (green) whereas the proposed pipeline recovers and reconstructs the whole scene (red).

3.3 Randomized Image Order

One of the key contributions of the proposed pipeline is the ability of recovering from wrong connections between image pairs made in previous reconstruction cycles. Wrong connections often occure in symmetric environments which is why we chose the publicly available TempleOfHeaven dataset [14] with a rotationaly symmetric structure for the first experiment. The dataset consists of 341 images taken in a regular spacing and perfectly demonstrates problems arising from unpleasant orderings. As a reference model, we reconstructed the scene using VisualSFM and restricted the matches to sequential matching in the original order. Figure 5 shows the development of the model as a function of the timestep . We report the number of clusters before and after global cluster position estimation as well as the number of outlier cameras. A camera is considered to be an outlier if the position error is larger than half the minimum distance between two cameras in the reference model.

As expected both methods perform well in the “linear” case, where the images are fed to the pipeline in the original ordering. A single cluster is reconstructed and maximum of 8 cameras are classified as outliers. In a second experiment we randomly shuffled the order of images. After an initial set of 20 images, the baseline merges all clusters to a single one and as a result the number of outlier cameras increases linearly. The final 3D model is highly corrupted and only covers about a third of the temple (Figure 


In contrast, the proposed pipeline creates up to 6 individual but locally consistent clusters. After 84 out of the 341 images the clusters are correctly localized into a single effective cluster by the cluster positioning (dashed line). Figure 1 shows the resulting reconstruction after 100 images as well as the corresponding clustered viewgraph.

Figure 6: Comparison of the proposed progressive pipeline (left, center) to the incremental sfm pipeline (right) on the Stanford dataset.

3.4 Very Unfortunate Image Order

In order to demonstrate the recovery capabilities of our proposed pipeline we conducted an additional experiment on the TempleOfHeaven dataset. The temple has a strong rotational symmetry which repeats with 60 degrees. One of the worst conditions for an algorithm would be if the image ordering had exactly this 60 degree periodicity. To test the algorithm, we created an artificial image sequence by feeding the images alternatively in the following order: . Figure 7 shows the number of clusters, images, and outliers in every timestep. Due to the high amount of matches between the symmetric part and the lack of images in between, the clustering algorithm sees enough evidence for putting all images into a single cluster (Figure 3(a)) and roughly every second image is an outlier. After the addition of the 36th image, the connectivity has sufficiently changed so that the two clusters are recognized and the 3D models are separated.

Due to the highly interleaved configuration the structure of the individual models cannot be recycled in this case and both clusters are reconstructed from scratch. At this stage there are no inter-cluster constraints available yet, which is why the global cluster positioning does not succeed and the models are displayed as independent sub-models (Figure 3(c)). On the 95th image enough intra-cluster evidence is available s.t. the models can be placed into a common coordinate system (Figure 3(d)) and with the 101st image a single cluster is formed (this time the structure of the individual sub-models can be recovered and a simple merging is needed). The experiment shows that the proposed pipeline can recover from a wrong local minimum of the reconstruction.

Figure 7: Evolution of the clusters and images in a very unfortunately ordered image sequence. Despite the heavy rotational symmetry of the temple dataset, the pipeline is able to recover from a wrong configuration (timestep 36).

3.5 Realworld Progressive Reconstruction

While the experiments so far demonstrated the capabilities of the progressive pipeline for unfortunate image orderings, its behavior in real world applications remains to be shown. Therefore we collected a series of 4516 images with three different mobile devices (HTC Nexus 9, LG Nexus 5x, and Google Pixel) consisting of 29 sequences on the main quad of the Stanford University. The sequences were then progressively reconstructed in an interleaved order (simulating multiple users collaboratively acquiring the pictures) as well as in the batch-processing mode. An additional similar experiment was conducted using the publicly available Quad dataset [4] consisting of 6514 images, mainly taken by iPhone 3G. Images were shuffled randomly and pushed to the reconstruction algorithms.

Table 1 shows the proposed pipeline compared to the incremental Theia [31] pipeline in batch and progressive modes. In addition we compare against the HSfM [5] pipeline purely in batch-processing mode. The proposed algorithm as the only pipeline can deal with the progressive scheme while the other get stuck in a local minimum and only reconstruct a very small subpart of the scene222Experiments were repeated using multiple random orderings.. In the case of the progressive pipeline, we also report the number of failure recoveries during the whole reconstruction (the number of resets with a local cluster due to major topological changes). Figure 6 illustrates the final result of the proposed pipeline versus the local minimum of the incremental pipeline. The compute times of the proposed pipeline are comparable to existing pipelines despite the continuous flow of intermediate results. This is due to the fact that our pipeline practically never operates on the whole image sets, but only on the local clusters. This allows significantly faster execution times of the bundle adjustment.

We furthermore run our method on several of the datasets presented by [11]. We used random ordering of the crowd sourced data and pushed them to the progressive pipeline as done in the Quad [4] experiment. Figure 8 shows a sample result on the Radcliffe scene. While both HSfM and the incremental baseline method got confused by the duplicate structure, our pipeline successfully separated the front and rear views of Radcliffe. The two clusters were merged successfully into the correct configuration by the global cluster pose optimization. In contrary to [11], this is not done in a post processing step but erroneous connections are detected on-the-fly during the reconstruction. Table 1 shows the results on the resulting numbers on the dataset equally to the experiment before. Our method was able to separate duplicate structure in all datasets. As the reconstruction backbone of the presented pipeline bases on the well known incremental [31] and global [34] pipelines, the accuracy of the resulting structure is equals the accuracy of these pipelines.

[max width=0.30trim=200px 150px 50px 120px, clip=true]img/rc_hsfm [max width=0.30trim=200px 200px 200px 200px, clip=true]img/rc_baseline_incremental [max width=0.35trim=200px 250px 180px 200px, clip=true]img/rc_progrec_clustercolor

Figure 8: Result on the Radcliffe dataset with HSfM (left), Theia (incr.)(right) and the proposed method (right).

width=0.95 Stanford Quad [4] BigBen Radcliffe Camera Alexander Nevsky Cathedral Branden- burg Gate Arc De Triomphe total imgs [#] 4516 6514 402 282 448 175 434 batch HSfM [5] imgs [#] 2,140 4,832 375 272 430 173 441 time [s] 11,490 23,570 1,932 1,654 1,900 389 2,131 Theia (incr.) [31] imgs [#] 3,298 5,462 394 278 443 173 410 time [s] 34,749 158,853 2,385 1,307 3,687 1,018 3,348 progressive Theia (incr.) [31] imgs [#] 51 527 394 280 418 173 416 time [s] 7,605 215,064 2,026 1,473 2,268 1,141 3,516 proposed imgs [#] 3,165 3,894 285 279 427 173 405 time [s] 13,276 25,713 552 1,121 2,163 627 377 clusters [#] 76 92 5 2 4 2 5 recoveries [#] 14 29 1 0 5 1 1

Table 1: Evaluation of our pipeline versus the incremental Theia pipeline and HSfM. Only the proposed pipeline can successfully handle the progressive scenario.

4 Conclusions

We proposed a novel progressive sfm pipeline which addresses a multiuser-centric scenario, where a 3D model is simultaneously reconstructed from multiple image streams handled by a cloud-based reconstruction service. In contrary to existing work, our pipeline does not depend on the image order and does not require any a-priori global knowledge about image connectivity. The progressive pipeline avoids taking any binding decisions and is able to recover for erroneous configurations. A global viewgraph is incrementally built and maintained. The graph is clustered based on the local connectivity of cameras and individual clusters are reconstructed using either an incremental or a global reconstruction pipeline. In the last step, individual models are merged using a lightweight posegraph optimization just on the cluster centers. We demonstrated the effectiveness and efficiency of our pipeline on multiple dataset and compared it to existing solutions.


  • [1] S. Agarwal et al. Building rome in a day. ACM, pages 105–112, 2011.
  • [2] R. Arandjelovic and A. Zisserman. Three things everyone should know to improve object retrieval c.

    IEEE Conf. Comput. Vis. Pattern Recognit.

    , (April):2911–2918, 2012.
  • [3] A. Chatterjee and V. M. Govindu. Efficient and robust large-scale rotation averaging. ICCV, pages 521–528, 2013.
  • [4] D. Crandall, A. Owens, N. Snavely, and D. Huttenlocher. SfM with MRFs: Discrete-continuous optimization for large-scale structure from motion. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35(12):2841–2853, December 2013.
  • [5] H. Cui, X. Gao, S. Shen, and Z. Hu. Hsfm: Hybrid structure-from-motion. In CVPR, pages 1212–1221, 2017.
  • [6] Z. Cui and P. Tan. Global Structure-from-Motion by Similarity Averaging. In 2015 IEEE Int. Conf. Comput. Vis., pages 864–872, 2015.
  • [7] A. J. Davison. Real-time Simultaneous Localisation and Mapping with a Single Camera. ICCV, 2:1403–1410, 2003.
  • [8] M. Farenzena, A. Fusiello, and R. Gherardi. Structure-and-motion pipeline on a hierarchical cluster tree. 2009 IEEE 12th Int. Conf. Comput. Vis. Work. ICCV Work. 2009, pages 1489–1496, 2009.
  • [9] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
  • [10] M. Havlena, A. Torii, and T. Pajdla. Efficient Structure from Motion by Graph Optimization. ECCV, pages 1–14, 2010.
  • [11] J. Heinly, E. Dunn, and J. M. Frahm. Correcting for duplicate scene structure in sparse 3D reconstruction. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 8692 LNCS(PART 4):780–795, 2014.
  • [12] J. Heinly, J. L. Schönenberger, E. Dunn, and J.-M. Frahm. Reconstructing the World * in Six Days. CVPR, 2015.
  • [13] B. K. P. Horn. Closed-form solution of absolute orientation using unit quaternions. J. Opt. Soc. Am. A, 4(4):629, 1987.
  • [14] N. Jiang, P. Tan, and L. F. Cheong. Seeing double without confusion: Structure-from-motion in highly ambiguous scenes. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pages 1458–1465, 2012.
  • [15] Z. Kang and G. Medioni. Progressive 3D model acquisition with a commodity hand-held camera. Proc. - 2015 IEEE Winter Conf. Appl. Comput. Vision, WACV 2015, pages 270–277, 2015.
  • [16] L. Kneip, D. Scaramuzza, and R. Siegwart. A Novel Parametrization of the Perspective-Three-Point Problem for a Direct Computation of Absolute Camera Position and Orientation. CVPR, 2011.
  • [17] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard. G2o: A general framework for graph optimization. In ICRA, pages 3607–3613. IEEE, may 2011.
  • [18] A. Locher, M. Perdoch, and L. Van Gool. Progressive Prioritized Multi-view Stereo. CVPR, pages 3244–3252, 2016.
  • [19] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, Nov. 2004.
  • [20] D. Martinec and T. Pajdla. Robust rotation and translation estimation in multiview reconstruction. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2007.
  • [21] P. Moulon, P. Monasse, and R. Marlet. Global fusion of relative motions for robust, accurate and scalable structure from motion. ICCV, pages 3248–3255, 2013.
  • [22] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós. Orb-slam: A versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, Oct 2015.
  • [23] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2007.
  • [24] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.-M. Frahm. A Universal Framework for Random Sample Consensus : USAC. PAMI, pages 1–17, 2013.
  • [25] R. Roberts, S. N. Sinha, R. Szeliski, and D. Steedly. Structure from motion for scenes with large duplicate structures. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pages 3137–3144, 2011.
  • [26] J. L. Schönberger and J.-M. Frahm. Structure-from-Motion Revisited. CVPR, 2016.
  • [27] J. L. Schönberger, F. Radenovic̀, O. Chum, and J. M. Frahm. From single image query to detailed 3d reconstruction. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5126–5134, June 2015.
  • [28] S. N. Sinha, D. Steedly, and R. Szeliski. A multi-stage linear approach to structure from motion. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 6554 LNCS(PART 2):267–281, 2012.
  • [29] J. Sivic and A. Zisserman. Video Google: a text retrieval approach to object matching in videos. IEEE Int. Conf. Comput. Vis., (Iccv):1470–1477, 2003.
  • [30] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the world from internet photo collections. International Journal of Computer Vision, 80(2):189–210, 2008.
  • [31] C. Sweeney. Theia multiview geometry library: Tutorial & reference.
  • [32] C. Sweeney, V. Fragoso, T. Höllerer, and M. Turk. Large scale sfm with the distributed camera model. In 3D Vision (3DV), pages 230–238. IEEE, 2016.
  • [33] C. Sweeney, T. Sattler, H. Tobias, M. Turk, and M. Pollefeys. Optimizing the Viewing Graph for Structure-from-Motion. Int. Conf. Comput. Vis., pages 801–809, 2015.
  • [34] K. Wilson and N. Snavely. Robust global translations with 1DSfM. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 8691 LNCS(PART 3):61–75, 2014.
  • [35] C. Wu. Towards linear-time incremental structure from motion. 3DV, pages 127–134, 2013.
  • [36] C. Zach, A. Irschara, and H. Bischof. What can missing correspondences tell us about 3D structure and motion? CVPR, 1, 2008.
  • [37] C. Zach, M. Klopschitz, and M. Pollefeys. Disambiguating Visual Relations Using Loop Constraints. CVPR, 2010.