1 Introduction
The structure from motion pipeline makes it possible to take photos of the same object/scene and then not only align and calibrate these photos, but also reconstruct an observed surface with a high amount of details. At the moment, the progress in camera sensor development opens a possibility for a regular user to take photos with a size up to hundreds of megapixels, the number which has been increasing rapidly over the past decades. Additionally, due to help of UAVs, affordable quadrocopters and automatic flight planners, it becomes possible to gradually increase the amount of pictures one can take in a short span of time. Therefore, in the area of photogrammetry, the task of being able to use all of the available data for a detailed noisefree surface reconstruction in an outofcore fashion is necessary to make a highly detailed large scale reconstruction possible on affordable computers with limited RAM.
We present a surface reconstruction method that has strong noisefiltering properties and can take both depth maps and terrestrial LIDAR scans as an input. The whole method is implemented in an outofcore way: the required memory usage is low even for very large datasets  we targeted the usage to be around 16 GB even for a Copenhagen city dataset with 27472 photos  see Fig. 1. Each processing stage is divided into independent parts for outofcore guarantees, thus additionally obtaining a massive parallelism property (i.e. pipeline is clusterfriendly). Calculationheavy stages (the computation of histograms and iterative numeric scheme) are accelerated with GPUs via OpenCL API.
2 Related Work
Poisson surface reconstruction method [18] performs well in local geometry details preservation by respecting normals of input point clouds, however it normally fails to handle scale diversity and often fails to filter noise between the surface and the sensor origins. It should be noted that handling of scale diversity can be added, and noise filtering can be implemented on an early stage as a depth map filtering [22]. Nevertheless, while Poisson reconstruction can be implemented in an outofcore fashion [2]  the preceding depth map filtering approach will likely require a high amount of memory to keep all depth maps relevant for a current depth map filtering in RAM.
Graph cutbased reconstruction methods [15], [17], [14], [33]
explicitly take into account visibility rays from sensors’ origins to samples in depth maps, and therefore such methods have great noise filtering properties. Scale diversity is also naturally supported via Delaunay tetrahedralized space discretization. However, Delaunay tetrahedralization and minimum graph cut estimation of an irregular graph
[3], [12] have high memory consumption and are computationally heavy. Because of this, the outofcore SSR [24] method happens to be more than one order of magnitude slower than our method.Local fusion methods [5], [20], [19] including FSSR [9], [10] are well suited for parallelization and scalability [20]. However on the other hand due to their local nature, they have weak hole filling properties and can not filter strong depth map noise in difficult cases like the basin of the fountain in the Citywall dataset as shown in [27].
Photoconsistent mesh refinementbased methods [29], [21] are fast due to GPU acceleration but are not able to change the topology of an input mesh, and thus they heavily depend on the quality of an initial model reconstruction.
Total variation minimizationbased methods [32], [13], [25] are shown to have great noise filtering properties due to visibility constraints, and can easily be GPUaccelerated [31]. Additionally, as shown in [27], [28], a variation minimization scheme can be implemented in a compact and scale diversityaware way by the use of a balanced octree, but even with such compact space representation, the peak memory consumption becomes critical for a large scale scene reconstruction task.
Our functional formulation follows [25], it was adapted for 3D space in a way, discussed in [32]. We use a 2:1 balanced octree similar to [27] for 3D space representation. In contrast with their method, our framework has strict peak memory guarantees and is much faster thanks to GPU acceleration in the most timeconsuming stages (as shown in [27] – its bottlenecks were in the computation of the histograms (17%) and the energy minimization (80%) stages, so we implemented both of them on GPU).
3 Algorithm Overview
To begin with, we would like to discuss the chosen functional minimization with a focus on noise robustness, scale diversity awareness and GPUfriendliness, while not taking memory requirements into account. Later, we will show how to adapt this minimization scheme for an outofcore fashion in Section 4.
3.1 Distance Fields as Input Data
We prefer to be able to support different kinds of range image data as an input, such as:

Depth maps built with stereo methods like SGM [16] from regular terrestrial photos or aerial UAV photos;

RGBD cameras, which are essentially the same as previously mentioned depth maps;

Terrestrial LIDAR point clouds with a known sensor origin.
It means that we need to generalize over all these types of data and work with an abstraction of a range image that can be used in the functional formulation. All this data can naturally be formulated in a way, described in [32], as a distance field , which is equal to on each ray between the camera and a depth map sample and then fades away to right under the surface sample – see Fig. 3. The only difference for our case is that we want to work with scenes with diverse scales, so (the width of a relevant nearsurface region) and (the width of an occluded region behind the surface) should be adaptive to the radius of a material point . Thus, in all our experiments we use and (smaller values lead to holes in thin surfaces, larger values lead to ’bubbliness’  excessive thickness of thin surfaces).
Both depth maps from stereo photos and RGBD cameras can be represented in such way naturally. The only nontrivial question is how to estimate material point radii for each pixel in a depth map. For each depth map’s pixel, we are estimating the distance in 3D space to the neighboring pixels and take half of that distance as the sample point’s radius .
To represent terrestrial LIDAR point clouds as range images, we rely on the fact that the structure of such point clouds is very similar to the structure of pictures, taken with 360degree cameras (see Fig. 3). Because of that, we treat them just like depth maps of 360degree cameras with the only difference that LIDAR data is nearly noisefree, and thus we can rely on such data with more confidence (i.e. with a weaker regularization term) – see Section 3.4 below.
3.2 Functional Formulation
In our task, given multiple distance fields , we want to find such an indicator field (where corresponds to a reconstructed isosurface, – to the exterior of the object, and – to the interior of the object) that will closely represent these distance fields in some way. One of the ways to formulate what would be a good field is to introduce some energy functional. The less energy the functional produces – the better the indicator field is.
Total variation (TV) regularization force term for with an data fidelity term between and is one such energy functional named TV [32] and is defined as:
(1) 
Note that while the TV term prevents the surface from having discontinuities, there is no term that would force a regularity of surface normals to tend the reconstruction to piecewise polynomial functions of an arbitrary order (see details in [25]). Such term was introduced as a part of the
energy functional via an additional vector field
in [4] and it was adapted to 2.5D reconstruction in [25]:(2) 
where denotes the symmetric gradient operator
(3) 
In order to minimize this functional, we use the primaldual method [25]. Also, like in [32], we have implemented primaldual iterations over a coarsetofine scheme with the execution of iterations accelerated on GPU for a faster convergence. We have found that iterations are enough for convergence on each level of the scheme.
3.3 Space Discretization
To minimize w.r.t. , we need to choose a space discretization. A regular grid [32] does not correspond to scale diversity and will potentially lead to high memory consumption. In our approach, we use an adaptive octree. Let be the radius of the root cube of the octree. For each point sample with the center in and the radius , an octree should have a cube containing . This cube should be on such an octree depth that the following inequality holds:
(4) 
In the implemention of the primaldual method of minimization, we need pointers from each octree cube to its neighboring cubes in order to access neighbors’ , values and dual variables , . In an adaptive octree, any cube can have any number of neighbors due to adaptive subdivision and scene scale diversity. However, we aim to execute the iterative scheme on GPU, meaning that in order to achieve good performance we need to limit the number of neighboring voxels in some way, resulting in a limited number of references needed to store per each voxel. We utilize the approach, discussed in[27], which results in 2:1 balancing of the adaptive octree, leading to the point when each cube has only or less neighbors over each face.
As will be discussed later in Section 3.4, it is important to know what average radius of point samples corresponds to each octree cube creation. Therefore, for each octree cube we also store its density, which we also call the cube’s radius , defined as
(5) 
3.4 Distance Fields to Histograms
We transform all distance fields into histograms in our octree like in [32]. This allows us to run iterations with compact histograms [31] (thanks to fixed size per voxel) instead of large (due to high overlap) depthmaps [32] in memory. The main difference is that we want the algorithm to be aware of scale diversity and to implement the minimization framework in the coarsetofine scheme. Because of this, it is impossible to ignore how big or small a voxel projection is in a depth map: if a projected voxel is large (for example, on the coarser levels) and is covered with many depth map pixels (i.e. intersected with many distance field rays), we need to account for all of them. This problem is very similar to the texture aliasing problem in computer graphics, which can be solved with texture mipmaps [30]. Similarly, we have used depth map pyramids by building mipmaps for each depth map. Therefore, when we project a voxel to the depth map, we choose an appropriate depth map level of details, and only then we estimate a histogram bin of a current voxel to which the current depth map will contribute, see the listing in Algorithm 1.
Note that such voxel projection into the pyramid of a depth map is very unstable and changes heavily with any change of the size of a working region bounding box, which happens because the natural voxel’s radius in the octree is equal to , where is the voxel’s depth and is the radius of the octree’s root voxel which depends on the size of a whole working region. To be invariant to the selection of the working region size, for each cube, in addition to its center, we store its density value, which is equal to the average radius of point samples that it represents, see Eq. 5.
These details lead to local and stable progressive isosurface refinement from coarse to fine levels  see Fig. 5.
4 OutofCore Adaptation
Our main contribution is an outofcore adaptation of the minimization scheme on a 2:1 balanced octree. Consequently, we implement each stage of the algorithm in an outofcore way, where the stages are:

Build a linear octree from all cubes discussed above in Space discretization (Section 3.3)

Balance the octree, so that each cube has a limited number of neighbors

Build an indexed treetop to run primaldual iterations independently on each treetop leaf’s part

Save distance fields’ votes to voxels’ histogram bins over the balanced octree (GPUaccelerated)

Coarsetofine functional minimization over each part of the balanced octree (GPUaccelerated)

Surface extraction via the marching cubes algorithm
4.1 From Distance Fields to Octree
For each distance field, we estimate each sample’s radius as half of the distance to its neighbors in this distance field. Then for each sample, we spawn an octree cube containing this sample at an appropriate to depth , as formulated in Eq. 4. All cubes are encoded with 96bit 3D Morton codes [23] and are saved to a single file per each distance field.
Afterwards, we need to merge all these files containing cubes (i.e. Morton codes). We encoded our cubes with Morton codes, which introduce a Zcurve order over them – see a Zcurve example on a balanced 2D quadtree on Fig. 7. Thus, cubes merging into a single linear octree can be done with an outofcore kway merge sort.
4.2 Octree Balancing
To limit the number of neighbors for each cube, we need to balance the obtained octree. A linear octree can be arbitrarily large because it describes the whole scene. Outofcore octree balancing is described in [26]. Balancing also relies on the Morton code ordering – we only need to load a part of the sorted linear octree, balance that part independently from others, and save the balanced part to a separate file. Later, we only need to merge all balanced parts, which can be accomplished like in the previous stage of linear octree merging – via an outofcore kway merge sort.
4.3 Octree Treetop
At this moment, we need to have some highlevel scene representation to be able to compute the histograms and run iterations for
minimization over the balanced octree part by part. In fact, such subdivision into parts will make it possible for each of the next stages, including the final polygonal surface extraction, to be split into OpenCL’s like independent parts (i.e. with massive parallelism, which is useful for clusteracceleration).Let us calculate how many descendants each intermediate cube of the octree has on deeper octree levels. Consider a treetop – a rooted subtree that contains the minimum number of octree cubes in the leaves, with the restriction that each leaf cube contains less than descendants in the original tree, see Fig. 7. In all experiments we used because it is small enough to guarantee that each subsequent step will fit in 16 GB of RAM, but at the same time limits the number of leaves in a treetop to just a couple of thousands even on the largest datasets.
Due to outofcore constraints, we can not estimate a global treetop by loading the whole octree into the memory. Therefore, we build independent treetops for all linear balanced octree parts, and then merge those treetops into a global one. At this stage, we can easily save indices of all covered relevant cubes for each treetop leaf. Moreover, these indices are consecutive due to Zcurve ordering of Morton codes – see Fig. 7. Hence, we only need to save two indices with each treetop leaf –indices of the first and the last relevant cubes from the linear balanced octree. This gives us an ability to load cubes relevant for current treetop leaf from balanced octree in IOfriendly consecutive way. In addition, we have strong guarantees that the number of such cubes is limited by .
4.4 Computation of the histograms
Now we have the scene representation, provided by the balanced linear octree and its indexed treetop. As the next part of our method, we need to add votes of all distance fields to all relevant cubes in the octree.
Let us process all treetop leafs one by one and estimate relevant distance fields for each leaf, which is achieved simply by checking each distance field frustum for an intersection with the treetop leaf cube volume. Then, we can just load all relevant distance fields for each treatop leaf one by one and add their votes to all descendants of the current treetop leaf, like shown in the listing in Algorithm 1.
Note that at any moment during the computation of the histograms the memory contains no more than octree cubes and only a single distance field.
4.5 Functional Minimization
Now we need to iteratively minimize the functional from Eq. 2. As shown in [32], it is highly beneficial to use a coarsetofine scheme (especially in regions with lack of data) for convergence speed. As we will see in this subsection – the scheme also helps to not introduce any seams between processed parts.
Suppose that we have already minimized the functional over the whole octree up to depth . Now we want to execute primaldual iterations at the depth in an outofcore way while not producing any seams between the parts. Like in the previous stage during the computation of the histograms, we process treetop leaves one by one. Let us load a treetop leaf’s cubes in a set and their neighbors in a set . Now, we can iterate a numeric scheme like in [25] over the cubes from with the only difference on the treetop leaf’s border – we want our neighbors’ indicator values in cubes from to be equal to indicator values of their parenting cubes, which were estimated on the previous level thanks to the coarsetofine scheme. I.e. we update the indicator for all cubes inside the current leaf’s border (set ) while the indicator for neighboring cubes outside of the border (set ) is frozen  see two examples in Fig. 9.
By following this routine, at any given time we process only the cubes from a treetop leaf and their neighbors, and thus our memory consumption is bounded by their number. We do not face any misalignments on the surface next to treetop leaf borders due to explicit border constraints and the fact that the surface from one level to the next does not move far, but just progressively becomes more detailed, see Fig. 5 and details of the computation of the histograms in Section 3.4.
We notice that not so many cubes appear on the coarsest levels, meaning that each separate treetop leaf normally contains very few cubes. Therefore, we find it beneficial for performance to process multiple treetop leaves at once on the coarsest levels (w.r.t. leaves’ total number of cubes on the current level).
4.6 Marching Cubes
As the last part, after estimating the indicator value for all cubes of the octree, we need to extract a polygonal isosurface corresponding to an indicator value . For that purpose, we can perform the marching cubes algorithm on perleaf basis, using the same outofcore tree partition as used before.
Marching cubes in a part of balanced octree is trivial: for each octree cube, we extract 3Dpoints between indicator values of different sign (i.e. points from isosurface corresponding to zero indicator value), and then build their triangulation via dynamic programming by minimizing the total surface area, similar to [1].
Note that neighboring surface parts have seamlessly matching borders, because both parts have the same indicator value across the border due to stability of progressive refinement discussed in the previous subsection.
Finally, it is important to note that the number of triangle faces will be extralarge for any large dataset. We follow each octree part marching cubes with QSlimbased [11] decimation, which we modified with a strict border constraint, namely that no border edge (i.e. a triangle edge lying on a treetop leaf’s cube face) should be ever collapsed. This way we achieve strict guarantees of a seamless surface between neighboring treetop leaves.
Processing stage  Time  Time in % 
Linear octree + merge  30 + 11 min  11% + 4% 
Balance octree + merge  7 + 11 min  3% + 4% 
Index treetop  8 min  3% 
Histograms (GPU)  49 min  19% 
Primaldual method (GPU)  88 min  34% 
Marching cubes  59 min  22% 
5 Results
We evaluated our method on an affordable computer with an 8core CPU and a GeForce GTX 1080 GPU on five large datasets: Citywall^{1}^{1}1https://www.gcc.tudarmstadt.de/home/proj/mve/ [10] and Breisach^{2}^{2}2https://lmb.informatik.unifreiburg.de/people/ummenhof/multiscalefusion/ [27] – two datasets from previous papers with high scale diversity, Copenhagen^{3}^{3}3https://download.kortforsyningen.dk/content/skraafoto [8] – largescale aerial photos of the city (this dataset was additionally evaluated on a small cluster too), Palacio Tschudi^{4}^{4}4https://doi.org/10.26301/4h297e80 [7] and Tomb of Tu Duc^{5}^{5}5https://doi.org/10.26301/n06nqa49 [6] (42 noisefree terrestrial LIDAR scans) – two large public datasets collected by CyArk and distributed by Open Heritage 3D.
The summary for these datasets presented in Table 2.
For photobased datasets, we executed the structure from motion pipeline to estimate depth maps with SGM [16] method and evaluated our algorithm by using these depth maps as input. Note that to speed up the estimation of depthmaps, we downscaled original photos for some datasets – see Table 2. For the Tomb of Tu Duc LIDAR dataset, we converted each input scan into a 360camera’s depth map and used histogram votes with an increased weight – see the listing in Algorithm 1. Processing breakdowns for other datasets including the Copenhagen city dataset (evaluated twice – on an affordable computer and on a small cluster) with reconstruction results for many different city scenes are provided in the supplementary.
We ensured that the results are detailed and clean (see Fig. 4, 8, 10, 11) for all datasets, and that our method’s peak memory usage was between 10 GB and 17 GB. Comparison with previous work [27], [24], presented in Table 3, shows that our method has significantly lower peak memory usage and is notably faster. To ensure that this speedup was not at a cost of quality, we compared our results with [27] in Fig. 4 and Fig. 11 (referred results were obtained with the software that their authors had used ^{6}^{6}6We used pointfusion 0.2.0, publicly available at http://lmb.informatik.unifreiburg.de/people/ummenhof/multiscalefusion, we used the same depthmaps for quality comparison).











Citywall [10]  2000x1500 (x1)  564 depth maps  1205 mil  404 mil  135 mil  15 mil  13.17  63 min  
Breisach [27]  2592x1728 (x2)  2111 depth maps  2642 mil  1457 mil  558 mil  57 mil  10.07  260 min  
Tomb of Tu Duc (LIDAR) [6]  8000x4000 (x1)  42 LIDAR scans  661 mil  1304 mil  672 mil  48 mil  10.05  160 min  
Palacio Tschudi [7]  1840x1228 (37%) 1500x1000 (63%) (x4)  13703 depth maps  16 billion  6 billion  3159 mil  243 mil  16.75  1213 min  
Copenha gen city [8]  3368x2168 (26%) 2575x1925 (74%) (x4)  27472 depth maps  28 billion  24 billion  7490 mil  267 mil  13.35  1758 min 










Citywall  564 depth maps  75 GB  19 h  13.17 GB  63 min  32*8.9 GB  58 h  
Breisach  2111 depth maps  64 GB  76 h  10.07 GB  260 min  N/A  N/A 
6 Conclusions
In this work, we present an outofcore method for surface reconstruction from depth maps and terrestrial LIDAR scans. Our results have shown that the algorithm specifics do not increase the running time; instead, thanks to GPUacceleration our implementation has proven to be much faster than previously published results on the datasets that we have used for testing. We have also shown that the quality of results is comparable to an incore reconstruction method GDMR [27]. Note that an outofcore balanced octree with treetop indexing is a rather general concept that can be used as a framework for different methods in a similar way as we used it with the minimization method.
One of the main contributions of our work is an outofcore framework for fast and detailed surface reconstruction.
References
 [1] Jules Bloomenthal. Polygonization of implicit surfaces. Citeseer, 1988.
 [2] Matthew Bolitho, Michael Kazhdan, Randal Burns, and Hugues Hoppe. Multilevel streaming for outofcore surface reconstruction. pages 69–78, 2007.
 [3] Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of mincut/maxflow algorithms for energy minimization in vision. IEEE transactions on pattern analysis and machine intelligence, 26(9):1124–1137, 2004.
 [4] Kristian Bredies, Karl Kunisch, and Thomas Pock. Total generalized variation. SIAM Journal on Imaging Sciences, 3(3):492–526, 2010.
 [5] Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 303–312, 1996.
 [6] CyArk. Complex of hué monuments  tomb of tu duc, vietnam, 2019.
 [7] CyArk. Palacio tschudi  chan chan, peru, 2020.
 [8] Danish Agency for Data Supply and Efficiency. Skraafoto of copenhagen, 2019.
 [9] Simon Fuhrmann and Michael Goesele. Floating scale surface reconstruction. ACM Transactions on Graphics (ToG), 33(4):1–11, 2014.
 [10] Simon Fuhrmann, Fabian Langguth, and Michael Goesele. Mvea multiview reconstruction environment. In GCH, pages 11–18, 2014.
 [11] Michael Garland and Paul S Heckbert. Simplifying surfaces with color and texture using quadric error metrics. In Proceedings Visualization’98 (Cat. No. 98CB36276), pages 263–269. IEEE, 1998.
 [12] Andrew V Goldberg, Sagi Hed, Haim Kaplan, Pushmeet Kohli, Robert E Tarjan, and Renato F Werneck. Faster and more dynamic maximum flow by incremental breadthfirst search. In AlgorithmsESA 2015, pages 619–630. Springer, 2015.
 [13] Gottfried Graber, Thomas Pock, and Horst Bischof. Online 3d reconstruction using convex optimization. pages 708–711, 2011.
 [14] Jiali Han and Shuhan Shen. Scalable point cloud meshing for imagebased largescale 3d modeling. Visual Computing for Industry, Biomedicine, and Art, 2(1):1–9, 2019.
 [15] Vu Hoang Hiep, Renaud Keriven, Patrick Labatut, and JeanPhilippe Pons. Towards highresolution largescale multiview stereo. pages 1430–1437, 2009.
 [16] Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, 30(2):328–341, 2007.
 [17] Michal Jancosek and Tomas Pajdla. Exploiting visibility information in surface reconstruction to preserve weakly supported surfaces. International scholarly research notices, 2014, 2014.
 [18] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. 7, 2006.

[19]
Andreas Kuhn, Heiko Hirschmüller, Daniel Scharstein, and Helmut Mayer.
A tv prior for highquality scalable multiview stereo
reconstruction.
International Journal of Computer Vision
, 124(1):2–17, 2017.  [20] Andreas Kuhn and Helmut Mayer. Incremental division of very large point clouds for scalable 3d surface reconstruction. pages 10–18, 2015.
 [21] Shiwei Li, Sing Yu Siu, Tian Fang, and Long Quan. Efficient multiview surface refinement with adaptive resolution control. pages 349–364, 2016.
 [22] Paul Merrell, Amir Akbarzadeh, Liang Wang, Philippos Mordohai, JanMichael Frahm, Ruigang Yang, David Nistér, and Marc Pollefeys. Realtime visibilitybased fusion of depth maps. pages 1–8, 2007.
 [23] Guy M Morton. A computer oriented geodetic data base and a new technique in file sequencing. 1966.
 [24] Christian Mostegel, Rudolf Prettenthaler, Friedrich Fraundorfer, and Horst Bischof. Scalable surface reconstruction from point clouds with extreme scale and density diversity. pages 904–913, 2017.
 [25] Thomas Pock, Lukas Zebedin, and Horst Bischof. Tgvfusion. pages 245–258, 2011.
 [26] Tiankai Tu and David R O’hallaron. Balance refinement of massive linear octree. 2004.
 [27] Benjamin Ummenhofer and Thomas Brox. Global, dense multiscale reconstruction for a billion points. pages 1341–1349, 2015.
 [28] Benjamin Ummenhofer and Thomas Brox. Global, dense multiscale reconstruction for a billion points. International Journal of Computer Vision, pages 1–13, 2017.
 [29] HoangHiep Vu, Patrick Labatut, JeanPhilippe Pons, and Renaud Keriven. High accuracy and visibilityconsistent dense multiview stereo. IEEE transactions on pattern analysis and machine intelligence, 34(5):889–901, 2011.
 [30] Lance Williams. Pyramidal parametrics. In Proceedings of the 10th annual conference on Computer graphics and interactive techniques, pages 1–11, 1983.
 [31] Christopher Zach. Fast and high quality fusion of depth maps. 1(2), 2008.
 [32] Christopher Zach, Thomas Pock, and Horst Bischof. A globally optimal algorithm for robust tvl1 range image integration. pages 1–8, 2007.
 [33] Yang Zhou, Shuhan Shen, and Zhanyi Hu. Detail preserved surface reconstruction from point cloud. Sensors, 19(6):1278, 2019.
Comments
There are no comments yet.