Log In Sign Up

Real-time CPU-based large-scale 3D mesh reconstruction

In Robotics, especially in this era of autonomous driving, mapping is one key ability of a robot to be able to navigate through an environment, localize on it and analyze its traversability. To allow for real-time execution on constrained hardware, the map usually estimated by feature-based or semi-dense SLAM algorithms is a sparse point cloud; a richer and more complete representation of the environment is desirable. Existing dense mapping algorithms require extensive use of GPU computing and they hardly scale to large environments; incremental algorithms from sparse points still represent an effective solution when light computational effort is needed and big sequences have to be processed in real-time. In this paper we improved and extended the state of the art incremental manifold mesh algorithm proposed in [1] and extended in [2]. While these algorithms do not achieve real-time and they embed points from SLAM or Structure from Motion only when their position is fixed, in this paper we propose the first incremental algorithm able to reconstruct a manifold mesh in real-time through single core CPU processing which is also able to modify the mesh according to 3D points updates from the underlying SLAM algorithm. We tested our algorithm against two state of the art incremental mesh mapping systems on the KITTI dataset, and we showed that, while accuracy is comparable, our approach is able to reach real-time performances thanks to an order of magnitude speed-up.


page 1

page 3

page 6


Efficient moving point handling for incremental 3D manifold reconstruction

As incremental Structure from Motion algorithms become effective, a good...

Incremental Reconstruction of Urban Environments by Edge-Points Delaunay Triangulation

Urban reconstruction from a video captured by a surveying vehicle consti...

Real-time Semantic 3D Dense Occupancy Mapping with Efficient Free Space Representations

A real-time semantic 3D occupancy mapping framework is proposed in this ...

SE-SLAM: Semi-Dense Structured Edge-Based Monocular SLAM

Vision-based Simultaneous Localization And Mapping (VSLAM) is a mature p...

OVPC Mesh: 3D Free-space Representation for Local Ground Vehicle Navigation

This paper presents a novel approach for local 3D environment representa...

I Introduction

Robot navigation and localization is a long studied topic in Robotics. In the last decade, the advancements in Simultaneous Localization and Mapping (SLAM) algorithms, especially since the proposal of the Parallel Tracking and Mapping (PTAM) [3]

paradigm, have led to algorithms results in robot localization and environment mapping. The most preeminent paradigms shown in the literature can be classified in feature-based 

[4, 5], direct [6, 7], and dense algorithms [8, 9].

Feature-based and direct approaches are able to reconstruct a sparse or semi-dense map of large-scale environment on a CPU On the other hand, even if dense algorithms have been proved to be scalable, as in [10], they rely on GPU-computing. However, in many robotics applications, such as autonomous vehicle and cost-effective surveying, computational power is a limited resource and, when available, GPUs are not as powerful as needed by dense mapping algorithms.

To propose a trade-off between accuracy and computation effort and to have a dense output, some works in literature have recently focused on incremental reconstruction of a mesh from sparse feature point extracted by SLAM algorithms [11, 1, 2]. These methods estimate a low resolution model of the environment, represented by a dense mesh, which is informative enough for traversability analysis and path planning. Among these methods, the most advanced enforce the manifold property111A mesh is manifold if each vertex v is regular, i.e., if and only if the edges connecting the vertices opposite to v are homeomorphic to a disk .i.e., they form path without loops and closed (see [12] for more details). which is needed for both online improvement, e.g., in mesh smoothing [13], or offline refinements, e.g., in photometric mesh optimization [14, 15, 16, 17]. This property can be enforced incrementally as in [1] and [2], but this makes the existing incremental algorithms not able to run in real-time.

Fig. 1: Mesh reconstruction in real-time with the proposed algorithm.

Starting from the previous works on incremental mesh reconstruction [18, 2], in this paper we propose the first real-time incremental algorithm that is able to reconstruct a manifold mesh of the scene running on a single CPU core, and leaving other cores available to perform camera tracking and sparse data estimation via classical feature-based SLAM. As a main contribution we improve the computational effort of the methods proposed in [1] and [2], reaching a comparable accuracy (see in Figure 1 an example of result). An open source implementation is available at

Ii Relevant Related Works

Real-time 3D mapping has been addressed with many different approaches. The seminal work in [19] fuses depth maps and it creates an initial 3D rectangle mesh made by two triangles that covers all the depth images, then each triangle is subdivided iteratively where a discontinuity or a non-planarity in the depth map region enclosed in the triangle occurs. The meshes generated for each depth map are finally registered and the redundant triangles are deleted. This approach does not handle very well occlusions and it requires GPU computing. Bandino et al. [20] reconstruct an urban scene by approximating vertical 3D surfaces with adjacent rectangular sticks; as in the previous case, they exploit depth maps computed by means of GPU processing and do not recover a complete model of the scene.

Recently, Vineet et al.  [21] fuse semantic segmentation and stereo data to obtain a robust estimate of the environment model via Conditional Random Fields (CRF) optimization; their work is close to real-time, but it still requires a significant computational effort. Similarly, the authors of [22] have proposed a real-time reconstruction algorithm which fuses dense maps into a Truncated Distance Signed Function (TDSF) represented by an octree data structure; they estimate the surface only where strong evidence has been collected, therefore the output is non continuous and more prone to localization, and path planning errors, e.g., where the ground is missing. Following a similar approach Klingensmith et al. [23] fuses dense maps into a TDSF, but they use RGB-D data not available in outdoor scenarios. In [24] the authors propose an efficient real-time navigation platform that fuses FPGA stereo disparities into an integer voxel-based maps. Even if they build a local map on a CPU, it is very noisy and prone to localization errors and it has been tested in limited real-world scenarios.

A different class of reconstruction algorithms subdivides the space into tetrahedra through Delaunay Triangulation; this avoids unnecessary memory consumption and results in fast and scalable algorithms. Online mesh reconstruction of a small object from sparse data that relies on Delaunay triangulation has first been proposed in [25], however this method recomputes the map at each iteration, it is therefore not suitable for large-scale environments. The first incremental approach was proposed by Lovi et al. [11]; in their work they create a Delaunay Triangulation incrementally and they update the labels of the tetrahedra according to the visibility rays from cameras to points. The mesh extracted by the proposed method is the sharp boundary between free space and occupied tetrahedra. In a latter work, Hoppe et al. [26] have been able to extract the mesh incrementally with a graph cut algorithm, obtaining smoother results but without reaching real-time performance. Teixeira and Chli [27] build a 2D triangulation of image features, and project it in 3D; they reconstruct a 3D mesh that covers the region of the scene currently captured by the camera.

The previous incremental methods are not able to enforce the manifold property and therefore further mesh refinements, as a simple smoothing or the one proposed in [28], would be problematic to obtain. For this reason Litvinov and Lhuiller [1] proposed an algorithm to update incrementally the mesh while keeping the manifold property valid in all the iterations. The growing procedure, in that algorithm, carves the 3D space sometimes resulting in visual artifacts that affect the quality of the final reconstruction. This issue was faced in [29] where the same authors proposed an ad-hoc method to explicitly remove visual artifacts and in [2] where the authors changed the ordering of the carving procedure in order to preemptively remove problematic tetrahedra, i.e., those most likely to result in visual artifacts.

The main issue with all these approaches is that they are not able to reach real-time performance, even if their main goal, beside mapping, is to support a robot while navigating autonomously. Moreover, they are not able to cope with modifications of the 3D points estimates which is the common situation in SLAM algorithms; a first attempt to overcome the latter issue was proposed in [18]

where a simple, but effective, heuristic has been proposed to update the triangulation coherently when moving points already in it. However it is still an approximation and in the long term it may cause drifting of the tetrahedra labeling.

In Table I we summarized the features of the state-of-the-art approaches presented in this Section. Since we are mainly interested in 3D manifold meshes, in the following we focus our comparison against the algorithms proposed in [1] and improved in [2].

px [19] [20] [21] [22] [23] [24] [25] [11] [26] [27] [1] [2] Real-time CPU-only Large-Scale Manifold Mesh Continuous Mesh

TABLE I: State-of-the-art 3D reconstruction algorithms

Iii State of the Art Incremental Manifold Reconstruction

In this section we show a brief overview of the manifold reconstruction algorithm first proposed in [1] and improved in [2] and which is at the basis of our CPU-based real-time approach. It relies on a volumetric representation of the environment and it processes batches of points, cameras, and visibility rays, which are estimated incrementally by a feature-based SLAM algorithm, such as PTAM [3] or ORB-SLAM [30].

The algorithm builds a 3D Delaunay triangulation of the input SLAM points; the reconstructed surface, , partitions the triangulation between the set of Outside tetrahedra, i.e., the manifold subset of the free space (not all free space tetrahedra would be included in this set), and the complementary set of Inside tetrahedra, i.e., the remaining tetrahedra that roughly represent the matter. The notation means the boundary of the set . This manifold is updated as new points and camera positions are estimated. In this process, tetrahedra are associated to a weight and they are considered as free space if . Notice that in [1] this weight is simply the counter of the rays that intersects the tetrahedron while in [2] the weight is computed according to the inverse cone heuristics.


Fig. 2: Example of manifold growing. The red line is the manifold, numbers represent the weights, in yellow are the tetrahedra in the queue , in dark gray the tetrahedra not intersected.

The first manifold at time , named , is estimated in four steps. Point insertion: it adds all the 3D points estimated up to time and computes their 3D Delaunay triangulation. Label initialization: it initializes all the tetrahedra weight to . Ray tracing: for each camera to point ray it adds a weight to the intersected tetrahedra and a weight to their neighbors, this to preemtively remove most visual artifacts (see [2]); the list of free tetrahedra is named . Growing: it initializes a queue with the tetrahedron with the highest weight; then it iterates the following procedure until is empty: (a) remove the tetrahedron with the highest weight from ; (b) add it to only if the resulting is manifold; (c) add to the queue neighboring tetrahedra of that are not already in the queue or in the set (see Figure 2). Finally, to handle non-zero genus structures, it checks, for each vertex in the boundary, if all the incident tetrahedra are free-space and it tries to remove them; if is manifold, i.e., all the vertices in keeps the manifoldness, then , otherwise it keeps unchanged.

(a) (b)
Fig. 3: Naive point insertion: a new point added to the triangulation triggers a new triangulation of a subset of the tetrahedra; it is not trivial to infer the new tetrahedra status (inside/outside) keeping the manifoldness valid without a new ray tracing.

Once the system is initialized, a new set of points is estimated at time , where is the keyframe index, and is the period (in our experiments varies and depends on the SLAM algorithm, however it is usually about 4 frames). In this context with the term keyframe we refer to a batch of points and cameras to be inserted in the Delaunay triangulation. The insertion of each point would cause the removal of a set of the tetrahedra that invalidates the Delaunay property (Figure 3(a)); the surface is not guaranteed to be manifold anymore and any naive update of the triangulation potentially invalidates the manifoldness (Figure 3(b)).

To avoid manifoldness violation, the authors in [1] define a new set of tetrahedra , named enclosing set, (Figure 4(b)) and they apply the so called Shrinking procedure first (Figure 4(c)), i.e., the inverse of Growing; this procedure subtracts iteratively from the tetrahedra keeping the manifoldness valid. After this process, it is likely that ; however, in the case of the point is not added to the triangulation and it is discarded. In case of the point is added into the triangulation with the Point Addition step (Figure 4(d)), and once all points in have been added (or dropped), the algorithm applies Ray tracing (Figure 4(e)), and finally the Growing step (Figure 4(f)) similarly to what is described in the initialization procedure.

(a) (b) (c)
(d) (e) (f)
Fig. 4: Incremental manifold reconstruction steps: (a) a new point is estimated and it would invalidate the red tetrahedra (set ); (b) put a subset of the tetrahedra in ; (c) shrink the manifold; (d) add the point into the triangulation; (e) perform the ray tracing; (f) grow the manifold.

To keep the reconstruction scalable, both [1] and [2] create a grid of fictitious 3D points, named Steiner points, which spans the entire space the map is supposed to occupy. Steiner points partition the space using big cubes and the cube diagonal becomes the maximum diameter of the circumsphere used to verify the Delaunay property of the triangulation; during the estimation of the enclosing set, the cube diagonal defines a bounded set of tetrahedra affected by the point insertion.

Iv Real-Time Mesh Reconstruction

In this section we illustrate the main contribution of this paper, i.e., we describe how we build upon the previous method in order to achieve real-time performances and to extend it to moving point management. We redesigned some steps of the reconstruction pipeline proposed in [1] and [2], mostly by leveraging on the use of Steiner points, which are arbitrary 3D points added to the 3D Delaunay triangulation to bound the maximum dimension of the tetrahedra. Even if some of the methods we propose are designed to optimize [1] and [2], some of the concepts can be easily adapted to improve the efficiency of other Delaunay-based reconstruction algorithms.

Iv-a Incremental Steiner points insertion

The original method initializes the algorithm with a fixed grid of Steiner points, instead, we estimate the Steiner points grid iteratively keeping the manifold property valid and while new parts of the scene are discovered.

Let us define as the distance among the Steiner points in the grid; we keep track of the boundaries of the current grid by saving and . We bootstrap with an initial grid around the first camera pose , then, at each iteration , if the points visible by camera span outside the existing grid, we expand it. More formally, before adding a point in position , we check if , and ; in case all these conditions are verified, the point can be added without extending the grid; otherwise we extend the grid along the appropriate semi-axis, e.g., if , the grid is extended on the positive semiaxis by inserting a layer of Steiner points with coordinate .

Iv-B Shrink what you need

The efficiency of the shrinking procedure is strictly related to the size of the enclosing set . Indeed the set contains the tetrahedra that the algorithm need to shrink to keep the manifold property valid before inserting a set of new points. In the original method of  [1] and [2] the authors fix as the maximum accepted distance between a camera and point, and they define with respect to a camera , as the tetrahedra contained in the sphere centered in with radius

The drawback of this approach is that the enclosing volume is needlessly large, in fact, except for the case of an omnidirectional camera or an omnidirectional laser sensor, the point cloud visible from a camera is not evenly spread in a sphere with radius , but it is mostly clustered on the actual surface of the scene perceived by the sensor. Therefore unnecessarily large number of tetrahedra requires to be shrunk before adding new points, even if they occupy just a small part of the sphere. Shrinking huge regions of the space can produce visual artifacts.

Fig. 5: Efficient enclosing proposed in this paper

Instead, we propose to choose a smaller enclosing volume, in particular, we define as the set of tetrahedra exactly in the neighborhood of the 3D points that have to be added to the triangulation, regardless of the associated cameras. Given a set of new points that have to be inserted, the ideal enclosing volume is the convex hull of , i.e., the yellow region in Figure 5, expanded by a radius of , i.e., the red region in Figure 5.

We propose to approximate this volume by exploiting the Steiner grid. Let be a point contained in , where are the indexes identifying the cell inside the Steiner grid. Since a cell diagonal is at most , the corresponding sphere would span over at most two cells in each direction. The enclosing set induced by is then obtained by selecting all the tetrahedra that have some vertices inside , where . The final enclosing set for the -th camera is .

Iv-C Boundary spatial hashing

In the original method both the shrinking and growing procedure iterate over the boundary of the manifold mesh in order to subtract or add new tetrahedra to the manifold set . The lookup procedure that checks if a free space tetrahedra can be carved without invalidating the manifold property, and keeps the boundary ordered, does not scale properly since it has to consider the entire boundary of the mesh.

To improve on this lookup procedure, we propose to store the boundary tetrahedra into a spatial hash representation based on the Steiner grid cells. We create a hash function such that

is a vector defined for a Steiner cell

; contains all the pointrs to the boundary tetrahedra that intersects . Thanks to the hash function we can quickly and directly retrieve the tetrahedra needed to initialize the shrinking and growing procedures.

Iv-D Next Tetrahedron Caching

The ray tracing described in Section III walks along the triangulation to update the visibility information encoded in the tetrahedra weights. It bootstraps from the ending point (e.g., in Figure 6) and checks which facets of the set of incident tetrahedra intersects the ray; as a consequence, the tetrahedron containing the intersecting facet is selected and its visibility information updated; then ray tracing moves to the tetrahedron adjacent to facet and iterates the process until the selected tetrahedron contains the camera (e.g., in Figure 6). Depending on the number of new points inserted per frame, this step may require a big computational effort.

Fig. 6: Ray tracing from camera to point

In a robotics scenario, especially when a surveying vehicle explores the environment, the camera-to-point viewing rays have often the same direction, e.g., we keep perceiving a point in the front of the vehicle from the same direction while the camera moves forward. By relying on this assumption, we propose a method to speed up the identification of the next traversed tetrahedron on the walk by Next Tetrahedron Caching. While walking on a ray, for each traversed tetrahedron we store the index of the next tetrahedron ; then, when we trace any other ray, if in the current tetrahedron the index has been saved, the cell is checked first. Such process likely chooses the right next tetrahedra first, and avoids to check the intersection between the ray and the the other facets if not needed; moreover, by simply prioritizing the tetrahedra checks this caching mechanism is guaranteed to find the same solution of the original ray tracing procedure.

Iv-E Ray tracing scheduler

The authors in [1] and [2] do not manage moving points, however, as new images are processed, the point positions estimates are updated by the SLAM algorithm. Only in [18] a simple heuristic to manage moving points has been proposed, but it approximates the visibility updates induced by the moving points.

To handle moving points in an exact way we use three procedures, analogously to [11]: Ray tracing to increment the weights according to the Inverse Cone Heuristic proposed in [2]; Ray retracing, which is a particular ray tracing that affects only the new tetrahedra added after point insertion or removal; Ray untracing, which decrements the weights and removes the stored ray. To avoid redundant computations, especially when points positions change, we propose to carefully schedule the ray tracing operations. When we add a point , we schedule the tracing procedure for all the rays ending in and we schedule the ray retracing of each ray intersecting the conflicting tetrahedra, indeed they have been removed and new tetrahedra have been added, therefore they need to be updated. When we remove a point , we schedule ray retracing for each ray intersecting the tetrahedra conflicting with , the point is removed, and the ray untracing procedure is scheduled for each ray that connects a camera to the removed point. When we move a point, we first remove it from the old position and we add it back to the new position.

After we processed a new frame we collect a list of tracing untracing and retracing procedure for each ray . To avoid redundancy, contains at most one occurrence of each procedure. In general we execute the procedures in in this order: ray untracing, ray tracing and ray retracing, subject to the following rules. If contains both ray tracing and ray retracing, only the ray tracing is executed, since retracing is equivalent, but it only affects a subset of the tetrahedra. If contains ray untracing and ray retracing, without ray tracing, the is only untraced. This happens when intersected at least one tetrahedron destroyed by point insertion or removal, and it is then removed.

During the ray tracing or ray retracing procedures the path of the traversed cells is stored for each ray, avoiding recomputing the path from the camera to the point multiple times. This reduces significantly the effort needed to move and remove the points, indeed for each of them all the corresponding rays must be untraced and then traced in the new position.

V Experimental Results

We tested the proposed real-time incremental manifold mesh reconstruction algorithm on the KITTI dataset [31]; in particular we used four stereo sequences of the visual odometry dataset: 00, 01, 02, and 05, we discarded the 03 and 04 since they are relatively short. We executed the experiments on a Intel(R) Core(TM) i7-4770S at 3.10GHz with 8GB of DDR3 RAM with the reconstruction algorithm running on a single core.

We estimated the 3D points, camera poses and visibility rays through ORB-SLAM [30] which runs on the same machine, using three threads, in real-time. Using the ORB-SLAM output we have compared the proposed approach against the two state of the art incremental manifold mesh reconstruction algorithms, i.e., [1] and [2]. In our experiments we discarded the handle and artifacts removal algorithm from [1] since it takes a big computational effort, the number of handles created in these sequences is limited and the smoothing operator mitigates the influence of the artifacts. However some of the artifacts remains especially because of the sparsity of the point cloud estimated with ORB-SLAM. Since the compared algorithms do not manage loop closures, we considered 1000, 1101, 2000 and 1300 frames, respectively from the sequences 00, 01, 02, 05, which do not have any loop closure.

Sequence 00 Sequence 01 Sequence 02 Sequence 05
Fig. 7: Comparisons of run times between our approach and [2] (x-axis number of keyframe, y-axis: timings in seconds).

The proposed algorithm has been able to reconstruct the test sequences in real-time, improving substantially the computational effort with respect to its competitors. As in most mapping algorithms that decouple tracking and mapping [3, 22, 32], we build the map every keyframe, i.e., batch of points and cameras, while the ORB-SLAM algorithm tracks the camera position. In our case, map updates happen at 3-4 Hz while the tracking threads run at the same frame-frequency of the camera (10Hz). Comparing the different steps of the reconstruction in Table II, we improved each of them thanks to the novel algorithms described in Section IV. In the last column of Table II we report also the frequency of our algorithm with respect to the total number of frames; as it it can be noticed, the proposed mapping approach is able to process the frames at a frequency which is 4 times higher than the actual data rate. All techniques used in the novel algorithm have contributed to a substantial improvements, we highlight just that the first tetrahedron proposed by the Next Tetrahedron Caching was actually the correct one the 88% of the cases.

To visualize the improvement of the proposed method, in Figure 7 and Figure 8, we report the time spent for each keyframe extracted by ORB-SLAM. The run time of [2] grows linearly, especially because of the growing process which execution time increases as the dimension of the map increases. Instead, in our algorithm the boundary spatial hashing bounds significantly the computational cost of the growing procedure and the overall computational time results almost constant.

We also tested the proposed approach on the KITTI full sequences, i.e., including parts where a loop closure happens. In this case we cannot handle explicitly the loop closures and when they happen, small mesh misalignments occur; however, this experiment was designed to proof that the proposed method handles non-zero genus surfaces, and that it scales with the whole sequence. In Fig. 11 we illustrate for instance the reconstruction of the complete sequence 00, and we plot the run time in Fig. 9. In Fig. 10 we illustrate the correlation between the timings and the number of tetrahedra contained in the enclosing volume, i.e., those tetrahedra that required to be shrunk. This volume is bounded by the Steiner points, however the peaks that appears in the graphs corresponds to the regions where the trajectory overlaps with a region previously mapped, therefore the number of the tetrahedra contained in the the volume is higher. However the number of enclosed tetrahedra is still .

px Initialize Point Ray Steiner Shrink Insertion Tracing Grow Total freq seq00 [1] 0.07 31.59 4.23 41.49 257.74 335.12 2.98 [2] 0.07 54.68 3.95 39.21 154.18 252.09 3.97 proposed 0.03 3.06 0.82 4.50 4.66 13.07 76.51 seq01 [1] 1.21 76.78 5.12 50.60 1165.37 1299.19 0.85 [2] 1.32 112.08 6.09 56.50 945.81 1121.8 0.98 proposed 0.84 7.23 1.83 7.78 6.6 24.28 45.35 seq02 [1] 0.56 144.60 11.17 116.16 1957.12 2229.61 0.90 [2] 0.38 239.51 13.00 137.89 1936.38 2327.16 0.86 proposed 0.19 13.53 4.16 18.51 11.35 47.74 41.89 seq05 [1] 0.10 84.14 6.50 58.69 480.83 630.26 2.06 [2] 0.07 181.79 8.93 76.42 386.72 653.93 1.99 proposed 0.03 8.46 1.95 8.19 11.99 30.62 42.46

TABLE II: Run times for each step on KITTI sequences: times are expressed in seconds, frequencies in frame/second

px sequence 00 sequence 01 sequence 02 sequence 05

Fig. 8: Timings for each step: on the left [2]; on the right the proposed algorithm (x-axis number of keyframe, y-axis: timings in seconds).

px sequence 00 sequence 01 sequence 02 sequence 05

Fig. 9: Timings for each step from the proposed algorithm on the complete sequences (x-axis number of keyframe, y-axis: timings in seconds).
sequence 00 sequence 01
sequence 02 sequence 05
Fig. 10: Run time (blue) vs number of enclosing sets (orange). The graphs shows the high correlation. (x-axis number of keyframe, y-axis (left): timings in seconds, y-axis(right): number of tetrahedra)
Fig. 11: Top view (without ceiling) of the reconstruction of the 00 sequence of the KITTI dataset. The plot shows that in the correspondence of genus changes the number of enclosing sets increases, therefore the total timing increases
Sequence 00 Sequence 01 Sequence 02 Sequence 05
Fig. 12: Comparisons of run times with and without the moving points.

We tested also the accuracy of the reconstruction by comparing the depth images of the reconstruction rendered in each frame, against the distance of the velodyne points projected on the same image. In this case we use the velodyne data as ground truth for the 3D reconstructed mesh. In addition to real-time performance, our algorithm also results in a more accurate reconstruction with respect both to [2] and [1] (see Table III). This is mainly because the shrink heuristic avoids shrinking big parts of meshes which could remain shrunk after the successive growing process to preserve manifoldness. Overall the experiments show that the ORB-SLAM point cloud leads to similar results with respect to the Harris-based Structure from motion point cloud.

Since both [2] and [1] are not able to manage moving points, in the previous experiments we have integrated the points positions after they have been observed by two keyframes, and then they are not changed anymore. We have also evaluated the performance of our algorithm in presence of moving points, i.e., when ORB-SLAM updates the estimate of a point already in the mesh we change its position in the triangulation and we update its visibility information; even if the presence of moving points more likely causes the shrink procedure to get stuck in local minima and the generation of small visual artifacts, the resulting accuracy is not significantly diminished and still remains comparable with respect to both [2] and [1].

In Fig. 12 we show how the moving points handling affects the run times. The points which moved from their first insertion in the triangulation correspond to the 9%, 3%, 7% and 8% respectively in the 00, 01, 02, 05 sequences. In Table IV we reported the complete average per keyframe statistics. Since these points are usually very close to the vertex we need to modify, the computational overhead is marginal: indeed, Figure 12 shows very similar timings and variations are only to be ascribed to experiment-specific latencies.

seq00 seq01 seq02 seq05
[1] (w/o moving points) 1.32 0.79 0.70 1.48
[2] (w/o moving points) 1.13 1.22 0.72 1.50
proposed (w/o moving points) 0.91 0.72 0.62 1.14
proposed (w/ moving points) 1.05 1.14 0.74 1.21
TABLE III: Mean Absolute Error (MAE) of the reconstruction for the KITTI sequences errors in meters
seq00 seq01 seq02 seq05
# moving points 12 28 10 10
# new points 128 72 125 12
# tetrahedra shrinked 1171 739 12334 2100
# Steiner cells shrinked 26 29 350 27
# points not added 0 1 0 0
# traced rays 492 2188 454 443
# re-traced rays 112 696 97 155
# untraced rays 53 110 44 46
TABLE IV: Reconstruction average per-keyframe statistics with moving points

Vi Conclusion and Future Works

We proposed the first real-time incremental manifold reconstruction algorithm that runs on a single CPU and can handle large scale scenarios; the proposed algorithm is able to speed up the run time of state of the art algorithms significantly and, at the same time, it also improves on the accuracy of the reconstruction. We were able to obtain such results by redesigning few of the classical manifold reconstruction steps proposing a novel shrinking method and a novel ray tracing approach which leverage on hashing and caching strategies. In contrast to existing algorithm our algorithm is also able to manage moving points without any approximate heuristics and with negligible overheads.

As a possible future development we plan to manage loop closures and update the map shape accordingly, which means handling also the change the mesh genus when needed. We are also working on improving the accuracy of the reconstruction by incorporating shape prior, such as planes, through constrained Delaunay triangulation, still preserving real-time execution.


This work has been supported by the POLISOCIAL Grant “Maps for Easy Paths (MEP)” and the “Cloud4Drones” project founded by EIT Digital. We thank Nvidia who has kindly supported our research through the Hardware Grant Program.


  • [1] V. Litvinov and M. Lhuillier, “Incremental solid modeling from sparse and omnidirectional structure-from-motion data,” in BMVC, 2013.
  • [2] A. Romanoni and M. Matteucci, “Incremental reconstruction of urban environments by edge-points delaunay triangulation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2015, pp. 4473–4479.
  • [3] G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on.   IEEE, 2007, pp. 225–234.
  • [4] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam: Real-time single camera slam,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, no. 6, pp. 1052–1067, 2007.
  • [5] H. Strasdat, A. J. Davison, J. Montiel, and K. Konolige, “Double window optimisation for constant time visual slam,” in Computer Vision (ICCV), 2011 IEEE International Conference on.   IEEE, 2011, pp. 2352–2359.
  • [6] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in Computer Vision–ECCV 2014.   Springer, 2014, pp. 834–849.
  • [7] J. Engel, J. Stückler, and D. Cremers, “Large-scale direct slam with stereo cameras,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on.   IEEE, 2015, pp. 1935–1942.
  • [8] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in real-time,” in Computer Vision (ICCV), 2011 IEEE International Conference on.   IEEE, 2011, pp. 2320–2327.
  • [9] O. Kähler, V. A. Prisacariu, and D. W. Murray, “Real-time large-scale dense 3d reconstruction with loop closure,” in European Conference on Computer Vision.   Springer, 2016, pp. 500–516.
  • [10] T. Whelan, M. Kaess, M. Fallon, H. Johannsson, J. Leonard, and J. McDonald, “Kintinuous: Spatially extended kinectfusion,” 2012.
  • [11] D. I. Lovi, N. Birkbeck, D. Cobzas, and M. Jagersand, “Incremental free-space carving for real-time 3d reconstruction,” in Fifth International Symposium on 3D Data Processing Visualization and Transmission(3DPVT), 2010.
  • [12] M. Lhuillier, “2-manifold tests for 3d delaunay triangulation-based surface reconstruction,” Journal of Mathematical Imaging and Vision, vol. 51, no. 1, pp. 98–105, 2015.
  • [13] G. Taubin, “A signal processing approach to fair surface design,” in Proceedings of the 22nd annual conference on Computer graphics and interactive techniques.   ACM, 1995, pp. 351–358.
  • [14] H. H. Vu, P. Labatut, J.-P. Pons, and R. Keriven, “High accuracy and visibility-consistent dense multiview stereo,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 5, pp. 889–901, 2012.
  • [15] A. Romanoni, A. Delaunoy, M. Pollefeys, and M. Matteucci, “Automatic 3d reconstruction of manifold meshes via delaunay triangulation and mesh sweeping,” in Winter Conference on Applications of Computer Vision (WACV).   IEEE, 2016.
  • [16] A. Romanoni, D. Fiorenti, and M. Matteucci, “Mesh-based 3d textured urban mapping,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on.   IEEE, 2017.
  • [17] A. Romanoni, M. Ciccone, F. Visin, and M. Matteucci, “Multi-view stereo with single-view semantic mesh refinement,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 706–715.
  • [18] A. Romanoni and M. Matteucci, “Efficient moving point handling for incremental 3d manifold reconstruction,” in International Conference on Image Analysis and Processing (ICIAP).   Springer, 2015, pp. 489–499.
  • [19] M. Pollefeys, D. Nistér, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S.-J. Kim, P. Merrell, et al., “Detailed real-time urban 3d reconstruction from video,” International Journal of Computer Vision, vol. 78, no. 2-3, pp. 143–167, 2008.
  • [20] H. Badino, U. Franke, and D. Pfeiffer, “The stixel world-a compact medium level representation of the 3d-world.” in DAGM-Symposium.   Springer, 2009, pp. 51–60.
  • [21] V. Vineet, O. Miksik, M. Lidegaard, M. Nießner, S. Golodetz, V. A. Prisacariu, O. Kähler, D. W. Murray, S. Izadi, P. Perez, and P. H. S. Torr, “Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction,” in IEEE International Conference on Robotics and Automation (ICRA), 2015.
  • [22] T. Schops, T. Sattler, C. Hane, and M. Pollefeys, “3d modeling on the go: Interactive 3d reconstruction of large-scale scenes on mobile devices,” in 3D Vision (3DV), 2015 International Conference on.   IEEE, 2015, pp. 291–299.
  • [23] M. Klingensmith, I. Dryanovski, S. Srinivasa, and J. Xiao, “Chisel: Real time large scale 3d reconstruction onboard a mobile device using spatially hashed signed distance fields.” in Robotics: Science and Systems, vol. 4, 2015.
  • [24] S. Xu, D. Honegger, M. Pollefeys, and L. Heng, “Real-time 3d navigation for autonomous vision-guided mavs,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on.   IEEE, 2015, pp. 53–59.
  • [25] Q. Pan, G. Reitmayr, and T. Drummond, “Proforma: Probabilistic feature-based on-line rapid model acquisition.” in BMVC, 2009, pp. 1–11.
  • [26] C. Hoppe, M. Klopschitz, M. Donoser, and H. Bischof, “Incremental surface extraction from sparse structure-from-motion point clouds,” in BMVC, 2013, pp. 94–1.
  • [27] L. Teixeira and M. Chli, “Real-time mesh-based scene estimation for aerial inspection,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on.   IEEE, 2016, pp. 4863–4869.
  • [28] Z. Li, K. Wang, W. Zuo, D. Meng, and L. Zhang, “Detail-preserving and content-aware variational multi-view stereo reconstruction,” arXiv preprint arXiv:1505.00389, 2015.
  • [29] V. Litvinov and M. Lhuillier, “Incremental solid modeling from sparse structure-from-motion data with improved visual artifacts removal,” in

    International Conference on Pattern Recognition (ICPR)

    , 2014.
  • [30] R. Mur-Artal, J. Montiel, and J. D. Tardós, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [31] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research (IJRR), 2013.
  • [32] T. Schöps, T. Sattler, C. Häne, and M. Pollefeys, “Large-scale outdoor 3d reconstruction on a mobile device,” Computer Vision and Image Understanding, vol. 157, pp. 151–166, 2017.