Fully automatic structure from motion with a spline-based environment representation

by   Zhirui Wang, et al.

While the common environment representation in structure from motion is given by a sparse point cloud, the community has also investigated the use of lines to better enforce the inherent regularities in man-made surroundings. Following the potential of this idea, the present paper introduces a more flexible higher-order extension of points that provides a general model for structural edges in the environment, no matter if straight or curved. Our model relies on linked Bézier curves, the geometric intuition of which proves great benefits during parameter initialization and regularization. We present the first fully automatic pipeline that is able to generate spline-based representations without any human supervision. Besides a full graphical formulation of the problem, we introduce both geometric and photometric cues as well as higher-level concepts such overall curve visibility and viewing angle restrictions to automatically manage the correspondences in the graph. Results prove that curve-based structure from motion with splines is able to outperform state-of-the-art sparse feature-based methods, as well as to model curved edges in the environment.



page 9

page 10


Deep Learning Parametrization for B-Spline Curve Approximation

In this paper we present a method using deep learning to compute paramet...

A Consistent Higher-Order Isogeometric Shell Formulation

Shell analysis is a well-established field, but achieving optimal higher...

Nielson-type transfinite triangular interpolants by means of quadratic energy functional optimizations

We generalize the transfinite triangular interpolant of (Nielson, 1987) ...

B-spline Shape from Motion & Shading: An Automatic Free-form Surface Modeling for Face Reconstruction

Recently, many methods have been proposed for face reconstruction from m...

Smoothing Spline Growth Curves With Covariates

We adapt the interactive spline model of Wahba to growth curves with cov...

Deep Non-Rigid Structure from Motion

Non-Rigid Structure from Motion (NRSfM) refers to the problem of reconst...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reconstructing a 3D model from multiple planar projections is a classical inverse problem

with long-standing history in computer vision. The solution typically involves three steps. The first one is given by the extraction of stable features from the images, the second one by the establishment of correspondences between features in different images, and the third one by the exploitation of incidence relations that permit the recovery of camera poses and 3D structure. We denote a stable feature any point in the image that can be reliably extracted from different view-points while always pointing at the exact same point in 3D. Ignoring complex cases such as occlusions and apparent contours, these are all the image points for which a local extremum in the first order derivative can be observed. This—and the intuition behind line drawings—have lead to the initial belief that the most useful features in an image are simply all the edges. However, later research has shown that point correspondences are not only much easier to establish, but also easier to be used as part of incidence relations from which even closed-form solutions to camera resectioning and direct relative orientation can be derived

[26, 16]. Point-based paradigms therefore are the dominating solution to the structure from motion problem.

A fundamental interest in edge-based structure from motion however remains. Edges provide more data, and thus must lead to higher accuracy. A common way to exploit the potential of edges while still enabling comfortable matching of features and exploitation of incidence relations is given by relying on straight lines. Many works have proven the feasibility of purely line-based structure from motion [3], or even hybrid architectures that rely simultaneously on points and lines [29, 36, 23]. The latter, in particular, have successfully demonstrated superior accuracy in comparison to purely point-based representations. The problem with lines is that they do not represent a general model for describing the 3D location of imaged edges; They are limited to specific, primarily man-made environments in which straight lines are abundant, either in the form of occlusion boundaries, or in the form of appearance boundaries in the texture.

More general approaches in which 3D information for curved edges is recovered have recently been demonstrated in the online, incremental visual localization and mapping community. Works such as [9] and [22]

notably reconstruct semi-dense depth maps for all image edges. The depth estimates are updated and propagated from frame to frame. While certainly very successful in terms of an economic generation of enhanced map information, the representation is only local and therefore highly redundant in nature. A unique, global representation of the environment optimized jointly over all observations is not provided. Ignoring works that only focus on small scale object reconstruction

[7], the same accounts for fully dense approaches that estimate depth over the entire image [25].

Inspired by [27], we present an incremental spline-based structure-from-motion pipeline that provides a unique, global and general higher-order model for edges in the environment (straight or bended). In particular, our contributions over the literature are:

  • The first complete framework that automatically extracts, matches, initializes, and optimizes a purely curve-based environment representation without any human intervention. The optimization does not involve lifting, and hence remains fast on standard architectures.

  • Novel photometric and geometric criteria for verifying correspondences. In particular, our method uses color images, and the photometric error is evaluated in the HSV space. The quality of the correspondences is further reinforced by checking the consistency of edge identities and viewing directions.

  • A successful use of linked Bézier curves for representing 3D edges. The straightforward geometric meaning of Bézier spline parameters provides benefits during both initialization and regularization of the structure.

The paper is organized as follows. After a summary of further related work, Section 2 provides further details about the employed spatial parametrization. Section 3 then presents the core of our contribution, a fully automatic strategy for optimizing the curve-based representation. Section 4 finally concludes with experimental our experimental results, which confirm that structure from motion based on a high-order model is able to outperform point-based implementations.

1.1 Further related work

Curve-based geometric incidence relations have since ever intrigued the structure-from-motion community. For example, rather than solving the relative pose problem from points correspondences [26, 16], early works such as [28] and lateron [12] and [20] looked into the possibility of using curves and surface tangents to solve the stereo calibration problem. However, the presented constraints for solving geometric computer vision problems are easily influenced by noise, and not practically useful. In order to improve the quality of curve-based structure from motion, further works therefore looked at special types of curves such as straight lines and cones, respectively [11, 19].

Our primary interest is the solution of structure-from-motion over many frames and observations. Point-based solutions are very mature from both theoretical [15] and practical perspectives [1]. However, point-based representations are somewhat unsatisfying as they simply do not present a complete, visually appealing result. It is therefore natural that the structure-from-motion community has been striving for higher-level formulations that are able to return fully dense surface estimates [7]. Fully dense estimation is however very computationally demanding, which is why a compromise in the form of line based representations has also received significant attention from the community [3, 30]. Lines are however not a general model for representing the environment, they fail in environments where the majority of edges are bended. [17, 18, 34, 33]

provide a solution to this problem by introducing curve-based representations of the environment, such as sub-division curves, non-rational B-splines, and implicit representations via 3D probability distributions. However, they do not exploit the edge measurements to improve on the quality of the pose estimations as well, as they do not optimize the curves and poses in a joint optimization process.

Full bundle adjustment over general curve models and camera poses has first been shown in [4]. The approach however suffers from a bias that occurs when the model is only partially observed. [27] discusses this problem in detail, and presents a lifting approach that transparently handles missing data. [10] solves the problem by modeling curves as a set of shorter line segments, and [6] models the occlusions explicitly. While [27] is the most related to our approach, the lifted formulation is computationally demanding, the work does not discuss the fully automatic establishmentment of a correspondence graph that would enable fully automatic incremental structure-from-motion. Our main contribution is an efficient, fully automatic solution to this problem, thus enabling automatic curve-based structure from motion in larger scale environments.

Further related work can be found in the online visual SLAM community, which equally aimed at finding an efficient, general compromise between point-based [21, 24] and dense [25] formulations. While [29, 36, 23] have again looked at using lines (primarily through a combination with points), recent works such as [8, 9, 32, 22] have also successfully realized visual SLAM pipelines based on general edge features by estimating depth along all edges that can identified in the images (e.g. by applying Canny-edge extraction [5]). While these methods do represent fully automatic pipelines, their results do rely on a global representation of curves, and consequently fail to jointly optimize over poses and structure.

2 Bézier splines as a higher-order curve model

Different from the traditional map representation which uses points for map representation, we aim at using Bézier splines and thus reach a more complete, higher order representation of the environment. In this section, we are going to give a brief review of Bézier curves as well as the linked Bézier spline parametrization. We will conclude with an exposition of the basic registration cost for aligning a Bézier curve with a set of pixels in an image.

2.1 A short review of Bézier splines

A Bézier-spline is a continuous curve expression parametrized as a function of control points and a continuous curve parameter . Every point on the curve can be obtained by referring to a unique value of . The general definition of a Bézier spline is:


where and . is the control point of the Bézier curve, and is the degree of the curve. In our work, we use cubic Bézier splines for representing curves as they are a powerful representation allowing for independent spatial gradients at the beginning and the end of the curve. Cubic splines employ four control points , hence , and


Besides compactness, the choice of Bézier splines is motivated by their simple geometry meaning: the first and last control points are simply the beginning and ending point of the curve while the directions from the first control point to the second and the fourth control point to the third are equal to the local gradient at the beginning and the end of the curve. As we will see, this clear geometric meaning facilitates initialization and regularization of parameters. Furthermore Bézier curves are invariant with respect to affine transformations, which means that the transformation of a Bézier curve between different coordinate frames is done by simply applying the same transformation to its control points. Furthermore, Bézier splines provide implicit smoothness and a scale invariant density of points along the curve by simply adjusting the step-size of .

2.2 Smooth polybéziers

A single cubic Bézier spline is unable to fit arbitrarily complex contours. We overcome this problem by borrowing a simple idea from computer graphics: composite, piece-wise Bézier curves (i.e. so-called polybéziers). As indicated in Figure 1(a), polybéziers separate a contour into multiple segments where every segment is represented by a single Bézier spline of limited order. Composite splines stay continuous by simply sharing the ending point of a segment with the starting point of the subsequent segment. Moreover, in order to maintain smoothness, we furthermore make sure that the gradient at the end of one segment coincides with the spatial gradient at the beginning of the next.

Imposing these constraints is simply done by making the control points a function of other latent variables that are shared among neighbouring segments. On one hand, the continuity simply requires the first control point of one segment to be equal to the last control point of the previous segment. Let be the control points of segment . Furthermore, let the first control point of a segment also be denoted by . We obtain and . On the other hand, sharing the gradient is made possible by explicitly introducing the local direction at the beginning of each segment, denoted .

is a 3-vector constrained to unit-norm, which means it is a spatial direction with two degrees of freedom only. Since the second control point by definition lies in the direction of the local gradient at the first control point, it may be parametrised as

. In order to guarantee smoothness, the third control point of segment in turn becomes a function of the gradient at the beginning of the next segment, i.e. .

In summary, an arbitrary curve is given by a sequence of parameters , , and . By sharing parameters between neighboring segments, a 3D composite Bézier curve is represented as a sequence of cubic Bézier segments where


We parametrize the normal vectors minimally by making them a function of two rotation angles , which avoids the addition of side-constraints during optimization. However, in order to facilitate optimisation and avoid the gimbal lock, the parametrisation is local about an initial direction expressed by a pre-rotation , i.e. .

2.3 Initialization of Bézier splines

Figure 1: Left: Two segments of a smooth 3D bézier curve. Some of the optimization variables (the control points and the gradient directions ) are shared among adjacent segments. Right: Epipolar matching with 1D patches. The three best local minima are sub-sequently disambiguated by 2D patch matching. The colors in the right frame indicate the depth of pixels and thus show an example result of semi-dense matching.

As we will see in Section 3, our method relies on a sparse technique to first initialize the poses of all frames in a video sequence. We proceed by extracting Canny-edges [5] in each image and grouping them into curves based on simple connectivity and thresholding of local curvature. We then initialize the depth of each pixel on an edge by using a variant of the semi-dense epipolar tracking method presented in [8]. For each pixel within each group, we perform the following steps to recover the depth:

  • Find a good reference frame for stereo matching by considering the length of the baseline and the parallelism between the epipolar direction and the local image gradient.

  • Extract the epipolar line in the reference frame.

  • Perform a 1D search for photometric consistency along the epipolar line by comparing 1D image patches.

  • Short-list the three best local minima.

  • For each local minimum, perform a 2D patch comparison to find the best.

  • Minimise the photometric error via sub-pixel refinement of the disparity along the epipolar line.

  • Take a robust average among all the depths recovered within a local window.

The procedure is visualised in Figure 1(b). Knowing camera poses as well as semi-dense depth maps in each frame, it now becomes possible to initialize the 3D position of points along the curve expressed in world coordinates


where are the rotation and translation transforming a point from the camera to the world coordinate frame, is the matrix of intrinsic camera parameters, and if the depth of the point along the principal axis. We transform all points for which a depth has been recovered into 3D, and furthermore separate the contours into segments such that each segment has roughly the same number of pixels. We furthermore assume that each segment can now be correctly represented by a cubic Bézier spline. We finally estimate the control points of each segment by counting the number of pixels composing the segment, sampling an equal number of values for distributed homogeneously within the interval , and assuming that the Bézier spline evaluated at those continuous curve parameter values leads exactly to the hypothesised world point. Let us assume that there are world points. Under the above assumption, the curve parameter for each one of the points becomes


Using (2), we can then find the control points by constructing and solving the linear problem


Note that the left-hand matrix only depends on a discrete number of homogeneously sampled values for , and therefore can be computed upfront. To conclude the initialisation, the Bézier segments are grouped into curves following the same order then the original segments extracted in the image. To enforce continuity and smoothness in the initialised curve, the first control point of each segment is simply replaced by the last control point of the previous segment, and the direction at the link point and the distance from the link point are set to

3 Fully automatic, spline-based structure from motion

This section explains the structure of our curve-based structure-from-motion problem, which can notably be formulated as a graph optimisation problem. The first part of the section assumes that the structure of this graph is already initialised, and in turn focusses on how the individual registration costs as well as the overall bundle adjustment are computed. The second part of the section then presents the detailed flow-chart of our incremental structure-from-motion, and in particular provides all the details on the automatic management of the graph structure (i.e. correspondences between segments and frames).

3.1 Curve-based structure-from-motion as a graphical optimisation problem

The overall factor graph of our optimisation problem is illustrated in 2. Our map is given as a set of composite Bézier curves, each one being composed of one or multiple cubic Bézier splines, which we call here segments. However, as explained in the previous section, in order to ensure continuity and smoothness in the curves, we do not optimise the control points of the segments directly. The control points of the segments are dependent on latent variables, which potentially are even shared across neighbouring segments. With respect to Figure 2, these parameters are simply called Bézier spline parameters. In order to prevent the latter from collapsing or drifting off into unobservable directions, we have our first cost terms added to the graph, which are regularisation constraints on the Bézier spline parameters. We then sample points from each segment, and reproject them into frames for which a correspondence has been established. A second type of cost term occurs here in the form of our 3D-to-2D curve registration loss. The correspondence management will be discussed later, here we focus on the cost terms and the actual optimisation of the graph.

Figure 2: Factor graph of our optimisation problem. Curves are composed of Bézier segments, which in turn are sampled to return 3D points. The latter are reprojected into frames if a correspondence between this segment and that frame exists. The curve parameters are not directly optimised, but depend on latent variables some of which are shared among neighbouring segments (bordering control points, as well as curve directions in those points).

Let us define the vector as the vector of parameters defining the Bézier spline , which may hence be written as a function . It is clear that many of the parameters are shared among different segments, but we ignore this here for the sake of a simplified notation. We assume to have splines. Let us furthermore assume that we have camera poses and that the pose of each camera is parametrized by the 6-vector that expresses a local change with respect to the original pose . Let be a function that transforms a point from the world frame into the image plane of a camera at position . The function assumes and uses known camera intrinsic parameters. Let denote all pixels along an edge in keyframe , and a function that returns the nearest pixel on an edge (i.e. within ) to a reprojected image location . The final objective of the global optimisation is given by



is an indicator function that equals to if the segment is visible in frame , and otherwise to . Index runs from to causing a homogeneous sampling of 3D points along the segment through the continuous curve parameters . results as the -th 3D point sampled along the spline and reprojected into the view . as a result denotes the geometric registration cost given as the sum of disparities between reprojected 3D points and their nearest neighbours in . A projection onto the local gradient direction is added in order to facilitate optimisation in the sliding situation. Note that, as discussed in [35], this step also helps to efficiently overcome the bias discussed in [27] without having to employ the more expensive technique of variable lifting. To conclude, a robust norm

such as the Huber norm is added to account for outliers and missing data.

Besides the residuals in the form of curve alignment errors, we additionally have the regularisation costs and , which are weighted in using the trade-off parameters and . The regularization terms are added to the cost function in order to prevent convergence into wrong local minima. enforces the length of each segment (i.e. the distance between the first and the last control point) to be consistent with its original value after initialization. The term introduces a penalty in the situation where the algorithm aims at collapsing a segment such that the entire segment would match to a single pixel, and thus return a very low registration cost. Second, because curves are composed of relatively short segments, we want to prevent single segments from presenting too high curvature. penalises high curvature by making sure that the spatial curve directions in the first and the last control point (indicated by and ) do not deviate too much from the vector between points and . is particularly helpful to prevent uncontrolled curve behaviours in a situation where dept his badly observable.

Initial poses are given from a sparse initialisation, and the initial values for the Bézier splines are obtained using the procedure outlined in Section 2.3 (further details about graph management and initialisation are given in the following section). After initialisation, we solve the bundle adjustment problem (7) using an off-the-shelf nonlinear implementation of Levenberg-Marquardt [2]. The latter iteratively performs local linearizations of the residual and regularisation terms in order to gradually update the pose and Bézier spline parameters.

3.2 Efficient residual error computation

Figure 3: Example nearest neighbour field.

One of the more expensive parts of the computation is given by the nearest neighbour look-up . Inspired by [35], we employ a simple solution to speed up the optimisation by pre-computing a nearest neighbour look-up field that indicates the nearest pixel on an edge for any pixel in the entire image. An example is given in Figure 3. The extraction of the nearest neighbour field is accelerated by limiting it to pixels which are at most 15 pixels away from an edge.

3.3 Overall flow-chart

The input of the system is simply a sequence of RGB image from a calibrated camera. Before we initialize the Bézier Map, we perform ORB SLAM [24] to obtain an initial guess for the camera positions. ORB SLAM is a sparse feature based simultaneous localization and mapping system which can provide an accurate guess of the camera position in real-time. With the initial camera positions in hand, we then incrementally parse our frames and initialise new Bézier splines. Each time a new keyframe is added, we first establish the correspondence with existing segments before adding potentially new segments. The initialisation of splines from a single frame uses the strategy proposed in Section 2.3. A flow-chart of the overall system is indicated in Figure 4.

Figure 4: Overall flow-chart of our Bézier spline-based structure from motion framework including the initialisation from a sparse point-based method.

The Bézier splines are grouped into two distinct maps, one global map that stores well observed splines and a temporary map that stores new spline initialisations. All Bézier splines are initially put into the temporary map and then moved to the global map once sufficient observations are available. This delayed initialisation scheme helps to robustify the optimisation, as bundle adjustment uses only the well-observed splines in the global map. In order to prevent the addition of redundant representations, the establishment of correspondences in new keyframes (outlined in Section 3.4) needs to first consider the segments in the global map before moving on to the temporary map.

New segments are added to the temporary map whenever a sufficiently large group of connected pixels has not been registered with existing splines, and the semi-dense depth measurement for those pixels succeeded. The segment is added to an existing curve if the seed group of pixels is smoothly connected to the pixels of an already existing curve. To prevent the algorithm from losing too many correspondences in difficult passages, newly initialized Bézier splines may also be added directly into the global map to keep the tracking of subsequent frames alive. The addition of a new keyframe is concluded by local bundle adjustment over all recently observed frames and landmarks in the global map. Segments with less than three observations in keyframes will not be considered for updating the pose of the cameras. We alternately fix the parameters of Bézier splines and camera poses and optimize the other. After all key-frames have been loaded, we perform global bundle adjustment over all frames and splines.

3.4 Correspondence establishment

In this section, we are going to explain how we establish and manage the correspondences between segments and key-frames. Correspondences are verified based on four criteria:

  • Spatial distance: We require the initial geometric registration cost to be small enough. Points from a spline are required to consistently reproject near a set of connected edge pixels in the image. We evaluate the average reprojection error. Only splines with an average error lower than a given threshold will be considered as an inlier correspondence.

  • Appearance-based error: We store the average color of a segment in its original observation. We do not only consider the pixels forming the edge itself, but an isotropically enlarged region around each segment. We again set a threshold on the difference in appearance for determining inliers.

  • Viewing direction: Since the appearance in the neighbourhood of edges can depend heavily on the viewing direction (in particular for occlusion boundaries), we add a limitation on the range of possible viewing directions. We assume the viewing direction of an observation to be the vector from the camera center to center of a segment transformed into the world frame. A correspondence is no longer established if the current viewing direction has an angle of more than sixty degrees away from the average viewing direction of all previous observations.

  • Depth check: We check the relative depth of each reprojected segment, and discard segments with negative depth.

We further add correspondence pruning based on weak overall curve observations. For curves where more than fifty percent of all segments have no longer been observed in the three most recently added keyframes, the entire curve will be disabled. A disabled curve will no longer be used in local bundle adjustment until it is reactivated by sufficiently new observations in a new frame.

The average color error between a segment in its original image and the closest pixel retrieved via the nearest neighbor field is evaluated as follows. We represent the color in HSV format where, thus returning the average hue , saturation , and lightness values. Operating in the HSV color space can be more robust to illumination changes and difference caused by viewing direction changes. However, due to its definition, the hue value becomes unobservable at zero saturation. Errors in H therefore need to be scaled by the average saturation. The error between two segments is finally given by


4 Experimental evaluation

The algorithm is implemented in C++ and depends on OpenCV and the Google Ceres-solver [2] for solving our curve-based bundle-adjustment. We evaluate the pipeline on several indoor and outdoor images from open-source benchmark sequences [31, 14, 13]. Although some of them are RGB-D datasets, we only use the RGB channel in our pipeline. For initialization, we precompute the camera poses using ORB-SLAM [24], and also compare our solution against its global optimization result. We evaluate our result in terms of both the accuracy of the camera trajectory and the quality of the environment mapping.

4.1 Evaluation on Simulation Datasets

Before we test our pipeline on a larger scale dataset, we perform a dedicated experiment where the performance of edge-based registration is analysed and compared against a state-of-the-art point-based solution in different environments. Each test is generated by taking an image of the real-world, assuming it to be a planar environment, and then generating novel views by assuming a circular orbit on top of the plane (new views can easily be generated by homography warping). This allows us to work with features from the real world, but at the same time focuses the experiment on the actual accuracy of the estimation (i.e. issues related to the correspondence establishment are not taken into account). It furthermore allows us to explore a larger variety of environments while each time maintaining information about the ground-truth trajectory.

Logos (m) 0.016215 X X X X
Indoor imgs (m) 0.000703 0.000738 0.000691 X X
Outdoor imgs (m) 0.001634 0.000962 X X X
Table 1: Average Position Error of ORB SLAM on different types of images (’/’ means failure)
Logos (m) 0.000415 0.000312 0.000458 0.0035 0.0111
Indoor imgs (m) 0.000533 0.000533 0.000459 0.0014 0.0046
Outdoor imgs (m) 0.000885 0.0015 0.00153 0.0042 0.0068
Overall (m) 0.000611 0.000798 0.000812 0.003 0.0075
Table 2: Average Position Error of our method on different types of images.
Figure 5: The first row shows example images for each type of experiment. The second row shows the obtained distributions of position errors between the estimated result and ground truth for varying noise levels. For each pattern, we evaluate 5 different noise levels.

Three types of images are explored: logos, indoor environments (images taken from [31]), and outdoor environments (images taken from [13]. We analyze three image for each category while each time add a varying amount of Gaussian noise to each individual image. Noise is added by adding a random per-pixel intensity disturbance of to . A visualization of some patterns plus the distributions of obtained the position errors for the different noise levels are indicated in Figure 5.

As can be observed in Tables 1 and 2, ORB-SLAM quickly fails as noise is added to the images, especially for the logo patterns which do not contain much texture. In contrast, the proposed curve-based optimization shows that, if a reasonably good initialization for poses is available, a high level of accuracy and robustness can be achieved for all analysed images.

4.2 Evaluation on a full 3D dataset

We evaluate the complete algorithm on the living room sequence of the ICML-NUIM benchmark sequence [14]. This synthetic dataset provides realistic images of a camera exploring an indoor environment. Both ground-truth information for poses and 3D model are available, thus permitting the evaluation of both accuracy of motion and quality of the reconstruction. As we use only the RGB channel of the dataset, the recovered scale of the estimation is in fact arbitrary. To properly evaluate the output trajectory, we therefore perform a 7 DoF alignment between the estimated trajectory and ground truth (i.e. a similarity transformation that identifies rotation, translation, and scale for an optimal alignment). The trajectory error is simply the distance between the recovered position and ground truth.

We run the pipeline ten times and compute the average rmse and median of the trajectory error for both ORB SLAM and our Bézier-spline based optimization. The results are indicated in Table 3. As can be observed, ORB-SLAM has a lower rmse error while using Bézier splines achieves a smaller median error. While this indicates generally better accuracy, the inferior performance in terms of the rmse error is attributed to a few occasions in the dataset where only few contours are observed, thus leading to no substantial improvement in the optimized pose. Note that, in order to further explore the potential of curve-based optimization, we also analysed the quality if more curves are initialized by also using the available depth channel (note however that depth is only used to initialize the curves, it does not constrain the curves during bundle adjustment anymore). The result is indicated in the last row of Table 3. It shows that if sufficient curves can be initialized, curve-based optimization is at last able to outperform state-of-the-art point-based approaches.

Algorithm rmse(cm) mean(cm) median(cm)
ORB(Mono) 3.82 3.41 3.02
Bezier 4.45 3.68 2.96
Bezier(RGBD) 4.11 3.07 2.37
Table 3: Average errors for ORB-SLAM [24] and our Bézier-spline based optimization.

Since the ICML-NUIM dataset also provides 3D models of the environment, we can also visualize the quality of the mapping by overlaying some of our estimated curves onto the groundtruth CAD model of the environment. As illustrated in Figures 6 and 6, the curves align well with real-world edges, and thus provide a visually appealing, more meaningful representation of the environment than sparse point-based approaches.

Figure 6: Mapping results on the ICML-NUIM living room sequence [14].

5 Discussion

Our main novelty is a fully automatic structure from motion pipeline where a general, higher-order curve model has been successfully embedded into global bundle adjustment. This result stands in contrast with prior semi-dense visual SLAM pipelines, which alternate between tracking and mapping, and thus are unable to provide a globally consistent, jointly optimised result that explores all correlations between poses and structure. We employ polybéziers, the geometric intuition of which proves great benefits during initialization and regularisation. Our work furthermore illustrates the importance of managing the correspondences between segments and frames, and the resulting graphical form of the optimisation problem. Introducing such correspondences also enables us to prevent the use of the more computationally demanding data-to-model registration paradigm. We present an evaluation on several synthetically generated datasets simulating the appearance of different environments and application scenarios. We demonstrate that it is indeed possible to improve on the accuracy provided by purely sparse methods, and return visually expressive, complete semi-dense models that are jointly optimized over all frames.


  • Agarwal et al. [2009] S. Agarwal, N. Snavely, I. Simon, S.M. Seitz, and R. Szeliski. Building rome in a day. In Proceedings of the International Conference on Computer Vision (ICCV), pages 72 –79, 2009.
  • Agarwal et al. [2010] Sameer Agarwal, Keir Mierle, and Others. Ceres solver. http://ceres-solver.org, 2010.
  • Bartoli and Sturm [2005] A. Bartoli and P. Sturm. Structure from motion using lines: Representation, triangulation and bundle adjustment. Computer Vision and Image Understanding (CVIU), 100(3):416–441, 2005.
  • Berthilsson et al. [2001] R. Berthilsson, K. Astrom, and A. Heyden. Reconstruction of general curves, using factorization and bundle adjustment. International Journal of Computer Vision (IJCV), 41(3):171–182, 2001.
  • Canny [1986] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
  • Cashman and Fitzgibbon [2013] T. J. Cashman and A. W. Fitzgibbon. What shape are dolphins? building 3d morphable models from 2d images. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35(1):232––244, 2013.
  • Delaunoy and Pollefeys [2014] A. Delaunoy and M. Pollefeys. Photometric Bundle Adjustment for Dense Multi-View 3D Modeling. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2014.
  • Engel et al. [2013] Jakob Engel, Jurgen Sturm, and Daniel Cremers. Semi-dense visual odometry for a monocular camera. In Proceedings of the IEEE International Conference on Computer Vision, pages 1449–1456, 2013.
  • Engel et al. [2014] Jakob Engel, Thomas Schöps, and Daniel Cremers. LSD-SLAM: Large-scale direct monocular SLAM. In European Conference on Computer Vision, pages 834–849. Springer, 2014.
  • Fabbri and Kimia [2010] R. Fabbri and B. Kimia. 3D curve sketch: Flexible curve-based stereo reconstruction and calibration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
  • Faugeras and Mourrain [1995] O. Faugeras and B. Mourrain. On the geometry and algebra of the point and line correspondences between n images. In Proceedings of the International Conference on Computer Vision (ICCV), 1995.
  • Feldmar et al. [1995] J. Feldmar, F. Betting, and N. Ayache. 3D-2D projective registration of free-form curves and surfaces. In Proceedings of the International Conference on Computer Vision (ICCV), 1995.
  • Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • Handa et al. [2014] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2014.
  • Hartley and Zisserman [2004] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, New York, NY, USA, second edition, 2004.
  • Hartley [1997] Richard I Hartley. In defense of the eight-point algorithm. IEEE Transactions on pattern analysis and machine intelligence, 19(6):580–593, 1997.
  • Kaess et al. [2004] M. Kaess, R. Zboinski, and F. Dellaert. MCMC-based multi-view reconstruction of piecewise smooth subdivision curves with a variable number of control points. In Proceedings of the European Conference on Computer Vision (ECCV), 2004.
  • Kahl and August [2003] F. Kahl and J. August. Multiview reconstruction of space curves. In Proceedings of the International Conference on Computer Vision (ICCV), 2003.
  • Kahl and Heyden [1998] F. Kahl and A. Heyden. Using conic correspondences in two images to estimate the epipolar geometry. In Proceedings of the International Conference on Computer Vision (ICCV), 1998.
  • Kaminski and Shashua [2004] J. Y. Kaminski and A. Shashua. Multiple view geometry of general algebraic curves. International Journal of Computer Vision (IJCV), 56(3):195––219, 2004.
  • Klein and Murray [2007] Georg Klein and David Murray. Parallel tracking and mapping for small AR workspaces. In Mixed and Augmented Reality, 6th IEEE and ACM International Symposium on, pages 225–234, 2007.
  • Kuse and Shen [2016] Manohar Prakash Kuse and Shaojie Shen. Robust camera motion estimation using direct edge alignment and sub-gradient method. In IEEE International Conference on Robotics and Automation (ICRA-2016), Stockholm, Sweden, 2016.
  • Lu and Song [2015] Y. Lu and D. Song. Robust RGB-D odometry using point and line features. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 2015.
  • Mur-Artal et al. [2015] Raul Mur-Artal, JMM Montiel, and Juan D Tardós. ORB-SLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
  • Newcombe et al. [2011] Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. In 2011 international conference on computer vision, pages 2320–2327. IEEE, 2011.
  • Nistér [2004] David Nistér. An efficient solution to the five-point relative pose problem. IEEE transactions on pattern analysis and machine intelligence, 26(6):756–770, 2004.
  • Nurutdinova and Fitzgibbon [2015] I. Nurutdinova and A. Fitzgibbon. Towards pointless structure from motion: 3D reconstruction and camera parameters from general 3d curves. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
  • Porrill and Pollard [1991] J. Porrill and S. Pollard. Curve matching and stereo calibration. Image and Vision Computing (IVC), 9(1):45–50, 1991.
  • Pumarola et al. [2017] A. Pumarola, A. Vakhitov, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer. PL-SLAM: Real-time monocular visual SLAM with points and lines. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017.
  • Schindler et al. [2006] G. Schindler, P. Krishnamurthy, and F. Dellaert. Line-based structure from motion for urban environments. In Proceedings of the International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), pages 846–853, Chapel Hill, USA, 2006.
  • Sturm et al. [2012] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), pages 573–580, 2012.
  • Tarrio and Pedre [2015] J. J. Tarrio and S. Pedre. Realtime edge-based visual odometry for a monocular camera. In IEEE International Conference on Computer Vision (ICCV), pages 702–710, 2015.
  • Teney and Piater [2012] D. Teney and J. Piater. Sampling-based multiview reconstruction without correspondences for 3D edges. In Proceedings of the International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), 2012.
  • Xiao and Li [2005] Y. J. Xiao and Y. Li. Optimized stereo reconstruction of free-form space curves based on a nonuniform rational B-spline model. Journal of the Optical Society of America, 22(9):1746–1762, 2005.
  • Zhou et al. [2017] Yi Zhou, Laurent Kneip, and Hongdong Li. Semi-dense visual odometry for RGB-D cameras using approximate nearest neighbour fields. Accepted by ICRA 2017, abs/1702.02512, 2017. URL http://arxiv.org/abs/1702.02512.
  • Zuo et al. [2017] X. Zuo, X. Xie, Y. Liu, and G. Huang. Robust visual SLAM with point and line features. Arxiv Computing Research Repository, abs/1711.08654, 2017.