DefSLAM: Tracking and Mapping of Deforming Scenes from Monocular Sequences

08/20/2019 ∙ by Jose Lamarca, et al. ∙ University of Zaragoza 0

We present the first monocular SLAM capable of operating in deforming scenes in real-time. Our DefSLAM approach fuses isometric Shape-from-Template (SfT) and Non-Rigid Structure-from-Motion (NRSfM) techniques to deal with the exploratory sequences typical of SLAM. A deformation tracking thread recovers the pose of the camera and the deformation of the observed map at frame rate by means of SfT. A deformation mapping thread runs in parallel to update the template at keyframe rate by means of NRSfM with a batch of covisible keyframes. In our experiments, DefSLAM processes sequences of deforming scenes both in a laboratory controlled experiment and in medical endoscopy sequences, being able to produce accurate 3D models of the scene with respect to the moving camera.



There are no comments yet.


page 2

page 11

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of visual Simultaneous Localization and Mapping (SLAM) algorithms is to locate a visual sensor in an uncertain map which is being reconstructed simultaneously. The typical use case in SLAM are exploratory trajectories where the camera images a scene for the first time. In this case, visual monocular SLAM has to process several images rendering enough parallax to recover the map for the new scene region wrt. the camera. Once the map is available, the camera can be localized wrt. this map from just one image as long as the camera does not move to unexplored areas. The rigidity assumption constrains the problem significantly, and it is intensively exploited by state-of-the-art monocular SLAM systems [engel2017direct, klein2007parallel, mur2015orb].

However, the rigidity assumption rends invalid in applications where the deformation is highly predominant as in endoscopy. Recently, Augmented Reality (AR) has been shown to be a breakthrough tool for specialists to orientate in intracorporeal scenes by simply processing the video stream from a standard endoscope. One of the main tools for AR is visual monocular SLAM, however the deformation of the scene severely limits the use of the classical algorithms. To this end, we introduce DefSLAM, a calibrated monocular and deformable SLAM system which can perform in deforming environments.

In the literature, non-rigid monocular scenes have been handled by Non-Rigid Structure-from-Motion (NRSfM) [chhatkuli2014non, chhatkuli2016inextensible, parashar2017isometric, taylor2010non, vicente2012soft] and Shape-from-Template (SfT) [chhatkuli2017stable, lamarca2018, ngo2016template, salzmann2011linear] methods. NRSfM recovers the deformation of a non-rigid scene by processing a batch of images, which is computationally demanding. In contrast, SfT does it from only a single image, at a low computational cost but at the expense of a previously known 3D template, which is a textured model of the imaged scene. Our DefSLAM framework combines the advantages of these two classes of non-rigid monocular methods. We propose a parallel algorithm composed of a deformation tracking thread based on SfT as front-end running at frame rate, and a deformation mapping thread based on NRSfM to compute the template as a back-end running at a slower keyframe rate. Our algorithm is the first monocular SLAM that locates the camera and recovers the deformation of the map, while reconstructing new explored zones as shown in Figure 1.

The deformation tracking thread recovers the camera pose and the deformation of the map at frame rate. We define a template for the viewed part of the map which allows us to recover the observed deformation. The map points move according to this template by minimizing a combination of reprojection error and deformation energy for each frame. The deformation mapping thread initializes and refines map points and substitutes the template when new scene regions are visited, extending the map. It does not process all the frames, but just a selection of frames imaging the same scene region, these selected frames are called keyframes. This algorithmic component recovers the scene surface for the last keyframe. The deformation mapping thread produces a map of the deformed scene and defines the shape-at-rest on the template used by the deformation tracking thread to process the subsequent frames.

We validated our DefSLAM algorithm in exploratory trajectories in rigid and deforming scenes. We compared our results with the state-of-the-art rigid SLAM algorithm ORB-SLAM [mur2015orb]. These experiments validate the ability of DefSLAM to accurately code the deforming nature of the scene in rigid, deformable and medical scenarios.

Figure 1: Real-time reconstruction of a deforming scene with DefSLAM. The kerchief mandala deforms and the camera moves. DefSLAM locates the camera shown as green frustrum, while recovering the deformation of the kerchief and extends its 3D model in newly explored parts.

Notation We use calligraphic letters for sets of geometrical entities in the deforming scene, e.g.

for the set of all map points. Bold letters represent matrices and vectors. Scalars are represented in italics. The indexes


represent the frames and the keyframes respectively. Super indexes represent the temporal instant of the estimation. The index

represents the map points, the nodes and the edges of the mesh describing the template surface.

2 Related Work

2.1 Slam

Deformable visual SLAM. The deformable SLAM methods in the literature rely on sensors that give depth information, i.e. RGB-D or stereo sensors. Dynamic fusion [newcombe2015dynamicfusion] is a seminal work in deformable SLAM with an RGB-D camera. It fuses the frame-by-frame depth information by dynamically warping it into a canonical shape that incrementally maps the entire scene after an exploratory trajectory of partial observations. [innmann2016volumedeform] proposes to also use the photometric information for matching. Other recent approaches propose SLAM with a stereo sensor in medical scenes, [song2018mis] and [turan2017non]. We aim similar SLAM capabilities in deformable scenes but in the more challenging monocular case.

Rigid visual SLAM. There are many approaches to visual monocular SLAM. The state-of-the-art monocular rigid methods such as [engel2017direct, mur2015orb] provide very accurate, robust and fast results. There are some works that have attempted to apply rigid methods in in-vivo medical quasi-rigid scenes. [grasa2014visual] proposes an EKF-SLAM algorithm, and [mahmoud2018live] gets dense maps based on [mur2015orb]. [marmol2019dense] uses a rigid SLAM system to locate the camera in arthroscopic images. All of these methods assume that the deformation is negligible and hence that a purely rigid SLAM system is able to survive just by excluding from the map any deforming scene region. We aim similar performance in strongly deforming scenarios: real-time operation and capability to handle sequences of close-ups corresponding to exploratory trajectories.

2.2 Non-Rigid Monocular Techniques

The methods in the literature which aim to recover the structure of a non-rigid scene from monocular sequences are SfT and NRSfM.

Shape-from-Template. SfT algorithms recover the deformed shape of an object from a monocular image and the object’s textured shape at rest. This textured shape-at-rest of the object is the so-called template. These methods associate a deformation model with this template to recover the deformed shape. The main difference between these methods is the definition of its deformation model. We distinguish between analytic and energy-based methods. Among the analytic solutions, we focus on the isometric deformation which assumes that the geodesic distance between points in the surface is preserved. Isometry for SfT has proven to be well-posed and it quickly evolved to stable and real-time solutions [bartoli2015shape, chhatkuli2017stable, collins2010locally]. Energy-based methods [agudo2014good, lamarca2018, ngo2016template, salzmann2011linear] jointly minimize the energy shape wrt.

the shape-at-rest and the reprojection error for the image correspondences. These optimization methods are well suited to implement sequential data association with robust kernels to deal with outliers.

Orthographic Non-Rigid Structure-from-Motion. The earliest non-rigid monocular techniques are NRSfM. These methods were formulated using statistical models, first proposed in [bregler2000recovering]. This work gave rise to a family of methods [dai2014simple, moreno2011probabilistic, paladini2009factorization] that used a low dimensional basis model to obtain the configuration of the 3D points from the images of a sequence. They exploited spatial constraints [dai2014simple, garg2013dense], temporal constraints [akhter2011trajectory] and spatio-temporal constraints [agudo2015simultaneous, gotardo2011kernel, gotardo2011non]. These methods may handle small surface deformations or articulated objects, but they usually fail with very large deformations. They use an orthographic camera projection which is a weak approximation to the perspective camera. This is a important limitation due to the significant perspective effect in many applications.

Perspective Non-Rigid Structure-from-Motion. The isometry assumption, first proposed in SfT methods, has also produced excellent results in NRSfM [chhatkuli2014non, chhatkuli2016inextensible, parashar2017isometric, taylor2010non, vicente2012soft]. It brought not only improvements in terms of accuracy, but also the ability to handle perspective cameras. This is a more accurate model for close-ups in the exploratory sequences which are targeted in SLAM. [parashar2017isometric] is a local method, able to handle naturally occlusions and missing data also usual in many applications.

Our approach. We propose the first visual SLAM system capable of working with deforming monocular sequences. We propose a deformation tracking thread based on [lamarca2018], which uses a template to recover the camera pose and the deformation of the scene. In this work, the template is re-estimated when new zones are explored. We propose a deformation mapping thread which processes several keyframes in batch, to estimate the shape-at-rest of the new template in new areas. This is a full-fledged deformable mapping, based on isometric NRSfM [parashar2017isometric], with automatic data association, estimation of the new up-to-scale surface, scale recovery in the current map and map points embedding. We run deformation tracking and mapping in parallel, as proposed to [klein2007parallel], in a similar way to the state-of-the-art rigid SLAM methods [engel2017direct, klein2007parallel, mur2015orb] to achieve real-time performances. DefSLAM is thus a new type of non-rigid monocular reconstruction method which can process exploratory sequences incrementally.

3 DefSLAM System Overview

DefSLAM recovers the structure of the scene, its deformation and the camera pose. It is composed of three main components:

  • The map. The map represents the structure of the scene reconstructed by the algorithm, as a set of 3D map points. The map is deformable and the position of the map points evolves along the sequence. Each map point is represented by its positions for each processed frame .

  • The deformation tracking thread. This thread is the front-end of the system and runs at frame rate. It uses SfT which estimates the shape of the map and the camera pose for each frame . It also updates the position of the map points embedded in the template.

  • The deformation mapping thread. This thread is the back-end of the system and runs at keyframe rate. It uses NRSfM which estimates the surface observed in the last keyframe . From the estimate of the surface, we initialize the new template for deformation tracking.

4 Deformation Tracking

Deformation tracking is performed to recover the camera pose and the position of the map points embedded in the template for each deformation observed in frame . We recover them by jointly minimizing the reprojection error and deformation energy of the map associated to the shape-at-rest reconstructed in the keyframe .

In SLAM sequences, the camera usually images a zone of the scene smaller than the global scene. For efficiency and scalability, we only optimize the observed zone of the template and its closest vicinity. We refer to this part of the map as the local zone .

Figure 2 shows all the components of the deformable tracking: the template , the local zone and the camera pose .

4.1 Template

The template is a 2D triangular mesh embedded in the 3D space. It is composed of a set of planar triangular facets , defined by a set of nodes , and connected by a set of edges . The deformation of the map at frame is defined through the pose of the nodes of the template defined in the keyframe . The localization of the facet at frame is defined by the pose of its three nodes . The map points observed in the keyframe are embedded in the facets of the mesh. The position of a map point in frame is defined by means of its barycentric coordinates, , wrt. the position of the nodes of the face :


4.2 Camera Model

We use the calibrated pinhole model. The projection of the 3D point , in the frame by a camera located at is modelled by the projection function :


and are respectively the rotation and the translation of the transformation . are the focal lengths and the principal points obtained from the camera calibration. The set of observation in the image is the set of matched keypoints with a map point of . The map point is matched with the keypoint . Its corresponding normalized coordinates are . They can be computed from the keypoint coordinates as and .

4.3 Camera Pose and Template Deformation

Figure 2: Deformation tracking: estimating camera pose and deformation of the viewed map. is the entire map shape in the frame , is the local map shape in the frame and the camera pose. Black points belong to the global map. Some of them are embedded in the template. Current matched points in red.

To estimate the deformed and , we jointly minimize the reprojection error in the image and the deformation energy of the template :


We solve (3) with the Levenberg-Marquardt optimization method. The initial guess for , is the solution of the previous frame, . To determine the position of the camera , we fix the pose boundary nodes of during the optimization to constraint the gauge freedoms.

The reprojection error for the set of keypoints in image is defined as:


The reprojection error is robust against outliers as it is weighted with a Huber robust kernel .

We define a deformation energy wrt. as a combination of a stretching energy , a bending energy and a temporal term :


We use , and to weight the influence of each term.

The stretching energy measures the difference in the length of each edge in the local zone in the frame with respect to its length in the shape-at-rest of :


The bending energy measures the changes in mean curvature in each node wrt. the estimated in the shape-at-rest of . We estimate the mean curvature through the discrete laplacian operator [floater2003mean]. We make the bending term dimensionless by dividing it by the mean distance of the edges connected with the node :


The temporal term is a temporal filter to avoid abrupt changes between estimates. It is given by:


4.4 Data Association

To match the keypoints in the current frame with the map points, we apply an active matching strategy as proposed in [chli2009active]. First, the ORB points are detected in the current frame. Next, the camera pose is predicted with a camera motion model as a function of the past camera poses. Then, we use the last estimated shape of template and the barycentric coordinates to predict where the map points would be imaged. Around the map point prediction, we define a search region. We match the query map point with the one with the most similar ORB descriptor inside this search region. The similarity is estimated as the Hamming distance between the ORB descriptors. The ORB descriptor of the map point is taken from its initialization. We apply a threshold on the ORB similarity to definitively accept a match.

4.5 Keyframe Selection

The deformation tracking can recover the position of the camera and the deformation of the scene region covered by the template . In exploratory sequences, we need to incorporate new regions in the map and recompute to cover those areas. The deformation mapping is the process which recovers new parts of the map. It is launched when we decide to select the current frame as the next keyframe

. We use the heuristic of selecting a new keyframe when the mapping finishes the last estimation.

5 Deformation Mapping

Figure 3: Deformable mapping: extension of the map. Local area in green. Matched points in red. Surface estimated by NRSfM and final surface growing using as reference the correct in blue.

Deformation mapping recovers the observed map as a surface for the keyframe . This surface contains the map points observed in the keyframe during the tracking. We can thus refine the map points and introduce new ones. will be the shape-at-rest of the template for the deformation tracking for the next frames, as shown in Figure 3.

Deformation mapping is performed as follows: first, we recover the warps between the new keyframe and the set of its best covisible keyframes . Second, we estimate an up-to-scale surface by processing the best covisible keyframes with NRSfM. Third, we align with the entire map to obtain the scaled surface wrt. the old map in the keyframe . Finally, with this new surface, we create the new template which implies to compute triangular mesh and embed the map points in their facets.

5.1 NRSfM

In isometric NRSfM, the surface deformation is modelled locally for each point under the assumption of isometry and infinitesimal planarity. According to the assumption of infinitesimal planarity, any surface can be approximated with a plane at an infinitesimal level, while maintaining its curvature at the global level. Isometric NRSfM can handle both rigid and non-rigid scenes. Since we use a local method, it can handle missing data and occlusions inherently. Next, we summarize the isometric NRSfM method [parashar2017isometric], which we refer to as NRSfM.

Figure 4: Simplified notation for two keyframes. and are embeddings of the two surfaces in the images and . is the warp between and . is the deformation field between the surface and

is the embedding of the scene surface . It is parametrized using the normalized coordinates of the image :


where is the inverse depth of each point. The normal of the surface expressed wrt. this parametrization is given as:


where and , and the subindexes and denotes the partial derivatives.

NRSfM exploits the relationship between the metric tensor

, and the Christoffel symbols and of the surfaces of the new keyframe and that of its best covisible keyframes . Assuming infinitesimal planarity and isometry, , and only depend on and for each point in every keyframe image. The warp between the keyframes and represents the transformation from the image to the image . Figure  4 shows the different elements of the two view relation, the warp , the surface embedding for each keyframe in the couple and , and the isometric deformation between the surfaces and . Due to the infinitesimal planarity and isometry assumptions, the metric tensor and the Christoffel symbols in two different shapes and are related through the warp between these keyframes as:


where and are the Jacobian and the Hessian for the variable of the warp respectively. Eqs. (11) and (12) can be transformed in two cubic polynomial equations and for each point correspondence:


where the coefficients and depend only on the normalized coordinates of the points and the derivatives of first and second-order derivatives of the warp . We refer to [parashar2017isometric] for further details in the coefficients and .

If a point is matched in more than two keyframes, we can recover and by means of non-linear optimization:


Hence, we have made available an estimate of the normal for each point in the last inserted keyframe . The initial solution of this optimization are the normals of the deformed template when the keyframe was inserted.

Lastly, we recover from the set of estimated unit normals using Shape-from-Normals (SfN) [chhatkuli2014non]. The surface is modelled as a bicubic spline parametrized by its control nodes depth. The control nodes are defined in the image . We fit the depth of the nodes to obtain a surface orthogonal to the estimated normals with a regularizer in terms of bending energy. The final depth estimation of is up-to-scale as shown in Figure 5.

Figure 5: Envelope of solutions for the keyframe . are the set of normals. is the estimated up-to-scale surface.

5.2 Surface Alignment

Figure 6: alignment. We align the the map points of the up-to-scale estimation with the pose of the map points estimated for the frame .
Figure 7: DefSLAM system overview. The deformation tracking and mapping threads run in parallel. Deformation mapping is composed of NRSfM (Sec. 5.1), surface alignment (Sec. 5.1) and SfT template substitution (Sec. 5.3)

The new estimated surface is up-to-scale. We need to recover the solution with a coherent scale wrt. the already estimated map. This means that the scale-corrected shape-at-rest must match with the scale of the template estimated by the tracking when the keyframe was inserted.

We align these surfaces map points through a , which is the group of similarity of 3-space transformation, non-linear optimization:


where ,, are the rotation translation and scale defining the transformation. Figure  6 shows the alignment process.

To build our new template , we finally create a triangular mesh from the scale-corrected surface by means of Delaunay triangulation. The new map points 3D pose is computed from the camera observation and by constraining them to be in the estimated surface . Then, we embed all the re-observed map points and the new ones by projecting them into their corresponding facet. With this embedding, we calculate the barycentric coordinates of the map points which will be used by the tracking.

5.3 Template Substitution

The tracking runs at frame-rate while deforming the template. At the same time, when a new keyframe is dropped, the mapping has to estimate the new template. As the mapping operations are computationally demanding, the tracking will have processed frames before the new template computation finishes.

We propose to compute the shape for frame as instead of i.e. deforming the most recently computed template instead of the old one, and then to switch the templates, the process is summarized as follows. From the tracking thread we compute the matches between the current frame and the old template which coincide with the matches between the current frame and the new template. Then, we estimate the shape of the current frame from the new template with SfT eq. (3). In this case, we neglect the temporal term to allow a strong template deformation. Once we have recovered , the deformation tracking continues with the new template.

In summary, see Figure  7, we compute the surface from the batch processing of the current keyframe image along with its covisibles (Section 5.1). Then, we compute the that aligns with the previous map at frame (Section 5.2). Finally, we substitute the new template at frame .

5.4 Warp Estimation and Non-Rigid Guided Matching

Figure 8: Warp estimation and Non-Rigid guided matching. We use Schwarps between the keyframe (left) and (right). The warp between and is plotted in blue. Green points are the result of the guided matching stage and yellow points are the initially matched map points.

The input of NRSfM is the set of warps between the keyframe and its best covisible keyframes . A warp is a function that projects the coordinates of a point in the original image to the matched one:

We use a particular family of warps called Schwarps [pizarro2016schwarps], because, as discussed in [parashar2017isometric], the formulation of the 2D Schwarzian equation regularizers are equivalent to the infinitesimal planarity of the NRSfM. See Figure  8 for two examples of warp between keyframes.

To increase the number of matches we use the estimated warp as reference to perform a non-rigid guided search for more matches in the covisible keyframes .

First, we estimate an initial warp between the keyframe and its best covisible keyframes with the matches given by the deformation tracking. With this initial warp, we can estimate where a point would be seen in the rest of keyframes . Thus, we define a search region around the estimated position. We select the keypoint inside the search region with the smallest Hamming distance for the ORB descriptor. We apply a threshold on the ORB similarity to definitively accept a match. Once that we have the new matches we incorporate them to the initial ones and estimate the final warp.

5.5 Initialization

At initialization we need to have available a template for the imaged scene surface. We have assumed this initial template as a plane parallel to the image. We compute a triangular mesh with a Poisson surface reconstruction [kazhdan2006poisson].

Once our system has created several new keyframes, we launch the mapping thread and start to compute our template, that replaces the initial one. The accuracy of the first computed templates strongly depend on how many keyframes are fed in the NRSfM and on how large is the parallax they render.

According to the experiments, our algorithm can track from an inaccurate template with a high quality data association between keyframes, yielding long tracks and a low false positive rate. As a result, as more keyframes rendering high parallax are created, the estimated template eventually converges to the actual scene shape.

6 Implementation Details

The method is implemented in C++ and runs entirely on the CPU. We have used the OpenCV library [opencv_library]

for base computer vision functions. For the SfT optimization and the LS

registration, we have used the g2o library[kummerle2011g] and its implementation of Levenberg-Marquardt. For the Schwarps optimization, the normal estimation and the SfN, we have used the Ceres library [ceres-solver]. The runtime depends on the resolution of the mesh used as template. For a mesh of 1010 nodes the runtime is approximately  ms for the deformable tracking thread and approximately  ms for the deformable mapping in a machine with an i7-4700HQ cpu and with 7.7 Gb RAM. The code will be made available in the public git repository under the GPL license.

7 Experiments

We tested the performance of DefSLAM in two set of sequences: the Mandala dataset, which we propose to evaluate monocular deformable SLAM systems, and two sequences of the Hamlyn dataset ([mountney2010three, stoyanov2005soft]), with which we validated the algorithm in medical endoscopy. All the tested sequences are recorded with a stereo pair which we use to calculate the ground truth.

In the Mandala dataset, we analyzed the overall quality of our algorithm. We focus in two metrics: the RMS error of the estimated map against groundtruth and the fraction of matched map points. The RMS error is calculated after a scale alignment computed for each frame of the sequence. The fraction of matched map points is the quotient between the matched map points and the potentially matching ones contained in the frustum of the frame. A low fraction signals a poor map that can only represent partially the imaged frustum. We do a benchmark against a rigid method to compare the results of DefSLAM. We also validated the proposed monocular initialization versus initializing by giving an initial template in the sequences of the Mandala dataset.

In addition, we carried out an ablation analysis of the mapping and the tracking in one of the sequences of the mandala dataset. In the mapping, we focused in the influence of the SfN in the normals estimation. For the tracking, we tested the deformation tracking versus a rigid tracking. We also analyzed the sensitivity of the system to the weights of the regularizers of the tracking.

7.1 Mandala Dataset

The Mandala dataset are 5 sequences in which a kerchief with a mandala is imaged following an exploratory trajectory in circles while deforming. We increased the hardness of deformation progressively. The kerchief deforms near-isometrically. We define the hardness of the deformation based on the period of the waves generated on the surface and on their amplitude from the shape-at-rest.

The first sequence is called mandala0 and it is a rigid experiment where the kerchief remains on the floor. In the sequence mandala1, the deformation had an amplitude of 15 cm approximately and a period of approximately  s. In the sequence mandala2, the amplitude is approximately  cm and the period approximately  s. In the mandala3, the amplitude is approximately  cm and the mandala is oscillating with a period of approximately  s. In the mandala4, the amplitude is approximately 30 cm, and its period is halved, in addition the movements are caused in two different parts of the kerchief simultaneously. The first three experiments contain smoother deformation, meanwhile the last two have a much stronger deformation.

7.2 Monocular Initialization

The initialization for a SLAM system is always a challenging process. We tested the monocular initialization which we have proposed against the stereo initialization, i.e., using the stereo pair to estimate the initial template.

We compared them in the first 200 frames of the all sequences of the Mandala dataset. As shown in Figure 11, the initialization with the monocular method starts with a bigger error in all the sequences, but before the first 50 frames, it converged to the same solution than the initialization with stereo.

7.3 Overall Quality Dataset

Figure 9: Overall quality for Mandala dataset sequences. From left to right, the scenario contains more deformation. Top: 3D RMS error (mm) per frame, (the smaller, the better). Bottom: Computed matches with respect to the potential matches per frame (the higher, the better). DefSLAM is able to process all the sequences accurately, with a map covering the imaged area of the scene.
Figure 10: Recovering local deformations in the mandala3 sequence. The sequence shows a wave in the kerchief. (Top row) 2D image from Mandala3. (Bottom row) 3D reconstruction. We observe that DefSLAM can perceive and reconstruct the deforming behaviour of the scene.

We analyze the overall quality of the reconstruction of DefSLAM. We compare its performance with ORBSLAM, a rigid system.

We tuned ORBSLAM to process the Mandala dataset. ORBSLAM initialization failed dramatically with the deforming cases, hence we have applied the stereo initialization. Secondly, we have increased the keyframe creation rate to the maximum possible, only limited by the computing power. Finally, we have relaxed the thresholds for matching and outlier rejection in order to process observations with some degree of deformation.

Concerning DefSLAM, we apply the stereo initialization for a fair comparison. We repeated each experiment 5 times and we report the median value of the five executions per frame.

Figure 11: Stereo vs monocular initialization. Per frame RMS scene reconstruction error (mm) after per frame scale alignment. Monocular initialization converges to the stereo solution after a few frames.

Figure 9 shows the final results along the five sequences. The results of DefSLAM are shown in green and the results of ORBSLAM in blue. The first row shows the RMS error where smaller means the better, and the second row is the fraction of matched map point where higher means the better. Now, we analyze the results one-by-one.

In mandala0, the error of DefSLAM was comparable with the rigid SLAM algorithm. Although the initialization was easier for DefSLAM due to the assumption of a surface, ORBSLAM converged faster to a better solution. DefSLAM obtains number slightly worse than a state-of-the-art rigid system in a completely rigid scene. Concerning the fraction of matched map points, both DefSLAM and ORBSLAM got a high percentage which means that they reconstruct points that are highly reused, due to the rigidity of the scene.

In mandala1, there is deformation with low frequency. In this sequence, DefSLAM obtains a similar RMS error to the one obtained in mandala0, recovering the deformation of the kerchief. In contrast, although the ORBSLAM could process the entire sequence, its RMS error was highly penalized by the deformation, triplicating its RMS error obtained in the mandala0 sequence, and the RMS error of DefSLAM in the mandala1 sequence. The fraction of matched map points still does not show a important difference wrt. the rigid sequence for the DefSLAM, but we can see that the fraction of matched map points in the rigid case decreased. This is because the rigid system cannot predict accurately the map points in deforming areas and creates new map points for that deformation stage. The rigid map is populated with map points that are only matched if the deformation stage repeats, which reduces the fraction of matched map points. Meanwhile the non-rigid method is able to capture the deforming nature of the scene.

30 60 120
0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03
RMSE 25 29 26 29 27 30 26 25 25 23 26 24
Median 21 23 22 23 22 25 18 18 17 18 18 19
Max 61 64 51 61 61 72 66 61 66 62 74 55
30 60 120
0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03
RMSE 29 25 25 23 26 24 26 26 25 26 25 27
Median 24 19 19 17 19 17 20 18 20 18 18 23
Max 71 73 67 60 69 58 67 69 67 73 65 64
30 60 120
0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03
RMSE 32 25 25 27 23 26 27 22 25 24 23 25
Median 25 18 20 22 18 18 23 18 20 18 17 19
Max 80 66 59 64 66 65 65 59 55 58 60 58
Table 1: 3D error (mm) wrt. parameter variation. For frames #280 to #400 of mandala3.

In mandala2, the displacement is smaller but faster, the waving period is halved. DefSLAM got an error smaller than 2 cm during the first 500 frames. ORBSLAM had some problems to process the first frames, but eventually it can assume rigidity without the accumulation of a big error. In any case, DefSLAM can recover better the deformation observed during the sequence both in terms of RMS error and in fraction of matched map points per frame.

In the mandala3 and mandala4 sequences, the conditions are more extreme. ORBSLAM could not process any of these sequences entirely. In mandala3, the fast deformation yields difficulties for DefSLAM to follow the deformation and added some delay for it to obtain the correct shape. This provokes some peaks in the RMS error. In any case, the error average is around the 4 cm during the entire sequence. In Figure 10 we can observe the quality of the reconstruction of one wave travelling by the kerchief from mandala3. As the scene in this case is much more challenging, that affects directly the fraction of matched map points for DefSLAM, although it is able to process the entire sequence, the number of matched points is smaller.

In the mandala4, the displacements are larger but the period is the same. DefSLAM is again able to recover the deformation without augmenting notoriously the error. In the case of the matched points, it happened again as in the mandala3 and the fraction of matched map points decreased. However, it still can track the entire sequence. Supplementary material includes a video with fragments of the mandala dataset quality results.

7.4 Sensitivity Analysis

Concerning the sensitivity analysis, we run the sequence mandala3 from frame # 280 to # 400 where there is a strong deformation. We varied the weights , and and run the sequence to see the changes of the RMS, the median and the maximum value of the 3D error in the execution of the algorithm. We report the results in Table 1. We conclude that the system error are similar with a range of values for the weights from , and . By increasing the values of and , we ultimately arrive at a rigid model (See Sec. 7.6). By decreasing them, the system becomes underconstrained.

7.5 Deformation Mapping Analysis

Figure 12: (Left) Keyframe surface estimation by NRSfM and SfN. Per keyframe RMS error of the angle for the template reconstruction in degrees. (Right) Rigid tracking vs deformation tracking surface error as 3D RMS scene reconstruction error per frame in mm.

We analyzed the quality of the deformation mapping. We processed the sequence mandala3. We focus on NRSFM and SfN. The output of the NRSfM is the set of normals of the surface for each map point processed. SfN estimates the surface with these normals and simultaneously smooths the normals. We propose as metric the angle between the estimated normal and the ground truth normal. Figure 12 shows the angle RMS error of the reconstruction estimated by the NRSfM versus the error after the estimation of the surface with the SfN. Crucially, SfN consistently reduces the error. The mean error was approximately 12 degrees. We used an NRSfM batch size of between 5 and 15 keyframes to estimate the normals.

7.6 Deformation Tracking Analysis

To assess the contribution of deformation tracking to DefSLAM, we compare the proposed deformation tracking to a purely rigid tracking like [mur2015orb]. In the rigid case, the template points are assumed to be fixed in the scene to recover the camera using pose estimation. The rigid template is only updated by the deformation mapping after each new keyframe.

We did the experiment in the sequence mandala3. As shown in Figure 12, the rigid tracking can process part of the sequence with a slow deformation, because the keyframe rate update of the template is fast enough to track the deformation of the scene. However when there was strong deformation between the keyframe rate update it fails. In contrast, our proposed deformation tracking updated the template at frame rate and was able to successfully process the entire sequence. Another shortcoming of the rigid tracking is in the alignment step, where it has to register two surfaces in different instants, resulting eventually in a failure.

7.7 Hamlyn Dataset

Figure 13: Processing Hamlyn sequences. Per frame RMS scene reconstruction error in mm after a per frame scale alignment with the stereo ground truth. (Left) Heart sequence. (right) Abdominal sequence.
Figure 14: Hamlyn dataset sequence. (Two top rows) Heart sequence. (Two bottom rows) Abdominal Sequence. (Top) 2D images processed by the algorithm with the matches. (Bottom) 3D reconstruction.

Our last experiments test DefSLAM in intracorporeal sequences from the Hamlyn dataset. It contains medical sequences that we processed to show the performance of our algorithm in medical images. We analyzed two sequences. The first is a heart sequence which corresponds to a non-rigid scene with non-exploratory camera motion. The second is an abdominal exploration where the scene is predominantly rigid for the first 200 frames. It contains a small deformation from frame #200 to #400. In the middle of the sequence, there is significant motion clutter due to a tool interfering with the scene. Finally, from frame #420, there the tool causes severe deformations on the scene.

Figure 13 shows the evolution of the median after 5 executions RMS error during the Hamlyn sequences, DefSLAM RMS error was approximately 3 mm for the heart sequence and 8 mm for the abdomen sequence. In comparison with ORBSLAM, DefSLAM obtained half RMS error in the heart sequence. In the abdomen sequence, RMS error was similar in the initially rigid section. When the deformation starts the RMS error of ORBSLAM was increased, meanwhile DefSLAM keep on with a low error. When the tool crossed in front of the camera, the ORBSLAM failed and could not start again due to the deformation and the lack of parallax. In contrast, DefSLAM was able to process this section, but the RMS error increased. When the tool started to manipulate the scene, DefSLAM could recover the scene deformation achieving a RMS error similar to that of the initial section. In this experiments we initialized the ORBSLAM with the stereo camera and the DefSLAM with the monocular initialization proposed in the Section 5.5.

Figure 14 shows the overall quality of the 3D reconstruction of the medical sequences. Supplementary material includes a video with the results in both sequences.

8 Conclusions and Future Work

We have formulated DefSLAM, the first deformable SLAM able to process monocular sequences. We have proposed to split the computation of DefSLAM in two parallel threads. The deformation tracking thread is devoted to estimating the camera pose and the deformation of the scene, it is based on SfT. SfT needs a prior of the geometry of the scene encoded in the template. When exploring new zones, we have proposed to estimate new templates to cover the new areas. Our second thread, the deformation mapping, is devoted to periodically re-estimating the template to better adapt it to the currently observed scene.

Our experiments confirm that the proposed method is able to handle real exploratory trajectories of a deforming scene. Compared with a rigid SLAM, DefSLAM is not only able to cover wider areas of the scene, but also to produce more accurate scene estimates in deformable sequences.

We have also shown in our experiments that the system is able to handle endoscopy images. The next step will be its adaptation for medical imagery to handle all kinds of challenges not taken into account in the present work, i.e. uneven illumination, poor visual texture, non-isometric deformations or ultra close-up shots exploring the endoluminal cavities. Another future work is to develop other mapping tools such as relocalization or loop closure to further boost robust performance.


The authors are with the I3A, Universidad de Zaragoza, Spain. This research was funded by the Spanish government with the projects DPI2015-67275-P and the FPI grant BES-2016-078678.