1 Introduction
Motion estimation in 3D is a problem of great importance in computer vision, robotics, and computer graphics, playing a central role in a wide range of applications that include 3D scene reconstruction/modeling, human and object pose tracking, robot localization, augmented reality, humancomputer interfaces and deformable shape manipulation. The advent of affordable, commercial depth sensors has caused significant research effort on 3D motion estimation from 3D input, leading to the development of RGBD algorithms for fast visual odometry
[1, 2], efficient and accurate scene flow estimation [3, 4], as well as notable SLAM systems for both static [5, 6] and dynamic [7, 8] environments.Given the availability of 3D input, dense nonrigid registration is the most general motion estimation problem and it is particularly challenging. In its general form, the problem can be described as computing a motion field, densely supported on the surface of a 3D shape, that deforms the latter in order to geometrically align it to another, fixed “template” shape. This process of nonrigid 3D registration shares fundamental similarities with 2D image registration, known in the computer vision community as optical flow estimation: both problems pose similar challenges in deriving formulations that lead to accurate alignment while encoding reasonable prior constraints (regularization) to overcome illposedness.
A classical problem variant that is closely related to 3D nonrigid registration is that of RGBD scene flow. Given a pair of images, scene flow refers to the perpixel 3D motion of observed points in space from the first frame to the second; optical flow refers to the perpixel 2D projected motion. There have been a number of successful recent works on scene flow estimation from RGBD frame pairs, following both classical (variational) [9, 10, 11, 3, 4]
and deep learning
[12] frameworks. While of great relevance to a number of motion reasoning tasks, RGBD scene flow targets a specific instance of dense 3D motion estimation, as it inherently registers pairs of 2.5D geometries (depth maps). This hinders its application in scenarios that require alignment of fully 3D geometries, such as modeltoframe registration for dynamic reconstruction or modeltomodel shape deformation.Recently introduced dynamic reconstruction pipelines from RGBD input [7, 8, 13, 14] solve a more general problem by implementing warp field optimization algorithms for their modeltoframe registration step. Despite adopting different approaches for their model representations and surface fusion steps, they all rely on similar, point cloud based formulations for nonrigid registration. Scenes with dynamic topology are a challenging case for dynamic reconstruction systems: [7] and [8] make no provisions at all for these cases, while [13] and [14] deal with registration errors that occur because of dynamic topology at a subsequent stage, by discarding problematic regions and reinitializing model tracking. The fully volumetric approach of [15] does not use point representations for registration, directly aligning Signed Distance Fields (SDFs) [16] instead. While it intrinsically handles topological changes, significant scalability limitations are introduced by relying on volumetric representations. To the best of our knowledge, there exists no nonrigid point cloud registration algorithm producing warp fields that are errorfree on motion boundaries induced by dynamic scene topology.
Contributions. In this paper, we introduce a complete pipeline for the nonrigid registration of general, unorganized 3D point cloud pairs, which explicitly detects topology changes between the input point sets and produces piecewisesmooth warp fields that respect motion boundaries that result from these events. At the core of our approach lies a general warp field estimation algorithm (Section 3.3), inspired by those employed in recent dynamic reconstruction systems from RGBD input. We improve motion estimation on motion boundaries associated with topology changes in an efficient postprocessing phase (Section 3.4) that exploits the different properties of warp fields that are estimated in different directions (i.e. forward and backward) with respect to different types of topological events (i.e. ‘contact’ or ‘separation’, Section 3.2). After explicitly detecting regions of topology change events by means of simple, intuitive tests of local deformation, our method blends the forward and inverted backward motion hypotheses on a local basis, based on the type and proximity of detected events, ensuring smooth, seamless hypothesis transitions on the deformed surface. This stage makes no assumptions about the underlying registration engine and can be easily adapted for integration into existing pipelines. The implementation of our warp field estimation module (without the topology event handling) is openly available as part of our point cloud processing library [17]. Furthermore, the ability to detect and classify motion boundaries associated with dynamic topology is a byproduct of our pipeline that may be useful in tasks beyond geometric registration.
2 Related Work
RGBD scene flow estimation. The term ‘scene flow’ was introduced in [18] to refer to “the threedimensional motion field of points in the world, just as optical flow is the twodimensional motion field of points in an image.” Since then, significant research focus has shifted towards scene flow estimation from RGBD input. The formulation of [9] couples an norm data term derived from the optical flow and range flow [19] constraints with weighted TV regularization. In [10], the authors follow a similar variational approach but use a rigid motion parameterization of the flow field, computing 6DoF perpixel transformations and enforcing a local rigidity prior on the solution. A 6DoF local parameterization is also used in [20], which introduces a correspondence search mechanism that relies on 3D spheres rather than image plane patches, and effectively handles large displacements. In [21], a probabilistic approach for joint segmentation and motion estimation method is proposed; a depthbased segmentation is used for motion estimation, which is in turn regularized based on the mean rigid motion of each layer. A joint segmentation and scene flow estimation method is also presented in [11], which assumes that the scene movement can be described by a small number of latent rigid motions. Starting with a spatial means clustering for the motion label initialization, the algorithm iterates between motion estimation and segmentation (soft labeling), merging labels in the process. In [3], the first realtime RGBD variational scene flow method is introduced, achieving stateoftheart accuracy. An efficient joint odometry and piecewiserigid scene flow estimation method is proposed in [4], where the scene is segmented into ‘static’ and ‘moving’ geometric clusters, from which odometry and independent nonrigid motions are computed.
As mentioned in our introduction, scene flow solves a somewhat restricted problem in the context of dense 3D registration, as the support of the computed motion field is image bound.
Dynamic scene reconstruction. General nonrigid 3D registration algorithms have been developed in the context of online reconstruction of dynamic scenes from RGBD input. Most of them are formulated within a nonrigid Iterative Closest Point (ICP) framework, similar to the one introduced in [22], with the goal of registering a point cloud representation of the scene model to the current frame, while there also exist purely volumetric approaches [15] that align Signed Distance Field (SDF) geometry representations. DynamicFusion [7] was the first system to achieve high quality, realtime dense reconstructions from RGBD input. While it performs volumetric (SDF) fusion [16], its warp field estimation algorithm is based on point cloud “renders” of the model geometry. The estimated warp field is defined on a sparse ‘Embedded Deformation’ (ED) graph [23]
, with a 6DoF transformation attached to each node, and its evaluation on arbitrary points is performed via interpolation. The registration objective consists of a pointtoplane ICP cost, coupled with an ‘AsRigidAsPossible’ (ARAP)
[24], hierarchically defined regularization term, both under robust loss functions. The nonrigid tracker of VolumeDeform
[8] does not rely on an ED graph and estimates individual 6DoF transformations for every source geometry point. Its cost function consists of a dense pointtoplane cost, a sparse pointtopoint term derived from SIFT [25] correspondences, and an ARAP prior based on a ‘flat’ neighborhood graph, with all terms being quadratic. Fusion4D [13] combines the input of multiple range cameras for the task of dynamic reconstruction, using an ED warp field parameterization and following a similar registration objective formulation that additionally includes a ‘visual hull’ term. SurfelWarp [14] is a purely point cloud based approach that also relies on an ED motion field representation and uses the same registration costs as DynamicFusion, but under the quadratic loss function. On the other end of the spectrum, KillingFusion [15] is a purely volumetric approach that relies on direct SDFtoSDF alignment via variational minimization under novel regularizers that enforce the motion field to be isometric (ARAP prior analogue) and preserve level set geometry.All of the above systems produce results of remarkable quality, especially given their realtime budget. However, with the exceptions of [13], [14], and [15], they cannot handle scenes with dynamic topology, with the ‘closetoopen’ case (‘separation’, in our terminology) being particularly problematic. According to our introductory discussion, [13] and [14] deal with these cases essentially by discarding affected regions, while the volumetric registration of [15] is inherently immune to these events. Our proposed method is the first to tackle dynamic topology within the context of motion estimation and within a scalable pointbased representation framework.
3 Our Approach
3.1 Problem statement
Given a pair of unorganized 3D point sets, our goal is to estimate a warping function that nonrigidly deforms the first point cloud (source geometry) towards the second one (target) in a piecewise smooth manner.
Let and be the source and target geometry point sets, respectively, and be a warping function. In our nonrigid alignment setting, is required to have the following properties:

[noitemsep,topsep=0pt]

The image of point set via , , should be aligned as close to the target geometry as possible. Typically, this is formulated as the minimization of the sum of residuals between points in and their corresponding points (e.g., nearest neighbors) in .

Local transformations of neighboring points that lie on the same moving surface in should be similar; i.e. should be smooth. At the same time, motion discontinuities should be preserved: neighboring points in that lie on independently moving surfaces should be allowed to have different local transformations. This combined prior is known as piecewisesmoothness.
In a typical registration objective minimization formulation, the first property is expressed by the sum of registration residuals (e.g., pointtopoint and/or pointtoplane distances) over corresponding point pairs in the objective (data term), while the second one renders the otherwise underconstrained problem wellposed by introducing terms that penalize differences in local transformations of neighboring points (regularization term).
The loss function used to model the regularization penalty terms, plays an important role in the behavior of the warping function in motion boundary regions. For example, it is well known from the optical flow literature that quadratic regularization tends to oversmooth motion boundaries. On the other hand, robust loss functions (e.g., norm approximations for the penalty terms) are more effective in producing piecewisesmooth motion fields that preserve discontinuities.
In this work, we focus on estimating warp fields that respect motion boundaries resulting from changes in scene topology. Our notion of topology is directly derived from objectlevel connectivity: a change in scene topology can occur either when two or more separate objects come into contact or when two or more initially connected objects separate. We note that we use the term ‘object’ simply to refer to independently moving scene surface regions, without attaching any semantic meaning to them.
In the following, we show that simply adopting a robust loss function for regularization still produces visible warping artifacts in motion discontinuity regions that result from scene topology changes, and we present a complete registration pipeline that effectively and efficiently solves this problem.
3.2 Motivation and overview of our approach
Motion estimation errors on motion boundaries typically manifest as oversmoothing of the warp field because of excessive regularization and can be suppressed by eliminating regularization penalty terms for points in that lie on different sides of the discontinuity. However, without any knowledge about and its motion (e.g., some form of segmentation into independently moving objects), we cannot obtain a “correct” regularization graph a priori. Instead, the common choice is to use a NN graph of points in to define the regularization terms. It is easy to see that this choice is particularly problematic in cases where connected objects in move apart in , as NN regularization over will introduce penalty terms that relate points that lie on different objects, resulting in some amount of motion field oversmoothing over the separation boundary.
Such a challenging scenario that involves object ‘separations’ is depicted in Fig. (a)a, where the general, topologyagnostic warp field estimation algorithm described in Section 3.3 was used to nonrigidly align two RGBD frames. Despite the fact that the algorithm’s regularization term is formulated based on the robust, discontinuitypreserving Huber loss function (see Section 4.1 for parameter details), the registration result (warped source geometry) shows visible artifacts near the object separation areas. Quadratic regularization is known to induce even more excessive smoothing on motion boundaries. Since quadratic and norm approximation regularization types are the most commonly used ones in the literature, most current nonrigid registration algorithms are expected to exhibit a very similar behavior on these types of motion boundaries.
At the same time, it is clear that any change in topology between and will be directly reflected on the (different) nearestneighbor graph structures of the source and target geometries. We exploit this fact in the following way. Consider the case where two or more objects that are connected in become separate in . As discussed above, estimating the warp field that aligns to using the source geometry’s NN graph to define the regularization penalties is expected to result in some amount of oversmoothing over the motion boundary of the separation. However, in the backward motion direction (from to ), the same topology change manifests as a connection of separate objects. Estimating the backward warp field that aligns to using the target () geometry’s NN graph to define the regularization penalties should not exhibit any oversmoothing over the connection boundary, because the corresponding NN graph edges that would define regularization terms over the discontinuity are not there in the first place. Inverting the warping function yields another forward warp field hypothesis, , that will be free of oversmoothing over separation motion boundaries. Of course, the latter, being derived from , may suffer from oversmoothing over motion boundaries that correspond to object separations in the backward motion direction (from to ), or, equivalently, to objects coming into contact from to . These cases are expected to be handled correctly in the first place by the standard forward warp, .
Based on the above observations, the standard forward warp is expected to exhibit good behavior over contact boundaries, but to oversmooth separation boundaries. On the other hand, the inverted backward warp is expected to behave the opposite way, preserving separation discontinuities, but possibly blurring motion estimates in contact areas. Our proposed registration pipeline builds upon this idea by first detecting regions in that are likely to be contact or separation boundaries, and then locally blending the warp hypotheses and accordingly in a seamless manner. The final result is a piecewisesmooth warp field that aligns to and respects motion boundaries because of changes in scene topology.
An overview of our approach is provided in Fig. 4. The (topologyagnostic) warp field estimation algorithm used to obtain the initial forward and backward warp hypotheses is described in detail in Section 3.3. Our topology event detection mechanism, as well as our local hypothesis blending approach, are presented in Section 3.4.
3.3 Warp field estimation
We implement our warp field estimation algorithm within a nonrigid Iterative Closest Point (ICP) framework [22], similarly to the nonrigid trackers used in the recently introduced dynamic reconstruction pipelines of [7], [8], and [14].
Given the source and target geometries and , as well as an initial estimate of the unknown warp field (usually taken as the identity warp), the algorithm iteratively refines the latter until convergence has been reached. At the top level, the process iterates between a point correspondence search step between the warped source (according to the current estimate) and , and a warp field optimization step that updates given the established point correspondences (Algorithm 1). The two algorithm phases are presented in detail in the following.
3.3.1 Correspondence association
Our framework supports two complementary types of point correspondences between the (warped) source and the target geometries: dense correspondences that are established based on spatial point proximity, and sparse correspondences that result from keypoint matching. Each individual correspondence is represented as a pair of point indices, whose first component indexes a point in and its second one a point it : , where .
We support two mechanisms to establish dense correspondences. By default, we assume that both and have a fully 3D structure and we establish dense correspondences by finding the nearestneighbor in , in terms of Euclidean distance, of each point in , with the search being performed efficiently by parallel dtree queries. For certain applications that only require frametoframe (2.5Dto2.5D) or modeltoframe (3Dto2.5D) registration, we can further speed up the process by obtaining projective correspondences. This amounts to projecting and onto the target frame image and extracting correspondences based on points that are projected to the same pixel. This is the mechanism adopted in most realtime reconstruction pipelines [5, 7, 8].
In many common situations, dense geometric/depth correspondences alone are not enough to disambiguate the underlying motion. For example, tracking points on flat surfaces that lack geometric texture and slide parallel to each other may exhibit drift. Establishing robust keypoint correspondences between the source and target geometries can effectively mitigate this problem. We assume that our input geometries are equipped with sparse interest points; our sparse correspondences are established by the interest point descriptor matches between and . In our implementation, we focus on input geometries that are either RGBD frames or 3D reconstructions from RGBD input. The availability of regular images along with (registered) geometry allows us to adopt SIFT keypoint [25] (lifted to 3D) matches for our sparse correspondences.
To make optimization more stable, we discard correspondence candidates from the above mechanisms that do not meet some basic proximity and local similarity criteria. Let , , and ,
be the surface normals and colors (e.g., RGB value 3vectors) of the source and target geometries, indexed in the same way as their support points in
and . A correspondence candidate is considered valid and used in the optimization if all of the following hold:
[noitemsep,topsep=0pt]



In the above, is a point distance threshold, is a normal angle threshold, and is a color “distance” threshold.
3.3.2 Warp field optimization
Given a set of dense and sparse point correspondences, we shall now describe our warp field optimization step. Modeling the warp field using locally affine [22] or locally rigid [7] transformations provides better motion estimation results than adopting a simple translational local model, due to more effective regularization. In our approach, we adopt a locally rigid (6DoF) model.
Instead of computing a unique rigid transformation for each point in , we use the more efficient embedded deformation graph representation [23] for the warp field , similarly to [7] and [13]. Let be the set of virtual deformation nodes, where is the position of the th node, is a radius parameter that controls the th node’s area of effect, and is the 6DoF rigid transformation attached to the th node. The deformation node positions are obtained by downsampling the source geometry by means of a voxel grid of bin size . A reasonable choice for that ensures sufficient area of effect overlap among neighboring nodes is , for . As in [8], each local transformation is parameterized during optimization by a 6D vector of 3 Euler angles and 3 translational offsets. The effect of the warp field , represented by , on a point is given by interpolating the local node deformations in the neighborhood of . Let be the indices of the nearest neighbors of in . The local transformation parameter vector at is given by:
(1) 
where . The image of via is then:
(2) 
where and extract the rotation matrix and translation vector from our 6D parameterization.
We note that the above 6D parameterization is only used within optimization (line 5 of Algorithm 1) and that both the estimated incremental warp and the final composite estimate have their node transformations expressed in terms of transformation matrices. The fact that we continuously warp and compute starting from the identity warp, combined with the smoothness prior imposed on the warp field (shown below), allows us to overcome any problems associated with Euler angle parameterizations of rotation.
Our registration objective, as a function of the unknown warp field , which in the context of Algorithm 1 is the incremental warp , and the point correspondences between and , which are fixed for this step, is formulated as:
(3) 
Our data term, , is a weighted sum of a pointtoplane and a pointtopoint ICP cost:
(4) 
Pure pointtoplane metric optimization generally converges faster and to better solutions than pure pointtopoint [26], and is the standard trend in recent rigid [5, 6] and nonrigid [7, 8, 14] registration pipelines. However, as discussed in Section 3.3.1, integrating a pointtopoint term for robust point matches into the registration cost can effectively disambiguate motion estimation in cases where surfaces that lack geometric texture slide parallel to each other. Similarly to [8], we use our dense geometric correspondences to define our pointtoplane cost and our sparse keypoint correspondences for our pointtopoint cost:
(5)  
(6) 
Our regularization term directly penalizes differences in transformation parameters of neighboring virtual nodes of under the robust Huber loss function. If is the set of indices of the nearest neighbors of in , our regularization term is formulated as:
(7) 
where weights the pairwise penalties based on node distance, and denotes the sum of the Huber loss function values over the 6 parameter residual components. Parameter controls the point at which the loss function behavior switches from quadratic (squared norm) to absolutelinear (norm). Since norm regularization is known to better preserve solution discontinuities, we use a small value of . Given our locally rigid motion model, this regularization scheme enforces an ‘AsRigidAsPossible’ [24] prior to the estimated warp field.
The registration objective of equation (3) is nonlinear in the unknowns. We minimize by performing a small number of GaussNewton iterations. As in [13], we handle nonquadratic terms using the squarerooting technique of [27]. At every step, we linearize around the current solution (vector concatenation of all node transformation parameters ) and obtain a solution increment by solving the system of normal equations , where is the Jacobian matrix of the residual terms in and is the vector of (negative) residual values. We solve this sparse system iteratively, using the Conjugate Gradient algorithm with a diagonal preconditioner.
3.4 Handling topology changes
In our postprocessing phase for handling topology change motion boundaries, we first explicitly detect likely contact and separation regions in the source geometry, and proceed by appropriately blending our (default) forward and inverted backward warp hypotheses in a local yet seamless manner. It follows from the discussion in Section 3.2 that, as far as topology changes are concerned, the forward warp is only problematic in separation areas. However, instead of only focusing on and amending separations, it is beneficial to also explicitly consider contact events. As will become clear in the following, considering both types of events and treating them symmetrically robustifies their detection and allows for a less biased hypothesis blending scheme.
Using the same notation as in Section 3.2, let be the warp field that aligns to and the backward warp that aligns to , both computed by Algorithm 1. We will be using the ‘S’ subscript for forward () motion entities and the ‘D’ subscript for backward () ones. For the needs of the following discussion, we will consider these motion fields to be represented by the perpoint local rigid transformations of their support geometries, so that:
(8)  
(9) 
To invert in order to obtain an alternative forward warp, we appropriately rebase its inverse local transformations on . To that end, we first compute the target geometry’s image (which should be closely aligned to ) and, to each point , we assign the transformation , where is the index of the nearest neighbor of in the point set . Of course, we assume that the latter is indexed in the same way as . The inverted backward warp is then represented as:
(10) 
where and is the nearest neighbor index of in . Analogously, we obtain an alternative backward warp, by inverting our forward hypothesis:
(11) 
where and is the nearest neighbor index of in . To summarize, we have two forward motion () warp hypotheses ( and ) and two backward motion () ones ( and ).
3.4.1 Detecting topology change events
We detect topology change regions in based on how our warp estimates affect local neighborhoods of the source geometry.
Naturally, we expect that if is close to a separation boundary, its distance to some of its neighbors in should increase after applying the correct warp to . We shall refer to this effect as neighborhood stretching. The dual case of a contact event manifests exactly the same way in the backward motion direction (stretching of neighborhoods of ), in which the event is perceived as a separation. In the following, we will use a local measure of stretch over points in to detect separation areas, and map the same measure over in the backward direction onto to obtain a dual measure of “compression” that will allow us to detect contacts.
We quantify the above intuition by defining a local “stretch” operator for point under the warp as the maximum ratio of the distance to its neighbors before and after applying :
(12) 
where indexes the neighbors of in that lie within distance from it. The choice of the neighborhood radius value depends on the scale and resolution of the input geometries. For closerange point clouds acquired with Kinectlike cameras, we use .
To each point in , we associate one stretch value for each of our two forward warp hypotheses, according to definition (12). For , we have:
(13)  
(14) 
We also compute the local stretch of the target geometry under each of the two backward warps:
(15)  
(16) 
which we subsequently map onto , interpreting them as a compression measure (contact indicator), according to:
(17)  
(18) 
where is the index of the nearest neighbor of point in point set .
Using the above pointwise stretch/compress values on , we extract subsets of the source geometry that are likely to lie on topology change motion boundaries. Let be the sets of candidate separation and contact boundary points, respectively. According to the above discussion, points on a separation boundary are expected to have high stretch scores, while points on a contact boundary should exhibit high local compression. To decide whether a point in is a boundary candidate, we perform two symmetric tests per case that rely on two threshold values, an absolute score threshold , and a relative (ratio) threshold . A point of is a member of Sep (Con) if and only if the maximum of its two stretch (compress) scores is greater than and also greater than times its maximum compress (stretch) score. The process is summarized in Algorithm 2. A sample output is shown in Fig. (b)b (left), marked in red; note that in this case.
As we will show in Section 4.3, the above procedure is very effective at detecting and classifying topology changes, but, because of the continuous nature of our local stretch/compression measures and depending on the selected threshold values, it may produce “false positives” (e.g., in areas of actual deforming surface stretching or compression but constant topology). However, under the assumption that our two forward warp hypotheses behave similarly in the false positive areas and as will become clear in the next section, this does not affect our final warp estimate.
3.4.2 Local hypothesis blending
Our blending scheme produces a topologyaware warp field by combining the forward warp hypotheses and on a perpoint basis. Our objective is to assign a higher weight to (inverted backward warp) near separation areas, and ensure that (forward warp) has a stronger weight near contact areas. At the same time, it is desirable that point weights vary smoothly on , so that our warp blending does not introduce seam artifacts on due to differences in our original warp hypotheses.
We achieve the above by attaching a smoothly decaying kernel on each of our detected event points in Con and Sep and locally computing the weight for each event class. Assuming a maximum radius of effect (free parameter) for our event points, we model the influence of each event with an RBF kernel of bandwidth . The weights and of and for the source point are computed by accumulating influences of the event points in and respectively that are within a radius from :
(19)  
(20) 
where is a normalizing constant ensuring that , and , index the radius neighbors of in Con and Sep respectively. In the absence of any topology event influence (e.g., for at least from any event point), the above defaults to and , giving full weight to the standard forward hypothesis . The local transformation of our final, topologyaware warp field estimate at location is then given by:
(21) 
where converts the linear blend of the two transformation matrices back to a valid transformation matrix. The complete blending process is summarized in Algorithm 3 (see also equations (8) and (10)). A visualization of the blending weights is given in Fig. (b)b (middle), where each source geometry point is colored according to its inverted backward warp hypothesis weight , using a ‘bluetored’ colormap.
We note that, for source points close to topology events, one of and will dominate the other, essentially rendering the blending of (21) a binary selection. As we move farther from topology events, the two weights may assume comparable values. For our blended output (21) to be seamless and errorfree in that case, and should not differ too much in the areas that are “far enough” from event points. This highlights the importance of parameter ; we have found that values work well in practice, where is the resolution of our virtual deformation graph (Section 3.3.2).
As a concluding remark, we observe that the most costly operation of our postprocessing phase is the calculation, based on radiusneighborhoods, of the stretch/compress values of Section 3.4.1. Algorithm 3 also performs radius queries on point sets Con and Sep, but the latter are typically very small in size. In our experience, the overall running time of the entire phase is significantly smaller than a single warp field estimation. Furthermore, in the case of RGBD input, image structure can be easily exploited in order to accelerate the extraction of point neighborhoods.
4 Experiments
We conduct two sets of experiments for the evaluation of our registration pipeline. The first one is performed on a public dataset and examines our algorithm’s motion estimation accuracy, both with and without the topology handling phase (Section 4.2). For the second one, we use a custom dataset with topology event annotations and evaluate our event detection performance, as well as our estimated warp field quality in the presence of separation events (Section 4.3).
4.1 General setup details
For both sets of experiments, the input is RGBD data, either synthetic (first set) or captured by a Kinectlike camera (second set).
Point cloud generation. RGBD frames are converted to point clouds equipped with surface normal and color information, as well as a sparse set of interest points derived from SIFT features. In all cases, the full resolution of the input depth map is used, which is for the synthetic sequences and for the camera data. We use a fixed maximum depth of for all sequences in the first set, and vary the cutoff value in the range of to for the camera data, depending on the sequence. For normal estimation, we use NN neighborhoods with in our first set of experiments, and radius neighborhoods with in our second set. SIFT keypoints are extracted from the RGB images and lifted to 3D, discarding the ones that lie on depth boundaries.
Warp field estimation. In the correspondence association step (Section 3.3.1) of our nonrigid ICP algorithm, we set the maximum correspondence distance to for our first set of experiments and for our second one, while we use common values and for the maximum normal angle and color difference (colors are RGB triplets in ). The embedded deformation graph for our warp field parameterization (Section 3.3.2) has a resolution of , with each node’s area of effect being controlled by . To evaluate the local transformation for each point in the source geometry (equation (1)), we use its 4 nearest neighbors in . The pointtopoint weight in our data term (4) is set to . To favor norm behavior by our regularization term (7), we use a small Huber loss parameter value of , while we set the term’s weight to (equation (3)). Regularization topology is given by the 6 nearest neighbor neighborhoods of . We perform a maximum of 10 toplevel ICP iterations, while the process typically converges in less. Within each optimization step, we perform a maximum of 5 GaussNewton iterations.
Topology handling. In all experiments, local stretch is computed on neighborhoods of radius (Section 3.4.1). To detect and classify topology change events, we use an asbsolute score threshold of and a relative ratio of . For our blending step (Section 3.4.2), we assume that every detected event has a radius of effect equal to .
Sequence  Framewise EPE over entire sequence  Framewise AE over entire sequence  

PDFlow  FWarp  FBWarp  PDFlow  FWarp  FBWarp  
Median  Mean  Median  Mean  Median  Mean  Median  Mean  Median  Mean  Median  Mean  
alley_1  0.774  0.964  0.485  0.519  0.485  0.516  7.005  7.319  8.256  8.379  8.285  8.361 
ambush_5  12.651  21.172  1.714  7.017  1.639  6.411  30.435  29.505  5.558  6.920  5.262  6.386 
ambush_6  25.467  28.981  4.409  7.535  5.106  8.028  37.813  39.470  7.165  7.520  7.083  7.457 
ambush_7  1.861  4.969  0.633  0.925  0.584  0.891  11.517  27.146  11.377  10.885  11.270  10.763 
bamboo_1  1.104  1.221  0.677  0.712  0.677  0.713  8.159  8.560  5.517  5.705  5.517  5.713 
bamboo_2  0.889  2.628  0.687  1.027  0.686  1.014  7.134  11.924  8.315  8.729  8.357  8.669 
bandage_1  1.683  2.160  0.973  0.957  0.936  0.935  15.201  15.244  10.111  9.829  9.862  9.681 
bandage_2  0.769  1.062  0.329  0.334  0.330  0.331  9.035  10.754  6.525  6.554  6.523  6.542 
shaman_2  0.286  0.317  0.294  0.284  0.292  0.284  5.571  5.643  7.801  7.838  7.779  7.831 
shaman_3  0.485  0.552  0.178  0.291  0.179  0.298  8.125  8.115  3.155  4.559  3.153  4.604 
sleeping_1  0.502  0.514  0.149  0.541  0.149  0.541  5.939  6.110  1.954  6.487  1.954  6.488 
sleeping_2  0.146  0.143  0.170  0.171  0.170  0.171  2.508  2.520  3.352  3.335  3.352  3.335 
Overall  0.701  4.122  0.488  1.379  0.487  1.336  7.899  13.009  6.826  7.213  6.815  7.136 
4.2 Motion estimation accuracy evaluation
Due to the lack of publicly available datasets with ground truth dense 3D motion, we perform our accuracy assessments on MPI Sintel [28], a synthetic optical flow evaluation dataset. The dataset contains multiple sequences of (typically) 50 frames that capture motions ranging from slow, almost rigid to very large, highly nonrigid ones. In addition to ground truth optical flow, metric ground truth depth and camera intrinsics are provided, which we use to emulate RGBD input.
We base our evaluation on two classical optical flow performance measures, the endpoint error (EPE) and the angular error (AE) [29]. If is an optical flow estimate at a given pixel whose ground truth value is , EPE is computed as:
(22) 
The angular error AE is defined as the angle between the 3D spacetime vectors and , as:
(23) 
effectively enabling evaluation at pixels of zero flow. We convert 3D motion estimates to optical flow by first warping the source points in 3D and then computing the 2D point/pixel displacements as differences of the projected endpoints onto the image plane.
We evaluate and compare three methods: PDFlow [3], a stateoftheart scene flow algorithm, FWarp, our general warp field estimation algorithm defined in Section 3.3 (without the topology change handling phase), and FBWarp, our complete topologyaware warp field estimation pipeline. We run the three algorithms on the entire duration of 12 Sintel sequences and compute the average EPE and AE values per consecutive frame pair. We report the median and mean framelevel average errors over each sequence in Table I. Median values are not easily affected by extreme values, often providing a better picture of how accurate estimation is “half of the time”. We also report overall mean and median error values for each method, computed over the total number of frames from all sequences.
Our FBWarp method overall achieves the highest accuracy in terms of both error metrics, followed closely by our baseline, FWarp. PDFlow is very accurate in estimating slow motions (e.g., in the sleeping_2 sequence), but falls behind in most cases, producing particularly large errors in sequences that contain very fast motions, such as ambush_5 and ambush_6. We also refer the reader to the Sintelbased evaluation of MCFlow [11] (Table 2 of that paper), another stateoftheart scene flow algorithm with significantly better performance than PDFlow in estimating large motions. We were unable to evaluate MCFlow ourselves, because its implementation has not been released. While 6 of the Sintel sequences in our experiments and the ones in [11] are common, there are significant differences between our evaluation setups: 1) their reported EPE and AE values are computed on one specific frame pair per sequence, not over whole sequences as in here, 2) they downsample their input to half its original resolution per dimension (to ), whereas we use full resolution images, and 3) they consider only nonoccluded pixels, while we compute errors on all valid ones. The above prohibit reaching definitive conclusions. However, the fact that we adopt an arguably more difficult evaluation strategy (at full resolution, which means smaller pixel size for EPE interpretation, and evaluating over whole sequences) and still obtain comparable EPE and AE absolute values to the ones reported in [11] (averages of 1.203 and 6.559, respectively), leads us to argue that FBWarp compares favorably to MCFlow.
4.3 Topology change event handling
To evaluate the performance of the topologyhandling phase of our pipeline, we collected, using a Kinectlike camera, an RGBD dataset of 7 sequences that contain changes in scene topology. The regions of visible topology changes were manually annotated on the color image as binary masks, drawn in freehand mode, and classified as either contacts or separations. Our collected sequences capture the following diverse set of actions:

[noitemsep,topsep=0pt]

Handclapping (clap sequence)

Fast handdrumming on a desk (drum sequence)

Two pickandplace actions on objects lying on a flat surface or on top of each other (stack and unstack sequences)

A separation of two touching objects using both hands (separate sequence)

Two drawer opening action sequences (top_drawer and bot_drawer)
The annotated ground truth events capture all visible instances of handhand, handobject, and objectobject interaction. Snapshots of our sequences are shown in Fig. 5 (top two rows).
Our evaluation of this phase is twofold. First, we assess how well our detected event points in Con and Sep (Section 3.4.1) relate to the ground truth events. Second, we compare the geometric registration error between FWarp (baseline warp) and FBWarp (topologyaware motion estimate) in the vicinity of the ground truth topology change events, focusing on separations.
4.3.1 Topology event detection
We uniformly represent topology change events as triplets , where is a binary label indicating contact or separation, is the time (frame index) of the event, and is the subset of points in the source geometry that lie very close to event motion boundaries. We use the superscript ‘gt’ to denote ground truth event entities, and ‘det’ to denote the ones associated with detections by our algorithm. Let and denote the sets of ground truth and detected events, respectively. Given an annotated sequence, these are populated according to the following:

[noitemsep,topsep=0pt]

Labels and timestamps for events in come directly from the annotation data. Ground truth event point clouds are obtained by the image annotation binary masks, which directly mask regions of the th frame’s point cloud (input color and depth maps are registered).

Detected events are derived from the perframe outputs of Algorithm 2. At time (frame pair index), we interpret the connected components, in the Euclidean sense, of point sets Con and Sep as separate, meaningful events. We insert all triplets and into , where and denote the respective connected component sets. We use a distance threshold of for the Euclidean segmentation; to avoid noisy detections, we discard components that contain less than 75 points.
Adopting a 3D point set based representation for the spatial extent of topology events enables reasoning about event similarity in terms of metric distances.
Our assessments on spatial overlap of events will be based on the ‘overlap’ metric, defined for a pair of point clouds and as:
(24) 
where contains exactly the points in that lie within distance from their nearest neighbor in set . Clearly, . It is easy to verify that this metric is simply the intersectionoverunion ratio for the sets and . We use a radius value of .
We derive a manytomany matching between sets and by associating events in the two sets that have a significant spatiotemporal overlap. A ground truth event is matched to a detected event if and only if they both belong to the same class (contact or separation), their timestamps are very close, and they share a substantial spatial overlap. If is the set of event matches, then contains all pairs , for and , that satisfy all three conditions:

[noitemsep,topsep=0pt]



Based on the set of all valid spatiotemporal matches , we derive two interesting event mappings: one that maps each ground truth event to a single detected one, and an ‘inverse’ one that maps each detected event to a ground truth one. Both our ‘ground truth to detected’ and ‘detected to ground truth’ mappings associate an event in the first set with its match in the second one that maximizes overlap:
(25)  
(26) 
Of course, the above are only defined for events (ground truth/detected) that have valid matches, i.e. .
Sequence  Number of events 





Detected 






FWarp  FBWarp  
clap  11  32  100.00  0.832  1.000  78.12  0.633  0.800  1.325  1.134  
drum  21  63  100.00  0.763  0.190  76.19  0.616  0.042  1.609  1.207  
stack  4  15  100.00  0.820  0.250  40.00  0.713  0.000  1.996  1.453  
unstack  4  20  100.00  0.689  0.250  40.00  0.527  0.000  3.135  2.193  
separate  5  17  100.00  0.661  0.400  70.59  0.527  0.583  2.419  1.984  
top_drawer  4  11  100.00  0.951  0.000  54.55  0.819  0.667  2.305  1.325  
bot_drawer  3  12  100.00  0.882  0.000  66.67  0.757  0.375  2.380  1.222  
Overall:  52  170  100.00  0.788  0.058  66.47  0.630  0.035  1.901  1.376 
We present our detection results on our custom dataset in Table II. On average, our pipeline extracts three times more events than the annotated ones (columns 23). This is normal, as topology changes may manifest gradually in continuous video sequences, while our annotation process treats them as being instantaneous. In columns 46 and 79, we evaluate each of the mappings GtToDet and DetToGt in terms of event coverage (fraction of and , respectively, that was matched), average spatial overlap, and average detection delay (signed difference ). All ground truth events are covered by our detections (column 4), while an average of of the detected events have a valid ground truth match (column 7). These coverage fractions directly correspond to recall and precision
, yielding an Fscore of
. At the same time, the spatial overlap of matched events is high (almost on average for the covered ground truth events) and within the error margins of our freehand annotation, while average detection delays are very small. A small number of our detections (mostly separations) are depicted in the third row of Fig. 5, with the respective ground truth events shown in the first row.We note that, in our context of nonrigid registration, high recall is more important than high accuracy, because missing a topology change event is very likely to result in motion estimation errors. At the same time, as discussed in the concluding remarks of Section 3.4.1, a small number of “false positive” detections have essentially no impact on motion estimation under reasonable assumptions. Therefore, our topology change detection mechanism has desirable properties from a motion estimation perspective, while, at the same time, being able to detect contacts and separations with reasonable accuracy.
4.3.2 Registration under dynamic topology
We now quantitatively evaluate our motion estimation performance in the presence of dynamic topology. As discussed before, object separation events tend to produce warp field artifacts when not accounted for, while regular ‘forward’ warp estimates properly handle contacts. Therefore, we focus our assessments on areas of separation events.
For each ground truth separation event, we compute the average pointtopoint distance between points in the source geometry, warped under FWarp and FBWarp, and their nearest neighbors in the target frame. In order to avoid obscuring the differences between the two estimates in the areas of interest, instead of averaging over the whole source frame, we only consider points within our ground truth annotation masks, which provide reasonable approximations of the true motion boundaries. Furthermore, we discard occluded warped points using a simple depth test against the target geometry’s depth map, with a tolerance threshold of .
We report our persequence average registration error values (in ) over separation areas in the last two columns of Table II. FBWarp is consistently more accurate on all sequences, achieving significant error reductions of up to (bot_drawer sequence), with an average improvement of over the baseline. We also provide qualitative registration results for a subset of our ground truth separation events in the last two rows of Fig. 5. Clearly, FWarp (fourth row) significantly oversmooths separation motion boundaries, while FBWarp (last row) correctly preserves discontinuities in almost all cases.
5 Conclusions
We presented a complete pipeline for the nonrigid registration of arbitrary, unorganized point clouds that may be topologically different. Building upon a general warp field estimation algorithm, we introduced an efficient topology event handling postprocessing phase that detects and classifies object contact and separation events, and, by exploiting the different qualities of forward and backward motion estimates with respect to different event types, locally selects the most appropriate one, in a seamless manner. We evaluated the motion estimation accuracy of our method on the MPI Sintel dataset, achieving stateoftheart performance. Our evaluation on a custom dataset with sequences of highly dynamic scene topology demonstrated the success of our method in estimating motion on topological event boundaries, and showed promising performance in event detection. To the best of our knowledge, this is the first approach to handle dynamic topology in the context of raw point cloud registration. Furthermore, we openly release the implementation of our baseline warp field estimation algorithm as part of our point cloud processing library [17].
Acknowledgments
The support of Northrop Grumman Mission Systems University Research Program, of ONR under grant award N000141712622, and the support of the National Science Foundation under grants BCS 1824198 and CNS 1544787 are greatly acknowledged.
References
 [1] A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox, and N. Roy, “Visual odometry and mapping for autonomous flight using an rgbd camera,” in Robotics Research. Springer, 2017, pp. 235–252.
 [2] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for rgbd cameras,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 3748–3754.
 [3] M. Jaimez, M. Souiai, J. GonzalezJimenez, and D. Cremers, “A primaldual framework for realtime dense rgbd scene flow,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 98–104.
 [4] M. Jaimez, C. Kerl, J. GonzalezJimenez, and D. Cremers, “Fast odometry and scene flow from rgbd cameras based on geometric clustering,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3992–3999.
 [5] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Realtime dense surface mapping and tracking,” in Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on. IEEE, 2011, pp. 127–136.
 [6] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and A. Davison, “Elasticfusion: Dense slam without a pose graph,” in Proceedings of Robotics: Science and Systems, Rome, Italy, July 2015.

[7]
R. A. Newcombe, D. Fox, and S. M. Seitz, “Dynamicfusion: Reconstruction and
tracking of nonrigid scenes in realtime,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2015, pp. 343–352.  [8] M. Innmann, M. Zollhöfer, M. Nießner, C. Theobalt, and M. Stamminger, “Volumedeform: Realtime volumetric nonrigid reconstruction,” in European Conference on Computer Vision. Springer, 2016, pp. 362–379.
 [9] E. Herbst, X. Ren, and D. Fox, “Rgbd flow: Dense 3d motion estimation using color and depth,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 2276–2282.
 [10] J. Quiroga, T. Brox, F. Devernay, and J. Crowley, “Dense semirigid scene flow estimation from rgbd images,” in European Conference on Computer Vision. Springer, 2014, pp. 567–582.
 [11] M. Jaimez, M. Souiai, J. Stückler, J. GonzalezJimenez, and D. Cremers, “Motion cooperation: Smooth piecewise rigid scene flow from rgbd images,” in 3D Vision (3DV), 2015 International Conference on. IEEE, 2015, pp. 64–72.
 [12] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4040–4048.
 [13] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello, A. Kowdle, S. O. Escolano, C. Rhemann, D. Kim, J. Taylor, P. Kohli, V. Tankovich, and S. Izadi, “Fusion4d: Realtime performance capture of challenging scenes,” ACM Trans. Graph., vol. 35, no. 4, pp. 114:1–114:13, Jul. 2016. [Online]. Available: http://doi.acm.org/10.1145/2897824.2925969
 [14] W. Gao and R. Tedrake, “Surfelwarp: Efficient nonvolumetric single view dynamic reconstruction,” in Robotics: Science and Systems, 2018.
 [15] M. Slavcheva, M. Baust, D. Cremers, and S. Ilic, “Killingfusion: Nonrigid 3d reconstruction without correspondences,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, no. 4, 2017, p. 7.
 [16] B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. ACM, 1996, pp. 303–312.
 [17] K. Zampogiannis, C. Fermuller, and Y. Aloimonos, “Cilantro: A lean, versatile, and efficient library for point cloud data processing,” in Proceedings of the 26th ACM International Conference on Multimedia, ser. MM ’18. New York, NY, USA: ACM, 2018, pp. 1364–1367.
 [18] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade, “Threedimensional scene flow,” in Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, vol. 2. IEEE, 1999, pp. 722–729.
 [19] H. Spies, B. Jähne, and J. L. Barron, “Range flow estimation,” Computer Vision and Image Understanding, vol. 85, no. 3, pp. 209–231, 2002.
 [20] M. Hornacek, A. Fitzgibbon, and C. Rother, “Sphereflow: 6 dof scene flow from rgbd pairs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3526–3533.
 [21] D. Sun, E. B. Sudderth, and H. Pfister, “Layered rgbd scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 548–556.
 [22] B. Amberg, S. Romdhani, and T. Vetter, “Optimal step nonrigid icp algorithms for surface registration,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8.
 [23] R. W. Sumner, J. Schmid, and M. Pauly, “Embedded deformation for shape manipulation,” in ACM Transactions on Graphics (TOG), vol. 26, no. 3. ACM, 2007, p. 80.
 [24] O. Sorkine and M. Alexa, “Asrigidaspossible surface modeling,” in Symposium on Geometry processing, vol. 4, 2007, p. 30.
 [25] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
 [26] S. Rusinkiewicz and M. Levoy, “Efficient variants of the icp algorithm,” in 3D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on. IEEE, 2001, pp. 145–152.
 [27] C. Engels, H. Stewénius, and D. Nistér, “Bundle adjustment rules,” Photogrammetric computer vision, vol. 2, no. 2006, 2006.
 [28] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in European Conf. on Computer Vision (ECCV), ser. Part IV, LNCS 7577, A. Fitzgibbon et al. (Eds.), Ed. SpringerVerlag, Oct. 2012, pp. 611–625.
 [29] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” International Journal of Computer Vision, vol. 92, no. 1, pp. 1–31, 2011.
Comments
There are no comments yet.