Motion estimation in 3D is a problem of great importance in computer vision, robotics, and computer graphics, playing a central role in a wide range of applications that include 3D scene reconstruction/modeling, human and object pose tracking, robot localization, augmented reality, human-computer interfaces and deformable shape manipulation. The advent of affordable, commercial depth sensors has caused significant research effort on 3D motion estimation from 3D input, leading to the development of RGB-D algorithms for fast visual odometry[1, 2], efficient and accurate scene flow estimation [3, 4], as well as notable SLAM systems for both static [5, 6] and dynamic [7, 8] environments.
Given the availability of 3D input, dense non-rigid registration is the most general motion estimation problem and it is particularly challenging. In its general form, the problem can be described as computing a motion field, densely supported on the surface of a 3D shape, that deforms the latter in order to geometrically align it to another, fixed “template” shape. This process of non-rigid 3D registration shares fundamental similarities with 2D image registration, known in the computer vision community as optical flow estimation: both problems pose similar challenges in deriving formulations that lead to accurate alignment while encoding reasonable prior constraints (regularization) to overcome ill-posedness.
A classical problem variant that is closely related to 3D non-rigid registration is that of RGB-D scene flow. Given a pair of images, scene flow refers to the per-pixel 3D motion of observed points in space from the first frame to the second; optical flow refers to the per-pixel 2D projected motion. There have been a number of successful recent works on scene flow estimation from RGB-D frame pairs, following both classical (variational) [9, 10, 11, 3, 4]
and deep learning frameworks. While of great relevance to a number of motion reasoning tasks, RGB-D scene flow targets a specific instance of dense 3D motion estimation, as it inherently registers pairs of 2.5D geometries (depth maps). This hinders its application in scenarios that require alignment of fully 3D geometries, such as model-to-frame registration for dynamic reconstruction or model-to-model shape deformation.
Recently introduced dynamic reconstruction pipelines from RGB-D input [7, 8, 13, 14] solve a more general problem by implementing warp field optimization algorithms for their model-to-frame registration step. Despite adopting different approaches for their model representations and surface fusion steps, they all rely on similar, point cloud based formulations for non-rigid registration. Scenes with dynamic topology are a challenging case for dynamic reconstruction systems:  and  make no provisions at all for these cases, while  and  deal with registration errors that occur because of dynamic topology at a subsequent stage, by discarding problematic regions and reinitializing model tracking. The fully volumetric approach of  does not use point representations for registration, directly aligning Signed Distance Fields (SDFs)  instead. While it intrinsically handles topological changes, significant scalability limitations are introduced by relying on volumetric representations. To the best of our knowledge, there exists no non-rigid point cloud registration algorithm producing warp fields that are error-free on motion boundaries induced by dynamic scene topology.
Contributions. In this paper, we introduce a complete pipeline for the non-rigid registration of general, unorganized 3D point cloud pairs, which explicitly detects topology changes between the input point sets and produces piecewise-smooth warp fields that respect motion boundaries that result from these events. At the core of our approach lies a general warp field estimation algorithm (Section 3.3), inspired by those employed in recent dynamic reconstruction systems from RGB-D input. We improve motion estimation on motion boundaries associated with topology changes in an efficient post-processing phase (Section 3.4) that exploits the different properties of warp fields that are estimated in different directions (i.e. forward and backward) with respect to different types of topological events (i.e. ‘contact’ or ‘separation’, Section 3.2). After explicitly detecting regions of topology change events by means of simple, intuitive tests of local deformation, our method blends the forward and inverted backward motion hypotheses on a local basis, based on the type and proximity of detected events, ensuring smooth, seamless hypothesis transitions on the deformed surface. This stage makes no assumptions about the underlying registration engine and can be easily adapted for integration into existing pipelines. The implementation of our warp field estimation module (without the topology event handling) is openly available as part of our point cloud processing library . Furthermore, the ability to detect and classify motion boundaries associated with dynamic topology is a byproduct of our pipeline that may be useful in tasks beyond geometric registration.
2 Related Work
RGB-D scene flow estimation. The term ‘scene flow’ was introduced in  to refer to “the three-dimensional motion field of points in the world, just as optical flow is the two-dimensional motion field of points in an image.” Since then, significant research focus has shifted towards scene flow estimation from RGB-D input. The formulation of  couples an -norm data term derived from the optical flow and range flow  constraints with weighted TV regularization. In , the authors follow a similar variational approach but use a rigid motion parameterization of the flow field, computing 6DoF per-pixel transformations and enforcing a local rigidity prior on the solution. A 6DoF local parameterization is also used in , which introduces a correspondence search mechanism that relies on 3D spheres rather than image plane patches, and effectively handles large displacements. In , a probabilistic approach for joint segmentation and motion estimation method is proposed; a depth-based segmentation is used for motion estimation, which is in turn regularized based on the mean rigid motion of each layer. A joint segmentation and scene flow estimation method is also presented in , which assumes that the scene movement can be described by a small number of latent rigid motions. Starting with a spatial -means clustering for the motion label initialization, the algorithm iterates between motion estimation and segmentation (soft labeling), merging labels in the process. In , the first real-time RGB-D variational scene flow method is introduced, achieving state-of-the-art accuracy. An efficient joint odometry and piecewise-rigid scene flow estimation method is proposed in , where the scene is segmented into ‘static’ and ‘moving’ geometric clusters, from which odometry and independent non-rigid motions are computed.
As mentioned in our introduction, scene flow solves a somewhat restricted problem in the context of dense 3D registration, as the support of the computed motion field is image bound.
Dynamic scene reconstruction. General non-rigid 3D registration algorithms have been developed in the context of online reconstruction of dynamic scenes from RGB-D input. Most of them are formulated within a non-rigid Iterative Closest Point (ICP) framework, similar to the one introduced in , with the goal of registering a point cloud representation of the scene model to the current frame, while there also exist purely volumetric approaches  that align Signed Distance Field (SDF) geometry representations. DynamicFusion  was the first system to achieve high quality, real-time dense reconstructions from RGB-D input. While it performs volumetric (SDF) fusion , its warp field estimation algorithm is based on point cloud “renders” of the model geometry. The estimated warp field is defined on a sparse ‘Embedded Deformation’ (ED) graph 
, with a 6DoF transformation attached to each node, and its evaluation on arbitrary points is performed via interpolation. The registration objective consists of a point-to-plane ICP cost, coupled with an ‘As-Rigid-As-Possible’ (ARAP)
, hierarchically defined regularization term, both under robust loss functions. The non-rigid tracker of VolumeDeform does not rely on an ED graph and estimates individual 6DoF transformations for every source geometry point. Its cost function consists of a dense point-to-plane cost, a sparse point-to-point term derived from SIFT  correspondences, and an ARAP prior based on a ‘flat’ neighborhood graph, with all terms being quadratic. Fusion4D  combines the input of multiple range cameras for the task of dynamic reconstruction, using an ED warp field parameterization and following a similar registration objective formulation that additionally includes a ‘visual hull’ term. SurfelWarp  is a purely point cloud based approach that also relies on an ED motion field representation and uses the same registration costs as DynamicFusion, but under the quadratic loss function. On the other end of the spectrum, KillingFusion  is a purely volumetric approach that relies on direct SDF-to-SDF alignment via variational minimization under novel regularizers that enforce the motion field to be isometric (ARAP prior analogue) and preserve level set geometry.
All of the above systems produce results of remarkable quality, especially given their real-time budget. However, with the exceptions of , , and , they cannot handle scenes with dynamic topology, with the ‘close-to-open’ case (‘separation’, in our terminology) being particularly problematic. According to our introductory discussion,  and  deal with these cases essentially by discarding affected regions, while the volumetric registration of  is inherently immune to these events. Our proposed method is the first to tackle dynamic topology within the context of motion estimation and within a scalable point-based representation framework.
3 Our Approach
3.1 Problem statement
Given a pair of unorganized 3D point sets, our goal is to estimate a warping function that non-rigidly deforms the first point cloud (source geometry) towards the second one (target) in a piece-wise smooth manner.
Let and be the source and target geometry point sets, respectively, and be a warping function. In our non-rigid alignment setting, is required to have the following properties:
The image of point set via , , should be aligned as close to the target geometry as possible. Typically, this is formulated as the minimization of the sum of residuals between points in and their corresponding points (e.g., nearest neighbors) in .
Local transformations of neighboring points that lie on the same moving surface in should be similar; i.e. should be smooth. At the same time, motion discontinuities should be preserved: neighboring points in that lie on independently moving surfaces should be allowed to have different local transformations. This combined prior is known as piecewise-smoothness.
In a typical registration objective minimization formulation, the first property is expressed by the sum of registration residuals (e.g., point-to-point and/or point-to-plane distances) over corresponding point pairs in the objective (data term), while the second one renders the otherwise under-constrained problem well-posed by introducing terms that penalize differences in local transformations of neighboring points (regularization term).
The loss function used to model the regularization penalty terms, plays an important role in the behavior of the warping function in motion boundary regions. For example, it is well known from the optical flow literature that quadratic regularization tends to oversmooth motion boundaries. On the other hand, robust loss functions (e.g., -norm approximations for the penalty terms) are more effective in producing piecewise-smooth motion fields that preserve discontinuities.
In this work, we focus on estimating warp fields that respect motion boundaries resulting from changes in scene topology. Our notion of topology is directly derived from object-level connectivity: a change in scene topology can occur either when two or more separate objects come into contact or when two or more initially connected objects separate. We note that we use the term ‘object’ simply to refer to independently moving scene surface regions, without attaching any semantic meaning to them.
In the following, we show that simply adopting a robust loss function for regularization still produces visible warping artifacts in motion discontinuity regions that result from scene topology changes, and we present a complete registration pipeline that effectively and efficiently solves this problem.
3.2 Motivation and overview of our approach
Motion estimation errors on motion boundaries typically manifest as oversmoothing of the warp field because of excessive regularization and can be suppressed by eliminating regularization penalty terms for points in that lie on different sides of the discontinuity. However, without any knowledge about and its motion (e.g., some form of segmentation into independently moving objects), we cannot obtain a “correct” regularization graph a priori. Instead, the common choice is to use a -NN graph of points in to define the regularization terms. It is easy to see that this choice is particularly problematic in cases where connected objects in move apart in , as -NN regularization over will introduce penalty terms that relate points that lie on different objects, resulting in some amount of motion field oversmoothing over the separation boundary.
Such a challenging scenario that involves object ‘separations’ is depicted in Fig. (a)a, where the general, topology-agnostic warp field estimation algorithm described in Section 3.3 was used to non-rigidly align two RGB-D frames. Despite the fact that the algorithm’s regularization term is formulated based on the robust, discontinuity-preserving Huber- loss function (see Section 4.1 for parameter details), the registration result (warped source geometry) shows visible artifacts near the object separation areas. Quadratic regularization is known to induce even more excessive smoothing on motion boundaries. Since quadratic and -norm approximation regularization types are the most commonly used ones in the literature, most current non-rigid registration algorithms are expected to exhibit a very similar behavior on these types of motion boundaries.
At the same time, it is clear that any change in topology between and will be directly reflected on the (different) nearest-neighbor graph structures of the source and target geometries. We exploit this fact in the following way. Consider the case where two or more objects that are connected in become separate in . As discussed above, estimating the warp field that aligns to using the source geometry’s -NN graph to define the regularization penalties is expected to result in some amount of oversmoothing over the motion boundary of the separation. However, in the backward motion direction (from to ), the same topology change manifests as a connection of separate objects. Estimating the backward warp field that aligns to using the target () geometry’s -NN graph to define the regularization penalties should not exhibit any oversmoothing over the connection boundary, because the corresponding -NN graph edges that would define regularization terms over the discontinuity are not there in the first place. Inverting the warping function yields another forward warp field hypothesis, , that will be free of oversmoothing over separation motion boundaries. Of course, the latter, being derived from , may suffer from oversmoothing over motion boundaries that correspond to object separations in the backward motion direction (from to ), or, equivalently, to objects coming into contact from to . These cases are expected to be handled correctly in the first place by the standard forward warp, .
Based on the above observations, the standard forward warp is expected to exhibit good behavior over contact boundaries, but to oversmooth separation boundaries. On the other hand, the inverted backward warp is expected to behave the opposite way, preserving separation discontinuities, but possibly blurring motion estimates in contact areas. Our proposed registration pipeline builds upon this idea by first detecting regions in that are likely to be contact or separation boundaries, and then locally blending the warp hypotheses and accordingly in a seamless manner. The final result is a piecewise-smooth warp field that aligns to and respects motion boundaries because of changes in scene topology.
An overview of our approach is provided in Fig. 4. The (topology-agnostic) warp field estimation algorithm used to obtain the initial forward and backward warp hypotheses is described in detail in Section 3.3. Our topology event detection mechanism, as well as our local hypothesis blending approach, are presented in Section 3.4.
3.3 Warp field estimation
We implement our warp field estimation algorithm within a non-rigid Iterative Closest Point (ICP) framework , similarly to the non-rigid trackers used in the recently introduced dynamic reconstruction pipelines of , , and .
Given the source and target geometries and , as well as an initial estimate of the unknown warp field (usually taken as the identity warp), the algorithm iteratively refines the latter until convergence has been reached. At the top level, the process iterates between a point correspondence search step between the warped source (according to the current estimate) and , and a warp field optimization step that updates given the established point correspondences (Algorithm 1). The two algorithm phases are presented in detail in the following.
3.3.1 Correspondence association
Our framework supports two complementary types of point correspondences between the (warped) source and the target geometries: dense correspondences that are established based on spatial point proximity, and sparse correspondences that result from keypoint matching. Each individual correspondence is represented as a pair of point indices, whose first component indexes a point in and its second one a point it : , where .
We support two mechanisms to establish dense correspondences. By default, we assume that both and have a fully 3D structure and we establish dense correspondences by finding the nearest-neighbor in , in terms of Euclidean distance, of each point in , with the search being performed efficiently by parallel d-tree queries. For certain applications that only require frame-to-frame (2.5D-to-2.5D) or model-to-frame (3D-to-2.5D) registration, we can further speed up the process by obtaining projective correspondences. This amounts to projecting and onto the target frame image and extracting correspondences based on points that are projected to the same pixel. This is the mechanism adopted in most real-time reconstruction pipelines [5, 7, 8].
In many common situations, dense geometric/depth correspondences alone are not enough to disambiguate the underlying motion. For example, tracking points on flat surfaces that lack geometric texture and slide parallel to each other may exhibit drift. Establishing robust keypoint correspondences between the source and target geometries can effectively mitigate this problem. We assume that our input geometries are equipped with sparse interest points; our sparse correspondences are established by the interest point descriptor matches between and . In our implementation, we focus on input geometries that are either RGB-D frames or 3D reconstructions from RGB-D input. The availability of regular images along with (registered) geometry allows us to adopt SIFT keypoint  (lifted to 3D) matches for our sparse correspondences.
To make optimization more stable, we discard correspondence candidates from the above mechanisms that do not meet some basic proximity and local similarity criteria. Let , , and ,
be the surface normals and colors (e.g., RGB value 3-vectors) of the source and target geometries, indexed in the same way as their support points inand . A correspondence candidate is considered valid and used in the optimization if all of the following hold:
In the above, is a point distance threshold, is a normal angle threshold, and is a color “distance” threshold.
3.3.2 Warp field optimization
Given a set of dense and sparse point correspondences, we shall now describe our warp field optimization step. Modeling the warp field using locally affine  or locally rigid  transformations provides better motion estimation results than adopting a simple translational local model, due to more effective regularization. In our approach, we adopt a locally rigid (6DoF) model.
Instead of computing a unique rigid transformation for each point in , we use the more efficient embedded deformation graph representation  for the warp field , similarly to  and . Let be the set of virtual deformation nodes, where is the position of the th node, is a radius parameter that controls the th node’s area of effect, and is the 6DoF rigid transformation attached to the th node. The deformation node positions are obtained by downsampling the source geometry by means of a voxel grid of bin size . A reasonable choice for that ensures sufficient area of effect overlap among neighboring nodes is , for . As in , each local transformation is parameterized during optimization by a 6D vector of 3 Euler angles and 3 translational offsets. The effect of the warp field , represented by , on a point is given by interpolating the local node deformations in the neighborhood of . Let be the indices of the -nearest neighbors of in . The local transformation parameter vector at is given by:
where . The image of via is then:
where and extract the rotation matrix and translation vector from our 6D parameterization.
We note that the above 6D parameterization is only used within optimization (line 5 of Algorithm 1) and that both the estimated incremental warp and the final composite estimate have their node transformations expressed in terms of transformation matrices. The fact that we continuously warp and compute starting from the identity warp, combined with the smoothness prior imposed on the warp field (shown below), allows us to overcome any problems associated with Euler angle parameterizations of rotation.
Our registration objective, as a function of the unknown warp field , which in the context of Algorithm 1 is the incremental warp , and the point correspondences between and , which are fixed for this step, is formulated as:
Our data term, , is a weighted sum of a point-to-plane and a point-to-point ICP cost:
Pure point-to-plane metric optimization generally converges faster and to better solutions than pure point-to-point , and is the standard trend in recent rigid [5, 6] and non-rigid [7, 8, 14] registration pipelines. However, as discussed in Section 3.3.1, integrating a point-to-point term for robust point matches into the registration cost can effectively disambiguate motion estimation in cases where surfaces that lack geometric texture slide parallel to each other. Similarly to , we use our dense geometric correspondences to define our point-to-plane cost and our sparse keypoint correspondences for our point-to-point cost:
Our regularization term directly penalizes differences in transformation parameters of neighboring virtual nodes of under the robust Huber- loss function. If is the set of indices of the -nearest neighbors of in , our regularization term is formulated as:
where weights the pairwise penalties based on node distance, and denotes the sum of the Huber loss function values over the 6 parameter residual components. Parameter controls the point at which the loss function behavior switches from quadratic (squared -norm) to absolute-linear (-norm). Since -norm regularization is known to better preserve solution discontinuities, we use a small value of . Given our locally rigid motion model, this regularization scheme enforces an ‘As-Rigid-As-Possible’  prior to the estimated warp field.
The registration objective of equation (3) is non-linear in the unknowns. We minimize by performing a small number of Gauss-Newton iterations. As in , we handle non-quadratic terms using the square-rooting technique of . At every step, we linearize around the current solution (vector concatenation of all node transformation parameters ) and obtain a solution increment by solving the system of normal equations , where is the Jacobian matrix of the residual terms in and is the vector of (negative) residual values. We solve this sparse system iteratively, using the Conjugate Gradient algorithm with a diagonal preconditioner.
3.4 Handling topology changes
In our post-processing phase for handling topology change motion boundaries, we first explicitly detect likely contact and separation regions in the source geometry, and proceed by appropriately blending our (default) forward and inverted backward warp hypotheses in a local yet seamless manner. It follows from the discussion in Section 3.2 that, as far as topology changes are concerned, the forward warp is only problematic in separation areas. However, instead of only focusing on and amending separations, it is beneficial to also explicitly consider contact events. As will become clear in the following, considering both types of events and treating them symmetrically robustifies their detection and allows for a less biased hypothesis blending scheme.
Using the same notation as in Section 3.2, let be the warp field that aligns to and the backward warp that aligns to , both computed by Algorithm 1. We will be using the ‘S’ subscript for forward () motion entities and the ‘D’ subscript for backward () ones. For the needs of the following discussion, we will consider these motion fields to be represented by the per-point local rigid transformations of their support geometries, so that:
To invert in order to obtain an alternative forward warp, we appropriately rebase its inverse local transformations on . To that end, we first compute the target geometry’s image (which should be closely aligned to ) and, to each point , we assign the transformation , where is the index of the nearest neighbor of in the point set . Of course, we assume that the latter is indexed in the same way as . The inverted backward warp is then represented as:
where and is the nearest neighbor index of in . Analogously, we obtain an alternative backward warp, by inverting our forward hypothesis:
where and is the nearest neighbor index of in . To summarize, we have two forward motion () warp hypotheses ( and ) and two backward motion () ones ( and ).
3.4.1 Detecting topology change events
We detect topology change regions in based on how our warp estimates affect local neighborhoods of the source geometry.
Naturally, we expect that if is close to a separation boundary, its distance to some of its neighbors in should increase after applying the correct warp to . We shall refer to this effect as neighborhood stretching. The dual case of a contact event manifests exactly the same way in the backward motion direction (stretching of neighborhoods of ), in which the event is perceived as a separation. In the following, we will use a local measure of stretch over points in to detect separation areas, and map the same measure over in the backward direction onto to obtain a dual measure of “compression” that will allow us to detect contacts.
We quantify the above intuition by defining a local “stretch” operator for point under the warp as the maximum ratio of the distance to its neighbors before and after applying :
where indexes the neighbors of in that lie within distance from it. The choice of the neighborhood radius value depends on the scale and resolution of the input geometries. For close-range point clouds acquired with Kinect-like cameras, we use .
To each point in , we associate one stretch value for each of our two forward warp hypotheses, according to definition (12). For , we have:
We also compute the local stretch of the target geometry under each of the two backward warps:
which we subsequently map onto , interpreting them as a compression measure (contact indicator), according to:
where is the index of the nearest neighbor of point in point set .
Using the above point-wise stretch/compress values on , we extract subsets of the source geometry that are likely to lie on topology change motion boundaries. Let be the sets of candidate separation and contact boundary points, respectively. According to the above discussion, points on a separation boundary are expected to have high stretch scores, while points on a contact boundary should exhibit high local compression. To decide whether a point in is a boundary candidate, we perform two symmetric tests per case that rely on two threshold values, an absolute score threshold , and a relative (ratio) threshold . A point of is a member of Sep (Con) if and only if the maximum of its two stretch (compress) scores is greater than and also greater than times its maximum compress (stretch) score. The process is summarized in Algorithm 2. A sample output is shown in Fig. (b)b (left), marked in red; note that in this case.
As we will show in Section 4.3, the above procedure is very effective at detecting and classifying topology changes, but, because of the continuous nature of our local stretch/compression measures and depending on the selected threshold values, it may produce “false positives” (e.g., in areas of actual deforming surface stretching or compression but constant topology). However, under the assumption that our two forward warp hypotheses behave similarly in the false positive areas and as will become clear in the next section, this does not affect our final warp estimate.
3.4.2 Local hypothesis blending
Our blending scheme produces a topology-aware warp field by combining the forward warp hypotheses and on a per-point basis. Our objective is to assign a higher weight to (inverted backward warp) near separation areas, and ensure that (forward warp) has a stronger weight near contact areas. At the same time, it is desirable that point weights vary smoothly on , so that our warp blending does not introduce seam artifacts on due to differences in our original warp hypotheses.
We achieve the above by attaching a smoothly decaying kernel on each of our detected event points in Con and Sep and locally computing the weight for each event class. Assuming a maximum radius of effect (free parameter) for our event points, we model the influence of each event with an RBF kernel of bandwidth . The weights and of and for the source point are computed by accumulating influences of the event points in and respectively that are within a -radius from :
where is a normalizing constant ensuring that , and , index the -radius neighbors of in Con and Sep respectively. In the absence of any topology event influence (e.g., for at least from any event point), the above defaults to and , giving full weight to the standard forward hypothesis . The local transformation of our final, topology-aware warp field estimate at location is then given by:
where converts the linear blend of the two transformation matrices back to a valid transformation matrix. The complete blending process is summarized in Algorithm 3 (see also equations (8) and (10)). A visualization of the blending weights is given in Fig. (b)b (middle), where each source geometry point is colored according to its inverted backward warp hypothesis weight , using a ‘blue-to-red’ colormap.
We note that, for source points close to topology events, one of and will dominate the other, essentially rendering the blending of (21) a binary selection. As we move farther from topology events, the two weights may assume comparable values. For our blended output (21) to be seamless and error-free in that case, and should not differ too much in the areas that are “far enough” from event points. This highlights the importance of parameter ; we have found that values work well in practice, where is the resolution of our virtual deformation graph (Section 3.3.2).
As a concluding remark, we observe that the most costly operation of our post-processing phase is the calculation, based on radius-neighborhoods, of the stretch/compress values of Section 3.4.1. Algorithm 3 also performs radius queries on point sets Con and Sep, but the latter are typically very small in size. In our experience, the overall running time of the entire phase is significantly smaller than a single warp field estimation. Furthermore, in the case of RGB-D input, image structure can be easily exploited in order to accelerate the extraction of point neighborhoods.
We conduct two sets of experiments for the evaluation of our registration pipeline. The first one is performed on a public dataset and examines our algorithm’s motion estimation accuracy, both with and without the topology handling phase (Section 4.2). For the second one, we use a custom dataset with topology event annotations and evaluate our event detection performance, as well as our estimated warp field quality in the presence of separation events (Section 4.3).
4.1 General setup details
For both sets of experiments, the input is RGB-D data, either synthetic (first set) or captured by a Kinect-like camera (second set).
Point cloud generation. RGB-D frames are converted to point clouds equipped with surface normal and color information, as well as a sparse set of interest points derived from SIFT features. In all cases, the full resolution of the input depth map is used, which is for the synthetic sequences and for the camera data. We use a fixed maximum depth of for all sequences in the first set, and vary the cut-off value in the range of to for the camera data, depending on the sequence. For normal estimation, we use -NN neighborhoods with in our first set of experiments, and -radius neighborhoods with in our second set. SIFT keypoints are extracted from the RGB images and lifted to 3D, discarding the ones that lie on depth boundaries.
Warp field estimation. In the correspondence association step (Section 3.3.1) of our non-rigid ICP algorithm, we set the maximum correspondence distance to for our first set of experiments and for our second one, while we use common values and for the maximum normal angle and color difference (colors are RGB triplets in ). The embedded deformation graph for our warp field parameterization (Section 3.3.2) has a resolution of , with each node’s area of effect being controlled by . To evaluate the local transformation for each point in the source geometry (equation (1)), we use its 4 nearest neighbors in . The point-to-point weight in our data term (4) is set to . To favor -norm behavior by our regularization term (7), we use a small Huber loss parameter value of , while we set the term’s weight to (equation (3)). Regularization topology is given by the 6 nearest neighbor neighborhoods of . We perform a maximum of 10 top-level ICP iterations, while the process typically converges in less. Within each optimization step, we perform a maximum of 5 Gauss-Newton iterations.
Topology handling. In all experiments, local stretch is computed on neighborhoods of radius (Section 3.4.1). To detect and classify topology change events, we use an asbsolute score threshold of and a relative ratio of . For our blending step (Section 3.4.2), we assume that every detected event has a radius of effect equal to .
|Sequence||Frame-wise EPE over entire sequence||Frame-wise AE over entire sequence|
4.2 Motion estimation accuracy evaluation
Due to the lack of publicly available datasets with ground truth dense 3D motion, we perform our accuracy assessments on MPI Sintel , a synthetic optical flow evaluation dataset. The dataset contains multiple sequences of (typically) 50 frames that capture motions ranging from slow, almost rigid to very large, highly non-rigid ones. In addition to ground truth optical flow, metric ground truth depth and camera intrinsics are provided, which we use to emulate RGB-D input.
We base our evaluation on two classical optical flow performance measures, the endpoint error (EPE) and the angular error (AE) . If is an optical flow estimate at a given pixel whose ground truth value is , EPE is computed as:
The angular error AE is defined as the angle between the 3D space-time vectors and , as:
effectively enabling evaluation at pixels of zero flow. We convert 3D motion estimates to optical flow by first warping the source points in 3D and then computing the 2D point/pixel displacements as differences of the projected endpoints onto the image plane.
We evaluate and compare three methods: PD-Flow , a state-of-the-art scene flow algorithm, F-Warp, our general warp field estimation algorithm defined in Section 3.3 (without the topology change handling phase), and FB-Warp, our complete topology-aware warp field estimation pipeline. We run the three algorithms on the entire duration of 12 Sintel sequences and compute the average EPE and AE values per consecutive frame pair. We report the median and mean frame-level average errors over each sequence in Table I. Median values are not easily affected by extreme values, often providing a better picture of how accurate estimation is “half of the time”. We also report overall mean and median error values for each method, computed over the total number of frames from all sequences.
Our FB-Warp method overall achieves the highest accuracy in terms of both error metrics, followed closely by our baseline, F-Warp. PD-Flow is very accurate in estimating slow motions (e.g., in the sleeping_2 sequence), but falls behind in most cases, producing particularly large errors in sequences that contain very fast motions, such as ambush_5 and ambush_6. We also refer the reader to the Sintel-based evaluation of MC-Flow  (Table 2 of that paper), another state-of-the-art scene flow algorithm with significantly better performance than PD-Flow in estimating large motions. We were unable to evaluate MC-Flow ourselves, because its implementation has not been released. While 6 of the Sintel sequences in our experiments and the ones in  are common, there are significant differences between our evaluation setups: 1) their reported EPE and AE values are computed on one specific frame pair per sequence, not over whole sequences as in here, 2) they downsample their input to half its original resolution per dimension (to ), whereas we use full resolution images, and 3) they consider only non-occluded pixels, while we compute errors on all valid ones. The above prohibit reaching definitive conclusions. However, the fact that we adopt an arguably more difficult evaluation strategy (at full resolution, which means smaller pixel size for EPE interpretation, and evaluating over whole sequences) and still obtain comparable EPE and AE absolute values to the ones reported in  (averages of 1.203 and 6.559, respectively), leads us to argue that FB-Warp compares favorably to MC-Flow.
4.3 Topology change event handling
To evaluate the performance of the topology-handling phase of our pipeline, we collected, using a Kinect-like camera, an RGB-D dataset of 7 sequences that contain changes in scene topology. The regions of visible topology changes were manually annotated on the color image as binary masks, drawn in freehand mode, and classified as either contacts or separations. Our collected sequences capture the following diverse set of actions:
Hand-clapping (clap sequence)
Fast hand-drumming on a desk (drum sequence)
Two pick-and-place actions on objects lying on a flat surface or on top of each other (stack and unstack sequences)
A separation of two touching objects using both hands (separate sequence)
Two drawer opening action sequences (top_drawer and bot_drawer)
The annotated ground truth events capture all visible instances of hand-hand, hand-object, and object-object interaction. Snapshots of our sequences are shown in Fig. 5 (top two rows).
Our evaluation of this phase is twofold. First, we assess how well our detected event points in Con and Sep (Section 3.4.1) relate to the ground truth events. Second, we compare the geometric registration error between F-Warp (baseline warp) and FB-Warp (topology-aware motion estimate) in the vicinity of the ground truth topology change events, focusing on separations.
4.3.1 Topology event detection
We uniformly represent topology change events as triplets , where is a binary label indicating contact or separation, is the time (frame index) of the event, and is the subset of points in the source geometry that lie very close to event motion boundaries. We use the superscript ‘gt’ to denote ground truth event entities, and ‘det’ to denote the ones associated with detections by our algorithm. Let and denote the sets of ground truth and detected events, respectively. Given an annotated sequence, these are populated according to the following:
Labels and timestamps for events in come directly from the annotation data. Ground truth event point clouds are obtained by the image annotation binary masks, which directly mask regions of the -th frame’s point cloud (input color and depth maps are registered).
Detected events are derived from the per-frame outputs of Algorithm 2. At time (frame pair index), we interpret the connected components, in the Euclidean sense, of point sets Con and Sep as separate, meaningful events. We insert all triplets and into , where and denote the respective connected component sets. We use a distance threshold of for the Euclidean segmentation; to avoid noisy detections, we discard components that contain less than 75 points.
Adopting a 3D point set based representation for the spatial extent of topology events enables reasoning about event similarity in terms of metric distances.
Our assessments on spatial overlap of events will be based on the ‘-overlap’ metric, defined for a pair of point clouds and as:
where contains exactly the points in that lie within distance from their nearest neighbor in set . Clearly, . It is easy to verify that this metric is simply the intersection-over-union ratio for the sets and . We use a radius value of .
We derive a many-to-many matching between sets and by associating events in the two sets that have a significant spatiotemporal overlap. A ground truth event is matched to a detected event if and only if they both belong to the same class (contact or separation), their timestamps are very close, and they share a substantial spatial overlap. If is the set of event matches, then contains all pairs , for and , that satisfy all three conditions:
Based on the set of all valid spatiotemporal matches , we derive two interesting event mappings: one that maps each ground truth event to a single detected one, and an ‘inverse’ one that maps each detected event to a ground truth one. Both our ‘ground truth to detected’ and ‘detected to ground truth’ mappings associate an event in the first set with its match in the second one that maximizes overlap:
Of course, the above are only defined for events (ground truth/detected) that have valid matches, i.e. .
|Sequence||Number of events||
We present our detection results on our custom dataset in Table II. On average, our pipeline extracts three times more events than the annotated ones (columns 2-3). This is normal, as topology changes may manifest gradually in continuous video sequences, while our annotation process treats them as being instantaneous. In columns 4-6 and 7-9, we evaluate each of the mappings GtToDet and DetToGt in terms of event coverage (fraction of and , respectively, that was matched), average spatial overlap, and average detection delay (signed difference ). All ground truth events are covered by our detections (column 4), while an average of of the detected events have a valid ground truth match (column 7). These coverage fractions directly correspond to recall and precision
, yielding an F-score of. At the same time, the spatial overlap of matched events is high (almost on average for the covered ground truth events) and within the error margins of our freehand annotation, while average detection delays are very small. A small number of our detections (mostly separations) are depicted in the third row of Fig. 5, with the respective ground truth events shown in the first row.
We note that, in our context of non-rigid registration, high recall is more important than high accuracy, because missing a topology change event is very likely to result in motion estimation errors. At the same time, as discussed in the concluding remarks of Section 3.4.1, a small number of “false positive” detections have essentially no impact on motion estimation under reasonable assumptions. Therefore, our topology change detection mechanism has desirable properties from a motion estimation perspective, while, at the same time, being able to detect contacts and separations with reasonable accuracy.
4.3.2 Registration under dynamic topology
We now quantitatively evaluate our motion estimation performance in the presence of dynamic topology. As discussed before, object separation events tend to produce warp field artifacts when not accounted for, while regular ‘forward’ warp estimates properly handle contacts. Therefore, we focus our assessments on areas of separation events.
For each ground truth separation event, we compute the average point-to-point distance between points in the source geometry, warped under F-Warp and FB-Warp, and their nearest neighbors in the target frame. In order to avoid obscuring the differences between the two estimates in the areas of interest, instead of averaging over the whole source frame, we only consider points within our ground truth annotation masks, which provide reasonable approximations of the true motion boundaries. Furthermore, we discard occluded warped points using a simple depth test against the target geometry’s depth map, with a tolerance threshold of .
We report our per-sequence average registration error values (in ) over separation areas in the last two columns of Table II. FB-Warp is consistently more accurate on all sequences, achieving significant error reductions of up to (bot_drawer sequence), with an average improvement of over the baseline. We also provide qualitative registration results for a subset of our ground truth separation events in the last two rows of Fig. 5. Clearly, F-Warp (fourth row) significantly oversmooths separation motion boundaries, while FB-Warp (last row) correctly preserves discontinuities in almost all cases.
We presented a complete pipeline for the non-rigid registration of arbitrary, unorganized point clouds that may be topologically different. Building upon a general warp field estimation algorithm, we introduced an efficient topology event handling post-processing phase that detects and classifies object contact and separation events, and, by exploiting the different qualities of forward and backward motion estimates with respect to different event types, locally selects the most appropriate one, in a seamless manner. We evaluated the motion estimation accuracy of our method on the MPI Sintel dataset, achieving state-of-the-art performance. Our evaluation on a custom dataset with sequences of highly dynamic scene topology demonstrated the success of our method in estimating motion on topological event boundaries, and showed promising performance in event detection. To the best of our knowledge, this is the first approach to handle dynamic topology in the context of raw point cloud registration. Furthermore, we openly release the implementation of our baseline warp field estimation algorithm as part of our point cloud processing library .
The support of Northrop Grumman Mission Systems University Research Program, of ONR under grant award N00014-17-1-2622, and the support of the National Science Foundation under grants BCS 1824198 and CNS 1544787 are greatly acknowledged.
-  A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox, and N. Roy, “Visual odometry and mapping for autonomous flight using an rgb-d camera,” in Robotics Research. Springer, 2017, pp. 235–252.
-  C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for rgb-d cameras,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 3748–3754.
-  M. Jaimez, M. Souiai, J. Gonzalez-Jimenez, and D. Cremers, “A primal-dual framework for real-time dense rgb-d scene flow,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 98–104.
-  M. Jaimez, C. Kerl, J. Gonzalez-Jimenez, and D. Cremers, “Fast odometry and scene flow from rgb-d cameras based on geometric clustering,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3992–3999.
-  R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on. IEEE, 2011, pp. 127–136.
-  T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and A. Davison, “Elasticfusion: Dense slam without a pose graph,” in Proceedings of Robotics: Science and Systems, Rome, Italy, July 2015.
R. A. Newcombe, D. Fox, and S. M. Seitz, “Dynamicfusion: Reconstruction and
tracking of non-rigid scenes in real-time,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 343–352.
-  M. Innmann, M. Zollhöfer, M. Nießner, C. Theobalt, and M. Stamminger, “Volumedeform: Real-time volumetric non-rigid reconstruction,” in European Conference on Computer Vision. Springer, 2016, pp. 362–379.
-  E. Herbst, X. Ren, and D. Fox, “Rgb-d flow: Dense 3-d motion estimation using color and depth,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 2276–2282.
-  J. Quiroga, T. Brox, F. Devernay, and J. Crowley, “Dense semi-rigid scene flow estimation from rgbd images,” in European Conference on Computer Vision. Springer, 2014, pp. 567–582.
-  M. Jaimez, M. Souiai, J. Stückler, J. Gonzalez-Jimenez, and D. Cremers, “Motion cooperation: Smooth piece-wise rigid scene flow from rgb-d images,” in 3D Vision (3DV), 2015 International Conference on. IEEE, 2015, pp. 64–72.
-  N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4040–4048.
-  M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello, A. Kowdle, S. O. Escolano, C. Rhemann, D. Kim, J. Taylor, P. Kohli, V. Tankovich, and S. Izadi, “Fusion4d: Real-time performance capture of challenging scenes,” ACM Trans. Graph., vol. 35, no. 4, pp. 114:1–114:13, Jul. 2016. [Online]. Available: http://doi.acm.org/10.1145/2897824.2925969
-  W. Gao and R. Tedrake, “Surfelwarp: Efficient non-volumetric single view dynamic reconstruction,” in Robotics: Science and Systems, 2018.
-  M. Slavcheva, M. Baust, D. Cremers, and S. Ilic, “Killingfusion: Non-rigid 3d reconstruction without correspondences,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, no. 4, 2017, p. 7.
-  B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. ACM, 1996, pp. 303–312.
-  K. Zampogiannis, C. Fermuller, and Y. Aloimonos, “Cilantro: A lean, versatile, and efficient library for point cloud data processing,” in Proceedings of the 26th ACM International Conference on Multimedia, ser. MM ’18. New York, NY, USA: ACM, 2018, pp. 1364–1367.
-  S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade, “Three-dimensional scene flow,” in Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, vol. 2. IEEE, 1999, pp. 722–729.
-  H. Spies, B. Jähne, and J. L. Barron, “Range flow estimation,” Computer Vision and Image Understanding, vol. 85, no. 3, pp. 209–231, 2002.
-  M. Hornacek, A. Fitzgibbon, and C. Rother, “Sphereflow: 6 dof scene flow from rgb-d pairs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3526–3533.
-  D. Sun, E. B. Sudderth, and H. Pfister, “Layered rgbd scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 548–556.
-  B. Amberg, S. Romdhani, and T. Vetter, “Optimal step nonrigid icp algorithms for surface registration,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–8.
-  R. W. Sumner, J. Schmid, and M. Pauly, “Embedded deformation for shape manipulation,” in ACM Transactions on Graphics (TOG), vol. 26, no. 3. ACM, 2007, p. 80.
-  O. Sorkine and M. Alexa, “As-rigid-as-possible surface modeling,” in Symposium on Geometry processing, vol. 4, 2007, p. 30.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
-  S. Rusinkiewicz and M. Levoy, “Efficient variants of the icp algorithm,” in 3-D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on. IEEE, 2001, pp. 145–152.
-  C. Engels, H. Stewénius, and D. Nistér, “Bundle adjustment rules,” Photogrammetric computer vision, vol. 2, no. 2006, 2006.
-  D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in European Conf. on Computer Vision (ECCV), ser. Part IV, LNCS 7577, A. Fitzgibbon et al. (Eds.), Ed. Springer-Verlag, Oct. 2012, pp. 611–625.
-  S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” International Journal of Computer Vision, vol. 92, no. 1, pp. 1–31, 2011.