Localization and Tracking of User-Defined Points on Deformable Objects for Robotic Manipulation

05/19/2021 ∙ by Sven Dittus, et al. ∙ KIT 0

This paper introduces an efficient procedure to localize user-defined points on the surface of deformable objects and track their positions in 3D space over time. To cope with a deformable object's infinite number of DOF, we propose a discretized deformation field, which is estimated during runtime using a multi-step non-linear solver pipeline. The resulting high-dimensional energy minimization problem describes the deviation between an offline-defined reference model and a pre-processed camera image. An additional regularization term allows for assumptions about the object's hidden areas and increases the solver's numerical stability. Our approach is capable of solving the localization problem online in a data-parallel manner, making it ideally suitable for the perception of non-rigid objects in industrial manufacturing processes.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many manufacturing processes rely on image processing to enable industrial robots to manipulate objects. Whereas many sophisticated camera systems meet the need for localizing and tracking user-defined Points of Interest on rigid objects, there is still no sufficiently accurate solution for coping with this problem for deformable objects yet. Moreover, existing approaches reconstruct the deformable object‘s model online, requiring POIs to be defined at runtime and thus being unsuitable for fully automated processes. This paper proposes a solution for defining POIs on an offline model and then localizing and tracking these points on a deformable object in an online detection pipeline.

Fig. 1: Overview of the localization and tracking pipeline

Ii Related Work

Existing approaches for localizing and tracking POIs are only applicable to specific object categories (e.g. linear [Tang.2018] or planar [Tang.2013, Schulman.2013]), assume speficic deformation models (e.g. articulated models [Schmidt.2015] or skeletons [Gall.2009]) or particular materials (e.g. textiles [Li.2014, Li.2018]) or are restricted to detecting specific features (e.g. points on corners or edges [Yamazaki.2014, Ramisa.2012]). To reach the precision required for sophisticated manipulation tasks, prior work requires external markers [Finnegan.2006, Trumble.2017] or elaborate physics models [Bay.2006, Tian.2010, Lang.2011, Schulman.2013, Leizea.2014]. We combine several SotA-solutions for rigid object state estimation (such as SHOT descriptors), physical modelling (such as deformation fields) and computer graphics (such as projection) into a processing pipeline which permits the tracking and localization of user-defined points on arbitrary deformable objects using only a single depth camera without markers or prior physics modeling.

Iii Method

Our algorithm consists of three distinct phases: (1) Demonstration of the reference model and the POIs; (2) an iterative localization and tracking process consisting of observing a new point cloud, identification of correspondences between the observation and the deformed reference model of the previous timestep and estimation of the deformation; and (3) a coordinate transformation of the localized POIs into the robot end-effector coordinate system for subsequent manipulation.

Iii-a Surface and deformation model

To efficiently perform computations, we model object surfaces as triangle meshes, while deformations are modelled via a deformation grid [Innmann.2016]. Unlike [Innmann.2016], we use a highly detailed mesh as a surface representation which is independent of the deformation model’s resolution. This allows to increase computation performance while maintaining a highly detailed surface. Our deformation model consists of two data structures, both containing grid points. Whereas the equally spaced static grid describes the undeformed reference model, the deformation field represents the object’s deformed state at the current timestep . Each gridpoint is defined in

by a position vector

, allowing to express the position of an undeformed vertex within as with trilinear weights (cf. fig. 2). The position of the same vertex in the deformation field can be described analogously by a weighted sum of deformed gridpoint positions as . We define an observation as an organized point cloud of the deformed object.

Fig. 2: Trilinear weights (l.); static and deformed grid cells (r.)

Iii-B Demonstration

Most SotA approaches use either a low-resolution reference model of the deformable object [Zollhofer.2014] or none at all [Newcombe.2015, Innmann.2016]. To allow for user demonstrations of POIs, our algorithm requires a high-detail reference model to be created offline. To improve the stability of our solver, we limit the object’s initial deformation with respect to its reference model by demonstrating a library of reference models in an offline step. Each model in is a triangle mesh of the object in a distinct deformation state. The models are generated by fusing several depth images into a Truncated Signed Distance Field (TSDF) and then extracting the triangles via the MarchingCubes algorithm [Lorensen.1987]. After the demonstration of the reference models, the user can select relevant POIs on the meshed surface via a graphical user interface. At runtime, after the first observation , the model in most similar to is used to initialize .

Iii-C Correspondence identification

The definition and identification of correspondences links the current observation and the deformed reference model of the previous timestep and forms the basis of the deformation estimation: The estimation of a deformation is equivalent to the minimization of the distances between all correspondences. We define three correspondence types:


Result from projecting each surface point of the deformed reference model into the image plane and comparing it to the corresponding point that has been measured. The quality of a point-to-point (P2P)-correspondence can be described by a weight where denotes the distance between projected point and its correspondent , the distance between ’s normal and its correspondent , and the angle between the camera view direction and .


Projective correspondences such as P2P

typically only yield approximate, not exact, correspondences. p2s-correspondences add another degree of freedom by associating a point in the model to a

plane in the observation, defined by the P2P-correspondence and its normal . During deformation estimation, this allows to only minimize the distance along the normal.

Feature correspondences

Unlike projective correspondences, correspondences based on feature matching can detect large deformations, tangential movements and rotations of the object out of the image plane. Prior work [Guo.2016, Hansch.2014]

and our own experiments have found the PFH and FPFH descriptors to be highly sensitive and specific but to scale poorly with the size of the point cloud, while SHOT descriptors scale linearly and are robust against outliers. We implement feature correspondences using SHOT, as its sensitivity suffices for most real-world applications.

Iii-D Deformation estimation

The deformation of the reference object can be estimated by formulating an optimization problem to estimate their degrees of freedom and thus the deformation of the reference object. Using the notation introduced in III-A, the deformation of a single grid point can be expressed as . For estimating the deformation field, we split up all unknows into a single global rigid transformation and many local transformations and combine them in a vector :


The interpretation of correspondences as error terms allows to formulate the deformation estimation of as an energy minimization problem, which is also suggested by [Innmann.2016, Zollhofer.2014, Newcombe.2015]. This optimization can be regarded as a model regression problem and solved by existing solvers:


[Zollhofer.2014] and [Innmann.2016] solve a similar high-dimensional nonlinear optimization problem by linearizing the model and using the Gauss-Newton method, incurring a significant overhead for the computation of the Jacobian . [Innmann.2016] splits the optimization into a two-stage process composed of a fixed registration followed by a deformation estimation. We leverage the fact observed in [Sorkine.2007] that the deformation estimation can again be split into two independent sub-problems, which allows to solve for nonlinear rotations and linear translations using iterative Gauss-Newton on each subproblem in turn (“flip-flop” strategy). We perform fixed registration, the estimation of a global transformation , via Prerejective RANSAC (PSC). For the deformation estimation, setting up the Jacobian for the error terms of the three correspondence types is straightforward:

Fig. 3: Precision for different deformation types (l.); estimated deformation grid and ground-truth point cloud (blue) before and after one deformation cycle (r.)

where is the number of correspondences and is the entry at the row and column of the Jacobian . The Jacobians for and can be found analogously.


With a single camera’s perspective, it is impossible to observe the complete surface of an object. The spatial lack of correspondences implies an underdetermined equation system and being ill-conditioned. To alleviate this problem, we use an ARAP regularizer [Sorkine.2007], where non-observable surface points are deformed such that the total deformation of the body is as rigid as possible. Unlike prior work [Sorkine.2007, Sorkine.2017], we estimate the deformation in terms of instead of the mesh, leading to the adapted ARAP term


where denotes the neighborhood (6 surrounding grid points) of grid point . For each grid point , our solver must solve for 6 unknowns describing its pose . As shown in [Sorkine.2017], the (non-linear) estimation of can be solved in closed form given . For , we obtain


where the left-hand side is the product of the Laplace matrix with the vector of all unknowns .

Flip-flop solver

We iteratively estimate and in turn by closed-form solving for

via singular value decomposition (see

[Sorkine.2017] for details) and approximating via Gauss-Newton, where the update step is obtained via preconditioned conjugate gradients (PCG). Using the Jacobians derived above, we can obtain the deformation after an update step via

Fig. 4: POI tracking on a deformable tripod (l.); rubber seal assembly (r.)

Iv Results


In a first set of experiments, we assess the precision of our approach by comparing tracking results versus manually labeled correspondences. 10 POIs on a deformable tripod were considered, with each POI also fitted with a color-coded marker to facilitate manual labeling.111Since our algorithm only considers geometric features, the presence of the markers neither helped nor hurt the algorithm. The tripod was repeatedly deformed and the poses of the POIs were estimated by our algorithm as well as via the markers (cf. fig. 3). Our approach was capable of localizing all POIs with sub-millimeter accuracy, and tracking all POIs with errors between 0.6 and 2.2 mm. Unlike feature-matching based approaches, we always estimate the deformation of the complete surface and thereby avoid “mismatching” POIs by design.


A significant advantage of our approach is that each step of the solver pipeline can be efficiently parallelized. We benchmarked our algorithm using reference and deformation models at fine222Reference model: 30000 vertices, : 3250 grid points and coarse333Reference model: 15000 vertices, : 700 grid points resolutions. A parallelized CPU implementation of our algorithm localized all POIs in under in both cases on consumer hardware.

Robot experiments

In a first robot experiment, we track a point on the surface of a tripod subjected to several deformations of up to 20% of the tripod’s arm length, or ca. 2.5 cm. We use a UR5 robot equipped with a measuring tip to visualize tracking results (cf. fig. 4 (l.)), confirming precision within 2 mm. In a second experiment, we use our approach to position a flat rubber seal on a housing, illustrating its potential for real-world industrial applications (cf. fig. 4 (r.)).

V Discussion and Outlook

Our approach and solver pipeline allows efficient tracking and localization of POIs on deformable objects. Where prior work requires markers, explicit modelling or does not allow for offline POI definition, our approach achieves sub-millimeter precision localization and millimeter-precision tracking without these drawbacks. This makes it particularly suitable for applications in industrial robotics and flexible, quickly reconfigurable assembly or surface treatment tasks. We are working on integrating our solution into an industrial robot manipulation framework, a more efficient GPU implementation and a more extensive evaluation on a wider set of benchmarks.