We focus on the problem of tracking 3D object poses during in-hand manipulations using tactile image measurements from vision-based tactile sensors [lambeta2020digit, yuan2017gelsight]. Specifically, we look at tracking small objects without prior geometric models. For instance, a dexterous robot operating in a real-world household environment will need to manipulate novel household objects for which CAD models may not be available. We address the question: Can an object be tracked precisely enough for in-hand manipulation using only local measurements of its geometry?
Prior work has looked at the object tracking problem primarily in the context of planar pushing [sodhi2021tactile, suresh2021shape, lambert2019joint, yu2018realtime]. However, the problem of 3D in-hand manipulation poses additional challenges. Firstly, the motion is less constrained such that different object motions can explain the same measurements. Second, physics priors, such as quasi-static planar pushing models, are less informative in the 3D case. Hence, prior work on in-hand object tracking using tactile feedback has relied on a-priori information about the object being localized, such as 3D global models or a database generated by a simulator [wang2021gelsight, bauza2020tactile, liang2020hand, bauza2019tactile].
Our key insight is that the tactile object tracking problem can be efficiently decomposed in two ways. First, we can decompose an object into many smaller local surface patches, which can be treated independently. Second, most of the the information needed to infer the local surface patch geometry is already embedded in corresponding tactile images.
We find that reliable tracking is achievable with only a local patch—a fused map created from a sequence of key frame images within a continuous contact episode. For instance, even though two objects can have very different global geometries, they can contain very similar local patches that suffice for tracking.
To both create the local patch and track motion relative to it, we must fuse multiple tactile image measurements online while inferring the latent object poses. We formulate this as an inference problem over a factor graph that offers a flexible and efficient way to fuse such information while incorporating other priors derived from physical and geometric constraints [dellaert2020factor, dellaert2017factor].
What makes for a good representation for a local patch? An idealized tactile image captures the surface normals of the gel’s reflective layer, based on the color and intensity of illumination at each pixel. Hence, it is natural to learn a mapping from image to gel surface normals. While objects may have varying global shapes, the learned surface normal mapping can generalize across these different shapes, because the relationship between pixel intensities and surface normals depends only on the sensor configuration and the local contact geometry. Hence, we infer the surface normals at each pixel in the tactile image, then integrate those normals to create a 3D model of the visible section of the local patch.
We propose a novel two-stage approach for tracking objects in-hand without any prior object model information (Fig. 1). First, we learn a mapping from tactile images to surface normals using an image translation network. Second, we use these surface normals within a factor graph to both reconstruct a local patch map and use it to infer 3D object poses. Our key contributions are:
A factor graph formulation for 3D in-hand tactile tracking that does not rely on prior object models.
A factor that works across different global object shapes by relying on local patches generated from learned surface normals.
Empirical evaluation on both simulation and real-world trials.
Ii Related Work
Factor graphs for localization Localization and mapping problems are increasingly formulated as optimization objectives that leverage the inherent sparsity of the problem to give tractable and more accurate solutions over filtering approaches [cadena2016past]. Factor graphs are a popular way for solving such optimization objectives [dellaert2020factor, dellaert2017factor, sodhi2021tactile, czarnowski2020deepfactors, hartley2018hybrid]. They offer a flexible way to fuse multiple measurements while being computationally efficient to optimize. Factors in the graph encode local potentials on variables such as observation models between measurements and states as well as other priors such as physics and geometry. We formulate our problem using a factor graph based framework in this paper.
Vision-based touch sensing The advent of vision-based tactile sensors [lambeta2020digit, yuan2017gelsight, donlon2018gelslim, yamaguchi2016fingervision] has enabled high-dimensional tactile image measurements that capture the local deformation at the point of contact. Recent work has also looked at creating accurate simulation models for such sensors [wang2020tacto, si2021taxim, agarwal2020simulation]. These sensors are being explored for various tactile manipulation applications. One class of approaches use tactile images directly as local feedback to solve for control actions on tasks such as object insertion [dong2021tactilerl], box packing [dong2019boxpacking], and in-hand manipulations [she2020cable, lambeta2020digit]. However, such representations tend to overfit to the particular task or require significant amount of data to generalize across tasks. Our work focuses on extracting a state representation like the global object pose that is easy to use and generalizes across different downstream control and planning tasks.
Estimation from touch
Prior work on estimating states from touch during manipulation has included filtering methods[izatt2017tracking, saund2017touch, koval2015mpf, pezzementi2011object], learning-only methods [sundaralingam2019robust, li2014localization], methods utilizing prior model information [bauza2020tactile, liang2020hand, bauza2019tactile], and graph-based optimization for planar pushing [yu2018realtime, lambert2019joint, suresh2021shape, sodhi2021tactile]. In particular, graph-based optimization offers benefits such as being more accurate than filtering, an ability to incorporate analytic as well as learned models, and can be solved in real-time making use of efficient, incremental solvers [kaess2008isam, kaess2012isam2, sodhi2020ics] in literature.
Of these different approaches, the work in [bauza2019tactile, liang2020hand, bauza2020tactile, wang2021gelsight] is most closely related in terms of the final objective of tracking 3D object poses during in-hand manipulations.These, however, require prior object model information either as offline models [wang2021gelsight, bauza2019tactile, liang2020hand] or a database from a simulator [bauza2020tactile]. In contrast, we do not require a prior model of the object being tracked. Instead, we build a local patch map on the fly for the current contact episode and use that within a factor graph framework.
Iii Problem Formulation
We begin by formalizing the estimation problem as factor graph optimization. A factor graph is a bipartite graph with two types of nodes: variables and factors . Variable nodes are the latent states to be estimated, and factor nodes encode potentials on these variables from observations, physics, or geometry (Fig. 2).
We solve for the maximum a posteriori (MAP) objective by maximizing product of all factor graph potentials, i.e.,
To solve this inference objective efficiently, we assume
to be Gaussian factors corrupted by zero-mean normally distributed noise. Under Gaussian noise model assumptions, MAP inference is equivalent to a nonlinear least-squares objective[dellaert2017factor], i.e.,
For the in-hand object tracking problem, we define states in the graph to be the 6-DOF object and end-effector poses at every time step , i.e. , where . Factors in the graph include image-to-image factors , image-to-patch factors , velocity smoothness priors , end-effector pose priors and vision priors for re-localization at the beginning of a contact episode . At every time step, new variables and factors are added to the graph. Writing out Eq. 2 for our problem,
Individual cost terms in Eq. 3 are described in detail in Section IV-C. Eq. 3 is the optimization objective that we must solve for every time step. Instead of resolving from scratch every time step, we make use of efficient, incremental solvers [kaess2012isam2] to solve this in real-time.
We present a two-stage approach: First, we learn a mapping from tactile images to surface normals (Section IV-A) which can be integrated to create a 3D reconstruction (Section IV-B). Second, we use these surface normals to create a 3D local patch map online within a factor graph for inferring the latent 3D object poses (Section IV-C).
Iv-a Learning surface normals
Here we discuss how we learn to predict surface normals using tactile color images from the Digit sensor [lambeta2020digit]. Color images are generated by illuminating the gel surface with three light sources (Fig. 3(a)), such that each light has a unique color and direction. When pressing an object into the gel, the gel surface conforms to the object. Under an idealized model of the sensor, the color and intensity of light reaching the camera due to diffused reflection from a point on the surface is directly related to the surface normal of the gel at that point. Hence, the RGB intensity values of image pixel are expected to contain significant information about the corresponding local surface normals of the object.
To infer surface normal images from tactile images we train an image translation network, pix2pix[isola2017]. The pix2pix model is based on a generator-discriminator network architecture that enables it to learn mappings from a low amount of training data. It also enables us to learn a generalized mapping, i.e. we test on objects unseen during training. To train the model, we use a dataset of ground truth pairs of color and surface normal images . We extract datasets in two ways. For simulation dataset, we generate surface normals by adding a normal shader to the Tacto simulator [wang2020tacto] based on Pyrender [matl2019pyrender]. For the real dataset, we follow a similar procedure as [yuan2017gelsight, wang2021gelsight] of using a ball bearing of known radius whose ground truth normals can be synthesized. We manually annotate the circular patches in RGB images, and synthesize ground truth normals in the foreground annotations. We also augment our dataset with additional simulated images to improve generalization.
Iv-B Reconstruction from normals
Given an image with local surface normal information, we generate a 3D representation of the corresponding surface geometry. This surface normal image is related to the gradient of a corresponding depth image as,
Given depth gradients , we can recover the depth map by integration using a fast Poisson solver with discrete sine transform (DST) [doerner2012poisson] as used in prior work [yuan2017gelsight, wang2021gelsight]. For the boundary conditions, we use the mean distance to the undisturbed gel surface. A consequence of this choice is that contact regions crossing the edge of the image will not be correctly handled by the solver. We filter out such images in our trials.
Finally, once we have computed the depth map, we inverse project it to obtain a 3D point cloud. We use the OpenGL clip projection model [openGLproj] to map from pixel to world coordinates, i.e. , where is the projection matrix based on near, far plane values of the projection frustum, and is the camera view matrix.
Iv-C Factor graph optimization
Once we have the surface normal images, we integrate those along with other priors as factors within a factor graph. The factor graph optimizer then solves for the joint objective in Eq. 3. We look at each of the cost terms in Eq. 3 in detail.
In some cases, it suffices to look at consecutive tactile images and infer the relative transformation between them. Color images from the tactile sensor at the current and previous time step are converted into point clouds in their respective end-effector or sensor frame. The two point clouds are registered against each other using a point-to-plane iterative closest point (ICP) algorithm. The resultant relative transformation is added as a binary factor between consecutive poses in the graph. This is expressed as the term in Eq. 3, i.e.,
where are current relative estimates from the graph and is the measured transformation from ICP registration. Both , use gel center as their origin. are pairs of point correspondences in the two point clouds. denotes difference between two SE(3) manifold elements.
Image-to-image factors fail whenever the tactile image changes by a non-trivial amount leading to large registration errors. This happens whenever the object undergoes a larger transformation or moves in and out of the gel. To stabilize tracking in these situations we introduce a local patch model by fusing together multiple point clouds and register new tactile point cloud data against this patch. The local patch is maintained by fusing together images at specific key frames within the current contact episode, rather than fusing all available frames. We choose these key frames at fixed intervals given uniform motions but one can also select these based on a field-of-view overlap threshold. The current cloud is registered against the local patch map cloud , and the relative transformation is added as factors to the graph,
where are current estimates from the graph and is the measured transformation from ICP registration. are pairs of point correspondences in the current point cloud and local patch map.
Constant velocity priors
We add a prior that assumes objects move at constant velocity, which has the effect of smoothing tracked trajectories. This is a ternary factor between triplets of object poses,
We model uncertainty about end-effector locations as unary priors on end-effector variables,
where, are poses from the motion capture system with added Gaussian noise. For a robot end-effector, these pose measurements would instead come from the robot kinematics.
We add a global vision pose prior for only the object pose at the start of an episode,
where, are poses with added Gaussian noise . For re-localization during multi-contact episodes, we can add such a factor at the start pose of every new contact episode.
V Results and Evaluation
We evaluate our approach qualitatively and quantitatively on a number of episodes where an object, unknown a priori, must be tracked from a sequence of tactile measurements. We compare against a set of baselines on two fronts: a) on surface normal predictions from images and b) on the final tracking error of object poses. We use PyTorch[paszke2019pytorch] for training surface normal models, and GTSAM C++ library [dellaert2012factor] for factor graph optimization. We specifically use the iSAM2 [kaess2012isam2] solver for efficient, real-time optimization.
|Model type||Object Shapes|
|Sim Sphere||Cube||Real sphere||Toy brick||Toy human|
V-a Experimental setup
We collected simulation data using the Tacto [wang2020tacto] simulator where one can load the Digit sensor, an object and render high-resolution tactile image readings in real-time. The simulator uses PyBullet [coumans2016pybullet] as the underlying physics engine and Pyrender [matl2019pyrender] as the back-end rendering engine for generating images. We generated trials by using a position controller to move the object on the sensor surface. We collect data for a diverse set of objects: Sphere, Cube, Toy human and Toy brick.
For real-world episode, we used a Digit [lambeta2020digit] sensor to get tactile measurements. We mounted the object on a workbench and mounted the Digit on a movable plate, both of which were tracked by an OptiTrack motion capture system to get ground truth poses. We collect real-world data for two objects: a Sphere (ball-bearing) of 1/2” diameter and a Pyramid 1/2” tall with 1.75” side length.
V-B Surface normal reconstructions
We first analyze the accuracy of learned surface normals from color images collected by the Digit tactile sensor. For simulation episodes, we only train on two simple objects: 50 images of Sphere and 50 images of Cube. We tested the model on two different held-out shapes: Toy brick and Toy human. For real episodes, we only had ground truth normals for the real Sphere. Hence, we expanded the training dataset to include both the training data from simulation and the real Sphere data. We tested the model on held-out real Pyramid object. For baselines, we picked two different pix2pix architectures, unet and resnet. We also trained a baseline 3-layer MLP 5-32-32-3 with tanh activation on an L2 loss, mapping a single color pixel (r,g,b,x,y) to a surface normal (nx,ny,nz), similar to prior work [wang2021gelsight].
Fig. 4 shows qualitative performance of the pix2pix (unet) model on both real and simulated images. We can see the model generalizes to fairly different shapes such as the Toy human and Toy brick, even though it is trained on simple objects. Moreover, fine-tuning this model with very little real data, i.e., 50 images of the real Sphere, enables it to generalize to unseen geometries in the real-world such as different local patches of a Pyramid.
Quantitative model performance
Table I compares mean-squared pixel loss on the validation dataset for different model choices. We see that pix2pix, for both unet and resnet architecture, has a fairly low MSE loss and generalizes to unseen shapes such as the Toy human and Toy brick. On the other hand, the per-pixel MLP baseline incurs a high MSE loss. In general, the per-pixel MLP is likely to be insufficient for non-ideal tactile sensors with effects such as self-shadowing of the gel. Having convolutional layers, such as with the pix2pix architectures, can address such confounding effects since these are typically localized in the image, e.g. shadows are cast by nearby object features since object depth is small relative to the image size.
V-C Factor graph optimization
We now look at the final task performance of tracking 3D object poses using tactile image measurements. We compare 4 objectives, each of which use different factors: ConstVel uses only a constant velocity prior, ImageToImage uses Image-to-image factors, PatchGraph (ours) uses Image-to-patch factors and GroundtruthPatch which uses a global object model. GroundtruthPatch assumes the object is known apriori and hence represents the best a method can do.
Simulation tracking performance
Fig. 5 shows the qualitative tracking performance of PatchGraph for various objects in simulated trials. For Cube, Toy brick and Toy human, PatchGraph is able to reliably track rotations of the object. The local patch, constructed from online estimates, appears to be consistent with the local object geometry, thus explaining the good tracking performance. The object that was most difficult to track was Toy brick, as evidenced by the distortions in the patch, owing to jerkiness of the in-contact motions. We also note that tracking rotations for a Sphere has high errors, which is expected since rotation of a sphere are unobservable from tactile images.
Fig. 6 shows the quantitative rotation and translation tracking performance against baselines on all objects, with distinct contact sequences per object. Overall, PatchGraph has the lowest errors matching closely to that of GroundtruthPatch that has the global object model. This lends credence to our claim that a local patch suffices for reliable object tracking. ConstVel
understandably has the highest variance among any baselines.ImageToImage fails to outperform PatchGraph on any of the datasets. Toy brick appears to be the most challenging among datasets, where GroundtruthPatch has a clear performance gap.
Real tracking performance
Fig. 7 shows qualitative tracking performance of PatchGraph for various objects in real trials. Fig. 7(b) shows the local patches which appear consistent with the local object geometry, and is a key piece to reliable tracking. Fig. 7(c) shows good rotation tracking for Pyramid and translation tracking for Sphere.
Fig. 8 shows quantitative rotation and translation tracking performance against baselines on all objects, with distinct contact sequences per object. The main observation is that ImageToImage performs much worse than ImageToPatch, particularly in translation errors for Sphere. This is primarily because point clouds generated from individual images are not as geometrically discriminative as the fused local patch, causing ImageToImage factors alone to diverge quickly. PatchGraph keeps translation errors under 4mm and rotation errors under 0.2rad, which looks promising for use in dexterous object manipulations.
We presented a factor graph-based approach for tracking 3D object poses from tactile image sequences during in-hand manipulations. We showed reliable tracking on simulated objects and real objects without relying on any a priori object information. We achieved this by exploiting two decompositions of the tracking problem. First, that a complex object can be treated as a composition of many local patches each of which can be mapped and tracked largely independently. Second, surface normal information is highly localized within a tactile image and independent of the global object shape.
A primary limitation that can cause tracking failures is when the local patch map is not sufficiently discriminative geometrically, e.g. flat or featureless patches, or when the patch motions are degenerate in the observed image space, e.g. rotations of a spherical object. As future work, it would be interesting to explore solutions that take into account geometric degeneracies [zhang2016degeneracy, westman2019degeneracy] as well as approaches that can detect slip and shear [yuan2015measurement] to disambiguate motion degeneracies. Another interesting future direction would be to complement the tracker with a global first pose re-localization that is able to generalize across objects, e.g. using visual images to predict a contact location likelihood.