We look at the problem of estimating object states such as poses from touch. Consider the example of a robot manipulating an object in-hand: as the object is being manipulated, it is occluded by the fingers. This occlusion renders visual estimation alone insufficient. Touch in such cases can provide local, yet precise information about the object state.
The advent of new touch sensors [lambeta2020digit, yuan2017gelsight, yamaguchi2016fingervision] has enabled rich, local tactile image-based measurements at the point of contact. However, a single tactile image only reveals limited information that may correspond to multiple object poses, making it hard to directly predict poses from an image. We hence need to be able to reason over multiple such measurements to collapse uncertainty. Then the underlying inference problem is to compute the sequence of latent object poses, given a stream of tactile image measurements.
We solve this inference problem using a factor graph [dellaert2017factor] that offers a flexible way to process a stream of measurements while incorporating other priors including physics and geometry. The factor graph relies on having local observation models that can map measurements into the state space. Observation models for high-dimensional tactile measurements are, however, challenging to design. Prior work [yu2018realtime, lambert2019joint] used low-dimensional force measurements or hand-designed functions for high-dimensional tactile measurements. These, however, can be brittle and difficult to scale across objects and sensors.
In this paper, we propose learning tactile observation models that are incorporated as factors within a factor graph during inference (Fig. 1). The learner takes high-dimensional tactile measurements and predicts noisy low-dimensional poses that are then integrated by the factor graph optimizer. For the learner, however, accurately predicting object pose directly from a tactile image is typically not possible without additional information such as a tactile map and geometric object model. Our key insight is to instead have the learner predict relative poses from image pairs. That is, given a pair of non-sequential tactile images, predict the difference in the pose of the tactile sensor relative to the object in contact. This relative pose information is then used to correct for accumulated drift in the object pose estimate. Observation models trained in this fashion can work across a large class of objects, as long as the local contact surface patches are in distribution to those found in training data. This allows our method to generalize to new objects that are composed from new arrangements of familiar surface patches.
We propose a two-stage approach: we first learn local tactile observation models supervised with ground truth data, and then integrate these along with physics and geometric models as factors within a factor graph optimizer. The tactile observation model is learning-based, object-model free, and integrates with a factor graph. This integration leverages benefits of both learning-based and model-based methods — wherein, (a) our learnt factor is general and more accurate than hand-designed functions, and (b) integrating this learnt model within a factor graph lets us reason over a stream of measurements while incorporating structural priors in an efficient, real-time manner. Our main contributions are:
A novel learnable tactile observation model that seamlessly integrates as factors within a factor graph.
Tactile factors that work across objects.
Real-time object tracking during real-world pushing trials using only tactile measurements.
Ii Related Work
Localization and state estimation are increasingly solved as smoothing problems given the increased accuracy and efficiency over their filtering counterparts [cadena2016past]. Typically, the smoothing objective is formulated as MAP inference over a factor graph whose variable nodes encode latent states and factor nodes the measurement likelihoods [dellaert2017factor]. To incorporate high-dimensional measurements as factors in the graph, one needs an observation model to map between latent states and measurements. These typically are analytic functions eg. projection geometry for images [mur2017orb, engel2014lsd], or scan matching for point clouds [dong2019gpu, zhang2014loam]. Recent work in visual SLAM has also looked at using such functions on learnt, low-dimensional encodings of the original image measurements [czarnowski2020deepfactors, bloesch2018codeslam].
While there exist well-studied analytic models within visual SLAM literature, these are hard to define for tactile sensor images that capture deformations that don’t have a straightforward metric interpretation. Within tactile sensing literature, rich tactile image sensors [lambeta2020digit, yuan2017gelsight] are increasingly used for manipulation tasks. One class of approaches use local tactile measurements directly as feedback to solve various control tasks such as cable manipulation [she2020cable], in-hand marble manipulation [lambeta2020digit], box-packing [dong2019tactile]. While efficient for the particular task, it can be difficult to scale across different tasks. An inference module is additionally needed to estimate a common latent state representation like global object poses that can be used for different downstream control and planning tasks.
Prior work on estimating states from touch during manipulation has included filtering methods [izatt2017tracking, saund2017touch, koval2015mpf], learning-only methods [sundaralingam2019robust], methods utilizing a prior map [bauza2019tactile], and graph-based smoothing methods [yu2018realtime, lambert2019joint]. Smoothing approaches that model the problem as inference over a factor graph have the benefits of (a) being more accurate than filtering methods, (b) incorporating structured priors unlike learning-only methods, and (c) recovering global object poses from purely local observation models without needing a global map. Moreover, the graph inference objective can be solved in real-time making use of fast, incremental tools [kaess2008isam, kaess2012ijrr, sodhi2020ics] in the literature. Table I summarizes the closest related work on graph-based tactile estimation [yu2018realtime, yu2018realtimesuction, lambert2019joint] or using sensor data most similar to ours [bauza2019tactile]. While each work addresses an aspect of the estimation problem, none of them utilize high-dimensional tactile images as learnt models within a factor graph that can work across object shapes.
|Bauza et al. [bauza2019tactile]||✗||✓||✓|
|Yu et al. [yu2018realtime, yu2018realtimesuction]||✓||✗||✗|
|Lambert et al. [lambert2019joint]||✓||✗||✗|
Iii Problem Formulation
We formulate the estimation problem as inference over a factor graph. A factor graph is a bipartite graph with two types of nodes: variables and factors . Variable nodes are the latent states to be estimated, and factor nodes encode constraints on these variables such as measurement likelihood functions, or physics, geometric models. Maximum a posteriori (MAP) inference over a factor graph involves maximizing the product of all factor graph potentials, i.e.,
Under Gaussian noise model assumptions, MAP inference is equivalent to solving a nonlinear least-squares problem [dellaert2017factor]. That is, for Gaussian factors
corrupted by zero-mean, normally distributed noise,
where, are cost functions defined over states and include measurement likelihoods or priors derived from physical or geometric assumptions. is the Mahalanobis distance with covariance .
For our planar pushing setup, states in the graph are the planar object and end-effector poses at every time step , i.e. , where . Factors in the graph incorporate tactile observations , quasi-static pushing dynamics , geometric constraints , and priors on end-effector poses .
At every time step, new variables and factors are added to the graph. Writing out Eq. 2 for our setup at time step ,
Eq. 3 is the optimization objective that we must solve for every time step. Instead of resolving from scratch every time step, we make use of efficient, incremental solvers such as iSAM2 [kaess2012ijrr] for real-time inference. Individual cost terms in Eq. 3 are described in more detail in Section IV-B.
We present a two-stage approach: we first learn local tactile observation models from ground truth data (Section IV-A), and then integrate these models along with physics and geometric models as factors within a factor graph (Section IV-B). Fig. 2 illustrates our overall approach showing the three factors, tactile, physics and geometric, being integrated into the factor graph.
Iv-a Tactile observation model
The goal of learning a tactile observation model is to derive the tactile factor cost term in Eq. 3 to be used during graph optimization. We do this by predicting a relative transformation and penalizing deviations from this prediction. The relative transformation is that of the sensor (or end-effector) pose relative to the object in contact.
Our learnt tactile observation model consists of a transform prediction network that: Given a pair of non-sequential tactile image inputs at times , predicts the relative transformation . This is done by featurizing each image.
For encoding an image as feature , we use an auto-encoder with a structural keypoint bottleneck proposed in [lambeta2020digit]. It consists of an encoder and decoder using a tiny version of ResNet-18 as the backbone network. The encoder process the image input into a feature map from which 2D keypoint locations corresponding to maximum feature activations are extracted. At decoding, a Gaussian blob is drawn on an empty feature map for each extracted keypoint. The decoder then takes these as inputs and produces a reconstructed image. The auto-encoder is trained in a self-supervised manner with L2 image reconstruction error along with auxiliary losses that optimize for sparse, non-redundant keypoints.
Transform prediction network
Once we have a trained feature encoder, we use that encoder in the transform prediction network to map input image pairs
into keypoint feature vectors. This is followed by a fully-connected regression network that predicts a relative 2D transformation between times (Fig. 2). To make the same network work across object classes, we also pass in a one-hot class label vector . We expand the feature inputs via an outer product with the class vector, , , and pass this expanded input to the fully-connected layers. The labels
are from a simple classifier network trained on the activation maps extracted from the encoder.
The transform prediction network is trained using a mean-squared loss (MSE) against ground truth data . We make the loss symmetric for the network to learn both regular and inverse transform, that is,
Here, is the relative transform between end-effector poses at time steps in object coordinate frame. That is, , where are ground truth object and end-effector poses obtained using a motion capture system. is set of all non-sequential image pairs over a chosen time window .
Iv-B Factor graph optimization
Once we have the learnt local tactile observation model, we integrate it along with physics and geometric models as factors within a factor graph. The factor graph optimizer then solves for the joint objective in Eq. 3. Here we look at each of the cost terms in Eq. 3 in more detail.
Measurements from tactile sensor images are incorporated as relative tactile factors denoted by the cost in Eq. 3. Our relative tactile factor is a quadratic cost penalizing deviation from a predicted value, that is,
where, is the predicted relative transform from the transform prediction network that takes as inputs tactile image measurement at time steps . is the estimated relative transform using current variable estimates in the graph. denotes difference between two manifold elements. is the time step window over which these relative tactile factors are added. We choose this to be some subset of the training window set in Eq. 4.
computed using graph variable estimates , and computed as transform prediction network output can be expressed as,
Quasi-static physics factor
To model object dynamics as it is pushed, we use a quasi-static physics model. The quasi-static approximation assumes end-effector trajectories executed with negligible acceleration, i.e. the applied pushing force is just large enough to overcome friction without imparting an acceleration [lynch1992manipulation].
We use the velocity-only quasi-static model from [zhou2017fast] that uses a convex polynomial to approximate the limit surface for force-motion mapping. For sticking contact, the contact point pushing velocity must lie inside the motion cone, resulting in the following relationship between object and contact point velocities:
where, is the object twist, is the contact point velocity, and
is a hyperparameter dependent on pressure distribution of the object[lynch1992manipulation]. We calculate this value assuming pressure to be distributed either uniformly or at the corners/edges of the objects.
Object twist is computed using object poses and contact point velocity is computed using end-effector contact point estimate . That is,
where, rotates object and contact point velocities into current object frame.
We would like to add a geometric constraint to ensure that the contact point lies on the object surface. We do so as an intersection cost between the end-effector and the object. This is similar to the obstacle avoidance cost in [mukadam2018continuous, ratliff2009chomp] but instead of a one-sided cost function, we use a two-sided function to penalize the contact point from lying on either side of the object surface.
To compute this intersection cost, we transform end-effector contact point into current object frame as , and look up its distance value in a precomputed 2D signed distance field map of the object centered around the object. This is incorporated as the quadratic geometric factor cost term in Eq. 3. Expanding this,
Finally we also model uncertainty about end-effector locations as unary pose priors on the end-effector variables. These priors currently come from motion capture readings with added noise, but for a robot end-effector, these would instead come from the robot kinematics.
The pose priors are incorporated as the quadratic end-effector factor cost term in Eq. 3. Expanding,
where, are poses from the motion capture system with added Gaussian noise.
V Results and Evaluation
We evaluate our approach qualitatively and quantitatively on a number of real-world planar pushing trials where the pose of an object is unknown and must be estimated. We compare against a set of baselines on metrics like learning errors, estimation accuracy and runtime performance. Learnt tactile factors are trained using PyTorch[paszke2019pytorch]. These, along with engineered physics and geometric factors, are incorporated within GTSAM C++ library [dellaert2012factor]. We use the iSAM2 [kaess2012ijrr] solver for efficient, incremental optimization.
V-a Experimental setup
Fig. 3(a) shows the overall experimental setup for the pushing trials. We use an OptiTrack motion capture system to record ground truth object and end-effector poses for training and evaluating tactile observation models. Fig. 3(b) shows a closeup of the object, end-effector and the Digit tactile sensor [lambeta2020digit] mounted on an end-effector. The Digit sensor provides high-dimensional RGB images of the local deformation at the contact point.
V-B Tactile factor learning
We now look at performance of the first stage of our approach, i.e. learning tactile observation models.
The first step is to detect contact in tactile images. We do so by first subtracting a mean no-contact image from the current image. If % of significantly () different pixels exceeds a threshold, then contact is declared. We found this simple method works reliably on different pushing trials.
|linear with eng-feat||5.1e-3||6.7e-3||4.2e-3||5.9e-3|
|linear with learnt-feat||1.0e-3||1.1e-3||1.7e-3||1.5e-3|
|nonlinear with eng-feat||3.0e-3||5.2e-3||3.3e-3||4.2e-3|
|nonlinear with learnt-feat||1.0e-3||1.7e-3||3.0e-3||2.4e-3|
Fig. 5 shows results for keypoint features learnt using the auto-encoder network described in Sec IV-A. The tactile image shows an elliptical contact patch where the curvature varies with local surface geometry of the object. The learnt keypoint features are able to track the patch center over time.
Tactile model performance
Table II compares mean-squared losses for different choices of the transform prediction network described in Sec. IV-A. const is a zeroth order model that predicts mean relative transform of the training dataset. This is equivalent to using only contact detection information irrespective of contact patch locations in the images. eng-feat and learnt-feat represent engineered vs learnt keypoint features. For engineered features, we find a least-squares ellipse fit to detected contours in the image, as a generalization to points/line features used in related work [hogan2020tactile]. linear and nonlinear refer to using linear or nonlinear activations in the fully-connected layers. We see that models using learnt-feat have lower losses over eng-feat. We also see linear models have lower losses than const. The nonlinear models don’t show a significant improvement over the linear models.
V-C Factor graph optimization
We now look at the final task performance of estimating object poses using learnt tactile factors along with physics/geometric factors. For all runs, we use the same tactile model trained on the combined rect, disc, ellip datasets. The model is conditioned on the object class being used. We use the same covariance parameters in the graph, i.e. , , , .
Qualitative tracking performance
Fig. 6(a) shows that physics and geometric factors alone cause object poses to drift over time. Fig. 6(b)-(d) shows tracking with different tactile models: const, learnt, and oracle. const predicts a constant relative transform value for each tactile factor. This approach is equivalent to using only a contact detector. While it improves over no tactile, it is unable to correct object rotations relative to contact point leading to drifting object pose estimates. This effect is most pronounced in the rect trajectories, where pushing along a corner cause large object rotations about the contact point. learnt is our proposed method using the transform prediction network, while oracle provides ground truth relative transforms. We see that the learnt tactile model recovers object poses close to their true trajectory, and matches oracle performance closely.
Quantitative tracking performance
Fig. 6(e),(f) show mean-variance plots of RMSE rotational and translation object pose errors over time. Errors are computed over pushing sequences each for rect, disc, ellip objects. The sequences have varying pushing trajectories making and breaking contact typically times. Fig. 7 additionally shows summary statistics of the RMSE translation and rotational errors at the final time step . The learnt model performance is closest to oracle tactile performance recovering true object poses up to mm translational errors and rotational errors.
Finally, Fig. 8 shows runtime per iteration of the graph optimizer. Runtime stays relatively constant with new measurements and priors added every step.
We presented a factor graph based inference approach for estimating object poses from touch using vision-based tactile sensors. We proposed learning tactile observation models that directly integrate as factors within the graph. We demonstrated that our method is able to reliably track object poses for over 150 real-world planar pushing trials using tactile measurements alone.
As future work, in the tactile observation model, we’d like to learn to model a distribution over relative poses instead of only the mean-squared error values. This should improve performance in cases where the relative pose uncertainty is asymmetric or varies significantly between contact episodes. We’d also like to learn richer feature descriptors that describe contact patch geometry in addition to the patch centers. This should allow conditioning variable information to be captured within the feature descriptor itself. Finally, to make these tactile factors work on more complex manipulation tasks, different physics models will need to be incorporated.
We’d like to thank the DIGIT team, in particular P.W. Chou, M. Lambeta and R. Calandra, for support with the sensor, software and helpful discussions. We’d also like to thank D. Gandhi and Z. Dong for helpful discussions.
Here we look at some additional results for object pose tracking during planar pushing tasks. Fig. 9 shows the three different objects being pushed by the Digit tactile sensor mounted on an end-effector. Objects are pushed over varying sequences for each object. We perform following variations across the pushing sequences:
Trajectories are varied as combinations of straight-line and curved pushes.
Each sequence makes and breaks contact typically 2-3 times.
Object is pushed at different locations, e.g. pushing at both edges and corners for the rect object.
Figs. 10, 11 show qualitative object tracking results for the rect object, and Fig. 12 for the other objects: disc and ellip over these varying sequences. In the no tactile case, use of physics and geometric factors alone cause object poses to drift over time. The last three columns show tracking with different tactile models: const, learnt, oracle. const predicts a constant relative transform for each tactile factor. This approach is equivalent to using only a contact detector, i.e. using measurements that a standard non image-based contact sensor would give. While it improves over no tactile, it is unable to correct object rotations relative to contact point leading to drifting object pose estimates. This effect is most pronounced in the rect trajectories, where pushing along a corner cause large object rotations about the contact point. learnt is our proposed method using the transform prediction network, while oracle provides ground truth relative transforms. We see that the learnt tactile model recovers object poses close to their true trajectory, and matches oracle performance closely.