I Introduction
We look at the problem of estimating object states such as poses from touch. Consider the example of a robot manipulating an object inhand: as the object is being manipulated, it is occluded by the fingers. This occlusion renders visual estimation alone insufficient. Touch in such cases can provide local, yet precise information about the object state.
The advent of new touch sensors [lambeta2020digit, yuan2017gelsight, yamaguchi2016fingervision] has enabled rich, local tactile imagebased measurements at the point of contact. However, a single tactile image only reveals limited information that may correspond to multiple object poses, making it hard to directly predict poses from an image. We hence need to be able to reason over multiple such measurements to collapse uncertainty. Then the underlying inference problem is to compute the sequence of latent object poses, given a stream of tactile image measurements.
We solve this inference problem using a factor graph [dellaert2017factor] that offers a flexible way to process a stream of measurements while incorporating other priors including physics and geometry. The factor graph relies on having local observation models that can map measurements into the state space. Observation models for highdimensional tactile measurements are, however, challenging to design. Prior work [yu2018realtime, lambert2019joint] used lowdimensional force measurements or handdesigned functions for highdimensional tactile measurements. These, however, can be brittle and difficult to scale across objects and sensors.
In this paper, we propose learning tactile observation models that are incorporated as factors within a factor graph during inference (Fig. 1). The learner takes highdimensional tactile measurements and predicts noisy lowdimensional poses that are then integrated by the factor graph optimizer. For the learner, however, accurately predicting object pose directly from a tactile image is typically not possible without additional information such as a tactile map and geometric object model. Our key insight is to instead have the learner predict relative poses from image pairs. That is, given a pair of nonsequential tactile images, predict the difference in the pose of the tactile sensor relative to the object in contact. This relative pose information is then used to correct for accumulated drift in the object pose estimate. Observation models trained in this fashion can work across a large class of objects, as long as the local contact surface patches are in distribution to those found in training data. This allows our method to generalize to new objects that are composed from new arrangements of familiar surface patches.
We propose a twostage approach: we first learn local tactile observation models supervised with ground truth data, and then integrate these along with physics and geometric models as factors within a factor graph optimizer. The tactile observation model is learningbased, objectmodel free, and integrates with a factor graph. This integration leverages benefits of both learningbased and modelbased methods — wherein, (a) our learnt factor is general and more accurate than handdesigned functions, and (b) integrating this learnt model within a factor graph lets us reason over a stream of measurements while incorporating structural priors in an efficient, realtime manner. Our main contributions are:

A novel learnable tactile observation model that seamlessly integrates as factors within a factor graph.

Tactile factors that work across objects.

Realtime object tracking during realworld pushing trials using only tactile measurements.
Ii Related Work
Localization and state estimation are increasingly solved as smoothing problems given the increased accuracy and efficiency over their filtering counterparts [cadena2016past]. Typically, the smoothing objective is formulated as MAP inference over a factor graph whose variable nodes encode latent states and factor nodes the measurement likelihoods [dellaert2017factor]. To incorporate highdimensional measurements as factors in the graph, one needs an observation model to map between latent states and measurements. These typically are analytic functions eg. projection geometry for images [mur2017orb, engel2014lsd], or scan matching for point clouds [dong2019gpu, zhang2014loam]. Recent work in visual SLAM has also looked at using such functions on learnt, lowdimensional encodings of the original image measurements [czarnowski2020deepfactors, bloesch2018codeslam].
While there exist wellstudied analytic models within visual SLAM literature, these are hard to define for tactile sensor images that capture deformations that don’t have a straightforward metric interpretation. Within tactile sensing literature, rich tactile image sensors [lambeta2020digit, yuan2017gelsight] are increasingly used for manipulation tasks. One class of approaches use local tactile measurements directly as feedback to solve various control tasks such as cable manipulation [she2020cable], inhand marble manipulation [lambeta2020digit], boxpacking [dong2019tactile]. While efficient for the particular task, it can be difficult to scale across different tasks. An inference module is additionally needed to estimate a common latent state representation like global object poses that can be used for different downstream control and planning tasks.
Prior work on estimating states from touch during manipulation has included filtering methods [izatt2017tracking, saund2017touch, koval2015mpf], learningonly methods [sundaralingam2019robust], methods utilizing a prior map [bauza2019tactile], and graphbased smoothing methods [yu2018realtime, lambert2019joint]. Smoothing approaches that model the problem as inference over a factor graph have the benefits of (a) being more accurate than filtering methods, (b) incorporating structured priors unlike learningonly methods, and (c) recovering global object poses from purely local observation models without needing a global map. Moreover, the graph inference objective can be solved in realtime making use of fast, incremental tools [kaess2008isam, kaess2012ijrr, sodhi2020ics] in the literature. Table I summarizes the closest related work on graphbased tactile estimation [yu2018realtime, yu2018realtimesuction, lambert2019joint] or using sensor data most similar to ours [bauza2019tactile]. While each work addresses an aspect of the estimation problem, none of them utilize highdimensional tactile images as learnt models within a factor graph that can work across object shapes.
Factor graph  Highdimensional  Learnt  

inference  tactile images  models  
Bauza et al. [bauza2019tactile]  ✗  ✓  ✓ 
Yu et al. [yu2018realtime, yu2018realtimesuction]  ✓  ✗  ✗ 
Lambert et al. [lambert2019joint]  ✓  ✗  ✗ 
Ours  ✓  ✓  ✓ 
Iii Problem Formulation
We formulate the estimation problem as inference over a factor graph. A factor graph is a bipartite graph with two types of nodes: variables and factors . Variable nodes are the latent states to be estimated, and factor nodes encode constraints on these variables such as measurement likelihood functions, or physics, geometric models. Maximum a posteriori (MAP) inference over a factor graph involves maximizing the product of all factor graph potentials, i.e.,
(1) 
Under Gaussian noise model assumptions, MAP inference is equivalent to solving a nonlinear leastsquares problem [dellaert2017factor]. That is, for Gaussian factors
corrupted by zeromean, normally distributed noise,
(2) 
where, are cost functions defined over states and include measurement likelihoods or priors derived from physical or geometric assumptions. is the Mahalanobis distance with covariance .
For our planar pushing setup, states in the graph are the planar object and endeffector poses at every time step , i.e. , where . Factors in the graph incorporate tactile observations , quasistatic pushing dynamics , geometric constraints , and priors on endeffector poses .
At every time step, new variables and factors are added to the graph. Writing out Eq. 2 for our setup at time step ,
(3) 
Eq. 3 is the optimization objective that we must solve for every time step. Instead of resolving from scratch every time step, we make use of efficient, incremental solvers such as iSAM2 [kaess2012ijrr] for realtime inference. Individual cost terms in Eq. 3 are described in more detail in Section IVB.
Iv Approach
We present a twostage approach: we first learn local tactile observation models from ground truth data (Section IVA), and then integrate these models along with physics and geometric models as factors within a factor graph (Section IVB). Fig. 2 illustrates our overall approach showing the three factors, tactile, physics and geometric, being integrated into the factor graph.
Iva Tactile observation model
The goal of learning a tactile observation model is to derive the tactile factor cost term in Eq. 3 to be used during graph optimization. We do this by predicting a relative transformation and penalizing deviations from this prediction. The relative transformation is that of the sensor (or endeffector) pose relative to the object in contact.
Our learnt tactile observation model consists of a transform prediction network that: Given a pair of nonsequential tactile image inputs at times , predicts the relative transformation . This is done by featurizing each image.
Feature learning
For encoding an image as feature , we use an autoencoder with a structural keypoint bottleneck proposed in [lambeta2020digit]. It consists of an encoder and decoder using a tiny version of ResNet18 as the backbone network. The encoder process the image input into a feature map from which 2D keypoint locations corresponding to maximum feature activations are extracted. At decoding, a Gaussian blob is drawn on an empty feature map for each extracted keypoint. The decoder then takes these as inputs and produces a reconstructed image. The autoencoder is trained in a selfsupervised manner with L2 image reconstruction error along with auxiliary losses that optimize for sparse, nonredundant keypoints.
Transform prediction network
Once we have a trained feature encoder, we use that encoder in the transform prediction network to map input image pairs
into keypoint feature vectors
. This is followed by a fullyconnected regression network that predicts a relative 2D transformation between times (Fig. 2). To make the same network work across object classes, we also pass in a onehot class label vector . We expand the feature inputs via an outer product with the class vector, , , and pass this expanded input to the fullyconnected layers. The labelsare from a simple classifier network trained on the activation maps extracted from the encoder.
The transform prediction network is trained using a meansquared loss (MSE) against ground truth data . We make the loss symmetric for the network to learn both regular and inverse transform, that is,
(4) 
Here, is the relative transform between endeffector poses at time steps in object coordinate frame. That is, , where are ground truth object and endeffector poses obtained using a motion capture system. is set of all nonsequential image pairs over a chosen time window .
IvB Factor graph optimization
Once we have the learnt local tactile observation model, we integrate it along with physics and geometric models as factors within a factor graph. The factor graph optimizer then solves for the joint objective in Eq. 3. Here we look at each of the cost terms in Eq. 3 in more detail.
Tactile factor
Measurements from tactile sensor images are incorporated as relative tactile factors denoted by the cost in Eq. 3. Our relative tactile factor is a quadratic cost penalizing deviation from a predicted value, that is,
(5) 
where, is the predicted relative transform from the transform prediction network that takes as inputs tactile image measurement at time steps . is the estimated relative transform using current variable estimates in the graph. denotes difference between two manifold elements. is the time step window over which these relative tactile factors are added. We choose this to be some subset of the training window set in Eq. 4.
computed using graph variable estimates , and computed as transform prediction network output can be expressed as,
(6) 
Quasistatic physics factor
To model object dynamics as it is pushed, we use a quasistatic physics model. The quasistatic approximation assumes endeffector trajectories executed with negligible acceleration, i.e. the applied pushing force is just large enough to overcome friction without imparting an acceleration [lynch1992manipulation].
We use the velocityonly quasistatic model from [zhou2017fast] that uses a convex polynomial to approximate the limit surface for forcemotion mapping. For sticking contact, the contact point pushing velocity must lie inside the motion cone, resulting in the following relationship between object and contact point velocities:
(7) 
where, is the object twist, is the contact point velocity, and
is a hyperparameter dependent on pressure distribution of the object
[lynch1992manipulation]. We calculate this value assuming pressure to be distributed either uniformly or at the corners/edges of the objects.The dynamics in Eq. 7 is incorporated as the quadratic quasistatic cost term in Eq. 3. Expanding this,
(8) 
Object twist is computed using object poses and contact point velocity is computed using endeffector contact point estimate . That is,
(9) 
(10) 
where, rotates object and contact point velocities into current object frame.
Geometric factor
We would like to add a geometric constraint to ensure that the contact point lies on the object surface. We do so as an intersection cost between the endeffector and the object. This is similar to the obstacle avoidance cost in [mukadam2018continuous, ratliff2009chomp] but instead of a onesided cost function, we use a twosided function to penalize the contact point from lying on either side of the object surface.
To compute this intersection cost, we transform endeffector contact point into current object frame as , and look up its distance value in a precomputed 2D signed distance field map of the object centered around the object. This is incorporated as the quadratic geometric factor cost term in Eq. 3. Expanding this,
(11) 
Endeffector priors
Finally we also model uncertainty about endeffector locations as unary pose priors on the endeffector variables. These priors currently come from motion capture readings with added noise, but for a robot endeffector, these would instead come from the robot kinematics.
The pose priors are incorporated as the quadratic endeffector factor cost term in Eq. 3. Expanding,
(12) 
where, are poses from the motion capture system with added Gaussian noise.
V Results and Evaluation
We evaluate our approach qualitatively and quantitatively on a number of realworld planar pushing trials where the pose of an object is unknown and must be estimated. We compare against a set of baselines on metrics like learning errors, estimation accuracy and runtime performance. Learnt tactile factors are trained using PyTorch
[paszke2019pytorch]. These, along with engineered physics and geometric factors, are incorporated within GTSAM C++ library [dellaert2012factor]. We use the iSAM2 [kaess2012ijrr] solver for efficient, incremental optimization.Va Experimental setup
Fig. 3(a) shows the overall experimental setup for the pushing trials. We use an OptiTrack motion capture system to record ground truth object and endeffector poses for training and evaluating tactile observation models. Fig. 3(b) shows a closeup of the object, endeffector and the Digit tactile sensor [lambeta2020digit] mounted on an endeffector. The Digit sensor provides highdimensional RGB images of the local deformation at the contact point.
VB Tactile factor learning
We now look at performance of the first stage of our approach, i.e. learning tactile observation models.
Contact detection
The first step is to detect contact in tactile images. We do so by first subtracting a mean nocontact image from the current image. If % of significantly () different pixels exceeds a threshold, then contact is declared. We found this simple method works reliably on different pushing trials.
Model complexity  Datasets  

Disc  Rect  Ellip  Combined  
const  5.5e3  15e3  6.7e3  9.6e3 
linear with engfeat  5.1e3  6.7e3  4.2e3  5.9e3 
linear with learntfeat  1.0e3  1.1e3  1.7e3  1.5e3 
nonlinear with engfeat  3.0e3  5.2e3  3.3e3  4.2e3 
nonlinear with learntfeat  1.0e3  1.7e3  3.0e3  2.4e3 
Keypoint features
Fig. 5 shows results for keypoint features learnt using the autoencoder network described in Sec IVA. The tactile image shows an elliptical contact patch where the curvature varies with local surface geometry of the object. The learnt keypoint features are able to track the patch center over time.
Tactile model performance
Table II compares meansquared losses for different choices of the transform prediction network described in Sec. IVA. const is a zeroth order model that predicts mean relative transform of the training dataset. This is equivalent to using only contact detection information irrespective of contact patch locations in the images. engfeat and learntfeat represent engineered vs learnt keypoint features. For engineered features, we find a leastsquares ellipse fit to detected contours in the image, as a generalization to points/line features used in related work [hogan2020tactile]. linear and nonlinear refer to using linear or nonlinear activations in the fullyconnected layers. We see that models using learntfeat have lower losses over engfeat. We also see linear models have lower losses than const. The nonlinear models don’t show a significant improvement over the linear models.
VC Factor graph optimization
We now look at the final task performance of estimating object poses using learnt tactile factors along with physics/geometric factors. For all runs, we use the same tactile model trained on the combined rect, disc, ellip datasets. The model is conditioned on the object class being used. We use the same covariance parameters in the graph, i.e. , , , .
Qualitative tracking performance
Fig. 6(a) shows that physics and geometric factors alone cause object poses to drift over time. Fig. 6(b)(d) shows tracking with different tactile models: const, learnt, and oracle. const predicts a constant relative transform value for each tactile factor. This approach is equivalent to using only a contact detector. While it improves over no tactile, it is unable to correct object rotations relative to contact point leading to drifting object pose estimates. This effect is most pronounced in the rect trajectories, where pushing along a corner cause large object rotations about the contact point. learnt is our proposed method using the transform prediction network, while oracle provides ground truth relative transforms. We see that the learnt tactile model recovers object poses close to their true trajectory, and matches oracle performance closely.
Quantitative tracking performance
Fig. 6(e),(f) show meanvariance plots of RMSE rotational and translation object pose errors over time. Errors are computed over pushing sequences each for rect, disc, ellip objects. The sequences have varying pushing trajectories making and breaking contact typically times. Fig. 7 additionally shows summary statistics of the RMSE translation and rotational errors at the final time step . The learnt model performance is closest to oracle tactile performance recovering true object poses up to mm translational errors and rotational errors.
Runtime performance
Finally, Fig. 8 shows runtime per iteration of the graph optimizer. Runtime stays relatively constant with new measurements and priors added every step.
Vi Discussion
We presented a factor graph based inference approach for estimating object poses from touch using visionbased tactile sensors. We proposed learning tactile observation models that directly integrate as factors within the graph. We demonstrated that our method is able to reliably track object poses for over 150 realworld planar pushing trials using tactile measurements alone.
As future work, in the tactile observation model, we’d like to learn to model a distribution over relative poses instead of only the meansquared error values. This should improve performance in cases where the relative pose uncertainty is asymmetric or varies significantly between contact episodes. We’d also like to learn richer feature descriptors that describe contact patch geometry in addition to the patch centers. This should allow conditioning variable information to be captured within the feature descriptor itself. Finally, to make these tactile factors work on more complex manipulation tasks, different physics models will need to be incorporated.
Acknowledgements
We’d like to thank the DIGIT team, in particular P.W. Chou, M. Lambeta and R. Calandra, for support with the sensor, software and helpful discussions. We’d also like to thank D. Gandhi and Z. Dong for helpful discussions.
References
Appendix
Here we look at some additional results for object pose tracking during planar pushing tasks. Fig. 9 shows the three different objects being pushed by the Digit tactile sensor mounted on an endeffector. Objects are pushed over varying sequences for each object. We perform following variations across the pushing sequences:

Trajectories are varied as combinations of straightline and curved pushes.

Each sequence makes and breaks contact typically 23 times.

Object is pushed at different locations, e.g. pushing at both edges and corners for the rect object.
Figs. 10, 11 show qualitative object tracking results for the rect object, and Fig. 12 for the other objects: disc and ellip over these varying sequences. In the no tactile case, use of physics and geometric factors alone cause object poses to drift over time. The last three columns show tracking with different tactile models: const, learnt, oracle. const predicts a constant relative transform for each tactile factor. This approach is equivalent to using only a contact detector, i.e. using measurements that a standard non imagebased contact sensor would give. While it improves over no tactile, it is unable to correct object rotations relative to contact point leading to drifting object pose estimates. This effect is most pronounced in the rect trajectories, where pushing along a corner cause large object rotations about the contact point. learnt is our proposed method using the transform prediction network, while oracle provides ground truth relative transforms. We see that the learnt tactile model recovers object poses close to their true trajectory, and matches oracle performance closely.
Comments
There are no comments yet.