Efficient shape mapping through dense touch and vision

09/20/2021 ∙ by Sudharshan Suresh, et al. ∙ Brigham Young University 0

Knowledge of 3-D object shape is of great importance to robot manipulation tasks, but may not be readily available in unstructured environments. While vision is often occluded during robot-object interaction, high-resolution tactile sensors can give a dense local perspective of the object. However, tactile sensors have limited sensing area and the shape representation must faithfully approximate non-contact areas. In addition, a key challenge is efficiently incorporating these dense tactile measurements into a 3-D mapping framework. In this work, we propose an incremental shape mapping method using a GelSight tactile sensor and a depth camera. Local shape is recovered from tactile images via a learned model trained in simulation. Through efficient inference on a spatial factor graph informed by a Gaussian process, we build an implicit surface representation of the object. We demonstrate visuo-tactile mapping in both simulated and real-world experiments, to incrementally build 3-D reconstructions of household objects.



There are no comments yet.


page 1

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For general-purpose manipulation in unstructured scenes, robots must have accurate understanding of object properties. In particular, knowledge of 3-D shape and its uncertainty enables a breadth of downstream tasks like grasping, dexterous manipulation, and non-prehensile actions. Agents in household or warehouse environments may encounter apriori unknown objects, which they must reconstruct on the fly.

Vision and depth-based shape perception has been well-studied [newcombe2011kinectfusion], but is often prone to failure in the context of manipulation. Self-occlusion, occlusion due to clutter, and fixed viewpoint hinders visual methods when robots interact with a scene. Furthermore, sensory signal is degraded from poor illumination, limited range, and ambiguities arising from transparent or specular objects.

Studies show humans can optimally fuse touch and vision to reconstruct shape [helbig2007optimal], reinforcing their complementarity. Vision gives coarse global context, while touch gives precise local information. The development of vision-based tactile sensing [yamaguchi2016combining, yuan2017gelsight, donlon2018gelslim, ward2018tactip, alspach2019soft, lambeta2020digit, padmanabha2020omnitact, wang2021gelsight], like the GelSight [yuan2017gelsight], has led to renewed interest in the shape mapping problem. Fusing both modalities requires globally integrating tactile signals at the distal end, joint kinematics, and vision.

Mapping with high-resolution touch is an open research direction [wang20183d, bauza2019tactile, smith20203d, smith2021active], and a key challenge is to efficiently incorporating these dense measurements into a 3-D mapping framework. Moreover, the tactile sensor’s coverage is limited by its size and durability, while cameras only provide partial visibility of the object. It’s desired that a shape representation must faithfully approximate regions lacking sensor measurements.

Figure 1:

We perform incremental 3-D shape mapping with a vision-based tactile sensor, GelSight, and an overlooking depth-camera. We combines multi-modal sensor measurements into our Gaussian process spatial graph (GP-SG), for efficient incremental mapping. The depth-camera gives us a partial noisy estimate of 3-D shape, after which we sequentially add tactile measurements as Gaussian potentials into our GP-SG. The tactile measurements are recovered from GelSight images via a learned model trained in simulation. The results demonstrate accurate implicit surface reconstruction and uncertainty prediction for interactive perception tasks.

In this paper, we propose a framework that incrementally reconstructs tabletop 3-D objects from a sequence of tactile images and a noisy depth-map (Figure 1). We leverage optical tactile simulation to learn local shape from GelSight-object interactions. We represent 3-D shape as a signed distance function (SDF) sampled from a Gaussian process (GP), and re-formulate shape mapping as probabilistic inference on a spatial graph. We show that visuo-tactile measurements can be incorporated into an incremental graph optimizer as local Gaussian potentials. This affords efficient access to the implicit surface and SDF uncertainty. We present both simulated and real experiments, generating effective reconstructions of global shape despite limited sensor coverage. Specifically, our contributions include:

  1. Accurate recovery of local shape from touch learned via tactile simulation of GelSight-object interactions,

  2. Incremental shape mapping through efficient inference in our Gaussian process spatial graph (GP-SG),

  3. Evaluation of visuo-tactile shape mapping on our YCBSight-Sim and YCBSight-Real datasets.

2 Related work

2.1 Tactile sensing and local shape

For vision-based tactile sensors, photometric stereo [hertzmann2005example] has been widely used to reconstruct local shape [retrographic, microgeometry, yuan2017gelsight]

. The approach maps image intensities to gradients via a lookup table, and integrates the gradients to obtain a height-map. However, this method does not consider spatial position in the calibration, and leads to large variance around the boundary of the sensor. A multilayer perceptron network is later used to encode spatial variance

[wang2021gelsight], however an end-to-end learning method could prove more robust. For example, works such as [bauza2019tactile, ambrus2021monocular] learn a model from a limited set of real-world tactile interactions. Our method (Section 4) differs from the above as we train our model via a tactile simulator [si2021taxim]

that mimics intensity distributions from the real sensor. Simulation allows us to scale supervised-learning to a wider range of objects and ground-truth.

2.2 Visuo-tactile shape perception

Global information from vision has complemented low-resolution touch in a multi-modal setting [bjorkman2013enhancing, ilonen2014three, varley2017shape, gandler2020object]. Wang et al. [wang20183d] use monocular shape completion augmented with GelSight readings. However they rely primarily on the visual shape prediction, and tactile sensing serves as a refinement step. Smith et al. [smith20203d, smith2021active] demonstrate a learned perception model on simulated datasets, to predict local mesh deformations via high-resolution touch and filling-in through vision. The context of our work resembles those of [wang20183d] and [smith20203d], with partial vision and high-dimensional touch. Our contributions differ from these methods as we (i) perform incremental inference on the measurement stream, and (ii) do not rely on data-driven shape priors.

2.3 Gaussian processes and graphs

We wish to faithfully approximate non-contact regions, capture surface uncertainty, and probabilistically handle measurement noise. Gaussian process implicit surfaces (GPIS) [williams2007gaussian] showcase these properties and have found preference in manipulation research—over point-clouds [bauza2019tactile] and other parametric methods [bierbaum2008robust]. The GPIS considers the object’s SDF magnitude and gradient as a GP, conditioned on noisy sensor measurements. This has been successfully applied to both passive [dragiev2011gaussian, ottenhaus2016local] and active 3-D reconstruction [bjorkman2013enhancing, jamali2016active, yi2016active, driess2017active] with low-resolution tactile data. We extend these ideas, scaling them to a stream of high-dimensional touch measurements for incremental shape reconstruction.

The key challenge, especially for GelSight point-clouds, is that GPs scale poorly due to matrix inversion costs. In the SLAM community, common approximations include local GPs [lee2019online, stork2020ensemble] and compact kernels [ranganathan2010online]. These have further been incorporated into factor graphs [dellaert2017factor] for trajectory estimation [yan2017incremental], target tracking [rosen2014inference], motion planning [mukadam2018continuous], elevation modeling [wang2019underwater], and planar mapping [Suresh21tactile]. Inspired by these, our representation encodes GP potentials as local constraints in a spatial factor graph.

3 Problem formulation

We consider a robot arm with a GelSight tactile sensor interacting with an unknown 3-D object fixed on a tabletop. Given a sequence of images from the GelSight sensor, robot kinematics, and depth-map from a depth-camera, we incrementally estimate the object’s shape and signed distance function (SDF) uncertainty.

Object shape: We represent the object’s shape as an implicit surface in the robot’s frame, with SDF uncertainty (Refer Section 5.3).

Tactile measurements: During interaction, upon detecting contact, we record the corresponding tactile image and sensor pose :


Depth-map: We capture a depth-map of the object from the camera, represented in the robot-frame: .

Assumptions: In line with prior efforts, we assume:

  • [leftmargin=2em]

  • Calibrated robot-camera extrinsics,

  • Fixed object pose and known approximate object

  • A passive exploration algorithm for object coverage.

The rest of the paper is as follows: Section 4 presents a GelSight image to height-map model for tactile perception. Section 5 combines tactile point-clouds with a depth-map in an incremental GP spatial graph. In Section 6, we demonstrate our method for simulated and real visuo-tactile experiments. Finally, we sum up our efforts in Section 7.

Figure 2: Our learned model takes in tactile images, and outputs both estimated height-maps and binary contact masks. The residual network is trained on a corpus of GelSight-object interactions in simulation.
Figure 3: Tactile images generated from GelSight interactions in [top] simulated and [bottom] real settings. Pictured alongside are the height-maps and contact masks output from our learned model .
Figure 4: Local shape recovery benchmarked on our YCBSight-Sim dataset (Refer Section 6.1). [top] We evaluate our learned model with respect to the baseline lookup table method for height-map estimation. Here we use a pixel-wise root-mean-square error (RMSE) metric, and observe consistent, low error for our method when compared with the lookup table. [bottom] We compare our learned contact mask model against intensity-based thresholding on intersection over union (IoU) metric. All test data is randomly generated, and ygray  represents the hold-out objects not encountered in training.

4 Local shape from touch

Vision-based tactile sensors perceive contact geometries as images. The soft, illuminated gelpad deforms elastically on contact and is captured by an embedded camera. We represent local shape recovery as the inverse sensor model:


With , , and knowledge of sensor pose from robot kinematics, we can obtain a tactile point-cloud , comprising of 3-D points and normals :


In this section, we learn through simulation, and its output forms the basis for our visuo-tactile mapping in Section 5.

4.1 Learning from simulation

For tactile sensors with soft body deformation, local shape geometry can be learned through supervision. Image-to-depth estimation networks [eigen2014depth, laina2016deeper] can learn true-depth from GelSight images even without sensor calibration. However, this would require a large corpus of tactile images and corresponding ground-truth depths. While this is impractical in the real-world, we render images from a tactile simulator instead [si2021taxim]. To ensure transfer to the real-world, the simulator is calibrated with reference data from a real GelSight sensor, thus mimicking the same intensity distributions.

Network and training: We use an implementation [fcrn2018github] of the fully convolutional residual network [laina2016deeper] as our depth estimator, as shown in Figure 3. The network combines ResNet-50 as the encoder and up-sampling blocks as the decoder. Our model takes tactile images as input, and outputs predictions of both height-map and contact mask . We choose 30 household objects from YCB dataset [calli2017yale], and hold out 6 objects for testing generalization. For each object, we generate 660 images from randomly sampled sensor poses on their ground-truth mesh models. We split the train-validation-test sets as 550-50-60.

Benchmarks: We compare , with the standard lookup table method [yuan2017gelsight] . This maps tactile images to gradients of the local shape, and uses fast Poisson integration to derive their height-maps. The contact masks are generated from an intensity-based thresholding of contact vs. non-contact frames.

Evaluation: Figure 4 compares with respect to benchmarks on our YCBSight-Sim dataset (refer Section 6.1). We compare each estimated height-map and contact mask against the ground-truth. Specifically, we evaluate:

  1. Pixel-wise RMSE on height-maps, and

  2. Intersection over union (IoU) on contact masks.

On height-map estimation, we outperform the benchmark with an average RMSE of 0.094 mm across all object classes. The lookup table has larger variance, with an average RMSE of 0.182 mm. Note that the maximum penetration depth of the simulation is 1 mm. On contact mask estimation, we have an average IoU of 0.752, while the handcrafted image thresholding performs much worse with 0.379. Finally, in Figure 3, we demonstrate generalization of to both unseen simulation and real-world tactile interactions.

5 3-D shape estimation

5.1 Standard Gaussian processes

A GP is a nonparametric method to learn a continuous function from data, well-suited to model spatial and temporal phenomena [rasmussen2003gaussian]

. To estimate shape, a classical GP considers the object’s SDF to be a joint Gaussian distribution over noisy measurements of its surface. At any given point in space, the SDF

represents the signed-distance from the surface: on the surface, inside, and outside. The GP meaningfully approximates the global shape, even in regions lacking sensor information. Given a dense tactile measurement 111or depth map , we learn a function between positions and normals :


More generally, treating the left and right hand side of Equation 4 as the GP’s input-output:


The posterior distribution at a query point for a full GP with measurements, is given by [rasmussen2003gaussian]:


where is the sensor noise covariance, and , and are the train-train, train-query, and query-query kernels respectively. Each kernel’s constituent block is an kernel basis, in our case a thin-plate function [williams2007gaussian]. This inference is computationally intractable for the large that accrues from high-dimensional tactile measurements. The update operations involve costly matrix inversions, and per-query costs (Refer Equation 6). We now present a local approximation that can be updated and queried incrementally, with bounded computational costs.

5.2 GP-SG: Gaussian process spatial graph

We represent the scene as a spatial factor graph [dellaert2017factor], comprising of nodes we optimize for and factors that constrain them. These query nodes are at their respective spatial positions , distributed in an volume. Our optimization goal is to recover the posterior , which represents the SDF of the volume and its underlying uncertainty.

Implementing the full GP (Equation 6) in the graph is costly, as each measurement constrains all query nodes . Motivated by prior work in spatial partitioning [lee2019online, stork2020ensemble], we decompose the GP into local unary factors as a sparse approximation. Given that and query node

follow a GP, the joint distribution and conditional are:


This gives us a unary Gaussian potential which can be incorporated into a least-square setting:


At a timestep , given measurements , we add the set of associated factors within a local radius of each query node’s position . Thus, for all query nodes, we accumulate a small set of factors:


This sparsifies an otherwise intractable optimization, pictorially represented in Figure 5 for a 2-D case. Taking the Stanford bunny as an example, we illustrate how a set of noisy surface measurements are converted into local GP factors. The final optimization recovers a posterior SDF mean and uncertainty. More specifically, for the visuo-tactile problem, the maximum a posteriori estimation is:

Figure 5: [right] A 2-D illustration of our GP spatial graph (GP-SG), an efficient local approximation to a full GP. The graph consists of SDF query nodes () each at their spatial positions . Each surface measurement () () produces a unary factor () at query node (within the local radius ). This represents a local Gaussian potential for the GP implicit surface. [left] Optimizing for yields posterior SDF mean + uncertainty. The zero-level set of the SDF gives us the implicit surface .
Figure 6: Results from simulated visuo-tactile mapping on our YCBSight-Sim dataset. Shown for each object are (i) sample GelSight images, (ii) tactile and depth-map measurements on the ground-truth mesh, and (iii) frames from the incremental mapping. Each object is initialized with a noisy rendered depth-map (Depth only), and with each sequential GelSight measurement, we gain further understanding of global shape and reduce surface uncertainty. Visualized here are the implicit surface + SDF + uncertainty for the intervals of touches.

where is the factor set from the depth-map , and is the factor set from tactile measurement . The term applies a positive SDF prior to nodes, initializing the volume as empty space. Inference is carried out at each timestep via incremental smoothing and mapping (iSAM2) [kaess2012isam2].

This framework combines the computational benefits of an online local GPIS [lee2019online, stork2020ensemble] with those of an incremental least-squares solver. This is well-suited for sensors like the GelSight, as the dense point-clouds are too expensive to incorporate into a full GP. When querying, we recover the posterior mean and covariance only for the nodes updated—the remaining grid is accessed from cache.

5.3 Implicit surface generation

The posterior estimate represents the SDF’s mean and uncertainty, sampled from the volume. A marching cubes algorithm [lorensen1987marching] can give us both the implicit surface and the corresponding SDF uncertainty . is generated as the zero-level set of the SDF:


Finally, we prune faces/vertices from that lie outside for any of the sensor measurements. These areas have high surface uncertainty, and our spatial graph will poorly approximate them. Furthermore, this is necessary for sequential data as we cannot expect a watertight mesh from partial coverage.

Figure 7: [top] The Chamfer distance (CD) with respect to ground-truth meshes for our YCBSight-Sim experiments. Objects are initialized with high CD from noisy depth-map, but converge to low-error in 35–40 touches. [bottom] Average execution time for update and query operations on our GP spatial graph (GP-SG). At each touch we add Gaussian potentials during update, and recover posterior mean and covariance during query. We notice a dip in timing towards the end, due to smaller contact areas on the top of objects.
Figure 8: Results from real visuo-tactile mapping on our YCBSight-Real dataset. This is structured similar to Figure 6, except with reconstruction frames at intervals of touches. The Kinect performs poorly for specular objects such as tomato_soup_can and potted_meat_can, but high-precision GelSight measurements can disambiguate global shape. Our mapping generalizes well and we observe similar results between simulated and real experiments.
Figure 9: Experimental setup for the YCBSight-Real dataset, with a GelSight tactile sensor, a depth-camera and the YCB objects. Objects are firmly secured on a mechanical bench vise, to ensure they stay stationary. We collect measurements by approaching from a discretized set of angles and heights, and detecting contact from the tactile images. The overlooking Kinect collects a depth-map to initialize our visuo-tactile mapping.
Figure 10: The Chamfer distance (CD) with respect to ground-truth meshes for our YCBSight-Real experiments. We observe the error converges to a similar magnitude as Figure 7 after 30 touches. They initially start out with a lower error than simulation as a result of the hallucinated base measurements we add to each object (refer Section 6.3).

6 Experimental evaluation

We illustrate our method in both simulated (Section 6.2) and real-world (Section 6.3) visuo-tactile experiments. We compare our shape estimates with respect to the ground-truth meshes using the Chamfer distance (CD) [barrow1977parametric], a commonly-used shape similarity metric.

Implementation: The framework is executed on an Intel Core i7-7820HQ CPU, 32GB RAM without GPU parallelization. We use the GTSAM [dellaert2012factor] optimizer with iSAM2 [kaess2012isam2] for incremental inference. Due to the precision of sensing, we empirically weight the noise of tactile measurements to be lower than that of the depth-map. We set the grid size , which occupies a volume of side larger than the objects. The local radius is tuned to of the side length.

6.1 Visuo-tactile data collection

We collect the YCBSight-Sim and YCBSight-Real datasets for evaluating our method. This comprises of YCB ground-truth meshes [calli2017yale], GelSight images from interaction, sensor poses, and a depth-map. While we consider 30 household objects in simulation, we restrict our shape mapping evaluation to 6 objects. This subset of objects have varied geometries (curved, rectangular, and complex) to verify the generalization of our method.

YCBSight-Sim: We generate GelSight-object interactions using Taxim, an example-based tactile simulator [si2021taxim]. We simulate 60 uniformly spread sensor poses on each object, normal to the local surface of the mesh. We render a depth-map from the perspective of an overlooking camera using Pyrender [pyrender]. Finally, zero-mean Gaussian noise is added to tactile point-clouds, sensor poses, and depth-map.

YCBSight-Real: We use a UR5e 6-DoF robot arm, mounting the GelSight sensor on a WSG50 parallel gripper. The depth-map is captured via a fixed-pose, calibrated Azure Kinect, approximately 1 m away from the object. Our complete setup can be seen in Figure 9. The GelSight captures 640  480 RGB images of the tactile interactions in a 2.66 cm area. The objects are secured by a mechanical bench vise at a known pose, to ensure they remain static. After capturing the depth-map , we approach each object from a discretized set of angles and heights. We detect contact events by thresholding the tactile images. We collect tactile images of the object’s lateral surface, along with the gripper poses via robot kinematics.

6.2 Simulated tactile mapping

In Figure 6 we highlight mapping results for the 6 objects in YCBSight-Sim. We first visualize the implicit surface and SDF uncertainty from depth-map only. After this, touch measurements are added incrementally and reflect in the shape estimate. The surface uncertainty is typically high for regions that lack depth/tactile information, and reduces over time. Figure 7 shows that the CD with respect to the ground-truth mesh decreases with greater number of touches, and converges within 35–40 touches. The timing plot of graph operations shows near-constant graph update and query time. The execution time reduces towards the end of the datasets as a result of smaller contact areas on the top surface of the objects. These timings can be further improved by parallelizing spatial operations.

6.3 Real-world tactile mapping

In Figure 8, we show our method working on real data collected in YCBSight-Real. The Kinect depth-maps for specular objects like tomato_soup_can and potted_meat_can are erroneous, but tactile information provides more precise local shape. To prevent damage to the robot and sensor, we do not explore near the base of the object—we instead hallucinate measurements at the bottom based on the nearest corresponding sensor poses. In Figure 10, we plot the CD over time for the 6 YCB objects. The initial error is lower than simulation due to the additional hallucinated measurements. We see the error converge to an average CD of 18.3 mm2, a similar magnitude as in the simulated experiments.

7 Conclusion

We present an incremental framework for 3-D shape estimation from dense touch and vision. We formulate a GP spatial graph (GP-SG) structure, that efficiently infers an object’s implicit surface and SDF uncertainty. To integrate GelSight tactile images, we recover local shape with a model learned in tactile simulation. Our method is first demonstrated in a simulated visuo-tactile setting, and is later shown to generalize to real-world shape perception.

As future work, we wish to actively reconstruct these shapes using surface uncertainty information. The current method can further benefit from (i) parallelized spatial graph operations, and (ii) data-driven shape priors [varley2017shape, wang20183d]. Finally, we wish to consider relaxing the fixed-pose assumption [Suresh21tactile], and perception of deformable objects.