1 Introduction
For general-purpose manipulation in unstructured scenes, robots must have accurate understanding of object properties. In particular, knowledge of 3-D shape and its uncertainty enables a breadth of downstream tasks like grasping, dexterous manipulation, and non-prehensile actions. Agents in household or warehouse environments may encounter apriori unknown objects, which they must reconstruct on the fly.
Vision and depth-based shape perception has been well-studied [newcombe2011kinectfusion], but is often prone to failure in the context of manipulation. Self-occlusion, occlusion due to clutter, and fixed viewpoint hinders visual methods when robots interact with a scene. Furthermore, sensory signal is degraded from poor illumination, limited range, and ambiguities arising from transparent or specular objects.
Studies show humans can optimally fuse touch and vision to reconstruct shape [helbig2007optimal], reinforcing their complementarity. Vision gives coarse global context, while touch gives precise local information. The development of vision-based tactile sensing [yamaguchi2016combining, yuan2017gelsight, donlon2018gelslim, ward2018tactip, alspach2019soft, lambeta2020digit, padmanabha2020omnitact, wang2021gelsight], like the GelSight [yuan2017gelsight], has led to renewed interest in the shape mapping problem. Fusing both modalities requires globally integrating tactile signals at the distal end, joint kinematics, and vision.
Mapping with high-resolution touch is an open research direction [wang20183d, bauza2019tactile, smith20203d, smith2021active], and a key challenge is to efficiently incorporating these dense measurements into a 3-D mapping framework. Moreover, the tactile sensor’s coverage is limited by its size and durability, while cameras only provide partial visibility of the object. It’s desired that a shape representation must faithfully approximate regions lacking sensor measurements.

We perform incremental 3-D shape mapping with a vision-based tactile sensor, GelSight, and an overlooking depth-camera. We combines multi-modal sensor measurements into our Gaussian process spatial graph (GP-SG), for efficient incremental mapping. The depth-camera gives us a partial noisy estimate of 3-D shape, after which we sequentially add tactile measurements as Gaussian potentials into our GP-SG. The tactile measurements are recovered from GelSight images via a learned model trained in simulation. The results demonstrate accurate implicit surface reconstruction and uncertainty prediction for interactive perception tasks.
In this paper, we propose a framework that incrementally reconstructs tabletop 3-D objects from a sequence of tactile images and a noisy depth-map (Figure 1). We leverage optical tactile simulation to learn local shape from GelSight-object interactions. We represent 3-D shape as a signed distance function (SDF) sampled from a Gaussian process (GP), and re-formulate shape mapping as probabilistic inference on a spatial graph. We show that visuo-tactile measurements can be incorporated into an incremental graph optimizer as local Gaussian potentials. This affords efficient access to the implicit surface and SDF uncertainty. We present both simulated and real experiments, generating effective reconstructions of global shape despite limited sensor coverage. Specifically, our contributions include:
-
Accurate recovery of local shape from touch learned via tactile simulation of GelSight-object interactions,
-
Incremental shape mapping through efficient inference in our Gaussian process spatial graph (GP-SG),
-
Evaluation of visuo-tactile shape mapping on our YCBSight-Sim and YCBSight-Real datasets.
2 Related work
2.1 Tactile sensing and local shape
For vision-based tactile sensors, photometric stereo [hertzmann2005example] has been widely used to reconstruct local shape [retrographic, microgeometry, yuan2017gelsight]
. The approach maps image intensities to gradients via a lookup table, and integrates the gradients to obtain a height-map. However, this method does not consider spatial position in the calibration, and leads to large variance around the boundary of the sensor. A multilayer perceptron network is later used to encode spatial variance
[wang2021gelsight], however an end-to-end learning method could prove more robust. For example, works such as [bauza2019tactile, ambrus2021monocular] learn a model from a limited set of real-world tactile interactions. Our method (Section 4) differs from the above as we train our model via a tactile simulator [si2021taxim]that mimics intensity distributions from the real sensor. Simulation allows us to scale supervised-learning to a wider range of objects and ground-truth.
2.2 Visuo-tactile shape perception
Global information from vision has complemented low-resolution touch in a multi-modal setting [bjorkman2013enhancing, ilonen2014three, varley2017shape, gandler2020object]. Wang et al. [wang20183d] use monocular shape completion augmented with GelSight readings. However they rely primarily on the visual shape prediction, and tactile sensing serves as a refinement step. Smith et al. [smith20203d, smith2021active] demonstrate a learned perception model on simulated datasets, to predict local mesh deformations via high-resolution touch and filling-in through vision. The context of our work resembles those of [wang20183d] and [smith20203d], with partial vision and high-dimensional touch. Our contributions differ from these methods as we (i) perform incremental inference on the measurement stream, and (ii) do not rely on data-driven shape priors.
2.3 Gaussian processes and graphs
We wish to faithfully approximate non-contact regions, capture surface uncertainty, and probabilistically handle measurement noise. Gaussian process implicit surfaces (GPIS) [williams2007gaussian] showcase these properties and have found preference in manipulation research—over point-clouds [bauza2019tactile] and other parametric methods [bierbaum2008robust]. The GPIS considers the object’s SDF magnitude and gradient as a GP, conditioned on noisy sensor measurements. This has been successfully applied to both passive [dragiev2011gaussian, ottenhaus2016local] and active 3-D reconstruction [bjorkman2013enhancing, jamali2016active, yi2016active, driess2017active] with low-resolution tactile data. We extend these ideas, scaling them to a stream of high-dimensional touch measurements for incremental shape reconstruction.
The key challenge, especially for GelSight point-clouds, is that GPs scale poorly due to matrix inversion costs. In the SLAM community, common approximations include local GPs [lee2019online, stork2020ensemble] and compact kernels [ranganathan2010online]. These have further been incorporated into factor graphs [dellaert2017factor] for trajectory estimation [yan2017incremental], target tracking [rosen2014inference], motion planning [mukadam2018continuous], elevation modeling [wang2019underwater], and planar mapping [Suresh21tactile]. Inspired by these, our representation encodes GP potentials as local constraints in a spatial factor graph.
3 Problem formulation
We consider a robot arm with a GelSight tactile sensor interacting with an unknown 3-D object fixed on a tabletop. Given a sequence of images from the GelSight sensor, robot kinematics, and depth-map from a depth-camera, we incrementally estimate the object’s shape and signed distance function (SDF) uncertainty.
Object shape: We represent the object’s shape as an implicit surface in the robot’s frame, with SDF uncertainty (Refer Section 5.3).
Tactile measurements: During interaction, upon detecting contact, we record the corresponding tactile image and sensor pose :
(1) |
Depth-map: We capture a depth-map of the object from the camera, represented in the robot-frame: .
Assumptions: In line with prior efforts, we assume:
-
[leftmargin=2em]
-
Calibrated robot-camera extrinsics,
-
Fixed object pose and known approximate object
dimensions, -
A passive exploration algorithm for object coverage.
The rest of the paper is as follows: Section 4 presents a GelSight image to height-map model for tactile perception. Section 5 combines tactile point-clouds with a depth-map in an incremental GP spatial graph. In Section 6, we demonstrate our method for simulated and real visuo-tactile experiments. Finally, we sum up our efforts in Section 7.



4 Local shape from touch
Vision-based tactile sensors perceive contact geometries as images. The soft, illuminated gelpad deforms elastically on contact and is captured by an embedded camera. We represent local shape recovery as the inverse sensor model:
(2) |
With , , and knowledge of sensor pose from robot kinematics, we can obtain a tactile point-cloud , comprising of 3-D points and normals :
(3) |
In this section, we learn through simulation, and its output forms the basis for our visuo-tactile mapping in Section 5.
4.1 Learning from simulation
For tactile sensors with soft body deformation, local shape geometry can be learned through supervision. Image-to-depth estimation networks [eigen2014depth, laina2016deeper] can learn true-depth from GelSight images even without sensor calibration. However, this would require a large corpus of tactile images and corresponding ground-truth depths. While this is impractical in the real-world, we render images from a tactile simulator instead [si2021taxim]. To ensure transfer to the real-world, the simulator is calibrated with reference data from a real GelSight sensor, thus mimicking the same intensity distributions.
Network and training: We use an implementation [fcrn2018github] of the fully convolutional residual network [laina2016deeper] as our depth estimator, as shown in Figure 3. The network combines ResNet-50 as the encoder and up-sampling blocks as the decoder. Our model takes tactile images as input, and outputs predictions of both height-map and contact mask . We choose 30 household objects from YCB dataset [calli2017yale], and hold out 6 objects for testing generalization. For each object, we generate 660 images from randomly sampled sensor poses on their ground-truth mesh models. We split the train-validation-test sets as 550-50-60.
Benchmarks: We compare , with the standard lookup table method [yuan2017gelsight] . This maps tactile images to gradients of the local shape, and uses fast Poisson integration to derive their height-maps. The contact masks are generated from an intensity-based thresholding of contact vs. non-contact frames.
Evaluation: Figure 4 compares with respect to benchmarks on our YCBSight-Sim dataset (refer Section 6.1). We compare each estimated height-map and contact mask against the ground-truth. Specifically, we evaluate:
-
Pixel-wise RMSE on height-maps, and
-
Intersection over union (IoU) on contact masks.
On height-map estimation, we outperform the benchmark with an average RMSE of 0.094 mm across all object classes. The lookup table has larger variance, with an average RMSE of 0.182 mm. Note that the maximum penetration depth of the simulation is 1 mm. On contact mask estimation, we have an average IoU of 0.752, while the handcrafted image thresholding performs much worse with 0.379. Finally, in Figure 3, we demonstrate generalization of to both unseen simulation and real-world tactile interactions.
5 3-D shape estimation
5.1 Standard Gaussian processes
A GP is a nonparametric method to learn a continuous function from data, well-suited to model spatial and temporal phenomena [rasmussen2003gaussian]
. To estimate shape, a classical GP considers the object’s SDF to be a joint Gaussian distribution over noisy measurements of its surface. At any given point in space, the SDF
represents the signed-distance from the surface: on the surface, inside, and outside. The GP meaningfully approximates the global shape, even in regions lacking sensor information. Given a dense tactile measurement 111or depth map , we learn a function between positions and normals :(4) |
More generally, treating the left and right hand side of Equation 4 as the GP’s input-output:
(5) |
The posterior distribution at a query point for a full GP with measurements, is given by [rasmussen2003gaussian]:
(6) |
where is the sensor noise covariance, and , and are the train-train, train-query, and query-query kernels respectively. Each kernel’s constituent block is an kernel basis, in our case a thin-plate function [williams2007gaussian]. This inference is computationally intractable for the large that accrues from high-dimensional tactile measurements. The update operations involve costly matrix inversions, and per-query costs (Refer Equation 6). We now present a local approximation that can be updated and queried incrementally, with bounded computational costs.
5.2 GP-SG: Gaussian process spatial graph
We represent the scene as a spatial factor graph [dellaert2017factor], comprising of nodes we optimize for and factors that constrain them. These query nodes are at their respective spatial positions , distributed in an volume. Our optimization goal is to recover the posterior , which represents the SDF of the volume and its underlying uncertainty.
Implementing the full GP (Equation 6) in the graph is costly, as each measurement constrains all query nodes . Motivated by prior work in spatial partitioning [lee2019online, stork2020ensemble], we decompose the GP into local unary factors as a sparse approximation. Given that and query node
follow a GP, the joint distribution and conditional are:
(7) |
This gives us a unary Gaussian potential which can be incorporated into a least-square setting:
(8) |
At a timestep , given measurements , we add the set of associated factors within a local radius of each query node’s position . Thus, for all query nodes, we accumulate a small set of factors:
(9) |
This sparsifies an otherwise intractable optimization, pictorially represented in Figure 5 for a 2-D case. Taking the Stanford bunny as an example, we illustrate how a set of noisy surface measurements are converted into local GP factors. The final optimization recovers a posterior SDF mean and uncertainty. More specifically, for the visuo-tactile problem, the maximum a posteriori estimation is:


(10) |
where is the factor set from the depth-map , and is the factor set from tactile measurement . The term applies a positive SDF prior to nodes, initializing the volume as empty space. Inference is carried out at each timestep via incremental smoothing and mapping (iSAM2) [kaess2012isam2].
This framework combines the computational benefits of an online local GPIS [lee2019online, stork2020ensemble] with those of an incremental least-squares solver. This is well-suited for sensors like the GelSight, as the dense point-clouds are too expensive to incorporate into a full GP. When querying, we recover the posterior mean and covariance only for the nodes updated—the remaining grid is accessed from cache.
5.3 Implicit surface generation
The posterior estimate represents the SDF’s mean and uncertainty, sampled from the volume. A marching cubes algorithm [lorensen1987marching] can give us both the implicit surface and the corresponding SDF uncertainty . is generated as the zero-level set of the SDF:
(11) |
Finally, we prune faces/vertices from that lie outside for any of the sensor measurements. These areas have high surface uncertainty, and our spatial graph will poorly approximate them. Furthermore, this is necessary for sequential data as we cannot expect a watertight mesh from partial coverage.




6 Experimental evaluation
We illustrate our method in both simulated (Section 6.2) and real-world (Section 6.3) visuo-tactile experiments. We compare our shape estimates with respect to the ground-truth meshes using the Chamfer distance (CD) [barrow1977parametric], a commonly-used shape similarity metric.
Implementation: The framework is executed on an Intel Core i7-7820HQ CPU, 32GB RAM without GPU parallelization. We use the GTSAM [dellaert2012factor] optimizer with iSAM2 [kaess2012isam2] for incremental inference. Due to the precision of sensing, we empirically weight the noise of tactile measurements to be lower than that of the depth-map. We set the grid size , which occupies a volume of side larger than the objects. The local radius is tuned to of the side length.
6.1 Visuo-tactile data collection
We collect the YCBSight-Sim and YCBSight-Real datasets for evaluating our method. This comprises of YCB ground-truth meshes [calli2017yale], GelSight images from interaction, sensor poses, and a depth-map. While we consider 30 household objects in simulation, we restrict our shape mapping evaluation to 6 objects. This subset of objects have varied geometries (curved, rectangular, and complex) to verify the generalization of our method.
YCBSight-Sim: We generate GelSight-object interactions using Taxim, an example-based tactile simulator [si2021taxim]. We simulate 60 uniformly spread sensor poses on each object, normal to the local surface of the mesh. We render a depth-map from the perspective of an overlooking camera using Pyrender [pyrender]. Finally, zero-mean Gaussian noise is added to tactile point-clouds, sensor poses, and depth-map.
YCBSight-Real: We use a UR5e 6-DoF robot arm, mounting the GelSight sensor on a WSG50 parallel gripper. The depth-map is captured via a fixed-pose, calibrated Azure Kinect, approximately 1 m away from the object. Our complete setup can be seen in Figure 9. The GelSight captures 640 480 RGB images of the tactile interactions in a 2.66 cm area. The objects are secured by a mechanical bench vise at a known pose, to ensure they remain static. After capturing the depth-map , we approach each object from a discretized set of angles and heights. We detect contact events by thresholding the tactile images. We collect tactile images of the object’s lateral surface, along with the gripper poses via robot kinematics.
6.2 Simulated tactile mapping
In Figure 6 we highlight mapping results for the 6 objects in YCBSight-Sim. We first visualize the implicit surface and SDF uncertainty from depth-map only. After this, touch measurements are added incrementally and reflect in the shape estimate. The surface uncertainty is typically high for regions that lack depth/tactile information, and reduces over time. Figure 7 shows that the CD with respect to the ground-truth mesh decreases with greater number of touches, and converges within 35–40 touches. The timing plot of graph operations shows near-constant graph update and query time. The execution time reduces towards the end of the datasets as a result of smaller contact areas on the top surface of the objects. These timings can be further improved by parallelizing spatial operations.
6.3 Real-world tactile mapping
In Figure 8, we show our method working on real data collected in YCBSight-Real. The Kinect depth-maps for specular objects like tomato_soup_can and potted_meat_can are erroneous, but tactile information provides more precise local shape. To prevent damage to the robot and sensor, we do not explore near the base of the object—we instead hallucinate measurements at the bottom based on the nearest corresponding sensor poses. In Figure 10, we plot the CD over time for the 6 YCB objects. The initial error is lower than simulation due to the additional hallucinated measurements. We see the error converge to an average CD of 18.3 mm2, a similar magnitude as in the simulated experiments.
7 Conclusion
We present an incremental framework for 3-D shape estimation from dense touch and vision. We formulate a GP spatial graph (GP-SG) structure, that efficiently infers an object’s implicit surface and SDF uncertainty. To integrate GelSight tactile images, we recover local shape with a model learned in tactile simulation. Our method is first demonstrated in a simulated visuo-tactile setting, and is later shown to generalize to real-world shape perception.
As future work, we wish to actively reconstruct these shapes using surface uncertainty information. The current method can further benefit from (i) parallelized spatial graph operations, and (ii) data-driven shape priors [varley2017shape, wang20183d]. Finally, we wish to consider relaxing the fixed-pose assumption [Suresh21tactile], and perception of deformable objects.