1 Introduction
For generalpurpose manipulation in unstructured scenes, robots must have accurate understanding of object properties. In particular, knowledge of 3D shape and its uncertainty enables a breadth of downstream tasks like grasping, dexterous manipulation, and nonprehensile actions. Agents in household or warehouse environments may encounter apriori unknown objects, which they must reconstruct on the fly.
Vision and depthbased shape perception has been wellstudied [newcombe2011kinectfusion], but is often prone to failure in the context of manipulation. Selfocclusion, occlusion due to clutter, and fixed viewpoint hinders visual methods when robots interact with a scene. Furthermore, sensory signal is degraded from poor illumination, limited range, and ambiguities arising from transparent or specular objects.
Studies show humans can optimally fuse touch and vision to reconstruct shape [helbig2007optimal], reinforcing their complementarity. Vision gives coarse global context, while touch gives precise local information. The development of visionbased tactile sensing [yamaguchi2016combining, yuan2017gelsight, donlon2018gelslim, ward2018tactip, alspach2019soft, lambeta2020digit, padmanabha2020omnitact, wang2021gelsight], like the GelSight [yuan2017gelsight], has led to renewed interest in the shape mapping problem. Fusing both modalities requires globally integrating tactile signals at the distal end, joint kinematics, and vision.
Mapping with highresolution touch is an open research direction [wang20183d, bauza2019tactile, smith20203d, smith2021active], and a key challenge is to efficiently incorporating these dense measurements into a 3D mapping framework. Moreover, the tactile sensor’s coverage is limited by its size and durability, while cameras only provide partial visibility of the object. It’s desired that a shape representation must faithfully approximate regions lacking sensor measurements.
In this paper, we propose a framework that incrementally reconstructs tabletop 3D objects from a sequence of tactile images and a noisy depthmap (Figure 1). We leverage optical tactile simulation to learn local shape from GelSightobject interactions. We represent 3D shape as a signed distance function (SDF) sampled from a Gaussian process (GP), and reformulate shape mapping as probabilistic inference on a spatial graph. We show that visuotactile measurements can be incorporated into an incremental graph optimizer as local Gaussian potentials. This affords efficient access to the implicit surface and SDF uncertainty. We present both simulated and real experiments, generating effective reconstructions of global shape despite limited sensor coverage. Specifically, our contributions include:

Accurate recovery of local shape from touch learned via tactile simulation of GelSightobject interactions,

Incremental shape mapping through efficient inference in our Gaussian process spatial graph (GPSG),

Evaluation of visuotactile shape mapping on our YCBSightSim and YCBSightReal datasets.
2 Related work
2.1 Tactile sensing and local shape
For visionbased tactile sensors, photometric stereo [hertzmann2005example] has been widely used to reconstruct local shape [retrographic, microgeometry, yuan2017gelsight]
. The approach maps image intensities to gradients via a lookup table, and integrates the gradients to obtain a heightmap. However, this method does not consider spatial position in the calibration, and leads to large variance around the boundary of the sensor. A multilayer perceptron network is later used to encode spatial variance
[wang2021gelsight], however an endtoend learning method could prove more robust. For example, works such as [bauza2019tactile, ambrus2021monocular] learn a model from a limited set of realworld tactile interactions. Our method (Section 4) differs from the above as we train our model via a tactile simulator [si2021taxim]that mimics intensity distributions from the real sensor. Simulation allows us to scale supervisedlearning to a wider range of objects and groundtruth.
2.2 Visuotactile shape perception
Global information from vision has complemented lowresolution touch in a multimodal setting [bjorkman2013enhancing, ilonen2014three, varley2017shape, gandler2020object]. Wang et al. [wang20183d] use monocular shape completion augmented with GelSight readings. However they rely primarily on the visual shape prediction, and tactile sensing serves as a refinement step. Smith et al. [smith20203d, smith2021active] demonstrate a learned perception model on simulated datasets, to predict local mesh deformations via highresolution touch and fillingin through vision. The context of our work resembles those of [wang20183d] and [smith20203d], with partial vision and highdimensional touch. Our contributions differ from these methods as we (i) perform incremental inference on the measurement stream, and (ii) do not rely on datadriven shape priors.
2.3 Gaussian processes and graphs
We wish to faithfully approximate noncontact regions, capture surface uncertainty, and probabilistically handle measurement noise. Gaussian process implicit surfaces (GPIS) [williams2007gaussian] showcase these properties and have found preference in manipulation research—over pointclouds [bauza2019tactile] and other parametric methods [bierbaum2008robust]. The GPIS considers the object’s SDF magnitude and gradient as a GP, conditioned on noisy sensor measurements. This has been successfully applied to both passive [dragiev2011gaussian, ottenhaus2016local] and active 3D reconstruction [bjorkman2013enhancing, jamali2016active, yi2016active, driess2017active] with lowresolution tactile data. We extend these ideas, scaling them to a stream of highdimensional touch measurements for incremental shape reconstruction.
The key challenge, especially for GelSight pointclouds, is that GPs scale poorly due to matrix inversion costs. In the SLAM community, common approximations include local GPs [lee2019online, stork2020ensemble] and compact kernels [ranganathan2010online]. These have further been incorporated into factor graphs [dellaert2017factor] for trajectory estimation [yan2017incremental], target tracking [rosen2014inference], motion planning [mukadam2018continuous], elevation modeling [wang2019underwater], and planar mapping [Suresh21tactile]. Inspired by these, our representation encodes GP potentials as local constraints in a spatial factor graph.
3 Problem formulation
We consider a robot arm with a GelSight tactile sensor interacting with an unknown 3D object fixed on a tabletop. Given a sequence of images from the GelSight sensor, robot kinematics, and depthmap from a depthcamera, we incrementally estimate the object’s shape and signed distance function (SDF) uncertainty.
Object shape: We represent the object’s shape as an implicit surface in the robot’s frame, with SDF uncertainty (Refer Section 5.3).
Tactile measurements: During interaction, upon detecting contact, we record the corresponding tactile image and sensor pose :
(1) 
Depthmap: We capture a depthmap of the object from the camera, represented in the robotframe: .
Assumptions: In line with prior efforts, we assume:

[leftmargin=2em]

Calibrated robotcamera extrinsics,

Fixed object pose and known approximate object
dimensions, 
A passive exploration algorithm for object coverage.
The rest of the paper is as follows: Section 4 presents a GelSight image to heightmap model for tactile perception. Section 5 combines tactile pointclouds with a depthmap in an incremental GP spatial graph. In Section 6, we demonstrate our method for simulated and real visuotactile experiments. Finally, we sum up our efforts in Section 7.
4 Local shape from touch
Visionbased tactile sensors perceive contact geometries as images. The soft, illuminated gelpad deforms elastically on contact and is captured by an embedded camera. We represent local shape recovery as the inverse sensor model:
(2) 
With , , and knowledge of sensor pose from robot kinematics, we can obtain a tactile pointcloud , comprising of 3D points and normals :
(3) 
In this section, we learn through simulation, and its output forms the basis for our visuotactile mapping in Section 5.
4.1 Learning from simulation
For tactile sensors with soft body deformation, local shape geometry can be learned through supervision. Imagetodepth estimation networks [eigen2014depth, laina2016deeper] can learn truedepth from GelSight images even without sensor calibration. However, this would require a large corpus of tactile images and corresponding groundtruth depths. While this is impractical in the realworld, we render images from a tactile simulator instead [si2021taxim]. To ensure transfer to the realworld, the simulator is calibrated with reference data from a real GelSight sensor, thus mimicking the same intensity distributions.
Network and training: We use an implementation [fcrn2018github] of the fully convolutional residual network [laina2016deeper] as our depth estimator, as shown in Figure 3. The network combines ResNet50 as the encoder and upsampling blocks as the decoder. Our model takes tactile images as input, and outputs predictions of both heightmap and contact mask . We choose 30 household objects from YCB dataset [calli2017yale], and hold out 6 objects for testing generalization. For each object, we generate 660 images from randomly sampled sensor poses on their groundtruth mesh models. We split the trainvalidationtest sets as 5505060.
Benchmarks: We compare , with the standard lookup table method [yuan2017gelsight] . This maps tactile images to gradients of the local shape, and uses fast Poisson integration to derive their heightmaps. The contact masks are generated from an intensitybased thresholding of contact vs. noncontact frames.
Evaluation: Figure 4 compares with respect to benchmarks on our YCBSightSim dataset (refer Section 6.1). We compare each estimated heightmap and contact mask against the groundtruth. Specifically, we evaluate:

Pixelwise RMSE on heightmaps, and

Intersection over union (IoU) on contact masks.
On heightmap estimation, we outperform the benchmark with an average RMSE of 0.094 mm across all object classes. The lookup table has larger variance, with an average RMSE of 0.182 mm. Note that the maximum penetration depth of the simulation is 1 mm. On contact mask estimation, we have an average IoU of 0.752, while the handcrafted image thresholding performs much worse with 0.379. Finally, in Figure 3, we demonstrate generalization of to both unseen simulation and realworld tactile interactions.
5 3D shape estimation
5.1 Standard Gaussian processes
A GP is a nonparametric method to learn a continuous function from data, wellsuited to model spatial and temporal phenomena [rasmussen2003gaussian]
. To estimate shape, a classical GP considers the object’s SDF to be a joint Gaussian distribution over noisy measurements of its surface. At any given point in space, the SDF
represents the signeddistance from the surface: on the surface, inside, and outside. The GP meaningfully approximates the global shape, even in regions lacking sensor information. Given a dense tactile measurement ^{1}^{1}1or depth map , we learn a function between positions and normals :(4) 
More generally, treating the left and right hand side of Equation 4 as the GP’s inputoutput:
(5) 
The posterior distribution at a query point for a full GP with measurements, is given by [rasmussen2003gaussian]:
(6) 
where is the sensor noise covariance, and , and are the traintrain, trainquery, and queryquery kernels respectively. Each kernel’s constituent block is an kernel basis, in our case a thinplate function [williams2007gaussian]. This inference is computationally intractable for the large that accrues from highdimensional tactile measurements. The update operations involve costly matrix inversions, and perquery costs (Refer Equation 6). We now present a local approximation that can be updated and queried incrementally, with bounded computational costs.
5.2 GPSG: Gaussian process spatial graph
We represent the scene as a spatial factor graph [dellaert2017factor], comprising of nodes we optimize for and factors that constrain them. These query nodes are at their respective spatial positions , distributed in an volume. Our optimization goal is to recover the posterior , which represents the SDF of the volume and its underlying uncertainty.
Implementing the full GP (Equation 6) in the graph is costly, as each measurement constrains all query nodes . Motivated by prior work in spatial partitioning [lee2019online, stork2020ensemble], we decompose the GP into local unary factors as a sparse approximation. Given that and query node
follow a GP, the joint distribution and conditional are:
(7) 
This gives us a unary Gaussian potential which can be incorporated into a leastsquare setting:
(8) 
At a timestep , given measurements , we add the set of associated factors within a local radius of each query node’s position . Thus, for all query nodes, we accumulate a small set of factors:
(9) 
This sparsifies an otherwise intractable optimization, pictorially represented in Figure 5 for a 2D case. Taking the Stanford bunny as an example, we illustrate how a set of noisy surface measurements are converted into local GP factors. The final optimization recovers a posterior SDF mean and uncertainty. More specifically, for the visuotactile problem, the maximum a posteriori estimation is:
(10) 
where is the factor set from the depthmap , and is the factor set from tactile measurement . The term applies a positive SDF prior to nodes, initializing the volume as empty space. Inference is carried out at each timestep via incremental smoothing and mapping (iSAM2) [kaess2012isam2].
This framework combines the computational benefits of an online local GPIS [lee2019online, stork2020ensemble] with those of an incremental leastsquares solver. This is wellsuited for sensors like the GelSight, as the dense pointclouds are too expensive to incorporate into a full GP. When querying, we recover the posterior mean and covariance only for the nodes updated—the remaining grid is accessed from cache.
5.3 Implicit surface generation
The posterior estimate represents the SDF’s mean and uncertainty, sampled from the volume. A marching cubes algorithm [lorensen1987marching] can give us both the implicit surface and the corresponding SDF uncertainty . is generated as the zerolevel set of the SDF:
(11) 
Finally, we prune faces/vertices from that lie outside for any of the sensor measurements. These areas have high surface uncertainty, and our spatial graph will poorly approximate them. Furthermore, this is necessary for sequential data as we cannot expect a watertight mesh from partial coverage.
6 Experimental evaluation
We illustrate our method in both simulated (Section 6.2) and realworld (Section 6.3) visuotactile experiments. We compare our shape estimates with respect to the groundtruth meshes using the Chamfer distance (CD) [barrow1977parametric], a commonlyused shape similarity metric.
Implementation: The framework is executed on an Intel Core i77820HQ CPU, 32GB RAM without GPU parallelization. We use the GTSAM [dellaert2012factor] optimizer with iSAM2 [kaess2012isam2] for incremental inference. Due to the precision of sensing, we empirically weight the noise of tactile measurements to be lower than that of the depthmap. We set the grid size , which occupies a volume of side larger than the objects. The local radius is tuned to of the side length.
6.1 Visuotactile data collection
We collect the YCBSightSim and YCBSightReal datasets for evaluating our method. This comprises of YCB groundtruth meshes [calli2017yale], GelSight images from interaction, sensor poses, and a depthmap. While we consider 30 household objects in simulation, we restrict our shape mapping evaluation to 6 objects. This subset of objects have varied geometries (curved, rectangular, and complex) to verify the generalization of our method.
YCBSightSim: We generate GelSightobject interactions using Taxim, an examplebased tactile simulator [si2021taxim]. We simulate 60 uniformly spread sensor poses on each object, normal to the local surface of the mesh. We render a depthmap from the perspective of an overlooking camera using Pyrender [pyrender]. Finally, zeromean Gaussian noise is added to tactile pointclouds, sensor poses, and depthmap.
YCBSightReal: We use a UR5e 6DoF robot arm, mounting the GelSight sensor on a WSG50 parallel gripper. The depthmap is captured via a fixedpose, calibrated Azure Kinect, approximately 1 m away from the object. Our complete setup can be seen in Figure 9. The GelSight captures 640 480 RGB images of the tactile interactions in a 2.66 cm area. The objects are secured by a mechanical bench vise at a known pose, to ensure they remain static. After capturing the depthmap , we approach each object from a discretized set of angles and heights. We detect contact events by thresholding the tactile images. We collect tactile images of the object’s lateral surface, along with the gripper poses via robot kinematics.
6.2 Simulated tactile mapping
In Figure 6 we highlight mapping results for the 6 objects in YCBSightSim. We first visualize the implicit surface and SDF uncertainty from depthmap only. After this, touch measurements are added incrementally and reflect in the shape estimate. The surface uncertainty is typically high for regions that lack depth/tactile information, and reduces over time. Figure 7 shows that the CD with respect to the groundtruth mesh decreases with greater number of touches, and converges within 35–40 touches. The timing plot of graph operations shows nearconstant graph update and query time. The execution time reduces towards the end of the datasets as a result of smaller contact areas on the top surface of the objects. These timings can be further improved by parallelizing spatial operations.
6.3 Realworld tactile mapping
In Figure 8, we show our method working on real data collected in YCBSightReal. The Kinect depthmaps for specular objects like tomato_soup_can and potted_meat_can are erroneous, but tactile information provides more precise local shape. To prevent damage to the robot and sensor, we do not explore near the base of the object—we instead hallucinate measurements at the bottom based on the nearest corresponding sensor poses. In Figure 10, we plot the CD over time for the 6 YCB objects. The initial error is lower than simulation due to the additional hallucinated measurements. We see the error converge to an average CD of 18.3 mm^{2}, a similar magnitude as in the simulated experiments.
7 Conclusion
We present an incremental framework for 3D shape estimation from dense touch and vision. We formulate a GP spatial graph (GPSG) structure, that efficiently infers an object’s implicit surface and SDF uncertainty. To integrate GelSight tactile images, we recover local shape with a model learned in tactile simulation. Our method is first demonstrated in a simulated visuotactile setting, and is later shown to generalize to realworld shape perception.
As future work, we wish to actively reconstruct these shapes using surface uncertainty information. The current method can further benefit from (i) parallelized spatial graph operations, and (ii) datadriven shape priors [varley2017shape, wang20183d]. Finally, we wish to consider relaxing the fixedpose assumption [Suresh21tactile], and perception of deformable objects.