Learning to Grasp Without Seeing

05/10/2018 ∙ by Adithyavairavan Murali, et al. ∙ Carnegie Mellon University 0

Can a robot grasp an unknown object without seeing it? In this paper, we present a tactile-sensing based approach to this challenging problem of grasping novel objects without prior knowledge of their location or physical properties. Our key idea is to combine touch based object localization with tactile based re-grasping. To train our learning models, we created a large-scale grasping dataset, including more than 30 RGB frames and over 2.8 million tactile samples from 7800 grasp interactions of 52 objects. To learn a representation of tactile signals, we propose an unsupervised auto-encoding scheme, which shows a significant improvement of 4-9 variety of tactile perception tasks. Our system consists of two steps. First, our touch localization model sequentially 'touch-scans' the workspace and uses a particle filter to aggregate beliefs from multiple hits of the target. It outputs an estimate of the object's location, from which an initial grasp is established. Next, our re-grasping model learns to progressively improve grasps with tactile feedback based on the learned features. This network learns to estimate grasp stability and predict adjustment for the next grasp. Re-grasping thus is performed iteratively until our model identifies a stable grasp. Finally, we demonstrate extensive experimental results on grasping a large set of novel objects using tactile sensing alone. Furthermore, when applied on top of a vision-based policy, our re-grasping model significantly boosts the overall accuracy by 10.6 grasp with only tactile sensing and without any prior object knowledge.



There are no comments yet.


page 1

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Consider the task of grasping a slippery glass bottle. We use vision to determine the object’s location and its properties such as shape. Based on these estimates, we can even plan how to approach and make contact with the bottle. However, not until we get tactile feedback by touching, can we adjust our hands for a reliable grasp. In many cases, the hand completely occludes the object after contact, severely diminishing the use of hand-eye coordination; yet in all these cases we humans are invariably successful in grasping the objects. In fact, we are even capable of grasping objects solely based on touching. A good example is when we probe around on a nightstand for our phone. Haptics and the sense of touch plays a vital role in grasping. Yet, most of our currently existing grasping algorithms primarily builds on visual sensing (RGB-Depth or laser scanners). In fact, in the recent Amazon Picking Challenge, only one of 26 teams used a tactile sensor [1]. Can a robot learn to grasp solely based on touching and without even using vision? More importantly, can the robot incorporate both visual inputs and tactile feedback for robust grasping?

Fig. 1: Our Fetch robot learns to localize and grasp a novel object of unknown shape from just tactile sensing. Our method estimates the target’s location by touch-probing the workspace (top right), and establish an initial grasp (bottom left). We then learn to extract features from haptic feedback, and predict how to adjust the grasp (bottom right). This re-grasping process is repeated until our method identifies a stable grasp.

Sensory inputs affect the success of a grasp in all stages: localization of the object, planning111Grasp planning refers to both analytic and data-driven techniques. of the grasp control parameters (gripper pose, approach direction, etc.) and the execution of the grasp on the robot. Vision-based methods, such as object detection, segmentation and point cloud registration, are widely used for localization. Without using visual sensing, tactile exploration has demonstrated promising results on locating objects and estimating their 6 DOF poses [2, 3, 4, 5, 6, 7]. However, haptics has rarely been considered in the context of grasping beyond simple, individual objects. Recently, there has also been tremendous progress in data-driven grasp planning methods, namely in learning grasp policies from RGB-D images [8, 9, 10, 11]. But most of these approaches ignore haptic feedback during execution. In fact, tactile sensing has been previously used for grasp execution, for instance in assessing grasp stability [12, 13, 14], and thus enabling the hand to adjust its posture and position online [15, 16, 17, 18]. Nonetheless, these methods assume either the initial grasp or the object information is inferred with vision, with few exceptions [19]. Felip et al. [19] presented a full system for tactile grasping using hand-crafted rules. In such light, no general learning framework exists for a complete grasp (localization, planning and execution) using solely touch sensors.

In this work, we present the first general framework for learning to grasp with only tactile sensing and without prior object knowledge. Our goal is to scale to a diverse set of unknown objects. To this end, we focus on 2D planar grasps of a single object. To start with, we design a localization module to obtain an approximate location of the object. Intuitively, we control the robot to sequentially “touch-scan” the grasp plane until hitting the object and we use a particle filter to aggregate the measurements and track the target location.

With all the uncertainty of object location, tactile sensing and kinematics, how can the robot reliably grasp the object? Our core idea is to treat grasping as a multi-step process with error recovery. Specifically, we propose a re-grasping module

that refines the initial grasp with multiple re-trials. To extract rich meaningful features for the re-grasping task, we use a recurrent auto-encoder to learn an unsupervised representation from all the unlabelled data. These features are then fed to another neural network that simultaneously estimates grasp stability, and predicts the adjustment for the next grasp. Our framework will iterate on the grasps until our network estimates a high chance of success or the number of trials reaches a predefined limit.

Our high-capacity deep network requires a large-scale tactile dataset for training, which is missing in the community. We have thus created a new dataset of grasping with both tactile and visual sensing. Specifically, we record images, haptic measurements as the robot gripper encloses its fingers on an object, high-level re-grasp actions sent to the motion planner and labels of whether an object has been successfully grasped. Our publicly available dataset includes K interactions with unique objects with material labels. We hope that it will serve as a major resource for future research on visio-haptic manipulation.

Our method is trained using our dataset, and tested on unseen objects. We systematically vary components of our framework and benchmark the performance. First, we show that our unsupervised representation learning produces rich tactile features for a variety of passive (material recognition) and active (re-grasping) tasks. Next, we show that haptic based re-grasping improves a baseline policy, with the ground truth object location provided by vision-based localization. Finally, with touch based localization, our full method achieves a grasping accuracy of using tactile sensing alone. We believe this is one of the first results of grasping a large set of unknown objects without seeing. Furthermore, we explore combining haptic and visual sensing for robust grasping. Our results indicate that our multi-step re-grasping with tactile feedback 1) improves the robustness of grasp execution and 2) offers an easy plug-in for existing grasp planning methods.

Ii Related Work

Grasping is one of the fundamental problems in robotic manipulation and we refer readers to recent surveys [20, 21, 22].

Vision Based Grasping. Visual perception has been the primary modality for sensing, grasp planning and execution. Several work on model-based grasping make use of visual information like point clouds/images to estimate physical properties of objects (e.g., shape [23] or pose [24]), and finally to generate control commands for grasping. Sensing detailed physical properties from visual inputs can be exceedingly challenging, and might not be necessary for finding desired controls. Therefore, recent papers have focused on learning-based approaches [25, 8]. These methods directly map input visual data to the control signals for open-loop grasping. Recently, a lot of progress has been made in this direction by using deep models [26, 9, 10, 11]. However, using visual inputs alone leads to errors such as slippage due to low-friction or wrong grasp location due to self-occlusion.

Tactile Exploration. In contrast, humans make great use of tactile signals for grasping and can even grasp unknown objects without using visual sensing [27]. Therefore, recent work in robotics has also explored the use of haptics for sensing an object’s shape, pose, location or attributes [28, 29, 30]. For example, if the location of an object is known, the shape can be estimated by actively touching its parts [31, 32]. Similarly, given the 3D models of objects, several recent work seek to infer the 6DOF pose of the objects with a series of information-gathering actions [4, 3, 2]. However, these results have neither been considered for the task of grasping nor can generalize to unknown objects. The most relevant work are from [6] and [7]. Pezzementi et al. [6] built occupancy grid mapping using tactile sensing of unknown 2D objects. Kaboli et al. [7] proposed a pre-touch strategy to localize novel objects in a 3D workspace. These work are similar to our touch localization step, yet they failed to complete the full pipeline of touch-based grasping.

Fig. 2:

Overview of our system and approach. (a) Our robot and sensors: We equip a fetch robot with a Robotiq gripper and additional sensor packages. Our sensors include force sensor on the fingers of the gripper and RGB-D cameras on the head of the robot; (b) Our touch based object localization: We touch-probe a 2D grasp plane of the workspace, and use particle filtering to aggregate evidences of the object’s location. An initial grasp is established given an estimate of the object’s 2D location. (c) Our unsupervised learning scheme for haptic features: We learn to represent haptic data during grasping using an conditional auto-encoder. The learned features are fed into our re-grasping model to correct the initial grasp. (d) Our re-grasping model: Based on haptic features from current grasp, we estimate grasp stability and predict how to adjust the grasp. A new grasp is generated by applying the adjustment to the current grasp. This process repeats until our method predicts a stable grasp.

Re-grasping with Tactile sensing. Haptic feedback is widely used for closed-loop control when executing a grasp, also known as re-grasping. Early work [33] focused on analytical solutions for 2D planar grasp given ideal tactile sensing of a known object shape. For real world tactile data, hand crafted rules can be highly effective if object shape is known [17]. Several recent works addressed the task of re-grasping or assessing grasp stability without prior object knowledge [15, 12, 34, 18, 35, 14]. However, they all rely on a good initial grasp given by another sensor modality. The most relevant work are from [16, 36] and [19]. Based on tactile feedback, Dang et al. [16] learned to predict grasp stability [13], which is further used to guide grasping. Their method can generalize to unknown objects but requires accurate object locations. Moreover, their approach only used simulated data with hand-designed features. Koval et al. [36] utilized haptic feedback to learn both pre and post contact push-grasping policies. Their method accounts for inaccurate sensing of object location and pose, yet is limited to objects with known shapes. Conversely, our method learns tactile based re-grasping policies with neither prior knowledge of the object (shape/physics) nor necessarily a good initial grasp. In addition, our approach makes use of large-scale real-world visual and haptic data to learn how to grasp. Moreover, Felip et al. [19] presented a full tactile grasping pipeline (exploration and re-grasping) with a wrist force-torque sensor, fingertip tactile sensors and a fully actuated multi-fingered gripper. They used a set of hand-crafted rules/features and demonstrated success on a small set of novel objects. Conversely, our tactile perception modules are learned from data and only uses the fingertip tactile sensors. We show that our learned model can be applied to successfully grasp a larger set of novel objects, including deformable and elastic ones.

Grasping Datasets. Alongside algorithmic developments, large-scale datasets have fueled the success of learning to grasp [26, 9]. However, when it comes to haptic datasets, there have been only few attempts such as [37, 38]. These datasets either focus on passive tasks e.g., material recognition [38], or are limited to grasping a small set of - objects with a small number of trials [37, 39]. As part of our effort, we created the first large-scale grasping dataset with both tactile and visual sensing to facilitate future research of visio-haptic grasping. As a result, our work is also deeply intertwined with the unsupervised learning of tactile feature representations. Previous work has primarily used hand-crafted features for haptic data [17]. Schneider et al. [40] constructed haptic features using bag-of-words. Madry et al. [41] explored unsupervised learning of haptic features using sparse coding. The learned representation has been shown effective for re-grasping [34], though it is intended for a specific class of sensors providing a matrix/image of tactile responses. We propose a novel method for learning haptic features using a deep recurrent network similar to [42].

Iii Dataset

In this section, we present the effort on creating our visio-haptic dataset for grasping. Large-scale haptic dataset for grasping is important for learning high capacity deep models. Unfortunately, this kind of dataset is missing in the community. We seek to bridge this gap by collecting a new grasping dataset that includes both visual and haptic sensor data. Specifically, our dataset consists of grasp interactions with different objects. Each grasp interaction lasts for - seconds and is recorded with:

  • RGB Frames: We capture images of four specific events of a grasping: for the initial scene, before, during and after grasp execution. These images have a resolution of 1280x960.

  • Haptic Measurements: Tactile signals are measured by force sensors mounted on each of the three fingers of the gripper. The sensor measures the magnitude () and the direction of forces () at Hz.

  • Grasping Actions and Labels: We record the pose of all 2D planar grasps, including the initial grasp and subsequent re-grasps . We also record whether the re-grasp succeeded.

  • Material Labels of Objects: We label material categories (7) for each object, including metal, hard plastic, elastic plastic, stuffed fabric, wood, glass and ceramic.

Data Collection. To collect this dataset, we sample and execute a large set of grasps. The robot will lift up objects and automatically detect successful grasps. A major issue with this data collection process is how we can get more successful grasps. It is easy to collect failure cases by applying random grasps but it is difficult to collect successful grasps, which are critical for learning. To address this issue, we used an existing vision based grasping policy to sample an initial grasp from a pre-learned visual grasping policy [43]. We collect two sets of data and combine them to form our final dataset. The first set includes all objects with - initial grasps. Each initial grasp is followed by a single random re-grasp. The grasps in this set have a higher rate of success. On the other hand, our second set contains a subset of objects covering different types of materials. For each object in this set, we sample - initial grasps, and allow - random re-grasps, resulting in a higher failure rate.

Dataset Statistics. Overall, our dataset includes more than K RGB frames and over million of tactile samples from grasp interactions of objects. We provide grasping actions and labels for each interaction, as well as material labels for each object. To the best of our knowledge, this is by far the largest dataset for vision-haptic grasping. Our dataset is publicly available at: cs.cmu.edu/GraspingWithoutSeeing.

Iv Overview

We present an overview of our framework in Fig 2. Our goal is to reliably grasp a target object using just fingertip tactile sensors and without knowing the location, pose or shape of the object. Similar to previous works, our framework has two main stages: grasp planning and grasp execution. For planning, we make use of particle filtering to localize an object based on a sequence of touch-probing. For grasp execution, we learn to iteratively adjust the grasp based on haptic feedback, using a deep neural network. Unlike other work in robot learning [26] which learn torque control, we infer position control commands and use a motion planner to reach that configuration. We also explore the benefit of applying our re-grasping model on top of a vision based grasping policy. Our methods for planning and execution are detailed in Section V and VI, respectively.

Platform. We implement our method on a real world robotic platform—a research edition of Fetch mobile manipulator [44], equipped with a 3-Finger adaptive gripper (Robotiq). We use ROS [45] and position control with the Expanding Space Tree (ESTk) motion planner from MoveIt to generate collision-free trajectories for the robot. For haptic sensing, we mount a 3-Axis Optoforce sensor onto each of the three Robotiq fingers. We made sure this mounting is rigid by using customized 3D-printed fixtures (see left panel of Fig 2). For vision, we use a PrimeSense Carmine 1.09 short-range RGB-D camera mounted on the robot’s head. Note that visual data is not used in our method, except when we explore combining RGB frames from PrimeSense with haptic sensing for grasping.

V Initial Grasp from Touching

We present our method for grasp planning. Traditionally, the goal of planning is to generate a good initial grasp of a target object. This usually requires the robot to sense the physical properties of the object, such as shape or pose. This is especially challenging with tactile sensing alone. Nevertheless, our key observations are that 1) we can infer a rough location of the object by probing the grasp plane and hitting the target multiple times; 2) even a poor initial grasp is often sufficient for successful grasping, if we allow the robot to correct the grasp a few times using haptic feedback. Thus, we propose a simple method for grasping. We first localize the object by touching, and then generate a random initial grasp. We will show that this method can be highly effective when combined with our learning based re-grasping policy.

V-a Particle Filter for Touch Localization

The core of our grasp planning is a simple touch-based localization method using contact sensing. We consider the task of grasping a single target object within a known workspace–in our setting a constrained packaging box in which the object could be in any pose. In this case, we control the robot to line-scan a fixed D plane of the workspace using one of its fingers, which functions as a touch probe. The probe moves in a cartesian path until it detects a contact (defined by a threshold on the magnitude of force). Our method makes multiple contacts and uses particle filtering to infer the object’s location on the D plane.

The choice of a particle filter is tailored for our problem, as our contact measurement is highly non-linear and lacks analytic derivatives. Particle filters are a non-parametric formulation of the recursive Bayes filter:


The belief is approximated using a finite set of particles . above denotes the target location at time , is the line scan action and the contact sensing measurement. The touch-localization framework is summarized in Algorithm 1 and the detailed mechanisms of the particle filter could be found in [46]. At the end of touch-scanning, the centroid of the resampled particles is returned as the final estimate of the target object’s location.

Uniform random samples ;
for t = 1: do
        Run linear scan to get observation ;
        for i = 1: in  do
               Sample from motion model ;
               Update measurement , ;
        end for
        Resample(, ) ;
end for
return: mean() object location ;
Algorithm 1 Touch localization using Contact Sensing

We present details of our measurement and motion models.

  • Motion model: Touching the object might change its location. This displacement is usually small, yet is determined by how the robot moves (

    ), and the physical properties of the object and its environment. We simplify the motion model by assuming a Gaussian distribution independent of

    : , where is a small noise.

  • Measurement model: Our measurement model tracks physical occupancy of probed locations. Any location on the D plane can be either free space (no contact) or occupied by the object (contact). We either increase (occupied) or decrease (free space) the weights of particles that lies within the vicinity (a sphere of radius 2.5cm for our experiments) of the location. An example is shown in Fig 2, where particles in swept area of the probe are down-weighted and particles near the contact point (red circle) are up-weighted.

Once we estimate the target location, our next step is to generate a grasp. Without prior object information, we select a grasp by randomly sampling from the rest of the parameter space. Executing such a grasp is highly likely to fail, as this sample can be far away from feasible grasps. Somewhat surprisingly, we will show that this random policy can produce a successful grasp, if we allow the robot to re-grasp a few times and adjust its controls each time based on tactile feedback.

Vi Grasp Execution via Re-grasping

Fig. 3: Tactile response from both successful and failed grasps. These grasps are from objects with varying shape/material/compliance properties. We plot the time series of force magnitude from our sensors on three fingers (red: right, green: middle, blue: left). The maximum force during grasping is also displayed. We record signals before and after the gripper closes (shown in bottom). These signals contain important information about the object (e.g., material, shape) and the grasping (e.g., grasp stability). And we explore using them to estimate how to correct a previous grasp.

Given a noisy object location and a randomly selected grasp, how can the robot reliably grasp the object? To address this question, let us first look at what is measured by haptic sensors during grasping. Fig 3 shows haptic responses during the task of grasping. It is evident that these signals encode important information about the object in contact. For example, the magnitude of force implies the material of the object. And the temporal force variation across three fingers indicates the shape. These signals also capture critical aspects about the grasping. For example, we can predict the stability of the grasping by tracking the temporal structure of signals before and after contact. Therefore, we hypothesize that these tactile signals can be used to correct the initial grasp.

We will demonstrate that this is indeed possible if we consider grasping as a multi-stage process, and allow the robot to re-grasp a few times. Each new grasp is generated by adjusting a previous one using haptic feedback. Re-grasping thus helps to reduce the uncertainty of sensing. To this end, we propose a learning based approach for tactile based re-grasping. Our method learns representations from haptic data, estimate the grasp stability and predict the adjustment for next grasp, all using deep models. We now present our methods on haptic feature learning and tactile based re-grasping.

Vi-a Learning Haptic Features

The next question is how do we learn a generalized representation of haptic data? Should we use hand-designed features or some task-specific representation? Raw tactile signals are in the form of a time series, with a low dimensional vector at each time step. Since they do not encode much global information compared to modalities like vision, it is challenging to consider haptic data without the context of the robot control applied. Therefore, what we need to learn is a conditional representation and to this end, we trained a conditional auto-encoder model over the haptic signals, shown in Fig 

4. Both encoder and decoder in our model have a recurrent architecture (LSTMs [47]). Our encoder takes a sequence of haptic data and control signals as inputs, and encodes them into a low dimensional latent space . Our decoder reconstructs the input haptic data from the latent space .

By conditioning the reconstruction on control actions, the network must learn to embody the temporal structure of haptic data within the motion of the robot during grasping. This will allow us to re-use to present haptic and control signals for re-grasping. Note that the learning is unsupervised in nature and does not require manual labeling.

More specifically, our haptic signals, denoted by , include a D vector for each time step from all three fingers. Our control signals include the configuration of the gripper: and . is the mode of the adaptive gripper. describes the angle between the fingers, and has categorical values of “pinch”, “normal” and “wide angle”. The under-actuated gripper fingers have three links each but only one DOF as . is valid when the gripper has been fully enclosed on the object. If no object was enclosed (grasp failure), will take the maximum possible value. We use

2 loss and stochastic gradient descent for training. For feature extraction, we discard the decoder

and only use the encoder to extract the hidden state from a fixed size time window ( seconds).

Fig. 4: Network architectures for learning haptic features (top) and re-grasping policy (bottom). Our conditional auto-encoder learns to reconstruct haptic data using both haptic signals and applied gripper control. We treat the learned latent space as features for learning re-grasping policy

. Our re-grasping policy maps the hidden representation

to the adjustments of planar grasping parameters () (4D). These high level parameters are then executed using the motion planner to generate a new grasp.

Vi-B Learning to Re-grasp

We consider a multi-stage grasping problem, where each grasp is conditioned on the previous one. Formally, given a current grasp , we measure the haptic data and grasp configuration parameters (, ) and encode them into . is the hidden state that captures the haptic responses of the current grasp. Next, we learn the corrective action that leads to better grasp stability and the architecture is shown in Fig 4. At the same time, we learn a score function

to predict the grasp stability, which determines the empirical probability of grasp success. The score function

is a simple feedforward networks with 5 fully connected layers of size

and a final sigmoid function to estimate the probability. When testing, we iteratively apply the predicted

to current grasp . We execute until predicts a high rate of success. Algorithm 2 summarizes our method.

Localize object with vision/touch ;
Sample from / ;
Execute on robot ;
Collect first haptic measurement ;
for i = 2: do
        Encode ;
        Compute = () ;
        if   then
               Compute = () ;
               Execute on robot ;
               Collect haptic measurement ;
        end if
end for
Algorithm 2 Grasping Without Seeing

Our output action is parameterized by the change of the gripper’s position (-0.025m (, , ) 0.025m) and orientation ( ). is thus a D vector. Given that the haptic measurement is only relevant in the local neighborhood of the current grasp, we constrain the range of these parameters to small adjustments tailored to our setting. During data collection, continuous values of the re-grasp (, , , ) are sampled randomly. However, for the deep network we use a dicretized output space. Specifically, we discretize each control dimension into bins. Thus, the learning of the policy function is similar to multi-way binary classification.


Eq 2 shows our loss for learning our policy function. corresponds to the success/failure label while is the final dense layer before the sigmoid. gives the number of discretized bins for control parameter i, K (=4) is the number of control parameters, B is the batch size and is the sigmoid activation. is an indicator function and is equal to 1 when the control parameter i corresponding to bin j is applied. The learning rates for , , are 5e-7, 1e-5 and 5e-5 respectively. All models are trained with ADAM optimizer [48]

for around 20 epochs. The networks and optimization are implemented in TensorFlow 


and Keras. Similarly,

is learned using a cross-entropy loss.

Vi-C Improving Vision-Based Grasping with Re-grasping

Finally, for our experiments we also explore incorporating the haptic re-grasping module with vision based grasping. In practice, any vision-based policies could be used [9, 11, 10]. We adapt a variant of [43] (hereafter denoted as ). is used to generate an initial grasp, followed by our re-grasping model. We also use this policy to collect our dataset. We sample control parameters from that are more likely to produce a success grasp to increase the number of successful grasps in our dataset.

Specifically, five control parameters are inferred from the object’s image : , , , , . and are the 2-D grasp locations in image plane (converted to 3-D coordinates and with a calibrated depth camera). is the angle of the gripper about the vertical axis in a planar grasp (similar to [9]). is the configuration of the gripper, which is also used for our learning of haptic features. And is estimated height of the object from depth sensing. For both testing and data collection, we sampled parameters from and chose the command for each control dimension by .

Vii Experimental Evaluation

We now present our experimental results. Our experiments are divided into two parts. First, we evaluate the learned haptic features for two key tactile perception tasks of material recognition and grasp stability estimation. We compare against state-of-the-art haptic feature extraction methods, and benchmark the choice of classifiers. Second, we test our tactile based grasping framework. We report results for our re-grasping module, tactile-only grasping, and visio-haptic grasping.

Test Set for Grasping. To evaluate our grasping framework, we physically test grasping methods on a set of novel objects. We measure the grasp accuracy averaged over multiple trials per object as our evaluation criteria. This test setting is very challenging: testing objects are not presented in the training set and thus have not been seen by neither our model nor . Our testing set is divided into two parts, as shown in Fig 5. Each set consists of different objects. Set A is more difficult than Set B, as it contains objects with more complex geometry, heterogeneous material distribution (e.g., plastic toy guns and stapler) and articulations. This test set is also used for grasp stability estimation.

Fig. 5: Our test set of objects. These objects were not in the training data. We divide our test set into two parts. Set A contains slightly harder objects to grasp (such as the red and orange toy guns) compared to Set B.

Vii-a Learning Haptic Features

Our first experiment tests our haptic feature learning scheme. Our decoder achieves a reconstruction error (L2 norm) of and on the training set and our held-out testing set ( of the recorded data), respectively. This error (around 1 Newton of force) is reasonable when compared to Newton sensing noise from our force sensor. To further evaluate the learned haptic features, we consider two key tasks in tactile perceptions: (1) material recognition; and (2) grasp stability estimation. And we consider different combinations of haptic features and classifiers for both tasks.

Tactile Features. We compare our learned haptic features with two other baselines representation learning methods.

  • Auto-encoder. This is our haptic features learned using a unsupervised recurrent auto-encoder. Once learned, only the encoder is used to extract features.

  • Sparse Coding. This is a variant [50] of ST-HMP features [41]. These features are learned using dictionary learning and sparse coding on the spectrogram of 1D time series of tactile signals. Note that directly using ST-HMP is not feasible for us, as it requires 2D tactile images.

  • Hand Crafted. This is from [29], where raw signals from three specific events (before contact, when the finger closing movement is stalled due to object-finger contact, after the fingers are in equilibrium) are extracted.

Choice of Classifiers. We further vary the classifiers used for both material recognition and grasp stability estimation.

  • Deep Network. We train a five-layer neural network with cross entropy loss for classification.

  • SVM. This is a linear classifier trained with hinge loss.

Vii-A1 Material Recognition

The task is to classify different materials in our dataset using tactile signals during grasping. All features are learned from the full training set, as no supervision is required. Our classifiers are trained on a subset of the training set (80%) and tested on the held out testing set (the remaining 20%). We report average class accuracy. The results are presented in Table I.

Feature Type Accuracy (%)
Deep Network SVM
Auto-encoder (Ours) 42.86 40.68
Sparse Coding [50, 41] 36.35 35.93
Hand Crafted [29] 33.50 33.66
TABLE I: Results of Material Recognition

The features learned from our auto-encoder outperforms sparse coding and hand crafted features for both the deep network and SVM by a significant margin (at least 4.7%). The feed-forward network also performs at least comparably or slightly better than the SVM for all features. In particular, our haptic features with deep networks improves the traditional method of sparse coding with SVM by

. Furthermore, we show the confusion matrix for material recognition in Fig 

6. The majority of the error comes from hard objects that are composed of wood/metal/glass being mis-classified as hard plastic. This result demonstrates that our haptic features encode physical properties of the object.

Fig. 6: Confusion matrix for material recognition on a held out test set. Using our learned haptic features, we achieve an accuracy of .

Vii-A2 Grasp Stability Estimation

The task is to estimate whether the grasp will be successful given tactile signals during grasping. Again, all features are learned from the training set. We train the classifiers on the training set and apply them on our full test set ( trials on unseen objects). We report the accuracy for binary classification. The results are summarized in Table II.

Feature Type Accuracy (%)
Deep Network SVM
Auto-encoder (Ours) 85.92 84.50
Sparse Coding [50, 41] 81.37 80.12
Hand Crafted [29] 82.54 82.66
TABLE II: Results of Grasp Stability Estimation

The results of grasp stability follow the same trend of material recognition. Our haptic features significantly outperform other features. And the combination of our learned haptic features with deep network achieves the best accuracy. This result suggest that the learned haptic features contains important information for grasping. To better understand our tactile features for grasping, we visualize the t-SNE embedding of the learned features and plot example results of our grasp stability estimation in Fig 7. We observe that the main failure modes are from that (1) the part of the finger containing the haptic sensor may not come into contact with the object; and (2) the object may slip in the gripper.

Vii-A3 Remarks

We demonstrate that our learned features are highly effective for two key tactile perception tasks. When compared to other haptic features, our feature learning can substantially improve the performance. We also show that deep networks are on average better than classical linear SVM with all haptic features. These results provide a strong support to our design of the re-grasping model, i.e. the combination of our learned haptic features and deep networks.

Fig. 7: Visualization of learned haptic features using t-SNE Embedding. Red and blue dots correspond to failed and successful grasps respectively. We also plot four typical examples for grasp stability estimation.

Vii-B Tactile Based Grasping

Our second experiment focuses on the tactile based grasping framework. We first evaluate our core touch based re-grasping model. We then benchmark the full pipeline, and explore incorporating vision based grasping with our re-grasping model.

Vii-B1 Re-grasping Model

We evaluate our core re-grasping model using the full test set (20 objects). Note in this case, we assume an oracle object location is given: we place each object in eight canonical orientations (N,S,W,E,NE,SE,SW and NW). Moreover, the initial grasp is randomly selected given the object location. We then compare three different settings: a single random re-grasp, multiple random re-grasps, and our re-grasping model. For fair comparison, we set the number of trials for random re-grasps equal to the maximum number of trials of our model. Both random re-grasp and our model are based on our grasp stability estimation.

Object Location Initial Grasp Re-grasp Grasp Accuracy (%)
Set A Set B A+B
Vision Random - 16.3 32.5 24.4
Vision Random Random (4 trials) 17.5 28.8 23.2
Vision Random Ours (4 trials) 33.8 41.3 37.5
TABLE III: Re-grasping results with oracle object locations

The results are shown in Table III. Grasp accuracy on Set B is always higher than Set A. For the full set, the baseline accuracy for chance grasping is , where the first (and only) grasp is sampled from a random policy with no re-grasping. Interestingly, multiple random re-grasps slightly decreased the accuracy by . And our re-grasping model get the best accuracy of . This is better than the baseline of multiple random re-grasps. This result demonstrates the effectiveness of our re-grasping module.

Vii-B2 Grasping without Seeing

Going beyond re-grasping, we test our full pipeline of tactile based grasping, which includes touch based localization and re-grasping. In this case, we simplify our benchmark by only considering our test set B and use trials per object. This is primarily limited by the run time of our experiments. Our results are show in Table IV. Our pipeline increases the baseline of random grasping by and reaches an accuracy of with only tactile sensing. This is one of the first results for a complete grasping of multiple novel objects using only the sense of touch.

Accuracy (% on Set B)
Touch Random - 26.0
Touch Random Ours (4 trials) 40.0
Vision Vision - 51.3
Vision Vision Ours (4 trials) 61.9
TABLE IV: Grasping accuracy of our full method. We also present results of combining our re-grasping module with a vision based policy to further improve grasping.

Vii-B3 Visio-Haptic Grasping

Our last experiment combines the proposed re-grasping model with a vision based policy from  [43]. The results are show in Table IV. Our framework can further benefit from a good initial grasp (+). And more importantly, combining vision based grasping with our tactile based re-grasping can largely improve the accuracy by . These results provide a strong evidence for the need of combining visual and tactile sensing for robust grasping. Through this experiment, we also shows the flexibility of our re-grasping model, which can be readily plugged into existing grasp planning methods.

Viii Conclusions

In this paper, we demonstrate one of the first attempts of learning to grasp novel objects using only tactile sensing and without prior knowledge about the object. The core of our method lies in the combination of a) a simple method of touch based localization b) unsupervised learning of rich tactile features and c) a learning based method for re-grasping using haptic feedback. First, we created a large-scale dataset for visio-haptic grasping to evaluate our method and to facilitate future research. With this dataset, we used a auto-encoder to learn rich features from raw tactile signals. These features proved effective for both passive tasks like material recognition and active tasks like re-grasping, and displayed an improvement of around -% over prior methods. Finally, we show that our novel re-grasping model can progressively improve the grasping, leading to significantly higher success rate even from a noisy initial grasp. Our method achieved a grasping accuracy of using only tactile sensing for both localization and grasping. We also demonstrate that this re-grasping model can be combined with existing vision based grasping to further improve the accuracy by about 10%. We hope that our method together with our dataset could provide valuable insights for solving the challenging problem of autonomous grasping.

Ix Future Work

Our current method is limited in the sense that re-grasping has to start from a random initial grasp, which is far from optimal. Looking forward, tactile exploration could be used to build a representation of object shape (e.g., Gaussian Process Implicit Surfaces) followed by grasp planning [51]. Also, the major failure mode with our current hardware setup is one of partial observability - the regions of the robot’s finger not covered by the sensor might come in contact and push the object. This in turns affects all stages of our pipeline - from feature learning, localization, grasp stability estimation to re-grasping. This could be mitigated by using novel skin/contact sensors and wrist force-torque sensors alongside incidental contact algorithms [52]

. Furthermore, instead of adding symmetric Gaussian noise in the motion model of the particle filter, we can bias the model in the direction of the detected contact force. Finally, a joint learning of localization and re-grasping with reinforcement learning is interesting to explore. Staged learning or policy iteration on the learned policy would greatly improve its performance as in prior work

[43, 10, 9].

ACKNOWLEDGEMENTS This work was supported by ONR MURI N000141612007, NSF IIS-1320083 and Google Focused Award. Abhinav Gupta was supported in part by Sloan Research Fellowship and Adithya was partly supported by a Uber Fellowship. The authors would also like to thank Brian Okorn, Nick Rhinehart, Ankit Bhatia, Reuben Aronson, Lerrel Pinto, Tess Hellebrekers and Suren Jayasuriya for discussions.


  • [1] N. Correll, K. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. Romano, and P. Wurman, “Analysis and observations from the first amazon picking challenge,” IEEE Transactions on Automation Science and Engineering, 2017.
  • [2] S. Javdani, M. Klingensmith, J. A. Bagnell, N. S. Pollard, and S. S. Srinivasa, “Efficient touch based localization through submodularity,” in ICRA, 2013.
  • [3] B. Saund, S. Chen, and R. Simmons, “Touch based localization of parts for high precision manufacturing,” in ICRA, 2017.
  • [4] A. Petrovskaya and O. Khatib, “Global localization of objects via touch,” IEEE Transactions on Robotics, 2011.
  • [5] M. C. Koval, M. R. Dogar, N. S. Pollard, and S. S. Srinivasa, “Pose estimation for contact manipulation with manifold particle filters,” in IROS, 2013.
  • [6] Z. Pezzementi, C. Reyda, and G. D. Hager, “Object mapping, recognition, and localization from tactile geometry,” in ICRA, 2011.
  • [7] M. Kaboli, D. Feng, K. Yao, P. Lanillos, and G. Cheng, “A tactile-based framework for active object learning and discrimination using multimodal robotic skin,” IEEE Robotics and Automation Letters, 2017.
  • [8] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embed to control: A locally linear latent dynamics model for control from raw images,” in NIPS, 2015.
  • [9]

    L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,”

    ICRA, 2016.
  • [10]

    S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,”

    ISER, 2016.
  • [11] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” RSS, 2017.
  • [12] Y. Bekiroglu, J. Laaksonen, J. A. Jorgensen, V. Kyrki, and D. Kragic, “Assessing grasp stability based on learning and haptic data,” IEEE Transactions on Robotics, 2011.
  • [13] H. Dang and P. K. Allen, “Stable grasping under pose uncertainty using tactile feedback,” Autonomous Robots, 2013.
  • [14] R. Calandra, A. Owens, M. Upadhyaya, W. Yuan, J. Lin, E. Adelson, and S. Levine, “The feeling of success: Does touch sensing help predict grasp outcomes?” Conference on Robot Learning, 2017.
  • [15] J. M. Romano, K. Hsiao, G. Niemeyer, S. Chitta, and K. J. Kuchenbecker, “Human-inspired robotic grasp control with tactile sensing,” IEEE Transactions on Robotics, 2011.
  • [16] H. Dang, J. Weisz, and P. K. Allen, “Blind grasping: Stable robotic grasping using tactile feedback and hand kinematics,” in ICRA, 2011.
  • [17] K. Hsiao, S. Chitta, M. Ciocarlie, and E. G. Jones, “Contact-reactive grasping of objects with partial shape information,” in IROS, 2010.
  • [18] H. Dang and P. K. Allen, “Grasp adjustment on novel objects using tactile experience from similar local geometry,” in IROS, 2013.
  • [19] J. Felip, J. Bernabe, and A. Morales, “Contact-based blind grasping of unknown objects,” IEEE-RAS International Conference on Humanoid Robots, 2012.
  • [20] A. Bicchi and V. Kumar, “Robotic grasping and contact: a review,” in ICRA, 2000.
  • [21] J. Tegin and J. Wikander, “Tactile sensing in intelligent robotic manipulation–a review,” Industrial Robot: An International Journal, 2005.
  • [22] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp synthesis—a survey,” IEEE Transactions on Robotics, 2014.
  • [23] A. T. Miller, S. Knoop, H. I. Christensen, and P. K. Allen, “Automatic grasp planning using shape primitives,” in ICRA, 2003.
  • [24] A. Collet, D. Berenson, S. S. Srinivasa, and D. Ferguson, “Object recognition and full pose registration from a single image for robotic manipulation,” in ICRA, 2009.
  • [25] A. Saxena, J. Driemeyer, J. Kearns, C. Osondu, and A. Y. Ng, “Learning to grasp novel objects using vision,” in ISER, 2006.
  • [26] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” JMLR, 2016.
  • [27] R. S. Johansson and J. R. Flanagan, “Coding and use of tactile signals from the fingertips in object manipulation tasks,” Nature reviews. Neuroscience, 2009.
  • [28] V. Chu, I. McMahon, L. Riano, C. G. McDonald, Q. He, J. M. Perez-Tejada, M. Arrigo, N. Fitter, J. C. Nappo, T. Darrell et al., “Using robotic exploratory procedures to learn the meaning of haptic adjectives,” in ICRA, 2013.
  • [29] A. Spiers, M. Liarokapis, B. Calli, and A. Dollar, “Single-grasp object classification and feature extraction with simple robot hands and tactile sensors,” IEEE Transactions on Haptics, vol. 9, 2016.
  • [30] W. Yuan, C. Zhu, A. Owens, M. A. Srinivasan, and E. H. Adelson, “Shape-independent hardness estimation using deep learning and a gelsight tactile sensor,” in ICRA, 2017.
  • [31] Z. Yi, R. Calandra, F. Veiga, H. van Hoof, T. Hermans, Y. Zhang, and J. Peters, “Active tactile object exploration with gaussian processes,” in IROS, 2016.
  • [32] M. Meier, M. Schopfer, R. Haschke, and H. Ritter, “A probabilistic approach to tactile shape reconstruction,” IEEE Transactions on Robotics, 2011.
  • [33] R. Fearing, “Simplified grasping and manipulation with dextrous robot hands,” IEEE Journal on Robotics and Automation, 1986.
  • [34] Y. Chebotar, K. Hausman, Z. Su, G. S. Sukhatme, and S. Schaal, “Self-supervised regrasping using spatio-temporal tactile features and reinforcement learning,” in IROS, 2016.
  • [35] S. Dragiev, M. Toussaint, and M. Gienger, “Uncertainty aware grasping and tactile exploration,” in ICRA, 2013.
  • [36] M. C. Koval, N. S. Pollard, and S. S. Srinivasa, “Pre-and post-contact policy decomposition for planar contact manipulation under uncertainty,” The International Journal of Robotics Research, 2016.
  • [37] Y. Chebotar, K. Hausman, Z. Su, A. Molchanov, O. Kroemer, G. Sukhatme, and S. Schaal, “Bigs: Biotac grasp stability dataset,” in ICRA Workshop on Grasping and Manipulation Datasets, 2016.
  • [38] Z. Erickson, S. Chernova, and C. C. Kemp, “Semi-supervised haptic material recognition for robots using generative adversarial networks,” arXiv preprint arXiv:1707.02796, 2017.
  • [39] Y. Chebotar, K. Hausman, O. Kroemer, G. Sukhatme, and S. Schaal, “Generalizing regrasping with supervised policy learning,” ISER, 2016.
  • [40] A. Schneider, J. Sturm, C. Stachniss, M. Reisert, H. Burkhardt, and W. Burgard, “Object identification with tactile sensors using bag-of-features,” IROS, 2009.
  • [41] M. Madry, L. Bo, D. Kragic, and D. Fox, “St-hmp: Unsupervised spatio-temporal feature learning for tactile data,” in ICRA, 2014.
  • [42] J. Sung, J. K. Salisbury, and A. Saxena, “Learning to represent haptic feedback for partially-observable tasks,” arXiv preprint arXiv:1705.06243, 2017.
  • [43] A. Murali, L. Pinto, D. Gandhi, and A. Gupta, “CASSL: Curriculum accelerated self-supervised learning,” ICRA, 2018.
  • [44] M. Wise, M. Ferguson, D. King, E. Diehr, and D. Dymesich, “Fetch & freight: Standard platforms for service robot applications,” Workshop on Autonomous Mobile Service Robots, IJCAI, 2016.
  • [45] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating system,” in ICRA workshop on open source software, 2009.
  • [46] S. Thrun, D. Fox, W. Burgard, and F. Dellaert, “Robust monte carlo localization for mobile robots,” Artificial intelligence, 2001.
  • [47]

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

    Neural computation, 1997.
  • [48] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  • [49] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  • [50] J.-P. Roberge, S. Rispal, T. Wong, and V. Duchaine, “Unsupervised feature learning for classifying dynamic tactile events using sparse coding,” in ICRA, 2016.
  • [51] J. Mahler, S. Patil, B. Kehoe, J. van den Berg, M. Ciocarlie, P. Abbeel, and K. Goldberg, “GP-GPIS-OPT: Grasp planning with shape uncertainty using gaussian process implicit surfaces and sequential convex programming,” ICRA, 2015.
  • [52] T. Bhattacharjee, A. Kapusta, J. Rehg, and C. Kemp, “Rapid categorization of object properties from incidental contact with a tactile sensing robot arm,” Humanoids, 2013.