Learning Latent Space Dynamics for Tactile Servoing

11/08/2018 ∙ by Giovanni Sutanto, et al. ∙ 0

In order to achieve a dexterous robotic manipulation, we need to equip our robot with a tactile feedback capability, i.e. the ability to drive action based on tactile sensing. In this paper we specifically address the challenge of tactile servoing, i.e. given the current tactile sensing and a target/goal tactile sensing --for example being memorized from a successful task execution in the past--, what is the action that will bring the current tactile sensing to move closer towards the target tactile sensing at the next time step. We develop a data-driven approach to acquire a dynamics model for tactile servoing by learning from demonstration. Moreover, our method represents the tactile sensing information as to lie on a surface --or a 2D manifold-- and perform a manifold learning, making it applicable to any tactile skin geometry. As a proof of concept, we evaluate our method on a robot equipped with a tactile finger. A video demonstrating our approach can be seen in https://youtu.be/5EJSAoUO0E0



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The ability to adapt action based on tactile sensing is the key to robustly interact and manipulate objects in the environment. Previous experiments have shown that when the tactile-driven control is impaired, humans have difficulties performing even basic object manipulation tasks [1, 2]. Therefore, we believe that equipping robots with tactile feedback capability is important to make progress in robotic manipulation, assisting humans in daily activities and advanced manufacturing.

In line with this direction, recently a variety of tactile sensors [3, 4, 5, 6] have been developed and used in robotics research community, and researchers have designed several tactile-driven control – or popularly termed as tactile servoing – algorithms. However, many tactile servoing algorithms were designed for specific kinds of tactile sensor geometry, such as a planar surface [7] or a spherical surface [8], therefore do not apply to the broad class of tactile sensors in general. For example, if we would like to equip a robot with a tactile skin of arbitrary geometry or if there is a change in the sensor geometry due to wear or damage, we will need a more general tactile servoing algorithm.

In this paper, we present our work on a learning-based tactile servoing algorithm that does not assume a specific sensor geometry. Our method comprises three steps. At the core of our approach, we treat the tactile skin as a manifold, hence first we perform an offline neural-network-based manifold learning, to learn a latent space representation which encodes the essence of the tactile sensing information. Second, we learn a latent space dynamics model from demonstration, also offline. Finally, we deploy our model to perform an online control action computation –based on both the current and target tactile signals– for tactile servoing on a robot.

(a) Simulated Robot
(b) Real Robot
Fig. 3: Learning tactile servoing platform.

This paper is organized as follows. Section II provides some related work. Section III presents the model that we use for learning tactile servoing from demonstration. We then present our experimental setup and evaluations in Section IV. Finally, we discuss our results and future works in Section V.

Ii Related Work

Our work is mostly inspired by previous works on learning control and dynamics in the latent space [9, 10]. Both of these works learn a latent space representation of the state, and also learn a dynamics model in the latent space. Watter et al. [10] designed the latent space’s state transition model to be locally linear, such that a stochastic optimal control algorithm can be directly applied to the learned model for control afterwards. Byravan et al. [9] designed the latent space to represent SE(3) poses of the tracked objects in the scene, and the transition model is simply the SE(3) transformations of these poses. Control in [9] is done by following gradient of squared Euclidean distance between the target and current latent space poses with respect to action. inline,color=red!40inline,color=red!40todo: inline,color=red!40Y: Maybe say more specific what is the same and what is different in your approach.

The latent space dynamics model that we trained takes latent space representation of the current tactile sensing and applied action, and predict the latent space representation of the next tactile sensing, which is termed as forward dynamics. Since we use the model for control, i.e. tactile servoing, it is also essential that we can recover action, given both the current and next tactile sensing — termed as inverse dynamics.

Previous work [11] learns both the forward and inverse dynamics model for poking, and the inverse dynamics model is represented as additional layers in the neural network model. In our work, we engineer the latent space representation to be Euclidean, such that the inverse dynamics model’s output action prediction can be acquired simply by computing the gradient of the squared distance between the current and next latent states with respect to action.

In terms of latent space representation, our work is inspired by the work of Hadsell et al. [12]

, where they use a Siamese neural network and a loss function such that similar data points are close to each other in the latent space and dissimilar data points are further away from each other in the latent space. On the other hand, we use a Siamese neural network and a loss function that does a Multi-Dimensional Scaling (MDS)

[13] such that the first two dimensions of the latent space represent the 2D map of the contact point on the tactile skin surface. Our third dimension in the latent space represents the degree of contact applied on the skin surface, i.e. how much pressure was applied at the point of contact.

Regarding tactile servoing, besides the previous works [7, 8] which have been mentioned in Section I, Su et al. [14]

designed a heuristic for tactile servoing with a tactile finger

[3]. Our work treats the tactile sensor as a general manifold, hence the method shall apply to any tactile sensors.

Previously, learning tactile feedback has been done through reinforcement learning


or a combination of imitation learning and reinforcement learning

[16, 17]. Sutanto et al. [18] learns a tactile feedback model for a trajectory-centric reactive policy. In this work, we learn a tactile servoing policy indirectly by learning a latent space dynamics from demonstration. Because we engineer the latent space to be Euclidean by performing MDS and maintaining contact degree information, the control action can be computed from the gradient of the squared latent space distance between the current and target states with respect to action. Hence our method does not need to perform a reinforcement learning.

Iii Data-Driven Tactile Servoing Model

Iii-a Tactile Servoing Problem Formulation

(c) and
(d) Action computation for tactile servoing from learned model.
(b) and
Fig. 8: Neural Network diagrams and its loss functions (drawn as dotted lines).
(b) and

Given the current tactile sensing and the target tactile sensing , the objective is to find the action which will bring the next tactile sensing closer to , which can be written as:


Iii-B Latent Space Representation

If the distance metric is a squared distance of two states which lie on a Euclidean space, then computing in Eq. 1 becomes trivial, i.e.

will be a gradient vector pointing from the current state to the target state. Unfortunately, both

and may not lie on a Euclidean space.

On the other hand, there seems to be some natural characterization of tactile sensing, such as the contact point and the degree of contact pressure applied at the point. The contact point in particular is a 3D coordinate which lies on the skin surface. Obviously we know that the skin surface is not Euclidean, i.e. we cannot go from the current contact point to the target contact point by simply following the vector between them because then it may be off the skin surface while doing so111The correct way of traversing from a contact point to the other is by following the geodesics between the two points on the skin surface.. However, if we are able to flatten the skin surface in 3D space into a 2D surface, then traversing between the two contact points translates into following the vector from one 2D point to the other on the 2D surface, which ensure that the intermediate points being traversed are all still on the 2D surface. Fortunately, there has been a method of mapping/embedding from a 3D surface into a 2D surface, called Multi-Dimensional Scaling (MDS) [13].

In this paper, we choose the latent space embedding to be three-dimensional222Here we assume that there exists a mapping from a tactile sensing into the 3D contact point on the tactile skin surface as well as a mapping from into the degree of contact pressure information.:

  1. The first two dimensions of the latent space –called the and dimensions of the latent space– corresponds to the 2D embedding of the 3D contact point on the tactile skin surface.

  2. The third dimension (the dimension) of the latent space represents the degree of contact pressure applied at the contact point.

We understand that the above representation can only represent a contact as a single 3D coordinate in the latent space. Therefore, it will not be able to capture the richer set of features, such as an object’s edges and orientations, etc. Tactile servoing for edge tracking is left for a future work.

We call the latent state representation of a tactile sensing as .

Iii-C Approach

We define the distance metric as a squared distance in the latent space between the embeddings of and by the embedding function , as follows333Subscripts in Eq. 2 corresponds to time indices.:


We assume the latent space dynamics as follows:


and numerical integration gives us:


We represent the embedding function by the encoder part of an auto-encoder neural network, while the latent space forward dynamics function is represented by a feedforward neural network.

For achieving the latent space representation as mentioned in III-B, we impose the following structure:

  1. We would like to map points on a surface in 3D space into a 2D coordinate. Essentially what we are dealing with here is a 2D manifold embedded in 3D space. For such manifold, the notion of distance between any pair of two 3D points on the manifold is given by the geodesics, i.e. the curve of shortest path on the surface. For this mapping, we would like to preserve these pairwise geodesic distances in the resulting 2D map. That is, for pairs of data points , we want to acquire a latent space embedding via the embedding function to get the latent space pairs whose distance in the and dimensions is as close as possible to the pairwise geodesic distance. Therefore we define the loss function [19]:


    is the number of data point pairs which is quadratic in the total number of data points . is the geodesic distance between the two data points in the -th pair. The pairwise geodesic distance between any two data points is approximated using the shortest path algorithm on a sparse distance matrix of M-nearest-neighbors of each data point. We use M-nearest-neighbors because the space is not 2D-Euclidean globally due to skin curvature, but it is locally 2D-Euclidean –i.e. flat– on a small neighborhood (a small patch) on the skin. The computation result is stored as a symmetric dense approximate geodesic distance matrix of size before the training begins. The pairwise loss function in Eq. 5 is applied by using a Siamese neural network as depicted in Figure (b)b(a).

  2. Encoding of the dimension of the latent space with the contact pressure information . This is done by imposing the following loss function:


While we have the ground truth for the dimension of the latent state, i.e. , we do not have the ground truth for the and dimensions. We have the 3D coordinate of the data point on the skin444

For BioTacs, these 3D coordinates can be computed from electrode values, by using the point of contact estimation model presented in

[20]. — which is used to compute the sparse distance matrix of M-nearest-neighbors of each data point — but we do not know how it is mapped to the and dimensions of the latent space, and this is our reason for using an auto-encoder neural network representation. The auto-encoder reconstruction loss is:


with is the encoder/embedding function, and is the decoder/inverse-embedding function.

Furthermore, we would like to be able to predict the forward dynamics in the latent space. For this purpose, we have the following loss function:


with is computed from Eq. 3 and 4. For additional robustness, we can also do chained predictions for time steps ahead and sum up the loss function in Eq. 8 for these chains, similar to the work by Nagabandi et al. [21].

Beside forward dynamics, we also found that the ability of the model to predict inverse dynamics to be essential for the purpose of action selection or control. This is in agreement with previous work by Agrawal et al. [11]. Agrawal et al. [11] uses some additional neural network layers to predict the action given the current and next states. However, in our case, since we engineer the latent space to be Euclidean, we know that the gradient vector of the latent space distance between the current and next states with respect to action will have the opposite direction to the action vector itself. Therefore, we model our inverse dynamics loss as a cosine distance, as follows:


We combine the loss functions and optimize them together as a total weighted-sum loss function:


with the weights are tuned so that each loss function components become comparable to each other. The overall model is trained by minimizing the total loss with respect to a human demonstration’s trajectory data .

After the model is trained, at test time we can perform tactile servoing by computing the gradient of the distance metric in Equation 2 w.r.t. , i.e. then follow the opposite direction of this gradient, similar to [22]. Each of individual loss functions and the action computation during tactile servoing using the trained model are depicted in Figure (b)b.

Iv Experiments

Iv-a Experimental Setup

We use an ABB Yumi robot as the hardware platform for our experiments. Yumi is a position-controlled bi-manual robot, but we only use the left arm for the purpose of our experiments. We mount a finger equipped with a biomimetic tactile sensor BioTac [3] using a 3D-printed adapter on the left hand of the robot. The setup is pictured in Figure 3. The BioTac has 19 electrodes distributed on the skin surface, capable of measuring deformation of the skin by measuring the change of impedance when the conductive fluid underneath the skin is being compressed or deformed due to a contact with an object. In our experiments, is a vector of 19 values corresponding to the digital reading of the 19 electrodes, subtracted with its offset value estimated when the finger is in the air and does not make any contact with any object. The contact pressure information is a scalar quantity, which is obtained by negating the mean of the vector , i.e. , with being the digital reading of the -th electrode minus its offset.

Iv-A1 Human Demonstration Collection

For collecting human demonstrations, we set the Yumi robot to be in Leadthrough mode, which essentially makes the robot joints become compliant with some degree of gravity compensation. In this mode, the robot tracks its joint positions and transmits this information to a computer for recording at 250 Hz. The tactile information is recorded at 100 Hz, while can be computed from later.

The demonstrator shows several minutes of contact interaction between the BioTac finger and a box object, particularly swiping around the edges of the box. In total we collect data points of the tactile sensing vector , and pairs of , i.e. the pairs of the current observed tactile state , the current action , and the next observed tactile state . We obtain the number of state-action pairs after excluding the pairs that contain states which correspond to contact pressure information below a specific threshold. We exclude these pairs as we deem them being off-contact tactile states and not being informative for performing tactile servoing555In the extreme case, when the robot is not in contact with any object, there is no point of performing tactile servoing..

After collecting the demonstration data, we pre-process the data by performing low-pass filtering with a cut-off frequency of 5Hz. We determined the low-pass cut-off frequency by performing a frequency-domain analysis using a visualization of the Fourier transform of the data. This frequency selection of tactile servoing is also supported by a previous work by Johansson et al.

[23]. We perform the forward dynamics prediction at 10 Hz.

inline,color=blue!40inline,color=blue!40todo: inline,color=blue!40G: Maybe this number of pairs need to be reorganized and moved to after action representation; we also need to explain that we perform low pass filtering at 5 Hz and state forward dynamics prediction at 10 Hz by re-organizing the data and averaging in-between actions

Iv-A2 Action Representation

From the demonstration, we collected the trajectory of joint positions . However, if we use the information of the time-derivative of i.e. joint velocity — as the action policy of the tactile servoing, this will not likely to generalize to new situations. The reason is because such policy at time is only reasonable in the context that the robot is at joint position . In other words, such policy will require the information of to be a part of the state, together with the tactile sensing information. This will make learning a tactile servoing model more difficult, as we have some additional dimensionality in the state.

A better option — a policy representation — that most likely will generalize better with minimum number of required state dimensionality is the end-effector velocity expressed with respect to the end-effector frame. By representing the end-effector velocity with respect to the end-effector frame, effectively we are cancelling out the dependency of the state representation on the end-effector pose information. Therefore we choose this policy representation.

To achieve this, first we compute the joint velocity information by numerical time-differentiation of , and then project them to the end-effector velocity with respect to the robot base frame via the kinematic Jacobian , as follows:


To get the end-effector velocity with respect to the end-effector frame , we compute the following [24]:


with is the end-effector orientation with respect to the base frame, expressed as a rotation matrix. We use the robot control framework Riemannian Motion Policies (RMP) [25] for computing both and . Hence, we define with dimensionality 6, where the first three dimensions is linear velocity and the last three is angular velocity.

Iv-A3 Machine Learning Framework and Training Process

Our auto-encoder takes in 19 dimensional input vector , and compresses it down to a 3 dimensional latent state embedding,

. The intermediate hidden layers are fully connected layers of size 19, 12, 6 with ReLU, tanh, and tanh activation functions, respectively, forming the encoder function

. The decoder part is a mirrored structure of the encoder function, forming . The latent space forward dynamics function is a feedforward neural network with 9 dimensional input (3 dimensional latent state and 6 dimensional action policy ), 1 hidden layer of size 6 with tanh activation functions, and 3 dimensional output.

Our training process is split into two steps. In the first step, we pre-train the auto-encoder with the loss function for k training iterations, each with a batch size of 128. After that we train both the auto-encoder and the latent space forward dynamics function together with in Eq. 10 for k training iterations, each with a batch size of 128. We set the values of , , , , and empirically.

We implement all components of our model in TensorFlow


. We also noticed a significant improvement in learning speed and fitting quality after we add Batch Normalization

[27] layers in our model.

Iv-B Auto-Encoder Reconstruction Performance

Our first evaluation is on the reconstruction performance of the auto-encoder in terms of normalized mean squared error (NMSE). NMSE is the mean squared prediction error divided by the variance of the ground truth. The result is summarized in Table

I. As we can see, all NMSE values are below 0.25 for the training, validation, and test sets.

Auto-Encoder Reconstruction NMSE
Electrode Training Set Validation Set Test Set
( Split) ( Split) ( Split)
1. 0.1154 0.1015 0.1310
2. 0.1053 0.0939 0.0862
3. 0.0647 0.0866 0.0550
4. 0.1138 0.1325 0.1136
5. 0.1163 0.1273 0.1148
6. 0.2061 0.2345 0.1993
7. 0.1361 0.2045 0.1102
8. 0.1438 0.1150 0.1578
9. 0.1033 0.0886 0.0904
10. 0.1072 0.1048 0.0963
11. 0.1179 0.1081 0.1217
12. 0.1321 0.1137 0.1197
13. 0.0788 0.0882 0.0804
14. 0.0522 0.0540 0.0565
15. 0.0843 0.0762 0.0857
16. 0.0722 0.0898 0.0607
17. 0.1095 0.1053 0.1191
18. 0.1197 0.1076 0.1021
19. 0.0909 0.1010 0.1453
TABLE I: NMSE of Auto-Encoder Reconstruction.

Iv-C Latent Dynamics Prediction Performance

Fig. 9: Normalized mean squared error (NMSE) vs. the length of chained forward dynamics prediction, averaged over latent space dimensions, on test dataset.

We perform evaluation of the latent space forward dynamics function by chain-predicting the next latent states and measuring NMSEs for the length of chain. In Figure 9, we compare the performance between 4 different combinations:

  • using both of and loss functions during training as indicated by LatStruct or not using both of them –i.e. without any structure imposed in the latent space representation– as indicated by noLatStruct, and

  • using inverse dynamics loss during training as indicated by IDloss or not using it as indicated by noIDloss.

We see that in all cases where no latent space structure is imposed, the performance is generally worse than those with imposed latent space structure. We believe this happens because it is a hard task to train a forward dynamics predictor to predict on an unstructured latent space. On the other hand, in general we see that all models with imposed inverse dynamics loss perform worse than those without . We think this is most likely because training a model without imposing loss is easier than training with imposing it. However, as we will see in Sub-Section IV-D, the model trained without loss does not provide correct action policies for tactile servoing as it was not trained to do so.

Iv-D Real Robot Experiment

(a) Snapshot 1
(b) Snapshot 2
(c) Snapshot 3
Snapshot 1
(e) Snapshot 1
(f) Snapshot 2
(g) Snapshot 3
(h) Snapshot 4
(i) Snapshot 1
(j) Snapshot 2
(k) Snapshot 3
(l) Snapshot 4
(d) Snapshot 4
(d) Snapshot 4
Fig. 22: Snapshots of our experiments executing the tactile servoing with the learned model: first row, figures (a)-(d) are real robot execution snapshots by a model trained with , , and losses imposed (success); second row, figures (e)-(h) are robot execution snapshots in a simulator with real-time sensor feed from the BioTac, by a model trained with , , and losses imposed (success); third row, figures (i)-(l) are robot execution snapshots in a simulator with real-time sensor feed from the BioTac, by a model trained with and losses imposed but without imposing loss (failed). The red marker on the BioTac on the real robot (first row) and the red blob on the simulator (second and third rows) indicate the approximate target/goal locations of the contact point, while the purple blob on the simulator (second and third rows) indicates the current contact point computed from the contact point estimation model in [20]. The yellow and cyan arrows display the angular and translational velocity components, respectively, of the policy computed by the model.
Fig. 23: x-y dimensions of the latent state trajectory during a real robot execution.

For testing the trained model to perform tactile servoing on a real robot, we created a velocity-tracking RMP [25] policy, in particular to track the end-effector velocity that is produced by the trained model666The model gives as output, while the robot only knows how to track , thus we need to invert Eq. 12 to perform tactile servoing.. In Figure 22, we provide snapshots of a robot execution on a real hardware and in a simulator fed with real-time tactile sensing from the BioTac finger. The model which was trained with all , , and loss functions imposed (first and second rows of Figure 22) was able to successfully execute tactile servoing and bring the initial contact point with an object (a screwdriver handle) to the target contact point as indicated by the red marker/blob. This was achieved by the model providing a velocity policy which was mostly composed of an angular velocity component (indicated by the yellow arrow), hence resulting in a rolling motion on the surface of the object. On the other hand, the model which was trained with and but without imposing (third row) failed to perform a tactile servoing behavior: instead of providing a rolling motion, it moved in a translational way as indicated by the cyan arrow. In Figure 23, we provide the latent space trajectory that was traversed during the real robot execution, from the start point (the circle mark) to the goal point (the triangle mark). We see that trajectory resembles a line from start to the goal, but stays imperfect due to the learning approximation. The experiment can be seen in the video https://youtu.be/5EJSAoUO0E0.

V Discussion and Future Work

In this paper we provide a learning-from-demonstration framework for achieving tactile servoing behavior. We showed that our manifold representation of tactile sensing information is critical to the success of our approach. We also showed that it is important for learning a tactile servoing model that is not just being able to predict the next state from the current state and action (forward dynamics prediction), but also being able to predict the action when given the current and next states (inverse dynamics).

In the future, we would like to extend our work to not only track a contact point, but also a contact profile surrounding the contact point. This can be useful to produce interesting behavior such as tactile navigation on the edges of an object.


We thank David Crombecque from Dept. of Mathematics, University of Southern California, for the insightful discussions on mathematical manifolds. We also thank Arunkumar Byravan for discussions on the SE3-Pose-Nets paper, and Kendall Lowrey for the help on a finishing work of the BioTac mounting on the Yumi robot, both from the University of Washington.