DeformerNet: A Deep Learning Approach to 3D Deformable Object Manipulation

by   Bao Thach, et al.

In this paper, we propose a novel approach to 3D deformable object manipulation leveraging a deep neural network called DeformerNet. Controlling the shape of a 3D object requires an effective state representation that can capture the full 3D geometry of the object. Current methods work around this problem by defining a set of feature points on the object or only deforming the object in 2D image space, which does not truly address the 3D shape control problem. Instead, we explicitly use 3D point clouds as the state representation and apply Convolutional Neural Network on point clouds to learn the 3D features. These features are then mapped to the robot end-effector's position using a fully-connected neural network. Once trained in an end-to-end fashion, DeformerNet directly maps the current point cloud of a deformable object, as well as a target point cloud shape, to the desired displacement in robot gripper position. In addition, we investigate the problem of predicting the manipulation point location given the initial and goal shape of the object.



There are no comments yet.


page 2

page 4

page 5


Learning Visual Shape Control of Novel 3D Deformable Objects from Partial-View Point Clouds

If robots could reliably manipulate the shape of 3D deformable objects, ...

Learn the Manipulation of Deformable Objects Using Tangent Space Point Set Registration

Point set registration is a powerful method that enables robots to manip...

Sequential Topological Representations for Predictive Models of Deformable Objects

Deformable objects present a formidable challenge for robotic manipulati...

Novel View Synthesis from Single Images via Point Cloud Transformation

In this paper the argument is made that for true novel view synthesis of...

Deformable Filter Convolution for Point Cloud Reasoning

Point clouds are the native output of many real-world 3D sensors. To bor...

Deep RBFNet: Point Cloud Feature Learning using Radial Basis Functions

Three-dimensional object recognition has recently achieved great progres...

Mobile Manipulation Leveraging Multiple Views

While both navigation and manipulation are challenging topics in isolati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Following Navarro-Alarcon et al. [5], we adopt the terminology shape servo

to describe the manipulation task in which a robot manipulates a 3D deformable object into a desired shape. While performing this servo control, the robot estimates the state of the object with visual sensors and uses this as a feedback signal. However, the biggest question is: How do we obtain a good state representation of the object to inform the robot about the object’s shape?

3D in this context refers to triparametric or volumetric objects [12] which have no dimension significantly smaller than the other two, unlike uniparametric (rope) and biparametric objects (cloth).

A series of papers [6, 7, 8] define a set of feature points on the object as the state representation. These methods only work for known objects with distinct texture and cannot generalize to a diverse set of objects. This formulation also simply controls the displacements of individual points which does not fully reflect the 3D shape of the object. For precise control, one must use a large number of feature points, making control highly susceptible to noise and occlusion. Navarro-Alarcon and Liu [5] and Qi et al. [10] leverage 2D image contours as the feedback representations. However, using only 2D data severely limits the space of deformation control in 3D.

Hu et al. [1] address these shortcomings by using point clouds. They use extended FPFH [11]

to extract a feature vector from the point cloud and learn the deformation function via a Deep Neural Network (DNN). We show that this hard-coded feature descriptor fails to generalize well.

Previous learning-based methods such as [4, 14, 15] focus on rope and cloth. These works, often operated in image space, are not quite relevant to 3D object manipulation problems. There are also physical differences between rope-cloth and 3D objects (e.g. a 3D elastic object like soft tissue will return to its initial shape after the robot releases it), which makes methods in one domain not applicable to the other.

Therefore, we propose a novel deep learning approach to solving the 3D shape servo problem. Instead of using the hard-coded feature vector as in [1]

, we develop a DNN that takes point clouds of the deformable objects as the inputs and outputs feature vectors. In addition, we develop a DNN that maps these feature vectors to the desired end-effector’s Cartesian position. We train both the feature extraction and deformation control neural networks together in an end-to-end fashion. Once trained, given point clouds of the object’s current and goal shapes, the robot computes the desired position of its gripper at every time step. Finally, we study the problem of predicting the manipulation point. We are the first to propose a solution to this problem for 3D shape servoing.

Ii Shape Servoing

Fig. 1: (Top) Architecture of DeformerNet; (Bottom) architecture of the feature extraction module.

We define a set of manipulation points located on the surface of the deformable object where the robot gripper makes contact with the object and can directly control their positions. We formulate the shape servoing problem as finding a deformation function which maps the state representation of the deformable object to the manipulation points’ positions . The intuition here is that we can drive the object to a desired shape by controlling the locations of these manipulation points: .

We choose to be a feature vector which encodes the shape of the deformable object. We leverage a DNN derived from PointConv [13] which takes the point cloud of the current object shape and that of the goal object shape as the inputs and learns a 256-dimension feature vector. Figure 1 visualizes the network architecture of our feature extraction network.

For closed-loop operations, we modify the original equation to reflect the position and feature displacements: where is the difference between the feature vector of the goal shape and that of the current object shape, and = is the relative displacement between the desired and current manipulation points. is also the Cartesian displacement of the robot end-effector since the robot directly controls the positions of the manipulation points.

We define this deformation function as a fully-connected neural network. The goal of the shape servoing problem then becomes learning a model which maps the feature vector displacement to the desired displacement of the manipulation points . Given the learned , we can now compute the desired end-effector position at every time step. The entire network architecture is shown in Fig. 1.

At runtime, the robot is given the point cloud of the current object shape and the point cloud of the goal object shape. First, we pass the point clouds through DeformerNet which outputs the desired gripper position. We then use the RRT-connect motion planner to plan a joint space path following the desired end-effector displacement. After the robot has reached an intermediate desired position, it gets a new point cloud of the current object shape and repeats the process.

Prior works [1, 6, 7, 8] assume that the manipulation points are given to the robot. However, a robot should select the best possible points to achieve the task. To understand the importance of selecting a good manipulation point, consider a simple scenario in Fig. 2. The leftmost image describes the goal shape; if the robot grasps the object as in the second image, it can successfully deform the object to a shape very close to the goal shape (third image). However, with the manipulation point shown in the fourth image, the shape servo task now becomes impossible (rightmost image).

We formulate the manipulation point as a function of the visual representations of the object: , where is the current point cloud and is that of the goal shape. We propose two methods for deriving this function .

Ii-1 Using Keypoint Detection Heuristic

We use an unsupervised keypoint detection algorithm derived from the method in [2] to identify a set of K keypoints on the point clouds and . We define which measures the displacement of each keypoint from the initial to the goal point cloud. We estimate the manipulation point location as the weighted average of the top keypoints that displace the most, with weights equal to the displacements of the keypoints.

Ii-2 Using PointConv

We learn the function as a regression problem using the same architecture as DeformerNet (Fig. 1). We modify the model to output the Cartesian position of the manipulation point instead of displacement.

Fig. 2: Importance of manipulation point (MP) selection. Leftmost: goal shape; Red: successful MP; Purple: failed MP.

Iii Experiments and results

Fig. 3: Left two columns: initial and goal shapes. Center columns: manipulation points predicted by keypoint method (left) and by PointConv (right); Blue spheres are the predicted MPs and green spheres are the ground truth. Right columns: shape servo results using MPs predicted in the center column.

In the Isaac Gym environment [3], we use the bimanual daVinci surgical robot to manipulate a box object. One arm grasps one end of the object and holds it in place, while the other deforms the object into a desired shape. We created a training dataset by randomly sampling 150 pairs of object initial configurations and manipulation points. For each pair, the robot deforms the object to 200 random shapes.

Figure 3 shows the experimental results of our manipulation point selection algorithm using the two methods. We ran our Keypoint method and PointConv method with 100 test cases, and the average Euclidean distance between the predicted manipulation point and ground truth was 0.035[m] and 0.036[m], respectively.

We evaluate the performance on 4 different pairs of initial-goal object shapes. Figure 3

visualizes the shape servo results. We use Chamfer distance as the evaluation metric to describe how close the final object point cloud is to the goal point cloud. Table

I shows the final Chamfer distance in each scenario, using the manipulation points predicted by the two methods.

Keypoint method PointConv method
Case 1 0.23 0.18
Case 2 0.32 0.39
Case 3 0.26 0.53
Case 4 0.22 0.52
TABLE I: Shape servo results in Chamfer distance [m]

We compare our DeformerNet with [1], the current state-of-the-art work for learning-based 3D shape servoing. We show that the hard-coded feature vector limits the deformation function model to only learn a small set of shapes. In contrast, using learned feature vectors, DeformerNet can fit a large number of shapes and hence outperforms the method from [1].

Hu et al. [1]’s model underfits the data and results in very high train and test MSE losses (41.2 and 125.1 mm). Thus, the controller performs poorly and leads to very high final Chamfer distances (m) in all 4 test cases even with ground-truth manipulation point. In contrast, the losses of our DeformerNet are almost equal to zero (1.5 and 2.4 mm). Furthermore, when we train Hu et al.’s model on a dataset with only a few shapes, the resulted MSE losses and final Chamfer distances become much smaller. This proves that while the previous method works well when trained on only a small set of shapes, it fails to generalize on our full dataset.



Iii-a Additional Experiment Visualizations

Due to space limits, we only present the final shape servo results in the main text. Here, for each shape servo test case mentioned in the experiment, we provide key frames of the whole manipulation sequence (Fig. 5). In addition, Fig. 6 shows the Chamfer distance between the current object point cloud and the goal object point cloud over time when running our DeformerNet controller.

Figure 4 shows the experimental setup. We use the bimanual daVinci surgical robot to manipulate a soft box object which mimics human tissue. The left arm grasps one end of the object and holds it in place, while the right arm deforms the object into a desired shape.

Fig. 4: Experimental setup of the two armed daVinci surgical robot in the Issac gym simulator [3].

Iii-B DeformerNet Details

As shown in Fig. 1, DeformerNet consists of two stages: feature vector extraction and defomation control inference. In the first stage, we perform convolution on 3D point clouds to extract feature vectors representing the states of the object shapes. This stage takes two point clouds as the inputs: one of the current object shape and one of the goal object shape. The output of this stage is two 256-dimension feature vectors.

We then subtract the two feature vectors one from another and feed this to the second stage. The deformation control inference stage takes this 256-dimension differential feature vector

and passes it through a series of fully-connected layers (128, 64, and 32 neural units, respectively). The fully-connected output layer produces the 3D manipulation point displacement. Note that this is also equivalent to the robot’s end-effector Cartesian position displacement since the robot directly controls the position of the manipulation point. We use an ReLU activation function and batch normalization for all convolutional and fully-connected layers except for the output layer.

We use the standard mean squared error loss function for training our DNN. We adopt the Adam optimizer and a decaying learning rate which starts at

and decreases by 1/10 every 50 epochs.

Iii-C Experiment Details

Partial point clouds are generated and segmented out from the robot and background using the depth camera available inside the Issac gym environment. We sample 2048 points on each object point cloud using the Furthest Point Sampling method from [9]

. For the Keypoint Detection Heuristic method, we use 200 keypoints on each point cloud. The physical property of the object used in the experiment is: Young modulus = 1000 Pa, Poisson = 0.3.

(a) Case 1 shape servo sequence
(b) Case 2 shape servo sequence
(c) Case 3 shape servo sequence
(d) Case 4 shape servo sequence
Fig. 5: Additional visualizations of the robot performing shape servoing to a variety of target shapes. The sparse red clouds visualize the target shapes of the object.
Fig. 6: Chamfer distance between the current object point cloud and the goal object point cloud over time. From left to right: cases 1, 2, 3, & 4 respectively.