Learning Rope Manipulation Policies Using Dense Object Descriptors Trained on Synthetic Depth Data

by   Priya Sundaresan, et al.

Robotic manipulation of deformable 1D objects such as ropes, cables, and hoses is challenging due to the lack of high-fidelity analytic models and large configuration spaces. Furthermore, learning end-to-end manipulation policies directly from images and physical interaction requires significant time on a robot and can fail to generalize across tasks. We address these challenges using interpretable deep visual representations for rope, extending recent work on dense object descriptors for robot manipulation. This facilitates the design of interpretable and transferable geometric policies built on top of the learned representations, decoupling visual reasoning and control. We present an approach that learns point-pair correspondences between initial and goal rope configurations, which implicitly encodes geometric structure, entirely in simulation from synthetic depth images. We demonstrate that the learned representation – dense depth object descriptors (DDODs) – can be used to manipulate a real rope into a variety of different arrangements either by learning from demonstrations or using interpretable geometric policies. In 50 trials of a knot-tying task with the ABB YuMi Robot, the system achieves a 66 knot-tying success rate from previously unseen configurations. See https://tinyurl.com/rope-learning for supplementary material and videos.



page 1

page 2

page 3

page 4

page 9

page 11

page 13

page 14


Learning to Smooth and Fold Real Fabric Using Dense Object Descriptors Trained on Synthetic Color Images

Robotic fabric manipulation is challenging due to the infinite dimension...

Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation

What is the right object representation for manipulation? We would like ...

Learning to Augment Synthetic Images for Sim2Real Policy Transfer

Vision and learning have made significant progress that could improve ro...

MMGSD: Multi-Modal Gaussian Shape Descriptors for Correspondence Matching in 1D and 2D Deformable Objects

We explore learning pixelwise correspondences between images of deformab...

Equivariant Descriptor Fields: SE(3)-Equivariant Energy-Based Models for End-to-End Visual Robotic Manipulation Learning

End-to-end learning for visual robotic manipulation is known to suffer f...

Untangling Dense Non-Planar Knots by Learning Manipulation Features and Recovery Policies

Robot manipulation for untangling 1D deformable structures such as ropes...

Untangling Dense Knots by Learning Task-Relevant Keypoints

Untangling ropes, wires, and cables is a challenging task for robots due...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Manipulating deformable objects is valuable for a wide variety of applications from surgery and manufacturing to household robotics [superhuman-surgery, SAVED, robotic-heart-surgery, dense-obj-nets, grasp2vec, bed-making, thananjeyan2017multilateral, grasping-deformable-obj, feeling-the-grip, interactive-comp-imaging]. We specifically consider manipulation of rope, whose infinite dimensional configuration space objects makes it difficult to build accurate dynamical models. Rope manipulation is also difficult because of significant perception challenges due to self occlusions, loops, and self-similarity [knot-theory]. There has been prior work successfully utilizing finite element models [flexible-objects] and hard-coded representations for deformable manipulation [learning-by-watching, knotting, knot-planning, towel-folding], but these techniques can fail to generalize to novel configurations.

Fig. 1: The robot uses dense depth object descriptors (DDODs), learned from synthetic depth images, to compare its current depth observation to a depth image of the desired configuration and plan actions to guide the rope to the goal configuration. We use this strategy to track video demonstrations of rope manipulation tasks and to define a geometric algorithm that ties knots from previously unseen starting configurations. A ball is added to the rope to break symmetry and enable consistent correspondence mapping. Although we exclusively use depth images for training and recording observations during manipulation, we show color images of the workspace for visual clarity.

These perception and modeling challenges motivate learning-based strategies. Past learning-based approaches have achieved impressive results on a variety of rope manipulation tasks, but require many hours of real-world data collection to learn action-conditioned visual dynamics models of the rope [vision-based-rope-manip, visual-planning-acting, zero-shot-visual-IL]. We address these issues by decoupling perception from planning and control. We learn abstract visual representations of rope by extending the techniques from [visual-descriptors, dense-obj-nets] to learn descriptors for the rope that are invariant across different configurations (Figure 2). We then demonstrate that these representations can be leveraged to create both interpretable (visually intuitive and geometrically structured) and transferable polices (task agnostic, learned from synthetic images, deployed on real images) for achieving various planar and non-planar rope configurations (Figure 1). Shifting the representational load from the control policy to a separate perception module enables learning to encode information about rope geometry in simulation without real data. Furthermore, because the object descriptors are trained only on images of the rope in different configurations and are agnostic to the actions that generated them, accurate dynamic simulation of the rope is unnecessary. This paper provides four contributions: (1) a novel approach to achieve complex planar and non-planar rope configurations with a single video demonstration of the task by tracking the learned dense depth object descriptors (DDODs); (2) experiments suggesting that the dense object descriptors from visual-descriptors, dense-obj-nets, previously applied to learn representations for rigid bodies and slightly deformable objects using real data, can be extended to learning representations for highly deformable objects such as rope using only synthetic depth images; (3) a geometrically-motivated algorithm using DDODs to tie knots from unseen rope configurations; and (4) experiments with an ABB YuMi robot suggesting the learned DDODs can be used to achieve a set of planar/non-planar rope configurations and successful knot-tying in 33/50 trials from previously unseen states.

Ii Background and Related Work

There is recent work on tracking deformable objects in videos such as [non-rigid-registration, hand-tracking, visual-descriptors, dynamic-fusion, online-deformable-tracking, interactive-comp-imaging]. There is also extensive literature on deformable manipulation [learning-by-watching, knotting, knot-planning, towel-folding, LfD-non-rigid]. We primarily focus on learning-based methods, which have been shown to generalize to a variety of tasks [vision-based-rope-manip, visual-planning-acting, zero-shot-visual-IL]. Due to the challenge of designing accurate analytical models for deformable objects, [vision-based-rope-manip, visual-planning-acting, zero-shot-visual-IL] provide effective learning-based algorithms for rope manipulation by either generating a visual plan or using an existing one from demonstrations, and then executing the plan by generating controls using learned dynamics models given a single video demonstration. However, these methods require tens of hours of real data collection to learn rope dynamics. These approaches also do not impose any geometric structure on the learned visual representations, limiting the interpretability of the learned policies. In contrast, we impose geometric structure on the learned visual representations via DDODs, learn them in simulation, and decouple them from robot actions. This accelerates training time substantially, and makes it easier to transfer the learned visual representation across domains. We learn geometrically meaningful visual representations for rope by using dense object descriptors, introduced in the context of robotic manipulation by [dense-obj-nets]. While task agnostic manipulation requires geometric understanding of the objects being manipulated, fine-grained understanding of the object configuration is often unnecessary to effectively grasp or push an object [dexnet1, dexnet2, dexnet3, linear-pushing, grasping-deformable-obj, feeling-the-grip]. We leverage dense descriptors for task-oriented manipulation, which often requires detailed geometric understanding to manipulate objects in the specific ways needed to achieve task success [dense-obj-nets, task-oriented-semantic]. There exists extensive literature on generating descriptors for keypoints in images [SIFT, HOG]

, but these approaches rely on image intensity gradients, which will not provide much signal in images where the pixel intensities and textures are largely homogeneous such as for a rope. This motivates a deep learning-based approach to utilize global information about the rope to generate descriptors and correspondences 

[visual-descriptors, dense-obj-nets, compact-geometric, pose-reg].  visual-descriptors propose a deep learning approach to learn a function that maps pixels corresponding to the same point on an object to the same descriptor and pixels corresponding to different points to different descriptors. dense-obj-nets use these dense object descriptors for task-oriented manipulation of rigid and slightly deformable objects such as stuffed animals. In contrast to prior work, we demonstrate that similar descriptors can be learned and leveraged for manipulation of very deformable 1D structures such as rope. We also learn descriptors from While [visual-descriptors, dense-obj-nets, compact-geometric] learn descriptors using color image input, we use synthetic depth input, which facilitates sim-to-real transfer of the learned representations  [bed-making, dexnet2] and richly encodes the geometric structure of ropes in knotted configurations.

Fig. 2: A visualization of learned descriptors, where the right column images display predicted pixel correspondences (red cursors) relative to the left image source pixels (green cursors) and predicted best match regions (darkened) [dense-obj-nets]. This is generated by applying the learned descriptor mapping: independently to both synthetic depth images, computing the pixelwise norm differences in descriptor space, and scaling these differences linearly [0, 255]. The darkened regions can be interpreted as a measure of uncertainty in predicted correspondences. Note that the predicted correspondences are sensitive to self-intersections.

Iii Simulator

Fig. 3: Rope simulation design. 1) The underlying representation of the rope is a set of =12 Bezier control points (visualized as black points with orange handles). These nodes can be randomly displaced along x, y, or z axes to produce arbitrary deformation or can be fixed according to a control polygon to produce structured deformation such as loops, overlaps, and knots. The Bezier curve is of variable length while the rope mesh is of fixed length. The Bezier nodes may become unequally spaced during displacement, as shown in sub-figure 1 when only 7 of the 12 nodes are visible after deformation. 2) The wireframe rope mesh with ordered vertices of known coordinates. 3) A rendered depth map. 4) A visualization of the densely annotated scene with =1,465 pixels corresponding to vertices sampled from the rope mesh in 3). The pixels are colored in a stream to demonstrate the ordering of the dense ground truth annotations in simulation.

We use Blender 2.8 [roosendaal2007blender]

— an open-source 3D graphics, animation, and rendering suite — to model the rope in simulation and generate synthetic depth training data. Hyperparameter details for the simulation environment are provided in Section

IX (Table III). The simulated rope is modelled by twisting four thin cylindrical meshes to produce a realistic braided twine appearance as in [blendertutorial]. A sphere mesh was added on one end to break the symmetry of the rope, which was experimentally shown to reduce ambiguity in descriptor learning. This rope representation consists of a mesh with over fifty thousand ordered vertices of known global coordinates and an underlying Bezier curve with control points, (Figure 3). A larger value enables higher manipulation fidelity and a larger configuration space for the rope. Simple configurations consist of purely planar deformations, formed by picking random points along the rope and pulling arbitrarily along the and directions. Complex configurations include planar deformations in addition to randomized overlap, loops, and knots. Producing varied synthetic depth training data requires simulating the rope in a variety of configurations and exporting the relevant ground truth data and rendered image. For the first step, we randomize the positions of a subset of the Bezier control points to produce varied deformations. Next, for a given scene, we export a depth image from the scene’s Z-Buffer output and a mapping . This represents the projection of vertex world coordinates to pixel coordinates in the synthetic camera frame. The parameter specifies how many pixels to annotate on the image, so a higher value of produces more dense pixel match sampling between images during training. This raw projection mapping fails to account for complex rope geometries, since multiple mesh vertices can project to the same pixel coordinate at regions of self-intersection or occlusion. Thus, we reparent all pixels in a given region to the top-most mesh vertex in that region using a -nearest neighbor algorithm with . That is, given and such that , we compare the z-coordinates of the corresponding mesh vertex world coordinates, and , respectively. If , the exported mapping will assign both to instead of . Pixel matches can be sampled across images of varying configurations by pairing pixels by corresponding mesh vertex.

Iv Dense Descriptor Learning

Here we describe the training procedure for training dense object descriptors for rope manipulation from synthetic depth data. Hyperparameters for descriptor learning are specified in Appendix IX (Table IV).

Iv-a Preliminaries

Fig. 4:

A visualization of trained, normalized rope descriptors applied to synthetic depth images unseen during training. The first and third images show examples of synthetic depth images of a rope in different configurations. The second and fourth represent the output of the dense correspondence network, where for each pixel on the rope mask, the normalized 3D descriptor vector is visualized as a RGB tuple. The visualizations suggest descriptor consistency across deformations.

We consider an environment which consists of a static flat plane and a braided rope and learn policies to achieve specific planar and non-planar configurations. We do this by learning a structured visual representation of the rope to estimate point-pair correspondences between an overhead depth image of the rope and a subgoal image. These correspondences are then used to generate interpretable geometric policies which move the rope to better align it with the subgoal. For more details on how the policies are defined, see Section

V. For visual representation learning, we build on the work in [visual-descriptors, dense-obj-nets] by learning descriptors from depth images in addition to RGB and extending the framework to a highly deformable object. In dense-obj-nets, representation learning is done by first sampling a variety of points on the surface a given object. The camera pose is changed via a randomly sampled rigid body transformation and the sampled points are associated with corresponding points in the new view using standard static scene reconstruction techniques. These correspondences are then used to train a Siamese network [siamese] with pixelwise contrastive loss to learn the desired embedding space. See [dense-obj-nets] for more details. dense-obj-nets demonstrate that these descriptors can be used to pick up rigid and slightly deformable objects at specific grasp points from multiple views, even when the target grasp is only identified in one view. Unlike [dense-obj-nets], since the rope is not rigid, it is insufficient to simply change the pose of the camera to learn object descriptors for manipulation. Thus, the rope must be manipulated into a variety of different possible configurations to generate useful correspondences. Since ground truth correspondences are difficult to obtain for a real rope, we leverage simulation to obtain point-pair correspondences, which are then used to learn DDODs. Unlike [dense-obj-nets], which train descriptors on RGB images, we train on synthetic depth [dense-obj-nets].

Iv-B Descriptor Learning from Synthetic Depth Images

The training procedure involves sampling a random initial configuration of the rope in simulation and applying some transformation to yield a new configuration . As in dense-obj-nets, the goal is to learn a mapping to a descriptor space in which corresponding points on and are encouraged to be close together while non-corresponding points are encouraged to be further apart.

Fig. 5: Three examples of rope manipulation action sequences the YuMi robot performed by one-shot visual imitation of a demonstrated sequence of observations. Each demonstrated sequence consists of a starting configuration followed by pick-and-place actions performed by a human supervisor to produce a different final state. For each step in the demonstration, the YuMi is given a fixed number of pick-and-place attempts (1 for non-planar sequences, 3 for planar sequences) to produce the next sequential state, unless the IoU of the current workspace image and the goal state is below a hand-tuned threshold (0.67). We allow fewer attempts for the non-planar case because we observed that it is more difficult for the robot to recover from poor nonplanar actions since these often produce entanglement or particularly pathological configurations, whereas missteps in the planar actions sequence are typically less costly since the rope is likely to remain planar and correspondences can be resampled. For a single action, the YuMi executes a greedy policy by grasping the correspondence on the rope in the current image that is farthest from its pixelwise match in the goal image and placing it at that point. Qualitative results suggest the efficacy of the geometric policy defined over the learned descriptors.

We generate planar transforms by randomly translating the coordinates of a subsample of the rope’s Bezier knots along the x and y axes to simulate pulling the rope arbitrarily along different directions. We also generate transforms that simulate more complex rope configurations including overlap, loops, and knots by geometrically arranging into the respective control polygons for these configurations as in [ma2003point], and then slightly perturbing knot coordinate positions for variation. We sample a set of corresponding point pairs on the rope between configurations and . This allows us to sample a wide variety of possible rope deformations, making it easier to generalize to different tasks at test-time. Learning in simulation also makes it possible to inject noise to enable robustness to varying experimental conditions as described in Section IX-A3. Then, we utilize the same training procedure as in [dense-obj-nets] to learn -dimensional DDODs, where is a hyperparameter that we experimentally vary between 3 and 16.

V Policy Design

Given the learned descriptors, we design interpretable geometric policies defined over the learned DDODs. We assume that the rope manipulation tasks considered can be performed by a sequence of pick and place actions by a single robotic arm as in prior work [vision-based-rope-manip, zero-shot-visual-IL]. Hyperparameter details regarding manipulation policies are specified in Appendix IX (Table V). We consider two algorithmic policies for rope manipulation tasks:

V-a Algorithm 1: One-Shot Visual Imitation

In this setting, a human demonstrator makes sequential pick and place actions to arrange the physical rope into a desired configuration. The robot observes one demonstration as a sequence of images from an overhead depth camera, then takes actions based on a greedy geometric policy. Actions, defined by a start point (grasp) and an end point (drop) , are generated by using the frames in the provided demonstration as subgoals and using the descriptors to sparsely estimate point-pair correspondences between points on the current depth image of the rope at time and the current subgoal, given by a demonstration frame (Figure 5). To find correspondences, we sparsely sample a set of roughly evenly spaced pixels on the rope mask in the current depth image by enforcing the constraint that the inter-pixel distance between any two points should be above a margin . For each of the sparsely sampled pixels, we compute their correspondence on the goal image by computing the nearest neighbors in descriptor space and taking the best match to be the median of the associated

pixels. We choose the median correspondence due to its robustness to outliers. Then, we find the pair of corresponding points with the highest discrepancy (largest distance in

between them), and take the following action to align these points in 3D space: the point-pair correspondence with the maximum discrepancy is computed and the robot grasps the rope at point and places the rope at point to align the furthest points in the image. This process is repeated up to times for each subgoal image or until the intersection-over-union (IoU) of the current and goal state image masks is below a hand-tuned threshold of 0.67. The IoU is a standardized metric across segmentation tasks [mask-rcnn] and provides an indication of the degree of alignment between two masks, which we use to judge the similarity of two rope configurations. We found the IoU to be a noisy measurement for alignment of current and subgoal rope masks, and use a relatively low threshold to account for this. This is likely caused by the long, thin geometry of the rope, which complicates pixelwise alignment of two otherwise very similar rope configurations.

Fig. 6: To perform knot-tying, we label the centered loop point and endpoint of the rope in a reference image, and define two geometric pick-and-place actions in terms of the relative spacing of these points to generate a knot. To generalize to a new initial loop configuration, we recompute loop and endpoint correspondences and execute the sequence.

V-B Algorithm 2: Descriptor Parameterized Knot-Tying

In this setting, we use a two-action sequence of a knot-tying task from a human demonstrator to parameterize a sequence of motion primitives for knot-tying that generalizes to unseen rope configurations. As in [vision-based-rope-manip], we assume the rope contains a single loop initially. The sequence is annotated with the two pick and place actions used to execute the task (Figure 6). The first action involves picking the side of the loop close to the end of the rope without the ball and placing it around the endpoint of the rope. We record the descriptor vectors for the grasp point and the end of the rope and use it to define an action in terms of DDODs. When faced with a new, unseen rope configuration with a loop, the robot grasps the closest point in descriptor space to the grasp point in the reference image and pulls it in the direction of the end of the rope, which is also found by matching with the closest descriptor in the reference frame. The next step involves grasping the end of the rope in the loop and pulling it to tighten the knot. To define this primitive, we record the descriptor vector for the end of the rope in the reference image. When executing this maneuver in a new configuration, the robot detects the end of the rope by finding the closest pixel in descriptor space to the end of the rope in the reference image. The robot grasps at this point and pulls to tighten the knot.

Fig. 7: Ablations measuring pixel-match error for the learned descriptors in simulation when descriptor dimension, sensing modality, number of correspondences used for training, and occlusion handling method are varied. Results suggest that the learned representations are largely insensitive to small changes in these parameters, with the exception of adding a ball to the end of the rope. Asymmetry is critical for good performance, as removing the ball results in a significant deterioration in performance as expected. Furthermore, we note that depth input performs nearly identical to RGB for non-planar configurations, which is consistent with the increased depth variation in non-planar settings.

Vi Experiments

Vi-a Baseline

We propose an analytical method for acquiring rope correspondences and performing manipulation which we compare against the dense correspondence method. This analytical method is detailed in the Appendix (Section IX) and relies on following the pixel intensity gradient of the rope at local crops to collect annotations along the rope geometry. This method is largely hand-tuned to the rope images observed by the depth camera and lacks robustness to small irregularities in the rope such as fraying and non-uniform thickness. It is also currently unsupported for the non-planar case, as following the gradient of the rope is nontrivial when the rope overlaps on itself. These challenges with designing an analytical baseline motivate the DDOD-based approach.

Vi-B Experimental Setup

We use a 4 ft. by 1/2 in. braided white nylon rope with a punctured tennis ball attached to one end to resolve ambiguity between the ends of the rope and to match the appearance of the rope in simulation. We assume access to observations from a overhead depth camera (Photoneo Phoxi 3D Scanner) and visibility of the rope in its entirety (including endpoints) throughout the duration of the task. We further assume a relatively flat background with no distractor objects. In this experimental setup, it is infeasible for the robot to do one shot visual imitation of the human demonstrator in sequence, since the rope state is changed from its initial configuration by the end of the demonstration. Thus, we also assume that the robot can start from the last recorded demonstration frame and do one shot visual imitation in reverse to restore the original configuration of the rope. Additional details about the experimental setup are provided in the Appendix (Section IX).

Vi-C Simulated Experiments

In simulation, we train the deep network used in dense-obj-nets to learn point-pair correspondences for a variety of rope deformations as described in Section IV, for both simple and complex tier configurations. For each network, we train on a set of 3,600 generated synthetic depth and RGB images ( 1 hr. data generation time) and evaluate on a held-out test set of 100 pairs of previously unseen images. Descriptor quality is measured in terms of pixel-match error on the held-out test set as in [dense-obj-nets]. Experiments suggest that the learned descriptors are consistent and able to accurately locate correspondences in images of rope in unseen configurations. Figure 4 shows a few qualitative examples. In Figure 7, we evaluate the quality of the learned descriptors when we vary the sensing modality (synthetic RGB/synthetic depth), the descriptor dimension, the number of annotated correspondences, and when we ignore/account for occlusions in the nonplanar datasets using the method described in Section III. We see that the descriptor quality is largely invariant to small changes in descriptor dimension, sensing modality, annotation density, and occlusion handling. For non-planar deformations, the gap in the pixelwise error for descriptors trained on RGB and depth data is observed to be significantly lower than for planar deformations. Given the greater depth variation in images, depth data is likely more rich and useful in the nonplanar case. We also observe the benefit of the added ball for breaking symmetry.

Type Subgoal Trials w/ Improvement Med. % Improvement
Baseline (P) 0 9/9 56
Baseline (P) 0 4/9 -4
Baseline (P) 0 6/9 42
DDOD (P) 0 28/32 58
DDOD (P) 1 28/32 42
DDOD (P) 2 23/32 33
DDOD (NP) 0 14/21 30
DDOD (NP) 1 13/21 4
TABLE I: Physical Experiment Results (Visual Imitation): We report the number of trials that improve with respect to the subgoal-based loss defined in Section VI-D1

for planar (P) and non-planar (NP) visual imitation experiments. We find that even in the non-planar case, the robot makes positive progress in most trials, but note that performance decreases as the task progresses. We also report the median percent improvement of the loss over each subgoal’s starting configuration. We report the median, because failures cause large negative outlier loss values, skewing the mean. We find that the visual imitation policy using dense object descriptors is able to drive the rope to configurations closer to the target configurations. We observe that performance deteriorates in later subgoals, which we hypothesize is due to compounding errors over time. We observe that non-planar manipulation is more challenging.

Mode Explanation Count
A wrong endpoint correspondence 4
B wrong loop point correspondence 6
C endpoint occluded after pull 3
D loop pulled misaligned 3
TABLE II: Classification of the 33% Failures for Physical Knot Tying

Vi-D Physical Experiments

We evaluate the learned representations for designing rope manipulation policies with an ABB YuMi robot equipped with one parallel jaw gripper. For planar physical experiments, we use a 3-dimensional DDOD network trained on simulated planar configurations with 1,400 labeled correspondences per rendering. For nonplanar manipulation, we use a 16-dimensional DDOD network trained on simulated nonplanar configurations with 557 annotations per rendering. Both networks are trained on noise-injected simulation images (Appendix, Section IX-A3) to enable transfer to the real rope. We use the networks to perform manipulation using the geometric policies from Section V.

Vi-D1 Alg 1

We evaluate Algorithm 1 on its ability to track and repeat video sequences of both planar and non-planar rope manipulation as shown in Figure 5 and Table I. Each planar and non-planar sequence consists of three or four frames respectively, including a starting configuration. For each of the subgoals, the robot executes up to or actions for planar and non-planar experiments respectively, and proceeds early to the next subgoal if the IoU threshold in Section V-A is met.

Evaluation Metric

To evaluate the agent’s ability to track the subgoals in the video sequence, we define a loss function that takes in the realized image

and the goal image : . For each image , a sequence of points along the rope is manually annotated, and a parametric piecewise linear function is fit to the points for . Then, the sum of squared errors is computed for a range of shifts and rotations of for evenly spaced points on the curve and the minimum is returned by . For each subgoal in the demonstration trajectory, is computed for all frames in the segment corresponding to it in the robot trajectory and report the percent improvement of the best frame over the segment’s starting configuration (Figure VI-D1, Table I).

Vi-D2 Alg 2

We evaluate the method in Section V-B on a knot-tying task from previously unseen configurations with the rope starting in a loop. As in prior work [vision-based-rope-manip, zero-shot-visual-IL], we report the success rate of the task by visually inspecting whether a knot was successfully tied. Figure 6 illustrates the knot-tying procedure used. The robot successfully ties a knot in trials (66%). This rate is higher than the knot-tying accuracy reported in [vision-based-rope-manip] (38%) and [zero-shot-visual-IL] (60%), and requires weaker supervision, although we do not provide a direct comparison due to differences in experimental setup. Failure modes include when the robot fails to accurately identify the loop and endpoint correspondences, fails to align the loop over the endpoint, or occludes the endpoint during alignment, preventing task completion (Table II).

Vii Discussion and Future Work

This work presents a new method for designing interpretable and transferable policies for rope manipulation by learning a geometrically structured visual representation (DDOD) entirely in simulation by building on the techniques from dense-obj-nets. The visual correspondence-driven manipulation policies demonstrated allow for ease of interpretation and understanding of robotic actions in both a one-shot visual imitation framework and a descriptor-parameterized task setting. We use this representation to design intuitive geometric policies to track planar and non-planar rope deformations from demonstrations and to design a geometric algorithm for knot tying which achieves a 66% success rate. In future work, we will explore learning more complex manipulation primitives in descriptor space such as suturing. We will also investigate whether the learned descriptors provide appropriate representations for reinforcement learning and for manipulation of 2D deformable objects like cloth.

Viii Acknowledgments

This research was performed at the AUTOLAB at UC Berkeley with partial support from Toyota Research Institute, the Berkeley AI Research (BAIR) Lab and by equipment grants from PhotoNeo. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors. Ashwin Balakrishna is supported by an NSF GRFP. We thank our colleagues who provided helpful feedback, code, and suggestions, especially David Tseng, Aditya Ganapathi, Michael Danielczuk, Jeffrey Ichnowski, and Daniel Seita.

Ix Appendix

The appendix is organized as follows:

  • Appendix A contains additional details on the rope simulator

  • Appendix B describes the baseline policy for planar manipulation

  • Appendix C contains additional details on the experimental setup

  • Appendix D specifies hyperparameters for the simulator, descriptor training, and manipulation policies.

Ix-a Rope Simulator Details

The rope simulator is implemented in the graphics/rendering engine Blender 2.8 using its Python API. We only use Blender’s rendering capabilities, rather than its dynamic simulation capabilities, to produce varied training data of the rope in different configurations. Blender preserves the ordering of mesh vertices after deformation, so for each image rendered, we export dense, ordered pixel-wise annotations using the world-to-camera transform on queried mesh vertices (Figure 9). Finally, the rendered images are injected with noise to resemble real depth images.

Ix-A1 Rope Model

The rope is modeled as a deformable four-stranded braid as in Figure 8, resembling twisted nylon ropes commonly used in lifting/towing/pulley applications. A sphere is added on one side to disambiguate both ends of the rope.

Fig. 8: Blender rope modelling pipeline, described in [blendertutorial]. 1: Four circles are joined as one mesh. 2: The ’Screw Modifier’ is applied to the mesh to produce a helix-like braided appearance. 3-4: the rope is elongated by increasing the ’Screw Length’ and ’Iterations’ attributes. 5: A Bezier curve is added to the scene. The black points represent control points and the orange segments are handles for the control points. The handles and control points structure the shape of the Bezier curve. 6: The Bezier curve is added as a ’Curve Modifier’ to the rope mesh along the Z-axis, so that the mesh will traverse along the curve. 7: The rope is further elongated. 8-9-10: Control points along the Bezier curve can be freely displaced in 3D space to deform the rope, or can be arranged to produce a specific configuration.

Ix-A2 Deformation

Arbitrary planar deformations are generated by randomly displacing the x and y coordinates of a subset of the rope’s Bezier control points, given by . We simulate more complex nonplanar rope configurations by geometrically arranging to yield the desired curvature in the shape of loops and knots as shown in Figure 10.

Fig. 9: Four Bezier control points are needed to parametrize an overlapping configuration, and six Bezier control points form the control polygon for a knotted configuration. In practice, we slightly randomize the positions of these control points to yield non-uniform nonplanar arrangements.
Fig. 10: Ground truth annotations. Using the transform between Blender’s world coordinate system and the simulated camera used in rendering, we collect dense, semantically consistent pixel-wise annotations of the rope in various configurations.
Fig. 11: Sim-to-Real Processing Pipeline. A raw synthetic depth image is post-processed to look like a real reference depth image (top left) by scaling the pixel range and strategically inpainting black along noise, edge, and gradient masks as described in IX-A3. This post-processing of the simulation images models the noise and black pixel corruption in real depth images along regions of high gradient. A DDOD mapping is trained on these processed simulated images to enable sim-to-real transfer.

Ix-A3 Domain Randomization

We leverage several domain randomization and image processing techniques to enable sim-to-real transfer by training on rendered synthetic depth images that are post-processed to match real images. In simulation, we slightly randomize over rope hyperparameters in the specified ranges from Table III

(all in Blender standard units). The intent is to account for slight dimension mismatch between domains. Additionally, we inject both zero-mean, unit variance Gaussian and Poisson noise in the simulated images to model the noise in real depth images. In real depth images of the rope, the corrupted pixels tended to occur along regions of high gradient, particularly on braided rope contours. To model this in the simulated images, we randomly color pixels black along areas of high Laplacian gradient and edges detected with a Canny edge detector

[bao2005canny] on the rope along a Perlin noise mask [perlin2002improving]. The Canny edges and Laplacian gradient provide rope contours and the Perlin noise provides realistic gradient noise. We rescale the pixel range of the simulated images to match the pixel range in real given a single reference real depth image. This process is illustrated in Figure 11.

Ix-B Gradient Tracking Details

We describe the implementation of an analytical method for acquiring rope correspondences from Section VI. This method aims to trace along the length of the rope from one endpoint to the other, recording ordered pixel annotations along the way. Examining the pixel intensity gradient for the local patch at each step in the tracing process yields the direction to step next. Given a segmentation mask of the rope in an arbitrary planar configuration, we use the gradient-based method to find correspondences as an analytical alternative to a descriptor-based approach. The full method can be broken down into three steps: pre-processing the images, acquiring annotations by tracing orthogonally to the gradient along the rope, and doing matching between a pair of images given a set of ordered annotations for each image.

Ix-B1 Image Preprocessing Pipeline

We apply OpenCV inpainting to the input segmentation mask of the rope to fill missing pixels using nearby pixels in the vicinity. Next, we use Gaussian blurring followed by binary thresholding to smooth the edges of the rope. This step is meant to ensure that small frays in the rope do not affect the overall gradient within a crop. To find the starting point on the rope, we use OpenCV-based Hough circle detection to locate the center of the attached ball. Given this point as a reference, the starting point of the rope is considered to be the nearest-neighbor on the rope mask, outside a fixed radius from the ball center.

1:procedure preprocess(img, circle_radius)
2:     img inpaint(img)
3:     img gaussian_blur(img)
4:     img binary_threshold(img)
5:     circle_center hough_circles(img)
6:      Find starting point on the rope circle_radius + 5 pixels away from the circle center
7:     angle 0
8:     while TRUE do
9:         dx (circle_radius + 5) * sin(angle)
10:         dy (circle_radius + 5) * cos(angle)
11:         rope_start circle_center + [dx, dy]
12:         if img[rope_start] ¿ 0 then
13:              break          
14:         angle angle + 5      
15:     return img, rope_start, circle_center
Algorithm 1 Image Preprocessing Pipeline

Ix-B2 Gradient Annotation Algorithm

The algorithm for accumulating annotations by following the pixel intensity gradient is presented in Algorithm 2.

1:procedure gradient(crop)
2:     grad_x, grad_y np.gradient(crop)
3:     directions [0, 0, 0, 0]
4:      Find which direction (up, down, right, left) most of the pixel intensity gradients point
5:     for  do
6:         for  do
7:              temp_grad [grad_x[i][j], grad_y[i][j]]
8:              directions update(directions, temp_grad)
9:               Update directions with direction that temp_grad points               
10:     if directions[2] ¿ directions[3] then
11:         final_x directions[2]
12:     else
13:         final_x -1 * directions[3]      
14:     if directions[0] ¿ directions[1] then
15:         final_y directions[0]
16:     else
17:         final_y -1 * directions[1]      
18:     return normalize([final_x, final_y])
19:procedure update_step(direction, curr_x, curr_y, step_size)
20:     dir_x, dir_y direction
21:     next_x curr_x + dir_x step_size
22:     next_y curr_y + dir_y step_size
23:     return next_x, next_y
24:procedure follow_grad(img, step_size, crop_size)
25:     img, rope_start, circle_center preprocess(img)
26:     curr_x, curr_y rope_start
27:     steps_taken 0
28:     points [ ]
29:      Follow gradient orthogonally until max number of steps reached (empirically determined, approximates end of rope)
30:     while steps_taken <MAX_NUM_STEPS do
31:         points.append([curr_x, curr_y])
32:         crop local_crop(img, curr_x, curr_y, crop_size)
33:          Take a 2 * crop_size by 2 * crop_size crop of the image centered at curr_x and curr_y
34:         grad_x, grad_y gradient(crop)
35:         direction adjust(orthogonal(grad))
36:          ADJUST flips direction if it is towards, instead of away, from (curr_x, curr_y)
37:         curr_x, curr_y update_step(direction, curr_x, curr_y, step_size)
38:         steps_taken steps_taken + 1      
39:     return points The final annotations list
Algorithm 2 Gradient Annotation Algorithm

Ix-B3 Matching

To acquire correspondences between a pair of images, we first compute their gradient-based annotation lists independently using the gradient annotation algorithm. Then, we subsample every pixel, where is the length of the annotation list.

Fig. 12: Gradient-based baseline correspondence method. Given a raw segmentation mask of the rope, we preprocess it by inpainting the black artifacts and low-pass filtering using OpenCV functions. Next, we locate a starting annotation (blue). To do this, we use OpenCV’s hough circles to find the center of the ball, and use this point as a reference to find the start of the rope. We choose this point because the circle is a consistent and localizable feature to find across images using classical methods. From the start point, we take a local crop, compute the pixel intensity gradient of the crop, and take a step in the direction orthogonal to the gradient. We re-sample the crop and repeat for a fixed number of steps until the end of the rope is reached.

Ix-B4 Failure Cases

Irregularities in rope thickness that persist after image preprocessing may result in a pixel intensity gradient that is not orthogonal to the direction of the overall rope (Figure 13). When this occurs, the algorithm takes steps along the width of the rope rather than the length until the opposite side of the rope is reached. Then, the algorithm will continue taking steps along the length of the rope but in the opposite direction, resulting in a loop backwards where the annotated points are now overlapping with previous annotations. The gradient annotation algorithm is currently unstable for nonplanar configurations of the rope. With crops containing self-occluding regions of the rope, it is unclear where the pixel intensity gradient will point, leading to confusion in annotations at the point of overlap (Figure 14). In future work, we will consider using the depth image gradient (as opposed to the segmentation mask gradient) which may provide richer information to support these cases.

Fig. 13: Rope irregularities failure mode. The nylon rope has non-uniform thickness and fraying. This can confuse the gradient direction, causing the annotations to follow along the curve of an irregular bump rather than along the rope geometry.
Fig. 14: Nonplanar failure mode. One failure of the analytical-based approach is that the gradient is unclear at areas of overlap in the rope. In this case, the gradient does not follow the rope geometry but rather starts collecting annotations in the opposite direction.

Ix-C Experimental Details

Ix-C1 Physical Experiment Setup

The pipeline of taking raw depth observations provided by the Photoneo Phoxi 3D Scanner and doing manipulation from these observations is as follows:

  • We acquire a raw point cloud of the rope on top of the surface used in manipulation.

  • Using pre-computed workspace boundaries in the world frame, we pre-process the resulting point cloud by removing points that lie on the manipulation surface or above the rope. This is to ensure that the final image of the workspace has a clean segmentation of the rope, since we did not domain-randomize the background of images seen in training.

  • We project the point cloud from the world frame to camera frame using a known world-to-camera calibration acquired via a chessboard. This results in a grayscale image of size 772 1032.

  • We downscale this image to size 640

    480 (the dimensions used for images in training) before passing the image through the neural network, and get pixel-wise correspondences via matching in descriptor space.

  • We upscale all pixel annotations to match the 772 1032 scale of the original image.

  • Finally, the robot end-effector can grasp at a point in world space given the properly scaled pixel annotations using the camera-to-world transformation.

Ix-C2 One-Shot Visual Imitation Details

We elaborate on the experimental setup for the one-shot visual imitation policy described in Section V-A.

  • A human demonstrator records a sequence of images separated by one pick-and-place action to deform the rope.

  • In order to ”imitate” the sequence, the robot uses the last recorded image, which captures the workspace after the human demonstration, and sequentially uses the previously recorded image as the next subgoal in the demonstration until the first recorded image is reached.

  • In this fashion, the robot repeatedly ”undoes” the set of actions the human demonstrator did, doing reverse imitation, rather than imitation in sequence. This is because imitation in sequence would require that the human and robot start from the exact same rope configuration, which is not possible after the human demonstrator has already deformed the rope during the demonstration.

Ix-D Hyperparameters

Parameter Range of Values
rope thickness [0.05, 0.065]
rope length [14.3, 15]
coil length (length of braid texture) [12.5, 14]
attached sphere radius [0.35, 0.37]
TABLE III: Rope Simulator Hyperparameters.
Parameter Value(s)
number of training images 3600
learning rate 1.0
learning rate decay 0.9
steps between learning rate decay 250
training iterations 3500
descriptor contrastive margin () 0.5
descriptor dimension (3,6,9,16)
number of annotations 500-1600
TABLE IV: Descriptor Training Hyperparameters.
Parameter Explanation Value
inter-annotation distance distance between sparsely sampled pixels on rope mask 50px
number of nearest neighbors number of matches sampled per annotation (median of these is taken to be final correspondence) 100
IoU thresh intersection-over-union threshold between current workspace image and goal image that determines when to move to next subgoal 0.67
subgoals number of steps in the demonstration 4
attempts per subgoal number of robot actions to reach each subgoal 3
TABLE V: One-Shot Visual Imitation Hyperparameters.
Fig. 15: Representative examples of both planar and non-planar simulated depth images rendered with Blender, with noise injection to model missing pixels in real images.
Fig. 16: Representative examples of both planar and non-planar real depth images.
Fig. 17: Representative example of six actions in one rollout of the one-shot visual imitation policy (Section VI-D1). The robot attempts to achieve two subgoals in a given demonstration, with three pick-and-place attempts per subgoal. Left column (from top to bottom): actions planned and taken by robot. Right column: two subgoals from human demonstration.