Using visual model-based learning for deformable object manipulation is challenging due to difficulties in learning plannable visual representations along with complex dynamic models. In this work, we propose a new learning framework that jointly optimizes both the visual representation model and the dynamics model using contrastive estimation. Using simulation data collected by randomly perturbing deformable objects on a table, we learn latent dynamics models for these objects in an offline fashion. Then, using the learned models, we use simple model-based planning to solve challenging deformable object manipulation tasks such as spreading ropes and cloths. Experimentally, we show substantial improvements in performance over standard model-based learning techniques across our rope and cloth manipulation suite. Finally, we transfer our visual manipulation policies trained on data purely collected in simulation to a real PR2 robot through domain randomization.READ FULL TEXT VIEW PDF
Robotic manipulation of rigid objects has received significant interest over the last few decades, from grasping novel objects in clutter [28, 25, 47, 39, 12] to dexterous in-hand manipulation [22, 2, 59]. However, the objects we interact within our daily lives are not always rigid. From putting on clothes to packing a shopping bag, we constantly need to manipulate objects that deform. Even seemingly rigid objects like metal wires significantly deform during everyday interactions. As a result, there has been a growing interest in algorithms that can tackle deformable object manipulation [54, 15, 42, 43, 46, 58, 45, 29, 50].
Deformable object manipulation presents two key challenges for robots. First, unlike rigid objects, there is no direct representation of the state. Consider the manipulation problem in Figure LABEL:fig:intro, where the robot needs to straighten a rope from a start configuration to any goal configuration. How does one track the shape of the rope? This lack of a canonical state often limits representations to discrete approximations . Second, the dynamics of deformable objects are complex and non-linear . Due to microscopic interactions within the object, even simple objects can exhibit complex and unpredictable behavior , which makes modeling and performing traditional task and motion planning with such deformable objects difficult.
One class of techniques that circumvents the challenges in state estimation and dynamics modeling is image-based model-free learning [13, 44, 26]. For instance, Matas et al. , Seita et al. , Wu et al.  use model-free methods in simulation for several difficult cloth manipulation tasks. However, without expert demonstrations, model-free learning is notoriously inefficient , and often needs millions of samples to learn from. This challenge is further exacerbated in the multi-task learning framework, where the robot needs to learn to reach multiple goals.
Model-based techniques, on the other hand, have shown promise in sample-efficient learning [57, 4, 35]. However, using such model-based learning techniques for deformable objects necessitates tackling the challenges of state representation and dynamics modeling head-on. So how does one learn models given high-dimensional observations and complex underlying dynamics? Some approaches take a direct approach to learning complex dynamics models through pixel-space [19, 8]. Another approach, by Agrawal et al. , Nair et al. , learns forward dynamics models in conjunction with inverse dynamic models for manipulating deformable objects. However, during robotic execution, only the inverse model is used. Other model-based approaches such as Wang et al.  train Causal InfoGANs [23, 6] to both extract visual representations and forward models, and use the learned forward models for planning. However, these techniques are not robust due to training instabilities associated with GANs .
In this paper, we introduce a new visual model-based framework that uses contrastive optimization to jointly learn both the underlying visual latent representations and the dynamics models for deformable objects. We hypothesize that using contrastive methods for model-based learning achieves better generalization and latent space structure do to its inherent information maximization objective. We re-frame the objective introduced in contrastive predictive coding  to allow for learning effective model dynamics and latent representations. Once the latent models for representations and dynamics are learned across offline random interactions, we use standard model predictive control (MPC) with one-step predictions to manipulate deformable objects to desired visual goal configurations. Given this controller, we empirically demonstrate substantial improvements over standard model-based learning approaches across multi-goal rope and cloth spreading manipulation tasks. Videos of our real robot runs and reference code can be found on the project website: https://sites.google.com/view/contrastive-predictive-model.
In summary, we present three key contributions in this paper: (a) We propose a contrastive predictive modeling approach to model learning that is compatible with model predictive control. To our knowledge, this is the first use of contrastive estimation for model-based learning. (b) We demonstrate substantial improvements in multi-task deformable object manipulation over other model learning approaches. (c) We show the applicability of our method to real robot rope and cloth manipulation tasks by using sim-to-real transfer without additional real-world training data.
There has been a substantial amount of prior work in the area of robotic manipulation of deformable objects. A detailed survey of past work can be found in Khalil and Payeur , Henrich and Wörn .
A standard approach to tackling deformable object manipulation is to use deformable object simulations with planning methods . Past work in this domain has focused on simple linear deformable objects [41, 55, 34], creating better simulations , and faster planning . However, the large number of states for deformable objects makes it difficult to plan correctly while being computationally efficient.
Instead of directly planning on the full dynamics, some prior research has focused on planning on simpler approximations, by using local controllers to handle the actual complex dynamics. One approach to using local controllers is model-based servoing [48, 54], where the end-effector is controlled to a goal location instead of explicit planning. However, since the controller is optimized over simple dynamics, it often gets stuck in local minima with more complex dynamics . To solve this, several works [3, 31] have proposed Jacobian controllers that do not need explicit models, while [17, 16] have proposed learning-based techniques for servoing. We note that our proposed work on learning latent dynamics models is compatible with several of these model-based optimization techniques.
Learning good representations remains a difficult challenge in deformable object manipulation. There has been a large amount of prior work on contrastive predictive methods to learn better representations of data. Word2Vec  optimizes a contrastive loss to demonstrate semantic and syntactic structure in the learned latent space for words. Oord et al.  shows that it is possible to learn high-level representations of images, video, and speech data by employing a large number of negative samples. Tian et al.  learns high-level representations by encouraging different views of scenes to be embedded close to one another, and further from others through a similarly framed contrastive loss. Recently, SimCLR 
, another contrastive learning framework, achieved state-of-the-art results in self-supervised learning representations, bridging the gap with supervised learning.
In this section, we describe our proposed framework for learning deformable object manipulation: Contrastive Forward Modeling (CFM). We begin by discussing formalism for predictive modeling and contrastive learning. Following that, we discuss our method for learning contrastive predictive models. See Figure -1 for an overview of our training scheme.
For our problem setting, we consider a fully observable environment with observations , actions , and deterministic transition dynamics . We would like to learn a predictive model to approximate the observation of the next timestep. This can be done by directly learning a visual model through pixel space with regression over observation-action-observation tuples [10, 19]. Once we have successfully learned a predictive model, it is natural to use it for planning to reach different desired goal states, for example, different configurations of a rope or cloth. However, planning directly through pixel space can be difficult, as pixel-value comparisons between images usually do not necessarily correlate well with their true distances. For example, consider an environment with a ball, where the task is to learn a policy that pushes the ball to the center. If the ball is far from the center, then all predicted next actions using a visual forward model would be equidistant from the goal ball-in-center image when comparing pixels values since there would be no image overlap. Therefore, we consider the framework of planning with in a learned latent space by encoding observations. We learn an encoder to embed our observations into a latent space, coupled with a predictive model in latent space between ’s, where our learned predictive model is now formulated as . In this work, we propose to learn the latent space using a contrastive learning method.
In our contrastive learning framework, we jointly learn an encoder and a forward model . We use the InfoNCE contrastive loss described by Oord et al. .
where is some similarity function between the computed embeddings from the encoder. The represents negative samples, which are incorrect embeddings of the next state, and we use such negative samples in our loss. The motivation behind this learning objective lies with maximizing mutual information between the predicted encodings and their respective positive samples. Within the embedding space, this results in the positive sample pairs being aligned together but the negative samples pushed further apart, as seen in Figure -1. Since we are jointly learning a forward model that seeks to minimize , we use the similarity function:
where the norm is a -norm. After learning the encoder and dynamics model, we plan using a simple version of Model Predictive Control (MPC), where we sample several actions, run them through the forward model from the current , and choose the action that produces closest (in -distance) to the goal embedding.
In this section, we experimentally evaluate our method in various rope and cloth manipulation settings, both in simulation and in the real world. Our experiments seek to address the following questions:
Do contrastive learning methods learn better latent spaces and forward models for planning in deformable object manipulation tasks?
What aspects of our contrastive learning methods contribute the most to performance?
Can we successfully manipulate deformable objects on a real-world robot?
To simulate deformable objects such as cloth and rope, we used the Deep Mind Control  platform with MuJoCo 2.0 . We use an overhead camera that renders RGB images as input observations for training our method.
We design the following tasks in simulation:
1. Rope: The rope is represented by 25 geoms in simulation with a four-dimensional action space: the first are the pixel pick point on the rope, and the last are the delta direction to perturb the rope. At the start of each episode, the rope’s state is randomly initialized by applying 120 random actions.
2. Cloth: The cloth is represented by a grid of geoms in simulation with a five-dimensional action space: the first are the pixel pick point on the cloth, and the last are the delta direction to perturb the cloth. At the start of each episode, the cloth’s state is randomly initialized by applying random actions. In MuJoCo 2.0, the skin of the cloth can be changed by using images taken of a real cloth.
For both rope and cloth environments, we evaluate our method by planning to a desired goal state image and computing the sum of the pairwise geom distances between the achieved and true goal states. We observe that taking an average of 1000 trials suffices to maintain high-confidence evaluation estimates.
|Joint Dynamics Model|
|Visual Forward Model|
|Rope (With DR)||Cloth (With DR)|
|Joint Dynamics Model|
|Visual Forward Model|
Since collecting real-world data on robots is expensive, our method seeks to address this problem by collecting randomly perturbed rope and cloth data in simulation. Using random perturbations allows for a diverse set of deformable objects and interactions for learning the latent space and dynamics model. We collect 4000 trajectories of length 50 for rope (200k samples), and 8000 trajectories of length 50 for cloth (400k samples).
To show the substantial improvements of our model over prior methods, we compare our method against several baselines: a random policy, a visual forward model, an autoencoder trained jointly with a latent dynamics model, PlaNet , and a joint dynamics model . In order to ensure that pick points are always on the rope or cloth, we constrain our pick points using a binary segmentation of the observation image computed by RGB thresholding. During plannig, all methods use MPC with one-step prediction.
Random Policy: We sample pick actions uniformly over the binary segmentation, and place actions are sampled uniformly random in a unit square centered around the pick location.
Visual Forward Model: We train a forward model similar to Kaiser et al.  to perform modeling and planning purely through pixel space.
Autoencoder: We learn a simple latent space model by jointly training a classical autoencoder with a forward dynamics model. The autoencoder learns to minimize the -distance between reconstructed and actual images .
PlaNet: We train PlaNet , a stochastic variant of an autoencoder, as another latent space model. PlaNet models a sequential VAE and optimizes a temporal variational lower bound.
Joint Dynamics Model: We jointly learn a forward and inverse model following Agrawal et al. .
For consistency across all latent space models, we use a latent size of for both the rope and the cloth environments. For all methods, we sample possible one-step trajectories when performing closed-loop planning. See Figure ‣ IV-A for example trajectories from each baseline in comparison to our method.
We used the same encoder architectures for all models. The encoder architecture is a series of 6 2D convolutions of kernel sizes
, strides, and filter sizes
respectively. We add Leaky ReLU activation in between each convolutional layer. Finally, the output is flattened and fed into a fully connected layer to produce the latent
. The forward model is a multi-layer perceptron (MLP) with two hidden layers of size
which outputs the parameters for a linear transformation on. Specifically for our method (CFM), we use the other batch elements as our negative samples for a total of negative samples per positive pair. For PlaNet, following Hafner et al. , the decoder architecture is a dense layer followed by transposed convolutions with kernel size and stride to upscale to the size of the image. The visual forward models follow the same convolutional encoder and decoder architectures as the previous model, with action conditioning implemented in a similar way to Kaiser et al.  where actions are processed by a separate dense layer at each resolution, multiplied channel-wise, and broadcasted spatially. The images and actions were centered and scaled to the range of . We trained all models with batch size 128, learning rate , and an Adam optimizer  for epochs. Each model was trained on a single NVIDIA TitanX GPU, taking roughly
hours. All of our simulated environments, evaluation metrics, and training code will be publicly released.
In this section, we compare the results of using our method with those of our baselines, analyzing the advantages and benefits that contrastive models bring over prior methods. Consider a naive baseline where we replace the InfoNCE loss with an MSE loss. This is equivalent to jointly fitting an encoder and dynamics model that minimizes
. We can see that the optimal solution is for the encoder to encode all observations to a constant vector to achieve zero loss. To prevent this form of a degenerate solution, we are required to regularize our latent space in some way. Both prior methods and contrastive learning do this in different ways so we analyzed which methods performed better over others. TableI shows the quantitative results comparing our method against baselines in different rope and cloth environments, with and without domain randomization for robot transfer. Note that our method does better on all randomly sampled goals with and without domain randomization, indicating stronger generalization in latent spaces for planning. Figure ‣ IV-A shows example simulator trajectories for each baseline. Each trajectory has the same starting location, same goal image, and was run for 20 actions.
An autoencoder regularizes its latent space by requiring additionally training a decoder to learn to reconstruct from . The model does well in some scenarios, such as a diagonal, but performs poorly when domain randomization is introduced to allow for transfer to a real robot. This is most likely because the autoencoder is optimized to have pixel-level perfect reconstructions, so features such as lighting and color must be encoded in the latent space even when they are not needed for the task. PlaNet behaves similarly to the autoencoder, as it is also a form of a stochastic autoencoder. It performs reasonably competitive with our method but again fails when domain randomization is introduced.
The joint dynamics model regularizes its latent space by jointly learning an inverse model with the forward model. The joint model performed the best across all the baselines when moving to domain randomized data. However, our method still outperforms the joint model for every task.
The visual forward model is the only method that plans in pixel space. It generally performs poorly for tasks with objects with low area coverage, such as the different rope goal orientations, but does better than our method on the cloth flattening task. However, since the forward model operates purely in pixel space, it unsurprisingly suffers from a sharp degradation in performance when introducing domain randomization. As such, it generalizes poorly to the real robot setting.
|Robot Experiments (Intersection in pixels)||Rope (Horizontal)||Rope (Vertical)||Rope (45°)||Rope (135°)||Rope (Squiggle)||Cloth (Flat)|
|Joint Dynamic Model||17.722||23.636||33.631||21.267||18.311||772.303|
|Contrastive Forward Model (Ours)||32.827||36.387||33.891||38.952||20.711||1001.082|
In this section, we perform an ablation study on our method, examining the impact of architectural designs on performance. We ablate over two aspects of our method: the forward model architecture, and the contrastive similarity function. For the forward model, our method uses a Multi-Layer Perceptron (MLP) that outputs the parameters of a linear function that is then applied to . For the contrastive similarity function, our method follows Equation 2. The quantitative results, measured as the sum of pairwise geom distances between the final and goal images, appear in Table II.
We compare using our similar function with the original InfoNCE similarity function in Oord et al. , the log-bilinear similarity function . We achieve the largest boost in performance when switching to our similarity function, as it is more in line with the minimization objective of learning a correct forward model, whereas the log-bilinear model only encourages alignment (as opposed to closeness) of embedding vectors.
We experiment with a few different forward model architectures: linear, a small MLP, and a small MLP that outputs parameters for a linear transformation. As expected, the biggest drop in performance occurs when learning the simpler linear dynamics model, and a slight drop when using an MLP for both rope and cloth tasks. This demonstrates the need for more complex models for latent forward-dynamics learning.
We use a PR2 robot to perform our experiments and an overhead camera looking down on the deformable objects to get the RGB image inputs. To ensure the policy learned in the simulator transfers over to the real world, we apply domain randomization by changing the lighting, texture, friction, damping, inertia, and mass of the object during every training step within the simulator. We also use a pick and place strategy to mimic the same four-dimensional actions within the simulator.
To compute the actions, we employ a model predictive control (MPC) approach of replanning our action at each time step based on the previous image. We segment the rope/cloth against the background to get the list of valid pick locations of the object. We then generate possible actions by uniformly sampling 100 random deltas in combined with randomly chosen start locations. We feed these into our forward model along with the encoding of our start image to get the latent encoding for each of the next prospective states. To pick the optimal action, we find the location and delta that minimizes the Euclidean distance from these next states to our goal state and return this action to the robot. The delta from the policy is on the scale of for both and coordinates, and we rescale this to pixels. On the robot side, we use a learned linear mapping to transform from the image’s pixel values to Cartesian coordinates that the robot uses. To emulate the simulator, the robot’s left arm motion is to go to the start location, go down and close the gripper, move up, move to the new location, move down and open the gripper, where the height of the gripper is hard-coded to some manually tuned value.
We use three baselines along with our contrastive method for real-world evaluation. The first is random actions and the others are the two policies that performed the best with domain randomization: the autoencoder and the joint dynamics model . For the rope, all the models are evaluated on five goal states: horizontal, vertical, straight line at , straight line at , and a squiggly rope on left. For the cloth, the models are evaluated on one goal state, a flat blue cloth with no rotation. The metric we use is the intersection in pixels between the segmented final image and the segmented goal image. We prefer this instead of intersection over union (IOU) since the objects have the same shape so the union normalization is unnecessary. Additionally, the simpler intersection values provide more insight for comparisons than IOU. The models are run for 40 actions on the rope or 100 actions on the cloth, and the image after each action is stored as an observation. Among all the observations, the one with the highest intersection with the goal is chosen for each method. To account for different seeds, we use 4 starting locations for our contrastive method and 2 starting locations for the baselines, with the scores being averaged across the different start locations. For the cloth, the seed also involves different colors of cloths (blue, gold, white).
The specific evaluation metrics are found in Table III which shows that our model performed the best for all the rope and cloth tasks. The joint dynamics model is the second best and got close results to ours on the and squiggle rope tasks. Some example trajectories from our model are seen from a forward view in Figure LABEL:fig:intro and from an overhead view in Figure 1. Visual comparisons between our method and baseline methods on the real robot are found in Figure 2. We see that our method more accurately plans towards correct goal states compared to the baselines.
In this paper, we propose a contrastive learning approach for predictive modeling of deformable objects. We show that contrastive learning learns stronger and more plannable latent representations compared to existing methods. Since our method only requires collecting random data in an environment, it allows for easier transfer to real robots without the need for real-world training.
We thank AWS for computing resources. We also gratefully acknowledge the support from Berkeley DeepDrive, NSF, and the ONR Pecase award.
Benchmarking deep reinforcement learning for continuous control. In ICML, Cited by: §I.
Deep auto-encoder neural networks in reinforcement learning. In IJCNN, Cited by: 3rd item.
Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. ISER. Cited by: §I.
Deep imitation learning of sequential fabric smoothing policies. arXiv preprint. Cited by: §I, §I.
Deep transfer learning of pick points on fabric for robot bed-making. arXiv preprint. Cited by: §I.