Over the last few decades, we have seen tremendous progress in robotic manipulation. From grasping objects in clutter [56, 45, 31, 27, 12] to dexterous in-hand manipulation of objects [1, 69], modern robotic algorithms have transformed object manipulation. But much of this success has come at the price of making a key assumption: rigidity of objects. Most robot algorithms often require (implicitly or explicitly) strict rigidity constraints on objects. But the objects we interact with everyday, from the clothes we put on to shopping bags we pack, are deformable. In fact, even ‘rigid’ objects deform under different shape factors (like a metal wire). Because of this departure from the ‘rigid-body’ assumption, several real-world applications of manipulation fail . So why haven’t we created equally powerful algorithms for deformable objects yet?
Deformable object manipulation has been a long standing problem [65, 15, 59, 32, 55], with two unique challenges. First, in contrast with rigid objects, there is no obvious representation of state. Consider the cloth manipulation problem in Fig. 1(a). How do we track the shape of the cloth? Should we use a raw point cloud, or fit a continuous function? This lack of canonical state often limits state representations to discrete approximations . Second, the dynamics is complex and non-linear . Due to microscopic interactions in the object, even simple looking objects can exhibit complex and unpredictable behavior . This makes it difficult to model and perform traditional task and motion planning.
, where robotic algorithms can reason about interactions directly from raw sensory observations. This can alleviate the challenge of state estimation for deformable objects, since we can directly learn on images. Moreover, since these methods do not require an explicit model of the object , it can overcome the challenge of having complex deformable object dynamics. But model-free learning has notoriously poor sample complexity . This has limited the application of learning to the setting where human demonstrations are available [41, 33]. In concurrent and independent work,  has shown how simulated demonstrators can be used to learn manipulation strategies to spread out a cloth.
In this work, we tackle the sample-complexity issue by focusing on an often ignored aspect of learning: the action space. Inspired by [16, 5], we start by using an iterative pick-place action space, where the robot can decide which point to grasp (or pick) and to which point it should drop (or place). But how should one learn with this action space? One option is to directly output both the pick point and place location for the deformable object. But the optimal placing location is heavily correlated with picking location, i.e. where you place depends heavily on what point you pick. This conditional structure makes it difficult to simultaneously learn without modeling this aspect of the action space.
To solve this, we propose a conditional action space, where the output of the picking policy is fed as input into the placing policy. But this leads us to a second problem: the placing policy is constrained by the picking policy. When learning starts, the picking policy often collapses into a suboptimal restrictive set of pick points. This inhibits the exploration of the placing policy, since the picking points it takes as input are only from a restrictive set, and results in a suboptimal placing policy. Now, since the rewards for picking come after the placing is executed, the picking policy receives poor rewards and results in inefficient learning. This illustrates the chicken and egg problem with conditional action spaces. Learning a good picking strategy involves having a good placing strategy, while learning a good placing strategy involves having a good picking strategy.
To break this chicken and egg loop, we learn the placing strategy independent of the picking strategy. This allows us to both learn the placing policy efficiently, and use the learned placing value approximator  to inform the picking policy. More concretely, since the value of the placing policy is conditioned on the pick point, we can find the pick point that maximizes the value. We call this picking policy Maximum Value of Placing (MVP). During training, the placing policy is trained with a random picking policy. However, during testing, the MVP picking policy is used. Through this, we observe a significant speedup in convergence on three difficult deformable object manipulation tasks on rope and cloth objects. Finally, we demonstrate how this policy can be transferred from a simulator to a real robot using simple domain randomization without any additional real-world training or human demonstrations. Videos of our PR2 robot performing deformable object manipulation along with our code can be accessed on the project website: https://sites.google.com/view/alternating-pick-and-place.
In summary, we present three key contributions in this paper: (a) we propose a novel learning algorithm for picking based on the maximal value of placing; (b) we show that the conditional action space formulation significantly accelerates the learning for deformable object manipulation; and (c) we demonstrate transfer to real-robot cloth and rope manipulation.
Ii Related Work
Ii-a Deformable Object Manipulation
Robotic manipulation of deformable objects has had a rich history that has spanned different fields from surgical robotics to industrial manipulation. For a more detailed survey, we refer the reader to [23, 15].
Motion planning has been a popular approach to tackle this problem, where several works combine deformable object simulations with efficient planning . Early work [49, 66, 39] focused on using planning for linearly deformable objects like ropes.  developed methods for fully deformable simulation environments, while 
created methods for faster planning with deformable environments. One of the challenges of planning with deformable objects, is the large degrees of freedom and hence large configuration space involved when planning. This, coupled with the complex dynamics, has prompted work in using high-level planners or demonstrations and local controllers to follow the plans.
Instead of planning on the full complex dynamics, we can plan on simpler approximations, but use local controllers to handle the actual complex dynamics. One way to use local controllers is model-based servoing [57, 65], where the end-effector is locally controlled to a given goal location instead of explicit planning. However, since the controllers are optimized over simpler dynamics, they often get stuck in local minima with more complex dynamics . To solve this model-based dependency, several works [2, 35, 42] have looked at Jacobian approximated controllers that do not need explicit models, while [18, 17] have looked at learning-based techniques for servoing. However, since the controllers are still local in nature, they are still susceptible to reaching globally suboptimal policies. To address this,  interleaves planning along with local controllers. Although this produces better behavior, transferring it to a robot involves solving the difficult state-estimation problem [51, 52]. Instead of a two step planner and local controller, we propose to directly use model-free visual learning, which should alleviate the state-estimation problem along with working with the true complex dynamics of the manipulated objects.
Ii-B Reinforcement Learning for Manipulation
Reinforcement Learning (RL) has made significant progress in many areas such as robotics. RL has enabled robots to handle unstructured perception such as visual inputs and reason about actions directly from raw observations . It been shown to solve manipulation problems such as in-hand block manipulation [1, 46], object pushing , and valve-rotating with a three-fingered hand . However, these algorithms have not yet seen wide applicability to deformable object manipulation. This is primarily due to learning being inefficient with complex dynamics , which we address in this work.
Over the last few years, deformable object manipulation has also been studied in reinforcement learning [41, 25, 33, 67, 54]. However, many of these works [25, 33] require expert demonstrations to guide learning for cloth manipulation. These expert demonstrations can also be used to learn wire threading [34, 50]. In concurrent work,  shows that instead of human demonstrators, a simulated demonstrator using state information can be used to obtain demonstrations. Other works  that do not need demonstrations for training require them at test time. We note that since using our conditional action spaces and MVP technique can be applied to any actor-critic algorithm, it is complementary to most methods that learning from expert demonstrations.
Iii-a Reinforcement Learning
We consider a continuous Markov Decision Process (MDP), represented by the tuple, with continuous state and action space, and , and a partial observation space .
defines the transition probability of the next stategiven the current state-action pair . For each transition, the environment generates a reward , with future reward discounted by .
Starting from an initial state sampled from distribution , the agent takes actions according to policy and receives reward at every timestep t. The next state is sampled from the transition distribution . The objective in reinforcement learning is to learn a policy that maximizes the expected sum of discounted rewards . In the case of a partially observable model, the agent receives observations and learns .
Iii-B Off Policy Learning
On-policy reinforcement learning [53, 22, 68] iterates between data collection and policy updates, hence requiring new on-policy data per iteration which tends to be expensive to obtain. On the other hand, off-policy reinforcement learning retains past experiences in a replay buffer and is able to re-use past samples. Thus, in practice, off-policy algorithms have achieved significantly better sample efficiency [14, 24]. Off-policy learning can be divided into three main categories: model-based RL, Actor-Critic (AC), and Q learning. In model-based RL, we learn the dynamics of the system. In the AC framework, we learn both the policy (actor) and value function (critic). Finally, in Q-learning we often learn only the value function, and choose actions that maximize it.
In this work, we consider the actor-critic framework since it is the most suitable for continuous control, as well as data-efficient and stable. Recent state-of-the-art actor-critic algorithms such as Twin Delayed DDPG (TD3)  and Soft-Actor-Critic (SAC)  show better performance than prior off-policy algorithms such Deep Deterministic Policy Gradient (DDPG)  and Asynchronous Advantage Actor-Critic (A3C) 
due to variance reduction methods in TD3 by using a second critic network to reduce over-estimation of the value function and an additional entropy term in SAC to encourage exploration. In this work, we use SAC since its empirical performance surpasses TD3 (and other off-policy algorithms) on most RL benchmark environments. However, our method is not tied to SAC and can work with any off-policy learning algorithm.
We now describe our learning framework for efficient deformable object manipulation. We start by the pick and place problem. Following this, we discuss our algorithm.
Iv-a Deformable Object Manipulation as a Pick and Place Problem
We look at a more amenable action space while retaining the expressivity of the general action space: pick and place. The pick and place action space has had a rich history in planning with rigid objects [5, 30]. Here, the action space is the location to pick (or grasp) the object and the location to place (or drop) the object . This operation is done at every step , but we will drop the superscript for ease of reading. With rigid objects, the whole object hence moves according . However, for a deformable object, only the point corresponding to on the object moves to , while the other points move according to the kinematics and dynamics of the deformable object . Empirically, since in each action the robot picks and places a part of the deformable object, there is significant motion in the object, which means that the robot gets a more informative reward signal after each action. Also note that this setting allows for multiple pick-and-place operations that are necessary for tasks such as spreading out a scrunched up piece of cloth.
Iv-B Learning with Composite Action Spaces
The straightforward approach to learning with a pick-place action space is to learn a policy that directly outputs the optimal locations to pick and to place , i.e. where is the observation of the deformable object (Fig. 2(a)). However, this approach fails to capture the underlying composite and conditional nature of the action space, where the location to place is strongly dependent on the pick point .
One way to learn with conditional output spaces is to explicitly factor the output space during learning. This has provided benefits in several other learning problems from generating images  to predicting large dimensional robotic actions [40, 62]. Hence instead of learning the joint policy, we factor the policy as:
This factorization will allow the policy to reason about the conditional dependence of placing on picking (Fig. 2(b)). However, in the context of RL, we face another challenge: action credit assignment. Using RL, the reward for a specific behavior comes through the cumulative discounted reward at the end of an episode. This results in the temporal credit assignment problem where attributing the reward to a specific action is difficult. With our factored action spaces, we now have an additional credit assignment problem on the different factors of the action space. This means that if an action receives high reward, we do not know if it is due to or . Due to this, training jointly is inefficient and often leads to the policy selecting a suboptimal pick location. This suboptimal then does not allow to learn, since only sees suboptimal picking locations during early parts of training. Thus, this leads to a mode collapse as shown in Sec. V-D.
To overcome the action credit assignment problem, we propose a two-stage learning scheme. Here the key insight is that training a placing policy can be done given a full-support picking policy and the picking policy can be obtained from the placing policy by accessing the Value approximator for placing. Algorithmically, this is done by first training conditioned on picking actions from the uniform random distribution . Using SAC, we train and obtain , as well as the place value approximator . Since the value is also conditioned on pick point , we can use this to obtain our picking policy as:
We call this picking policy: Maximum Value under Placing (MVP). MVP allows us get an informed picking policy without having to explicitly train for picking. This makes training efficient for off-policy learning with conditional action spaces especially in the context of deformable object manipulation.
V Experimental Evaluation
In this section we analyze our method MVP across a suite of simulations and then demonstrate real-world deformable object manipulation using our learned policies.
V-a Cloth Manipulation in Simulation
Most current RL environments like OpenAI Gym  and DM Control , offer a variety of rigid body manipulation tasks. However, they do not have environments for deformable objects. Therefore, for consistent analysis, we build our own simulated environments for deformable objects using the DM Control API. To simulate deformable objects, we use composite objects from MuJoCo 2.0 . This allows us to create and render complex deformable objects like cloths and ropes. Using MVP, we train policies both on state (locations of the composite objects) and image observations ( RGB). For image-based experiments, we uniformly randomly select a pick point on a binary segmentation of the cloth or rope in order to guarantee a pick point on the corresponding object.
The details for the three environments we use are as follows:
1. Rope : The goal is to stretch the rope (simulated as a 25 joint composite) horizontally straight in the center of the table. The action space is divided into two parts as and . is the two dimension pick point on the rope, and is the relative distance to move and place the rope. All other parts of the rope move based on the simulator dynamics after each action is applied. The reward for this task is computed from the segmentation of the rope in RGB images as:
where is the row number of the image, is the column number, and is the binary segmentation at pixel location . Hence for a image the reward encourages the rope to be in the center row (row number ) with an exponential penalty on rows further from the center. At the start of each episode, the rope is initialized by applying a random action for the first 50 timesteps.
2. Cloth-Simplified : The cloth consists of an 81 joint composite that is a grid. The robot needs to pick the corner joint of the cloth and move that to the target place. The action space is similar to the rope environment except the picking location can only be one of the four corners. In this environment, the goal is to flatten the cloth in the middle of the table. Our reward function is the intersection of the binary mask of the cloth with the goal cloth configuration. In MuJoCo, the skin of the cloth can be simulated by uploading an image. However, in this environment, we use a colormap  skin with four different colors in the corner.
3. Cloth : In contrast to the Cloth-Simplified environment that can only pick one of the 4 corners, Cloth allows picking any point in the pixel of cloth (if it is trained with image observation) or any composite particle (if state observation is used). The reward used is the same as in Cloth-Simplified. For both the Cloth and Cloth-Simplified environments, the cloth is initialized by applying a random action for the first 130 timesteps of each episode.
V-B Learning Methods for Comparison
Fig. 3 and Fig. 4 show our experimental results for various model architectures on the rope and cloth environments. To understand the significance of our algorithm, we compare the following learning methods: random, independent, conditional, learned placing with uniform pick, and MVP (ours).
Random: We sample actions uniformly from the pick-place action space of the robot.
Independent: We use a joint factorization of by simultaneously outputting the and .
: We first choose a pick location, and then choose a place vector distance given the pick location, modeled as.
Learned Placing with Uniform Pick: We use the conditional distribution , where is uniformly sampled from the pick action space.
MVP (ours): We use the trained learned placing with uniform pick policy and choose by maximizing over the learned Q-function.
V-C Training Details
. For state-based experiments, we use an MLP with 2 hidden layers of 256 units each; approximately 150k parameters. For image-based experiments, we use a CNN with 3 convolutions with channel sizes (64, 64, 4), each with a kernel size of 3 and a stride of 2. This is followed by with 2 fully connected hidden layers of 256 units each. In total approximately 200k parameters are learned. For all models, we repeat the pick information 50 times before concatenating with the state observations or flattened image embeddings. The horizon for Rope is 200 and 120 for both Cloth environments. The minimum replay pool size is 2000 for Rope and 1200 for the Cloth environments. The image size used for all environments is. Based on the original code, we added parallel environment sampling to speed-up overall training by 35 times.
V-D Does conditional pick-place learning help?
To understand the effects of our learning technique, we compare our learned placing with uniform pick technique with the independent representation in Fig. 3. We can see that using our proposed method shows significant improvement in learning speed for state-based cloth experiments, and image-based experiments in general. The state-based rope experiments do not show much of a difference due to the inherent simplicity of the tasks. Our method shows significantly higher rewards in the cloth simplified environment, and learns about 2X faster in the harder cloth environment. For image-based experiments, the baseline methods do no better than random while our method gives an order of magnitude (5-10X) higher performance for reward reached. The independent and conditional factored policies for image-based cloth spreading end up performing worse than random, suggesting some sort of mode collapse occurring . This demonstrates that conditional learning indeed speeds up learning for deformable object manipulation especially when the observation is an image.
V-E Does setting the picking policy based on MVP help?
One of the key contributions of this work is to use the placing value to inform the picking policy (Eq. 2) without explicitly training the picking policy. As we see in both state-based (Fig. 3) and image-based case (Fig. 4) training with MVP gives consistently better performance. Even when our conditional policies with uniform pick location fall below the baselines as seen in Cloth (State) and Rope (State), using MVP significantly improves the performance. Note that although MVP brings relatively smaller boosts in performance compared to the gains brought by the learned placing with uniform pick method, we observe that the learned placing with uniform pick policy already achieves a high success rate on completing the task, and even a small boost in performance is visually substantial when running evaluations in simulation and on our real robot.
V-F How do we transfer our policies to a real robot?
To transfer our policies to the real-robot, we use domain randomization (DR) [62, 44, 48] in the simulator along with using images of real cloths. DR is performed on visual parameters (lighting and textures) as well physics (mass and joint friction) of the cloth. On our PR2 robot (Fig. 1(a)) we capture RGB images from a head-mounted camera and input the image into our policy learned in the simulator. Since and are both defined as points on the image, we can easily command the robot to perform pick-place operations on the deformable object placed on the green table. Additionally, in simulation evaluation, we notice no degradation in performance due to DR while training using MVP (Fig. 6).
V-G Evaluation on the real robot
We evaluate our policy on the rope-spread and cloth-spread experiments. As seen in Fig. 5, policies trained using MVP are successfully able to complete both spreading tasks. For our cloth spreading experiment, we also note that due to domain randomization, a single policy can spread cloths of different colors. For quantitative evaluations, we select 4 start configurations for the cloth and the rope and compare with various baselines (Table I) on the spread coverage metric. For the rope task, we run the policies for 20 steps, while for the much harder cloth task we run policies for 150 steps. The large gap between MVP trained policies and independent policies supports our hypothesis that the conditional structure is crucial for learning deformable object manipulation. Robot execution videos can be accessed from the project website: https://sites.google.com/view/alternating-pick-and-place.
|Domains||Random policy||Conditional Pick-Place||Joint policy||MVP (ours)|
Vi Conclusion and Future Work
We have proposed a conditional learning approach for learning on manipulating deformable objects. We have shown this significantly improves sample complexity. To our knowledge, this is the first work that trains RL from scratch for deformable object manipulation and demonstrates it on real robot. This finding opens up many exciting avenues for deformable object manipulation from bubble wrapping a rigid object to folding a T-shirt, which pose additional challenges in specifying a reward function and handling partial observability. Additionally, since our technique only assumes an actor-critic algorithm, we believe it can be combined with existing learning from demonstration based techniques to obtain further improvements in performance.
We thank AWS for computing resources and Boren Tsai for support in setting up the robot. We also gratefully acknowledge the support from Komatsu Ltd., The Open Philanthropy Project, Berkeley DeepDrive, NSF, and the ONR Pecase award.
-  (2018) Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: §I, §I, §II-B.
-  (2013) Manipulation of deformable objects without modeling and simulating deformation. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4525–4532. Cited by: §I, §II-A.
-  (1995) A rule-based tool for assisting colormap selection. In Proceedings Visualization’95, pp. 118–125. Cited by: §V-A.
-  (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §V-A.
-  (1983) Planning collision-free motions for pick-and-place operations. The International Journal of Robotics Research 2 (4), pp. 19–44. Cited by: §I, §IV-A.
Benchmarking deep reinforcement learning for continuous control.
International Conference on Machine Learning, pp. 1329–1338. Cited by: §I, §II-B.
-  (2012) Soft material modeling for robotic manipulation. In Applied Mechanics and Materials, Vol. 162, pp. 184–193. Cited by: §I, §II-A.
-  (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. Cited by: §II-B.
-  (2011) Efficient motion planning for manipulation robots in environments with deformable objects. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2180–2185. Cited by: §II-A.
-  (2018-02) Addressing Function Approximation Error in Actor-Critic Methods. arXiv e-prints, pp. arXiv:1802.09477. External Links: Cited by: §III-B.
-  (2014) Generative adversarial nets. In NIPS, Cited by: §V-D.
-  (2018) Robot learning in homes: improving generalization and reducing dataset bias. In Advances in Neural Information Processing Systems, pp. 9094–9104. Cited by: §I.
-  (2018-01) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv e-prints, pp. arXiv:1801.01290. External Links: Cited by: §III-B.
-  (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §II-B, §III-B, §V-C.
-  (2012) Robot manipulation of deformable objects. Springer Science & Business Media. Cited by: §I, §II-A.
-  (2000) Intelligent learning for deformable object manipulation. Autonomous Robots 9 (1), pp. 51–58. Cited by: §I.
-  (2018) Three-dimensional deformable object manipulation using fast online gaussian process regression. IEEE Robotics and Automation Letters 3 (2), pp. 979–986. Cited by: §II-A.
-  (2018) Learning-based feedback controller for deformable object manipulation. arXiv preprint arXiv:1806.09618. Cited by: §II-A.
-  (2012) Survey on model-based manipulation planning of deformable objects. Robotics and computer-integrated manufacturing 28 (2), pp. 154–163. Cited by: §II-A.
-  (2015) Team ihmc’s lessons learned from the darpa robotics challenge trials. Journal of Field Robotics 32 (2), pp. 192–208. Cited by: §I.
Reinforcement learning: a survey.
Journal of artificial intelligence research4, pp. 237–285. Cited by: §III.
-  (2002) A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538. Cited by: §III-B.
-  (2010) Dexterous robotic manipulation of deformable objects with multi-sensory feedback-a review. In Robot Manipulators Trends and Development, Cited by: §II-A.
-  (2018) Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592. Cited by: §III-B.
-  (2015) Learning from multiple demonstrations using trajectory-aware non-rigid registration with applications to deformable object manipulation. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5265–5272. Cited by: §II-B.
-  (2016) End-to-end training of deep visuomotor policies. JMLR. Cited by: §I.
Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. ISER. Cited by: §I.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I, §I.
-  (2015) Continuous control with deep reinforcement learning. arXiv e-prints arXiv:1509.02971. Cited by: §III-B.
-  (1989) Task-level planning of pick-and-place robot motions. Computer 22 (3), pp. 21–29. Cited by: §IV-A.
-  (2016) Dex-net 1.0: a cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In ICRA, Cited by: §I.
-  (2010) Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In 2010 IEEE International Conference on Robotics and Automation, pp. 2308–2315. Cited by: §I.
-  (2018) Sim-to-real reinforcement learning for deformable object manipulation. arXiv preprint arXiv:1806.07851. Cited by: §I, §II-B.
A system for robotic heart surgery that learns to tie knots using recurrent neural networks. Advanced Robotics 22 (13-14), pp. 1521–1537. Cited by: §II-B.
-  (2018) Estimating model utility for deformable object manipulation using multiarmed bandit methods. IEEE Transactions on Automation Science and Engineering 15 (3), pp. 967–979. Cited by: §II-A.
-  (2017) Interleaving planning and control for deformable object manipulation. In International Symposium on Robotics Research (ISRR), Cited by: §II-A.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §II-B.
-  (2016-02) Asynchronous Methods for Deep Reinforcement Learning. arXiv e-prints, pp. arXiv:1602.01783. External Links: Cited by: §III-B.
-  (2006) Path planning for deformable linear objects. IEEE Transactions on Robotics 22 (4), pp. 625–636. Cited by: §II-A.
CASSL: curriculum accelerated self-supervised learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6453–6460. Cited by: §IV-B.
-  (2017) Combining self-supervised learning and imitation for vision-based rope manipulation. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2146–2153. Cited by: §I, §II-B, §IV-A.
-  (2014) On the visual deformation servoing of compliant objects: uncalibrated control methods and experiments. The International Journal of Robotics Research 33 (11), pp. 1462–1480. Cited by: §II-A.
-  (2001) Tight open knots. The European Physical Journal E 6 (2), pp. 123–128. Cited by: §I.
-  (2017) Asymmetric actor critic for image-based robot learning. arXiv preprint arXiv:1710.06542. Cited by: §I, §V-F.
-  (2016) Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. ICRA. Cited by: §I.
-  (2017) Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §II-B.
-  (2006) An obstacle-based rapidly-exploring random tree. In Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pp. 895–900. Cited by: §II-A.
-  (2016) Cad2rl: real single-image flight without a single real image. arXiv preprint arXiv:1611.04201. Cited by: §V-F.
-  (2007) Manipulation planning for deformable linear objects. IEEE Transactions on Robotics 23 (6), pp. 1141–1150. Cited by: §II-A.
-  (2013) A case study of trajectory transfer through non-rigid registration for a simplified suturing scenario. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4111–4117. Cited by: §II-B.
-  (2013) Generalization in robotic manipulation through the use of non-rigid registration. In Proceedings of the 16th International Symposium on Robotics Research (ISRR), Cited by: §II-A.
-  (2013) Tracking deformable objects with point clouds. In 2013 IEEE International Conference on Robotics and Automation, pp. 1130–1137. Cited by: §II-A.
-  (2015) Trust region policy optimization.. In ICML, pp. 1889–1897. Cited by: §III-B.
Deep imitation learning of sequential fabric smoothing policies. arXiv preprint arXiv:1910.04854. Cited by: §I, §II-B.
Deep transfer learning of pick points on fabric for robot bed-making. arXiv preprint arXiv:1809.09810. Cited by: §I.
-  (1996) Robot grasp synthesis algorithms: a survey. The International Journal of Robotics Research 15 (3), pp. 230–266. Cited by: §I.
-  (2009) Deformation planning for robotic soft tissue manipulation. In 2009 Second International Conferences on Advances in Computer-Human Interactions, pp. 199–204. Cited by: §II-A.
Rlpyt: a research code base for deep reinforcement learning in pytorch. arXiv preprint arXiv:1909.01500. Cited by: §V-C.
-  (2014) Garment perception and its folding using a dual-arm robot. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 61–67. Cited by: §I.
-  (1998) Introduction to reinforcement learning. Vol. 2, MIT press Cambridge. Cited by: §III.
-  (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: §V-A.
-  (2018) Domain randomization and generative models for robotic grasping. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3482–3489. Cited by: §IV-B, §V-F.
-  (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §V-A.
-  (2016) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §IV-B.
-  (2001) Robust manipulation of deformable objects by a simple pid feedback. In Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No. 01CH37164), Vol. 1, pp. 85–90. Cited by: §I, §II-A.
-  (2006) Knotting/unknotting manipulation of deformable linear objects. The International Journal of Robotics Research 25 (4), pp. 371–395. Cited by: §II-A.
-  (2019) Learning robotic manipulation through visual planning and acting. arXiv preprint arXiv:1905.04411. Cited by: §II-B.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §III-B.
-  (2011) Tactile sensing for dexterous in-hand manipulation in robotics—a review. Sensors and Actuators A: physical 167 (2), pp. 171–187. Cited by: §I.