In recent years robotic manipulation has been an active field of research, nevertheless work focusing on dual-arm manipulation is still rare. A second robotic arm enhances dexterity but also introduces new challenges and additional modeling effort as further degrees of freedom and interaction between manipulators need to be considered. Thus, it is common to use a complex task-specific control structure with multiple control loops. However, as our goal is to contribute towards a general framework for dual-arm manipulation, the method needs to be task-independent and modeling efforts should be restricted to a minimum.
To this end, it is possible to utilize deep reinforcement learning (RL). Thereby manipulation tasks can be learned from scratch by interaction with the environment. However, deep RL alone would require a lot of training samples which are expensive to collect in a real-world setup. Instead, it is preferable to introduce inductive biases into our architecture as to facilitate the learning process. Namely, we train a policy network only to generate high-level trajectories and use well-established control techniques to track those trajectories. Such a modular architecture also allows for zero-shot sim-to-real transfer. This enables us to do all the training in simulation. In general the distinction between decentralization and centralization has to be made on both control level and policy level. On a control level centralized control requires large modeling efforts and is not task-agnostic, our work considers a decentralized approach. With that in mind, two general paradigms are conceivable: the first one involves two separate decoupled RL agents which can be trained in a multi-agent RL setting and the second one utilizes a single policy controlling both arms. The latter is more feasible as it couples control of both manipulators through a policy network, resulting in an overall centralized method, and, thus increases precision and efficiency. Our method is based on the latter approach and attempts to learn a single policy using off-policy RL. Intuitively, such an approach can be thought of as a way to centralize decentralized control based on RL. This paper aims at exploring the applicability of deep RL to dual-arm manipulation tasks. Hence, we propose a framework to learn such tasks based on a combination of recent advances in RL and well-established control techniques. To reduce the need for task-specific knowledge and to avoid introducing additional bias in the learning process, we test our methods solely with sparse rewards. Nevertheless, only receiving a reward after successfully solving the task provides less guidance to the agent as no intermediate feedback is given. Thus, the already challenging task of dual-arm manipulation becomes more complicated and sample-inefficient. That is why we rely on simulation to train our policy and transfer the results to the real-world (fig. 1). Moreover, we design our framework with the goal of avoiding elaborate sim-to-real transfer procedures.
To demonstrate the effectiveness of our approach we evaluate our method on a dual-arm peg-in-hole task, as it requires high dexterity to manipulate two objects with small clearances under consideration of contact forces. We first use PyBullet  to create a realistic simulation environment and analyze the proposed approach with a focus on sample efficiency, performance and robustness. We then test the learned behavior in a real-world setup with two Franka Emika Panda  robots and demonstrate the feasibility of our method under minimized sim-to-real transfer efforts. Our contributions can be summarized as follows:
We explore and formulate a new paradigm for learning dual-arm manipulation tasks.
We compare the performance of different action spaces and controllers on the success and robustness of the learned policies.
We show the feasibility of zero-shot transferring policies trained in simulation to the real-world.
To our knowledge, our work is the first to explore the applicability of model-free RL to contact-rich dual-arm manipulation tasks.
Ii Related Work
Dual-arm manipulation is a challenging area of research, which can be divided into decentralized and centralized approaches. The first one utilizes independent controllers for each robot with explicit  or implicit [33, 30] communication channels and is often combined with leader/follower behaviour [28, 33]. Besides improved scalability and variability, decentralized control hardly reaches the efficiency and precision of centralized control, which integrates the control of both manipulators in a central unit.
Among feasible manipulation objectives, peg-in-hole insertion can be seen as a benchmark, since it requires accurate positioning, grasping, and handling of objects in contact-rich situations. Therefore, we select the task of dual-arm peg-in-hole to evaluate the performance of our approach.
Single-arm peg-in-hole. As research focusing on dual-arm peg-in-hole assembly is rare and mostly limited to extensive modeling [28, 22, 34], research on classical peg-in-hole assembly with a single robotic arm provides a perspective on model-free approaches based on reinforcement learning.  and  show that sparse rewards are sufficient to successfully learn a policy for an insertion task if combined with learning from demonstrations. In  a similar setup is combined with the concept of Hindsight Experience Replay (HER)  for the robotic manipulation tasks push as well as pick-and-place. Their results point out, that HER is sufficient to enhance the performance if only sparse rewards are available. The work in  utilizes residual reinforcement learning to leverage classical control, which performs well given spare rewards only and provides a hint, that the choice of action space can be crucial. An evaluation of action spaces on the task of single-arm peg-in-hole with a clearance of and a shaped reward function is presented in , where cartesian impedance control performs best. Moreover, 
applies position-force control with model-free reinforcement learning for peg-in-hole with a focus on transfer-learning and domain randomization.
Decentralized Dual-Arm Manipulation. In the work by , a decentralized approach for the dual-arm peg-in-hole task is proposed. The method is based on a leader/follower architecture. Hence, no explicit coupling between both manipulators is required. The leader would perform the insertion, and the follower would hold its position and be compliant to the applied forces. Similar to the previously mentioned work,  utilizes a decentralized approach, where the hole keeps the desired position with low compliance and the peg is steered in a spiral-screw motion towards insertion with high compliance. But despite reducing the necessity to model their interaction, both approaches lack dexterity, i.e., there is only one robot actively acting in the environment. In a general pipeline, there should be enough flexibility for both arms to be actively contributing towards the objective. Furthermore,  presents a method based on decomposing the task into phases and utilizing a sophisticated control flow for the whole assembly process. Despite reducing efforts in modeling the interaction explicitly, the control flow is only engineered for one specific task and lacks dexterity as movements are bound to the pre-programmed procedure.
Centralized Dual-Arm Manipulation. Pairet et al. proposes a centralized approach for dual-arm manipulation, which formulates an accurate model of the objective . A set of primitive behaviors are demonstrated to the robot by a human, the robot combines those behaviors and tries to solve a given task. Finally, an evaluator measures its performance and decides if further demonstrations are necessary. The approach has promising potential towards more robust and less task-specific dual-arm manipulation. However, besides the required modeling efforts, it is limited by the human teaching process, which introduces a different set of assumptions on top of the previous ones, limiting its applicability to semi-structured environments. Besides that, work on centralized methods, mostly focuses on cooperative object manipulation [5, 6, 11, 14, 26, 4] and highly relies on accurate and complex modeling of the underlying system dynamics.
Sim-to-Real Transfer. Sample inefficiency is one of the main challenges of deep RL algorithms. The problem is even worse for robotic tasks, which involve high-dimensional states and actions as well as complex dynamics. This motivates the use of simulation for data collection and training. However, due to the inaccuracies in the physics modeling and image rendering in simulation, policies trained in simulation tend to fail in the real-world. This is usually referred to as the ”reality gap”. The most popular paradigm to approach this problem is domain randomization (DR) . The main goal of DR is to subject the agent to samples based on diverse simulation parameters concerning the object  and the dynamics properties . By doing so the learned policy should be able to generalize to the different physical properties of real-world tasks. Recent work has explored active parameters sampling strategies as to dedicate more training time for troublesome parameter settings . Another approach for sim-to-real transfer is system modularity 
. Here a policy is split into different modules responsible for different objectives such as pose detection, online motion planning and control. Only components that won’t suffer from the reality gap are trained in simulation and the rest is adapted or tailor-made for the real-world setup. This comes in contrast to the most common end-to-end training in deep RL. In our work, we use a modular architecture to enable zero-shot sim-to-real transfer. Namely, we parameterize the controllers differently in simulation compared to the real-world to allow using the same high-level policy network.
Despite various contributions towards a general framework for dual-arm manipulation, we do not know of any work that successfully fulfills all requirements. Therefore, this work aims at proposing a unified pipeline for dual-arm manipulation based on a minimal set of assumptions. To the best of our knowledge, no publication exists which solves the challenge of contact-rich dual-arm peg-in-hole with model-free reinforcement learning, nor sparse rewards.
Iii Learning to Centralize
In this section, we introduce an approach to learn a centralized policy for dual-arm manipulation. Moreover, we intend to reduce the required modeling effort to a minimum as the proposed method is based on model-free reinforcement learning with sparse rewards. The approach does not require a specific dual-arm controller, since control is only coupled at a policy level.
The common approach in centralized dual-arm control strategies is to model the manipulator’s and the object’s dynamics explicitly and achieve coupling of both robotic arms by using multiple control loops. The system dynamics of the -th manipulator can be described as follows:
is the vector of joint positions,is the inertia matrix, is the coriolis/centrifugal matrix, is gravity vector, is the vector of joint torques and represents external torques, which can be further decomposed into the external wrench as in (2).
As both robots interact with each other directly or through the manipulation of objects, the respective external wrenches have to include all forces and moments which are applied to the robot. Hence, to create a valid dynamic chain, commonly (1) is also used to define the object dynamics: With as torques exerted by the manipulators or other objects on the respective object. Thereby, the dynamic equations need to be adapted to the specific manipulation task, but are not universally valid. Furthermore, the end-effector movements are restricted by object geometries, which need to be known in advance. Based on the dynamics model, commonly a hierarchical control strategy through multiple loops is applied, where the outer loops realizes the main objective such as desired object movements and the inner loop accounts to generate a firm grasp and bounded internal forces. [5, 6, 11, 14, 26, 4]. The particular control loops can utilize any control strategy. Nevertheless, impedance control is a common choice to achieve compliant behaviour [5, 6, 14, 26], as contact forces are limited by coupling the control torque with position and velocity (3). The control torque is calculated by multiplying the gains and with the difference of desired and actual position and velocity respectively. The principal can be applied in joint space or task space. and are placeholders to account for the variety of control laws and their additions (e.g. 5).
Besides, the need to define an explicit set of dynamic equations and constraints for each task, also the control loops are adapted for a specific use case. Hence, a general framework for dual-arm robotic manipulation is not feasible.
An alternative way to approach the problem is to utilize a policy network as high-level control and thereby dispose the need of designing a coupled control method. Figure 2 highlights the differences between those paradigms for dual-arm manipulation: The classical way to have a unified control scheme, which can include multiple control loops, to couple both robot arms (figure 2), and the proposed method, which uses two standard controllers with a high-level policy instead (figure 2). Thereby the controllers can be designed in a straightforward way without the need for purpose-built dual-arm control algorithms, allowing the use of any single-arm action space. In all cases, the framework uses a policy trained by RL to provide an action input to the controllers. The policy learns to inherently compensate for the constraints resulting from dual-arm interactions. Furthermore, coupling at policy level is convenient to implement as the policies action space can simply be extended to include a second controller.
Iii-B Action space
Classical control approaches for manipulation are often based on impedance control, as it comes with the previously mentioned advantages. However, since our method tries to compensate for interaction forces at a trajectory level, we explore different control laws as action spaces and study their effect on the task success.
Joint position control. First of all, we use joint position control (equation 4) to compute a torque command: Both gains, and , are set to a constant value, and evaluated at run-time and inferred by the policy.
Cartesian impedance control. Second, we implement cartesian impedance control : The action space allows to move control from joint space to cartesian space and includes model information such as the cartesian inertia matrix and the jacobian matrix as well as the gravity compensation term . As the degrees of freedom exceed the number of joints, nullspace compensation is added. Instead of , is directly passed in as action input.
Variable cartesian impedance control. Variable cartesian impedance control  is based on classical cartesian impedance control, though adds to the action space making control more variable to react with higher or lower stiffness if needed. We use anisotropic gains and couple velocity gains via to achieve critical damping.
Iii-C Reinforcement Learning.
In our method, the policy is responsible for generating high-level trajectories, which are later-on tracked by the chosen controller. We train the policy network using model-free RL. The policy receives the robot states to infer a control signal (action) for the above mentioned control laws (action spaces). We combine joint positions , joint velocities and joint torques of both robotic arms respectively, as well as cartesian positions and orientations of the end-effectors as state input . Nevertheless, the state might need to be adjusted if the framework is applied for a different task, which for instance includes additional objects.
The proposed method is not restricted to a specific model-free RL algorithm, though an off-policy algorithm is desirable to facilitate high sample efficiency and allow the use of experience replay. Thus, we use Soft Actor-Critic (SAC)  as the algorithm presents state of the art performance and sample efficiency, but could potentially be replaced by others such as Deep Deterministic Policy Gradients (DDPG)  or Twin Delayed Deep Deterministic Policy Gradients (TD3) .
To enhance sample efficiency, the environments should be implemented in a goal-based way. Thereby, the achieved goal is returned alongside the environment state and can easily be compared to the desired goal to compute a reward . Reward engineering is not necessary as we use a sparse reward (6).
To counteract the more challenging training task when using sparse compared to dense rewards, we use HER to augment past experiences. By replaying experiences with goals which have been or will be achieved along a trajectory the agent shall generalize a goal reaching behaviour. Hence, unsuccessful experiences still help to guide an agent, as a sparse reward otherwise does not provide feedback on the closeness to the desired goal.
We implement the general training and evaluation procedure in the following way: During each epoch, one full episode is gathered by interacting with the environment followed byoptimization steps. Moreover, we calculate the success rate every fifth epoch by averaging over test cycles. We use ADAM  as optimizer with a learning rate of and a batch size of . The experience replay memory size is set to and training starts after storing samples. The q-network and policy-network consist of and linear layers respectively with a hidden dimension of
and ReLUactivation functions. To update the target networks, an updating factor of is chosen. HER is set to use the strategy ”future” with sampling of additional experiences . All hyper-parameters are tuned manually and kept fixed for all experiments.
Iv Experimental Setup
We design experiments to answer the following questions: (i) can a central policy successfully learn dual-arm manipulations skills based on a decentralized control architecture? (ii) what action space leads to the highest success rate and the most robust policies? (iii) is our method robust against disturbances and position uncertainty? (iv) is a modular design enough to zero-shot transfer policies from simulation to the real-world?
To answer these questions, we evaluate the proposed method on the task of peg-in-hole assembly with two Franka Emika panda manipulators  both in simulation (fig. 1) and in real-world (fig. 1). The simulation environment is created using PyBullet . We design it to match the real-world setup as closely as possible. Both setups are similar except for the environment frequency which is 240Hz in simulation and 1KHz in the real-world. The policy is operating at 60Hz. Nevertheless, the robots only obtain the last state for control. To exclude the process of gripping and restrict movements of peg and hole, both are fixed to the end-effector of the respective robot arm. In the real-world experiments, this corresponds to the peg and hole being attached to the gripper of each manipulator. Furthermore, to enable an evaluation with increasing difficulty, pairs of peg and hole have been created with a clearance of and . Moreover, we define the goal state as relative distance between the positions of peg and hole. Both robots start with a significant distance and varying orientation with an offset around the initial joint position of . We restrict robot movements by their joint limits , whereas the workspace is not bounded. Furthermore, the robot bases are positioned on a table with a distance of
and both oriented to the same side. We use PyTorch to implement the models and train them on a NVIDIA GeForce RTX 2080 GPU.
V Simulation Results
We use the simulation to first train the policy and also to perform ablation studies and robustness tests. As can be seen in the supplementary video, the policy can be trained in simulation to learn a successful peg-in-hole insertion strategy. Both manipulators move equally towards each other without any bigger diversion. Furthermore, both are evenly involved in the process of aligning the end-effectors and pushing the peg inside. Hence, the approach does not lead to any leader/follower behaviour where one end-effector just keeps its position similar to a single-arm solution.
We design the experiments to start with an offset of from the initial joint positions and average all results over 4 seeds. All successful runs up to a clearance of converge to a success rate above in-between and epochs (fig. 3). As we use sparse rewards, no intermediate information about the task success is available. Hence, the success rates of a single run tend to converge either to or stay at , what can be easily seen for cartesian impedance control and a clearance of
, where only 1 out of 4 runs converges in the given time frame. Overall, variance and training duration is increasing with smaller clearances, which confirms that the task becomes more challenging as clearances decrease.
V-a Ablation Studies
We compare the following control variants to study the effect of different action spaces: joint position control, cartesian impedance control and variable cartesian impedance control as introduced in section III-B. Figure 3 shows the results when using a clearance of , where policies trained in all three action spaces converge in a comparable manner. Moreover, to analyze the effect of smaller clearances, we conduct the same experiments using a clearance between peg and hole (fig. 3). However, only joint position control converges in the respective time frame. Overall, the results are different to , where cartesian impedance control converges faster than joint position control for single-arm peg-in-hole, and , where variable cartesian impedance control performs best in contact-rich manipulation tasks.
As peg-in-hole manipulation requires stiffness adaption, variable impedance control should perform best among the evaluated action spaces. But, as it does not, a couple of improvements are conceivable: State representation and hyper-parameters could be optimized, as they have been set fixed during all experiments. Moreover, the increased size of the action space makes learning tasks more challenging, but could be decreased by introducing isotropic gains. In the case of cartesian impedance control, the first two possible improvements apply as well. Furthermore, the stiffness values are not tuned and leave lots of room for improvements.
First we showcase the robustness, by evaluating disturbance recovery, and second, we demonstrate the robustness against positioning errors.
Disturbance recovery. To investigate the robustness to unforeseen events, such as collision or active manipulation by humans, we evaluate the success rate after being disturbed from time step 10 till 40, resulting in an end-effector offset. Figure 4 shows the results. Each episode duration, with a maximum of 400 steps, is averaged over 60 random disturbances, 3 test cycles and all seeds. Afterwards, we compare their trajectories to a reference and calculate the disturbance as the difference between end-effector positions. Comparing all action spaces, it turns out that in our framework all variants can recover from external disturbances. Joint position control yields faster success, but episode duration’s increase proportionately more with higher external disturbances. Overall, the ability to recover depends mostly on the available time frame, hence increasing the maximum time steps could allow to handle larger disturbance offsets. Figure 5 visualizes trajectories of desired joint positions given by the policy network, when using joint position control. Two time steps after the external disturbance is applied, the policy starts to adapt to the external influence, varying on the disturbance magnitude. The exemplary results show, that the policy network is actively reacting to the disturbance and not purely relying on the controller.
Position uncertainties. Furthermore, to evaluate the performance under position uncertainties, for instance, caused by grasping the peg in an earlier phase, we apply a random offset to the relative location of the peg to the end-effector. Figure 4 shows the success rates for three offsets. We average each result over 50 different offsets and all seeds. In general, the success rates decrease with higher offsets and smaller clearances. Cartesian impedance control turns out to be less robust compared to joint position control and variable cartesian impedance control ends up last, which is comparable to the previous robustness evaluation of disturbance recovery. Nevertheless, joint position control and cartesian impedance control are capable to handle large offsets up to with high success rates, which should already be sufficient for most applications and is significant considering that no randomization has been applied during training. The evaluation under positional uncertainties shows, that the policy does not simply learn the underlying kinematics since peg and hole positions are fixed during training, but infers a peg insertion behaviour on the underlying state information.
V-C Real-World Results
To evaluate the approach in the real-world (fig. 1), we transfer the policies trained in simulation without taking further measures to improve transferability such as domain randomization or domain adaption. We explicitly evaluate the performance without any additional steps targeting sim-to-real transfer, to precisely investigate if the policy is robust enough to be applied in reality and both decentralized controllers can compensate to bridge the gap between simulation and reality.
To enable zero-shot transfer of the approach, the simulation environment has been adapted to match the real-world as close as possible. However, to ensure safety in real-world experiments, additional torque, velocity and stiffness limitations need to be applied to guarantee observability and non-critical behaviour, thereby even further increasing the sim-to-reality gap. For all experiments we use peg and hole with a clearance of , since all action spaces have been successfully trained in simulation when using that clearance.
Figure 6 shows the success rates for each action space when starting with a random deviation of for each joint from the initial joint positions . Using cartesian impedance control leads by far to the highest success rate as the action space helps to compensate for the sim-to-reality gap and is robust to the applied parameter changes for safe operation. Variable impedance control confirms the results of previous robustness evaluations as the variant can not reach a high success rate in real-world evaluations. One reason for the performance decrease might be that the variable stiffness led to overfitting to the system dynamics in simulation instead of learning a generalized stiffness adaption. Joint position control, which performed best in simulation, is not able to keep up in reality at all. The action space is not robust to joint torque and joint velocity limitations, thus would require additional retraining using the applied limitations. Overall, the results show that a well-chosen action space can help to enhance robustness and transfer the approach from simulation to reality without applying further methods to target sim-to-real transfer. Moreover, the modular design helped to incorporate adaptions after training the policy, which would not have been possible in an end-to-end approach. Nevertheless, the proposed method leaves room for improvements: Among them, the impact of domain randomization and domain adaption should be explored in future as well as fine-tuning in the real-world to adapt to the additionally applied safety constraints.
We introduce a framework for dual-arm assembly with the goal to compensate for constraint and interaction modeling of traditional centralized control. The approach explores the applicability of reinforcement learning by utilizing a policy network to couple decentralized control of both robotic arms without any explicit modeling of their interaction. The policy is trained through model-free reinforcement learning and can be combined with various well-established single-arm controllers. As we aim to explore a framework with a minimal set of task-specific assumptions we only use sparse rewards. Moreover, sample efficiency needs to be enhanced for higher precision tasks such as peg-in-hole with smaller clearances, therefore we plan to further optimize exploration, experience replay and action spaces in the future. We evaluate the approach in simulation on the task of dual-arm peg-in-hole and show that joint position control provides good results up to a investigated clearances of . Furthermore, we point out that in simulation the approach can recover from external disturbances and proof, that the method learns a general peg insertion behaviour by evaluating position uncertainties. Lastly, we zero-shot transfer the policy trained in simulation to the real-world and show that a well-chosen action space can help to compensate for the sim-to-reality gap. The framework can be seen as a first step in the direction of reducing modeling efforts for centralized dual-arm manipulation and leaves lots of room for further research including the investigation of methods to improve sim-to-reality transferability and the evaluation of further dual-arm manipulation tasks.
We greatly acknowledge the funding of this work by Microsoft Germany, the Alfried Krupp von Bohlen und Halbach Foundation and project KoBo34 (project number V5ARA202) by the BMBF (grant no. 16SV7985).
We would like to thank Carlos Magno C. O. Valle, Luis F.C. Figueredo, Konstantin Ritt, Maximilian Ulmer and Sami Haddadin for their general support and comments on this work.
-  Abien Fred Agarap. Deep learning using rectified linear units (relu). ArXiv, abs/1803.08375, 2018.
-  Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. CoRR, abs/1707.01495, 2017.
-  Cristian Camilo Beltran-Hernandez, Damien Petit, Ixchel Georgina Ramirez-Alpizar, and Kensuke Harada. Variable compliance control for robotic peg-in-hole assembly: A deep reinforcement learning approach. CoRR, abs/2008.10224, 2020.
-  Magnus Bjerkeng, Johannes Schrimpf, Torstein Myhre, and Kristin Ytterstad Pettersen. Fast dual-arm manipulation using variable admittance control: Implementation and experimental results. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, September 14-18, 2014, pages 4728–4734. IEEE, 2014.
-  F. Caccavale and L. Villani. An impedance control strategy for cooperative manipulation. IEEE/ASME International Conference on Advanced Intelligent Mechatronics, AIM, 1(July):343–348, 2001.
-  Fabrizo Caccavale, Pasquale Chiacchio, Alessandro Marino, and Luigi Villani. Six-DOF impedance control of dual-arm cooperative manipulators. IEEE/ASME Transactions on Mechatronics, 13(5):576–586, 2008.
-  Ignasi Clavera, David Held, and Pieter Abbeel. Policy transfer via modularity and reward guiding. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1537–1544. IEEE, 2017.
Erwin Coumans and Yunfei Bai.
Pybullet, a python module for physics simulation for games, robotics and machine learning.http://pybullet.org, 2016–2019.
-  Franka Emika. Franka emika panda robot arm. https://www.franka.de/technology.
-  Franka Emika. Panda control parameters. https://frankaemika.github.io/docs/control_parameters.html.
-  Sebastian Erhart, Dominik Sieber, and Sandra Hirche. An impedance-based control architecture for multi-robot cooperative dual-arm mobile manipulation. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, November 3-7, 2013, pages 315–322. IEEE, 2013.
-  Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. CoRR, abs/1802.09477, 2018.
-  Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. CoRR, abs/1812.05905, 2018.
-  Dennis J. F. Heck, Dragan Kostic, Alper Denasi, and Henk Nijmeijer. Internal and external force-based impedance control for cooperative manipulation. In European Control Conference, ECC 2013, Zurich, Switzerland, July 17-19, 2013, pages 2299–2304. IEEE, 2013.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
-  Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
-  Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
-  Roberto Martín-Martín, Michelle A. Lee, Rachel Gardner, Silvio Savarese, Jeannette Bohg, and Animesh Garg. Variable impedance control in end-effector space: An action space for reinforcement learning in contact-rich tasks. CoRR, abs/1906.08880, 2019.
-  Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J Pal, and Liam Paull. Active domain randomization. In Conference on Robot Learning, pages 1162–1176. PMLR, 2020.
-  Christian Ott. Cartesian Impedance Control of Redundant and Flexible-Joint Robots. Springer Publishing Company, Incorporated, 1 edition, 2008.
-  Èric Pairet, Paola Ardón, Frank Broz, Michael Mistry, and Yvan Petillot. Learning and generalisation of primitives skills towards robust dual-arm manipulation. arXiv preprint arXiv:1904.01568, 2019.
-  Hyeonjun Park, Peter Ki Kim, Ji-Hun Bae, Jae-Han Park, Moonhong Baeg, and Jaeheung Park. Dual arm peg-in-hole assembly with a programmed compliant system. pages 431–433, 2014.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban
Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan
Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith
Pytorch: An imperative style, high-performance deep learning library.In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
-  Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE, 2018.
-  Antonio Petitti, Antonio Franchi, Donato Di Paola, and Alessandro Rizzo. Decentralized motion control for cooperative manipulation with a team of networked mobile manipulators. In Danica Kragic, Antonio Bicchi, and Alessandro De Luca, editors, 2016 IEEE International Conference on Robotics and Automation, ICRA 2016, Stockholm, Sweden, May 16-21, 2016, pages 441–446. IEEE, 2016.
-  Yi Ren, Yang Zhou, Yechao Liu, Minghe Jin, and Hong Liu. Adaptive object impedance control of dual-arm cooperative humanoid manipulators. Proceedings of the World Congress on Intelligent Control and Automation (WCICA), 2015-March(March):3333–3339, 2015.
-  Gerrit Schoettler, Ashvin Nair, Jianlan Luo, Shikhar Bahl, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards. CoRR, abs/1906.05841, 2019.
-  Markku Suomalainen, Sylvain Calinon, Emmanuel Pignat, and Ville Kyrki. Improving dual-arm assembly by master-slave compliance. CoRR, abs/1902.07007, 2019.
J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel.
Domain randomization for transferring deep neural networks from simulation to the real world.In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017.
-  Anastasios Tsiamis, Christos K. Verginis, Charalampos P. Bechlioulis, and Kostas J. Kyriakopoulos. Cooperative manipulation exploiting only implicit communication. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2015, Hamburg, Germany, September 28 - October 2, 2015, pages 864–869. IEEE, 2015.
-  Patrick Varin, Lev Grossman, and Scott Kuindersma. A comparison of action spaces for learning manipulation tasks. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2019, Macau, SAR, China, November 3-8, 2019, pages 6015–6021. IEEE, 2019.
-  Matej Vecerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin A. Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. CoRR, abs/1707.08817, 2017.
-  Zijian Wang and Mac Schwager. Multi-robot manipulation without communication. In Nak Young Chong and Young-Jo Cho, editors, Distributed Autonomous Robotic Systems - The 12th International Symposium, DARS 2014, Daejeon, Korea, November 2-5, 2014, volume 112 of Springer Tracts in Advanced Robotics, pages 135–149. Springer, 2014.
-  Xianmin Zhang, Yanglong Zheng, Jun Ota, and Yanjiang Huang. Peg-in-hole assembly based on two-phase scheme and F/T sensor for dual-arm robot. Sensors, 17(9):2004, 2017.
-  Guoyu Zuo, Qishen Zhao, Jiahao Lu, and Jiangeng Li. Efficient hindsight reinforcement learning using demonstrations for robotic tasks with sparse rewards. International Journal of Advanced Robotic Systems, 17(1):1–13, 2020.