I Introduction
In recent years robotic manipulation has been an active field of research, nevertheless work focusing on dualarm manipulation is still rare. A second robotic arm enhances dexterity but also introduces new challenges and additional modeling effort as further degrees of freedom and interaction between manipulators need to be considered. Thus, it is common to use a complex taskspecific control structure with multiple control loops. However, as our goal is to contribute towards a general framework for dualarm manipulation, the method needs to be taskindependent and modeling efforts should be restricted to a minimum.
To this end, it is possible to utilize deep reinforcement learning (RL). Thereby manipulation tasks can be learned from scratch by interaction with the environment. However, deep RL alone would require a lot of training samples which are expensive to collect in a realworld setup. Instead, it is preferable to introduce inductive biases into our architecture as to facilitate the learning process. Namely, we train a policy network only to generate highlevel trajectories and use wellestablished control techniques to track those trajectories. Such a modular architecture also allows for zeroshot simtoreal transfer. This enables us to do all the training in simulation. In general the distinction between decentralization and centralization has to be made on both control level and policy level. On a control level centralized control requires large modeling efforts and is not taskagnostic, our work considers a decentralized approach. With that in mind, two general paradigms are conceivable: the first one involves two separate decoupled RL agents which can be trained in a multiagent RL setting and the second one utilizes a single policy controlling both arms. The latter is more feasible as it couples control of both manipulators through a policy network, resulting in an overall centralized method, and, thus increases precision and efficiency. Our method is based on the latter approach and attempts to learn a single policy using offpolicy RL. Intuitively, such an approach can be thought of as a way to centralize decentralized control based on RL. This paper aims at exploring the applicability of deep RL to dualarm manipulation tasks. Hence, we propose a framework to learn such tasks based on a combination of recent advances in RL and wellestablished control techniques. To reduce the need for taskspecific knowledge and to avoid introducing additional bias in the learning process, we test our methods solely with sparse rewards. Nevertheless, only receiving a reward after successfully solving the task provides less guidance to the agent as no intermediate feedback is given. Thus, the already challenging task of dualarm manipulation becomes more complicated and sampleinefficient. That is why we rely on simulation to train our policy and transfer the results to the realworld (fig. 1). Moreover, we design our framework with the goal of avoiding elaborate simtoreal transfer procedures.
To demonstrate the effectiveness of our approach we evaluate our method on a dualarm peginhole task, as it requires high dexterity to manipulate two objects with small clearances under consideration of contact forces. We first use PyBullet [8] to create a realistic simulation environment and analyze the proposed approach with a focus on sample efficiency, performance and robustness. We then test the learned behavior in a realworld setup with two Franka Emika Panda [9] robots and demonstrate the feasibility of our method under minimized simtoreal transfer efforts. Our contributions can be summarized as follows:

We explore and formulate a new paradigm for learning dualarm manipulation tasks.

We compare the performance of different action spaces and controllers on the success and robustness of the learned policies.

We show the feasibility of zeroshot transferring policies trained in simulation to the realworld.

To our knowledge, our work is the first to explore the applicability of modelfree RL to contactrich dualarm manipulation tasks.
Ii Related Work
Dualarm manipulation is a challenging area of research, which can be divided into decentralized and centralized approaches. The first one utilizes independent controllers for each robot with explicit [25] or implicit [33, 30] communication channels and is often combined with leader/follower behaviour [28, 33]. Besides improved scalability and variability, decentralized control hardly reaches the efficiency and precision of centralized control, which integrates the control of both manipulators in a central unit.
Among feasible manipulation objectives, peginhole insertion can be seen as a benchmark, since it requires accurate positioning, grasping, and handling of objects in contactrich situations. Therefore, we select the task of dualarm peginhole to evaluate the performance of our approach.
Singlearm peginhole. As research focusing on dualarm peginhole assembly is rare and mostly limited to extensive modeling [28, 22, 34], research on classical peginhole assembly with a single robotic arm provides a perspective on modelfree approaches based on reinforcement learning. [32] and [27] show that sparse rewards are sufficient to successfully learn a policy for an insertion task if combined with learning from demonstrations. In [35] a similar setup is combined with the concept of Hindsight Experience Replay (HER) [2] for the robotic manipulation tasks push as well as pickandplace. Their results point out, that HER is sufficient to enhance the performance if only sparse rewards are available. The work in [27] utilizes residual reinforcement learning to leverage classical control, which performs well given spare rewards only and provides a hint, that the choice of action space can be crucial. An evaluation of action spaces on the task of singlearm peginhole with a clearance of and a shaped reward function is presented in [31], where cartesian impedance control performs best. Moreover, [3]
applies positionforce control with modelfree reinforcement learning for peginhole with a focus on transferlearning and domain randomization.
Decentralized DualArm Manipulation. In the work by [28], a decentralized approach for the dualarm peginhole task is proposed. The method is based on a leader/follower architecture. Hence, no explicit coupling between both manipulators is required. The leader would perform the insertion, and the follower would hold its position and be compliant to the applied forces. Similar to the previously mentioned work, [34] utilizes a decentralized approach, where the hole keeps the desired position with low compliance and the peg is steered in a spiralscrew motion towards insertion with high compliance. But despite reducing the necessity to model their interaction, both approaches lack dexterity, i.e., there is only one robot actively acting in the environment. In a general pipeline, there should be enough flexibility for both arms to be actively contributing towards the objective. Furthermore, [22] presents a method based on decomposing the task into phases and utilizing a sophisticated control flow for the whole assembly process. Despite reducing efforts in modeling the interaction explicitly, the control flow is only engineered for one specific task and lacks dexterity as movements are bound to the preprogrammed procedure.
Centralized DualArm Manipulation. Pairet et al. proposes a centralized approach for dualarm manipulation, which formulates an accurate model of the objective [21]. A set of primitive behaviors are demonstrated to the robot by a human, the robot combines those behaviors and tries to solve a given task. Finally, an evaluator measures its performance and decides if further demonstrations are necessary. The approach has promising potential towards more robust and less taskspecific dualarm manipulation. However, besides the required modeling efforts, it is limited by the human teaching process, which introduces a different set of assumptions on top of the previous ones, limiting its applicability to semistructured environments. Besides that, work on centralized methods, mostly focuses on cooperative object manipulation [5, 6, 11, 14, 26, 4] and highly relies on accurate and complex modeling of the underlying system dynamics.
SimtoReal Transfer. Sample inefficiency is one of the main challenges of deep RL algorithms. The problem is even worse for robotic tasks, which involve highdimensional states and actions as well as complex dynamics. This motivates the use of simulation for data collection and training. However, due to the inaccuracies in the physics modeling and image rendering in simulation, policies trained in simulation tend to fail in the realworld. This is usually referred to as the ”reality gap”. The most popular paradigm to approach this problem is domain randomization (DR) [29]. The main goal of DR is to subject the agent to samples based on diverse simulation parameters concerning the object [29] and the dynamics properties [24]. By doing so the learned policy should be able to generalize to the different physical properties of realworld tasks. Recent work has explored active parameters sampling strategies as to dedicate more training time for troublesome parameter settings [19]. Another approach for simtoreal transfer is system modularity [7]
. Here a policy is split into different modules responsible for different objectives such as pose detection, online motion planning and control. Only components that won’t suffer from the reality gap are trained in simulation and the rest is adapted or tailormade for the realworld setup. This comes in contrast to the most common endtoend training in deep RL
[16]. In our work, we use a modular architecture to enable zeroshot simtoreal transfer. Namely, we parameterize the controllers differently in simulation compared to the realworld to allow using the same highlevel policy network.Despite various contributions towards a general framework for dualarm manipulation, we do not know of any work that successfully fulfills all requirements. Therefore, this work aims at proposing a unified pipeline for dualarm manipulation based on a minimal set of assumptions. To the best of our knowledge, no publication exists which solves the challenge of contactrich dualarm peginhole with modelfree reinforcement learning, nor sparse rewards.
Iii Learning to Centralize
In this section, we introduce an approach to learn a centralized policy for dualarm manipulation. Moreover, we intend to reduce the required modeling effort to a minimum as the proposed method is based on modelfree reinforcement learning with sparse rewards. The approach does not require a specific dualarm controller, since control is only coupled at a policy level.
Iiia Controller
The common approach in centralized dualarm control strategies is to model the manipulator’s and the object’s dynamics explicitly and achieve coupling of both robotic arms by using multiple control loops. The system dynamics of the th manipulator can be described as follows:
(1) 
(2) 
Where
is the vector of joint positions,
is the inertia matrix, is the coriolis/centrifugal matrix, is gravity vector, is the vector of joint torques and represents external torques, which can be further decomposed into the external wrench as in (2).As both robots interact with each other directly or through the manipulation of objects, the respective external wrenches have to include all forces and moments which are applied to the robot. Hence, to create a valid dynamic chain, commonly (
1) is also used to define the object dynamics: With as torques exerted by the manipulators or other objects on the respective object. Thereby, the dynamic equations need to be adapted to the specific manipulation task, but are not universally valid. Furthermore, the endeffector movements are restricted by object geometries, which need to be known in advance. Based on the dynamics model, commonly a hierarchical control strategy through multiple loops is applied, where the outer loops realizes the main objective such as desired object movements and the inner loop accounts to generate a firm grasp and bounded internal forces. [5, 6, 11, 14, 26, 4]. The particular control loops can utilize any control strategy. Nevertheless, impedance control is a common choice to achieve compliant behaviour [5, 6, 14, 26], as contact forces are limited by coupling the control torque with position and velocity (3). The control torque is calculated by multiplying the gains and with the difference of desired and actual position and velocity respectively. The principal can be applied in joint space or task space. and are placeholders to account for the variety of control laws and their additions (e.g. 5).(3) 
Besides, the need to define an explicit set of dynamic equations and constraints for each task, also the control loops are adapted for a specific use case. Hence, a general framework for dualarm robotic manipulation is not feasible.
An alternative way to approach the problem is to utilize a policy network as highlevel control and thereby dispose the need of designing a coupled control method. Figure 2 highlights the differences between those paradigms for dualarm manipulation: The classical way to have a unified control scheme, which can include multiple control loops, to couple both robot arms (figure 2), and the proposed method, which uses two standard controllers with a highlevel policy instead (figure 2). Thereby the controllers can be designed in a straightforward way without the need for purposebuilt dualarm control algorithms, allowing the use of any singlearm action space. In all cases, the framework uses a policy trained by RL to provide an action input to the controllers. The policy learns to inherently compensate for the constraints resulting from dualarm interactions. Furthermore, coupling at policy level is convenient to implement as the policies action space can simply be extended to include a second controller.
IiiB Action space
Classical control approaches for manipulation are often based on impedance control, as it comes with the previously mentioned advantages. However, since our method tries to compensate for interaction forces at a trajectory level, we explore different control laws as action spaces and study their effect on the task success.
Joint position control.
First of all, we use joint position control (equation 4) to compute a torque command: Both gains, and , are set to a constant value, and evaluated at runtime and inferred by the policy.
(4) 
Cartesian impedance control. Second, we implement cartesian impedance control [20]: The action space allows to move control from joint space to cartesian space and includes model information such as the cartesian inertia matrix and the jacobian matrix as well as the gravity compensation term . As the degrees of freedom exceed the number of joints, nullspace compensation is added. Instead of , is directly passed in as action input.
(5) 
Variable cartesian impedance control. Variable cartesian impedance control [18] is based on classical cartesian impedance control, though adds to the action space making control more variable to react with higher or lower stiffness if needed. We use anisotropic gains and couple velocity gains via to achieve critical damping.
IiiC Reinforcement Learning.
In our method, the policy is responsible for generating highlevel trajectories, which are lateron tracked by the chosen controller. We train the policy network using modelfree RL. The policy receives the robot states to infer a control signal (action) for the above mentioned control laws (action spaces). We combine joint positions , joint velocities and joint torques of both robotic arms respectively, as well as cartesian positions and orientations of the endeffectors as state input . Nevertheless, the state might need to be adjusted if the framework is applied for a different task, which for instance includes additional objects.
The proposed method is not restricted to a specific modelfree RL algorithm, though an offpolicy algorithm is desirable to facilitate high sample efficiency and allow the use of experience replay. Thus, we use Soft ActorCritic (SAC) [13] as the algorithm presents state of the art performance and sample efficiency, but could potentially be replaced by others such as Deep Deterministic Policy Gradients (DDPG) [17] or Twin Delayed Deep Deterministic Policy Gradients (TD3) [12].
To enhance sample efficiency, the environments should be implemented in a goalbased way. Thereby, the achieved goal is returned alongside the environment state and can easily be compared to the desired goal to compute a reward . Reward engineering is not necessary as we use a sparse reward (6).
(6) 
To counteract the more challenging training task when using sparse compared to dense rewards, we use HER to augment past experiences. By replaying experiences with goals which have been or will be achieved along a trajectory the agent shall generalize a goal reaching behaviour. Hence, unsuccessful experiences still help to guide an agent, as a sparse reward otherwise does not provide feedback on the closeness to the desired goal.
IiiD Implementation
We implement the general training and evaluation procedure in the following way: During each epoch, one full episode is gathered by interacting with the environment followed by
optimization steps. Moreover, we calculate the success rate every fifth epoch by averaging over test cycles. We use ADAM [15] as optimizer with a learning rate of and a batch size of . The experience replay memory size is set to and training starts after storing samples. The qnetwork and policynetwork consist of and linear layers respectively with a hidden dimension ofand ReLU
[1]activation functions. To update the target networks, an updating factor of is chosen. HER is set to use the strategy ”future” with sampling of additional experiences [2]. All hyperparameters are tuned manually and kept fixed for all experiments.Iv Experimental Setup
We design experiments to answer the following questions: (i) can a central policy successfully learn dualarm manipulations skills based on a decentralized control architecture? (ii) what action space leads to the highest success rate and the most robust policies? (iii) is our method robust against disturbances and position uncertainty? (iv) is a modular design enough to zeroshot transfer policies from simulation to the realworld?
To answer these questions, we evaluate the proposed method on the task of peginhole assembly with two Franka Emika panda manipulators [9] both in simulation (fig. 1) and in realworld (fig. 1). The simulation environment is created using PyBullet [8]. We design it to match the realworld setup as closely as possible. Both setups are similar except for the environment frequency which is 240Hz in simulation and 1KHz in the realworld. The policy is operating at 60Hz. Nevertheless, the robots only obtain the last state for control. To exclude the process of gripping and restrict movements of peg and hole, both are fixed to the endeffector of the respective robot arm. In the realworld experiments, this corresponds to the peg and hole being attached to the gripper of each manipulator. Furthermore, to enable an evaluation with increasing difficulty, pairs of peg and hole have been created with a clearance of and . Moreover, we define the goal state as relative distance between the positions of peg and hole. Both robots start with a significant distance and varying orientation with an offset around the initial joint position of . We restrict robot movements by their joint limits [10], whereas the workspace is not bounded. Furthermore, the robot bases are positioned on a table with a distance of
and both oriented to the same side. We use PyTorch
[23] to implement the models and train them on a NVIDIA GeForce RTX 2080 GPU.V Simulation Results
We use the simulation to first train the policy and also to perform ablation studies and robustness tests. As can be seen in the supplementary video, the policy can be trained in simulation to learn a successful peginhole insertion strategy. Both manipulators move equally towards each other without any bigger diversion. Furthermore, both are evenly involved in the process of aligning the endeffectors and pushing the peg inside. Hence, the approach does not lead to any leader/follower behaviour where one endeffector just keeps its position similar to a singlearm solution.
We design the experiments to start with an offset of from the initial joint positions and average all results over 4 seeds. All successful runs up to a clearance of converge to a success rate above inbetween and epochs (fig. 3). As we use sparse rewards, no intermediate information about the task success is available. Hence, the success rates of a single run tend to converge either to or stay at , what can be easily seen for cartesian impedance control and a clearance of
, where only 1 out of 4 runs converges in the given time frame. Overall, variance and training duration is increasing with smaller clearances, which confirms that the task becomes more challenging as clearances decrease.
Va Ablation Studies
We compare the following control variants to study the effect of different action spaces: joint position control, cartesian impedance control and variable cartesian impedance control as introduced in section IIIB. Figure 3 shows the results when using a clearance of , where policies trained in all three action spaces converge in a comparable manner. Moreover, to analyze the effect of smaller clearances, we conduct the same experiments using a clearance between peg and hole (fig. 3). However, only joint position control converges in the respective time frame. Overall, the results are different to [31], where cartesian impedance control converges faster than joint position control for singlearm peginhole, and [18], where variable cartesian impedance control performs best in contactrich manipulation tasks.
As peginhole manipulation requires stiffness adaption, variable impedance control should perform best among the evaluated action spaces. But, as it does not, a couple of improvements are conceivable: State representation and hyperparameters could be optimized, as they have been set fixed during all experiments. Moreover, the increased size of the action space makes learning tasks more challenging, but could be decreased by introducing isotropic gains. In the case of cartesian impedance control, the first two possible improvements apply as well. Furthermore, the stiffness values are not tuned and leave lots of room for improvements.
VB Robustness
First we showcase the robustness, by evaluating disturbance recovery, and second, we demonstrate the robustness against positioning errors.
Disturbance recovery. To investigate the robustness to unforeseen events, such as collision or active manipulation by humans, we evaluate the success rate after being disturbed from time step 10 till 40, resulting in an endeffector offset. Figure 4 shows the results. Each episode duration, with a maximum of 400 steps, is averaged over 60 random disturbances, 3 test cycles and all seeds. Afterwards, we compare their trajectories to a reference and calculate the disturbance as the difference between endeffector positions. Comparing all action spaces, it turns out that in our framework all variants can recover from external disturbances. Joint position control yields faster success, but episode duration’s increase proportionately more with higher external disturbances. Overall, the ability to recover depends mostly on the available time frame, hence increasing the maximum time steps could allow to handle larger disturbance offsets. Figure 5 visualizes trajectories of desired joint positions given by the policy network, when using joint position control. Two time steps after the external disturbance is applied, the policy starts to adapt to the external influence, varying on the disturbance magnitude. The exemplary results show, that the policy network is actively reacting to the disturbance and not purely relying on the controller.
Position uncertainties. Furthermore, to evaluate the performance under position uncertainties, for instance, caused by grasping the peg in an earlier phase, we apply a random offset to the relative location of the peg to the endeffector. Figure 4 shows the success rates for three offsets. We average each result over 50 different offsets and all seeds. In general, the success rates decrease with higher offsets and smaller clearances. Cartesian impedance control turns out to be less robust compared to joint position control and variable cartesian impedance control ends up last, which is comparable to the previous robustness evaluation of disturbance recovery. Nevertheless, joint position control and cartesian impedance control are capable to handle large offsets up to with high success rates, which should already be sufficient for most applications and is significant considering that no randomization has been applied during training. The evaluation under positional uncertainties shows, that the policy does not simply learn the underlying kinematics since peg and hole positions are fixed during training, but infers a peg insertion behaviour on the underlying state information.
VC RealWorld Results
To evaluate the approach in the realworld (fig. 1), we transfer the policies trained in simulation without taking further measures to improve transferability such as domain randomization or domain adaption. We explicitly evaluate the performance without any additional steps targeting simtoreal transfer, to precisely investigate if the policy is robust enough to be applied in reality and both decentralized controllers can compensate to bridge the gap between simulation and reality.
To enable zeroshot transfer of the approach, the simulation environment has been adapted to match the realworld as close as possible. However, to ensure safety in realworld experiments, additional torque, velocity and stiffness limitations need to be applied to guarantee observability and noncritical behaviour, thereby even further increasing the simtoreality gap. For all experiments we use peg and hole with a clearance of , since all action spaces have been successfully trained in simulation when using that clearance.
Figure 6 shows the success rates for each action space when starting with a random deviation of for each joint from the initial joint positions . Using cartesian impedance control leads by far to the highest success rate as the action space helps to compensate for the simtoreality gap and is robust to the applied parameter changes for safe operation. Variable impedance control confirms the results of previous robustness evaluations as the variant can not reach a high success rate in realworld evaluations. One reason for the performance decrease might be that the variable stiffness led to overfitting to the system dynamics in simulation instead of learning a generalized stiffness adaption. Joint position control, which performed best in simulation, is not able to keep up in reality at all. The action space is not robust to joint torque and joint velocity limitations, thus would require additional retraining using the applied limitations. Overall, the results show that a wellchosen action space can help to enhance robustness and transfer the approach from simulation to reality without applying further methods to target simtoreal transfer. Moreover, the modular design helped to incorporate adaptions after training the policy, which would not have been possible in an endtoend approach. Nevertheless, the proposed method leaves room for improvements: Among them, the impact of domain randomization and domain adaption should be explored in future as well as finetuning in the realworld to adapt to the additionally applied safety constraints.
Vi Conclusion
We introduce a framework for dualarm assembly with the goal to compensate for constraint and interaction modeling of traditional centralized control. The approach explores the applicability of reinforcement learning by utilizing a policy network to couple decentralized control of both robotic arms without any explicit modeling of their interaction. The policy is trained through modelfree reinforcement learning and can be combined with various wellestablished singlearm controllers. As we aim to explore a framework with a minimal set of taskspecific assumptions we only use sparse rewards. Moreover, sample efficiency needs to be enhanced for higher precision tasks such as peginhole with smaller clearances, therefore we plan to further optimize exploration, experience replay and action spaces in the future. We evaluate the approach in simulation on the task of dualarm peginhole and show that joint position control provides good results up to a investigated clearances of . Furthermore, we point out that in simulation the approach can recover from external disturbances and proof, that the method learns a general peg insertion behaviour by evaluating position uncertainties. Lastly, we zeroshot transfer the policy trained in simulation to the realworld and show that a wellchosen action space can help to compensate for the simtoreality gap. The framework can be seen as a first step in the direction of reducing modeling efforts for centralized dualarm manipulation and leaves lots of room for further research including the investigation of methods to improve simtoreality transferability and the evaluation of further dualarm manipulation tasks.
Acknowledgment
We greatly acknowledge the funding of this work by Microsoft Germany, the Alfried Krupp von Bohlen und Halbach Foundation and project KoBo34 (project number V5ARA202) by the BMBF (grant no. 16SV7985).
We would like to thank Carlos Magno C. O. Valle, Luis F.C. Figueredo, Konstantin Ritt, Maximilian Ulmer and Sami Haddadin for their general support and comments on this work.
References
 [1] Abien Fred Agarap. Deep learning using rectified linear units (relu). ArXiv, abs/1803.08375, 2018.
 [2] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. CoRR, abs/1707.01495, 2017.
 [3] Cristian Camilo BeltranHernandez, Damien Petit, Ixchel Georgina RamirezAlpizar, and Kensuke Harada. Variable compliance control for robotic peginhole assembly: A deep reinforcement learning approach. CoRR, abs/2008.10224, 2020.
 [4] Magnus Bjerkeng, Johannes Schrimpf, Torstein Myhre, and Kristin Ytterstad Pettersen. Fast dualarm manipulation using variable admittance control: Implementation and experimental results. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, September 1418, 2014, pages 4728–4734. IEEE, 2014.
 [5] F. Caccavale and L. Villani. An impedance control strategy for cooperative manipulation. IEEE/ASME International Conference on Advanced Intelligent Mechatronics, AIM, 1(July):343–348, 2001.
 [6] Fabrizo Caccavale, Pasquale Chiacchio, Alessandro Marino, and Luigi Villani. SixDOF impedance control of dualarm cooperative manipulators. IEEE/ASME Transactions on Mechatronics, 13(5):576–586, 2008.
 [7] Ignasi Clavera, David Held, and Pieter Abbeel. Policy transfer via modularity and reward guiding. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1537–1544. IEEE, 2017.

[8]
Erwin Coumans and Yunfei Bai.
Pybullet, a python module for physics simulation for games, robotics and machine learning.
http://pybullet.org, 2016–2019.  [9] Franka Emika. Franka emika panda robot arm. https://www.franka.de/technology.
 [10] Franka Emika. Panda control parameters. https://frankaemika.github.io/docs/control_parameters.html.
 [11] Sebastian Erhart, Dominik Sieber, and Sandra Hirche. An impedancebased control architecture for multirobot cooperative dualarm mobile manipulation. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, November 37, 2013, pages 315–322. IEEE, 2013.
 [12] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actorcritic methods. CoRR, abs/1802.09477, 2018.
 [13] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actorcritic algorithms and applications. CoRR, abs/1812.05905, 2018.
 [14] Dennis J. F. Heck, Dragan Kostic, Alper Denasi, and Henk Nijmeijer. Internal and external forcebased impedance control for cooperative manipulation. In European Control Conference, ECC 2013, Zurich, Switzerland, July 1719, 2013, pages 2299–2304. IEEE, 2013.
 [15] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
 [16] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 [17] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, 2016.
 [18] Roberto MartínMartín, Michelle A. Lee, Rachel Gardner, Silvio Savarese, Jeannette Bohg, and Animesh Garg. Variable impedance control in endeffector space: An action space for reinforcement learning in contactrich tasks. CoRR, abs/1906.08880, 2019.
 [19] Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J Pal, and Liam Paull. Active domain randomization. In Conference on Robot Learning, pages 1162–1176. PMLR, 2020.
 [20] Christian Ott. Cartesian Impedance Control of Redundant and FlexibleJoint Robots. Springer Publishing Company, Incorporated, 1 edition, 2008.
 [21] Èric Pairet, Paola Ardón, Frank Broz, Michael Mistry, and Yvan Petillot. Learning and generalisation of primitives skills towards robust dualarm manipulation. arXiv preprint arXiv:1904.01568, 2019.
 [22] Hyeonjun Park, Peter Ki Kim, JiHun Bae, JaeHan Park, Moonhong Baeg, and Jaeheung Park. Dual arm peginhole assembly with a programmed compliant system. pages 431–433, 2014.

[23]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban
Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan
Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith
Chintala.
Pytorch: An imperative style, highperformance deep learning library.
In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.  [24] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Simtoreal transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE, 2018.
 [25] Antonio Petitti, Antonio Franchi, Donato Di Paola, and Alessandro Rizzo. Decentralized motion control for cooperative manipulation with a team of networked mobile manipulators. In Danica Kragic, Antonio Bicchi, and Alessandro De Luca, editors, 2016 IEEE International Conference on Robotics and Automation, ICRA 2016, Stockholm, Sweden, May 1621, 2016, pages 441–446. IEEE, 2016.
 [26] Yi Ren, Yang Zhou, Yechao Liu, Minghe Jin, and Hong Liu. Adaptive object impedance control of dualarm cooperative humanoid manipulators. Proceedings of the World Congress on Intelligent Control and Automation (WCICA), 2015March(March):3333–3339, 2015.
 [27] Gerrit Schoettler, Ashvin Nair, Jianlan Luo, Shikhar Bahl, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards. CoRR, abs/1906.05841, 2019.
 [28] Markku Suomalainen, Sylvain Calinon, Emmanuel Pignat, and Ville Kyrki. Improving dualarm assembly by masterslave compliance. CoRR, abs/1902.07007, 2019.

[29]
J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel.
Domain randomization for transferring deep neural networks from simulation to the real world.
In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017.  [30] Anastasios Tsiamis, Christos K. Verginis, Charalampos P. Bechlioulis, and Kostas J. Kyriakopoulos. Cooperative manipulation exploiting only implicit communication. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2015, Hamburg, Germany, September 28  October 2, 2015, pages 864–869. IEEE, 2015.
 [31] Patrick Varin, Lev Grossman, and Scott Kuindersma. A comparison of action spaces for learning manipulation tasks. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2019, Macau, SAR, China, November 38, 2019, pages 6015–6021. IEEE, 2019.
 [32] Matej Vecerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin A. Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. CoRR, abs/1707.08817, 2017.
 [33] Zijian Wang and Mac Schwager. Multirobot manipulation without communication. In Nak Young Chong and YoungJo Cho, editors, Distributed Autonomous Robotic Systems  The 12th International Symposium, DARS 2014, Daejeon, Korea, November 25, 2014, volume 112 of Springer Tracts in Advanced Robotics, pages 135–149. Springer, 2014.
 [34] Xianmin Zhang, Yanglong Zheng, Jun Ota, and Yanjiang Huang. Peginhole assembly based on twophase scheme and F/T sensor for dualarm robot. Sensors, 17(9):2004, 2017.
 [35] Guoyu Zuo, Qishen Zhao, Jiahao Lu, and Jiangeng Li. Efficient hindsight reinforcement learning using demonstrations for robotic tasks with sparse rewards. International Journal of Advanced Robotic Systems, 17(1):1–13, 2020.
Comments
There are no comments yet.