Open-Sourced Reinforcement Learning Environments for Surgical Robotics

03/05/2019 ∙ by Florian Richter, et al. ∙ University of California, San Diego 0

Reinforcement Learning (RL) is a machine learning framework for artificially intelligent systems to solve a variety of complex problems. Recent years has seen a surge of successes solving challenging games and smaller domain problems, including simple though non-specific robotic manipulation and grasping tasks. Rapid successes in RL have come in part due to the strong collaborative effort by the RL community to work on common, open-sourced environment simulators such as OpenAI's Gym that allow for expedited development and valid comparisons between different, state-of-art strategies. In this paper, we aim to bridge the RL and the surgical robotics communities by presenting the first open-sourced reinforcement learning environments for surgical robotics, called dVRL. Through the proposed RL environment, which are functionally equivalent to Gym, we show that it is easy to prototype and implement state-of-art RL algorithms on surgical robotics problems that aim to introduce autonomous robotic precision and accuracy to assisting, collaborative, or repetitive tasks during surgery. Learned policies are furthermore successfully transferable to a real robot. Finally, combining dVRL with the over 40+ international network of da Vinci Surgical Research Kits in active use at academic institutions, we see dVRL as enabling the broad surgical robotics community to fully leverage the newest strategies in reinforcement learning, and for reinforcement learning scientists with no knowledge of surgical robotics to test and develop new algorithms that can solve the real-world, high-impact challenges in autonomous surgery.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement Learning (RL) is a framework that has been utilized in areas largely outside of surgical robotics to incorporate artificial intelligence to a variety of problems [1]. The problems solved, however, have mostly been in extremely structured environments such as video games [2] and board games [3]. There has also been recent success in robotic manipulation and specifically grasping, and with evidence that the learned policies are transferable from simulation to real robots [4, 5]. These successes have hinged on having simulation environments that are lightweight and efficient, as RL tends to require thousands to millions of simulated attempts to evaluate and explore policy options. For robotics, this is crucial for real-world use of RL due to the impracticality of running millions of attempts on a physical system only to learn a low-level behavior.

Fig. 1: Reinforcement Learning in Action: we used a learned policy from our RL environment in a a collaborative human-robot context, perform autonomous suction (right arm) of blood to iteratively reveal several debris that a surgeon-controlled arm then removes from a simulated abdomen.

Surgical robots, such as Intuitive Surgical’s da Vinci® Surgical System, have brought about more efficient surgeries by improving the dexterity and reducing fatigue of the surgeon through teleoperational control. While these systems are already providing great care to patients, they have also opened the door to a variety of research including surgeon performance metrics [6], remote teleoperation [7, 8], and surgical task automation [9]

. Surgical task automation have furthermore been an increasing area of research in an effort to improve patient throughput, reduce quality-of-care variance among surgeries, and potentially deliver automated surgery in the future. Automation efforts includes automating subtasks includes knot tying

[10, 11], endoscopic motions [12, 13], surgical cutting [14, 15], and debris removal [16, 17, 18]. One of the challenges moving forward for the surgical robotics community is that despite these successes, many have been based around hand-crafted control policies that can be difficult to both develop at scale and generalize across a variety of environments. RL offers a solution to these problems by shifting human time-costs and the limitations of feature- and controller-design, to autonomously learning these via large-scale, faster-than-real-time, parallelized simulations (Fig. 1).

To bridge reinforcement learning with surgical robotics, simulation environments need to be provided such that RL algorithms of past, present, and future can be prototyped and tested on. OpenAI’s Gym [19] has offered perhaps one of the most impactful resource to the RL community for testing a range of environments and domains through a common API, and has been wildly successful in engaging a broad range of machine learning researchers, engineers, and hobbyists. In this paper, we aim to bring RL to the surgical robotics domain via the first open-sourced reinforcement learning environments for surgical robotics, called dVRL. We are motivated to engage the broader community that include surgical robotics and also non-domain experts, such that reinforcement learning enthusiasts with no domain knowledge of surgery can still easily prototype their algorithms with such an environment and contribute to solutions that would have real world significance to robotic surgery and the patients that undergo those procedures. To accomplish this, we present the following novel contributions in this work:

  1. the first, open-sourced reinforcement learning environment for surgical robotics,

  2. demonstration of learned policies from the RL environment effectively transferring to a real robot with minimal effort, and

  3. automating surgically relevant, human-robot collaborative tasks using the learned policies.

The syntactic interface with the environment is inherited from OpenAI’s Gym environment [19], and is thus easy to include into their pipeline of environments to test. The RL environments are developed for the widely used da Vinci® Surgical System such that any RL-learned strategy could be applied on their platforms. Specifically, newly learned policies can be transferred onto any of the internationally networked, 40+ daVinci Research Platforms and participating labs [20], including the one at UC San Diego, to encourage international collaborations and reduce the barriers for all to validate on a real world system.

Ii Background in RL

The RL framework considered is based on a Markov Decision Process where an agent interacts with an environment. The environment observations are defined by the state space

and the agent interacts with the environments through the action space . The initial state is sampled from a distribution of initial states where . When an agent performs an action,

, on the environment, the next state is sampled from the transition probability

where and a reward is generated from a reward function .

In RL, the agent aims to find a policy that maximize the cumulative reward, where is the time horizon and is the discount factor. The Q-Function, , gives the expected value of the cumulative reward when in state , taking an action , and following the policy . Therefore an optimal policy for an agent , which aims to maximizes the cumulative reward, can be formalized as for all , , and policies . is considered the optimal Q-Function.

There is a substantial amount of research in RL to find the optimal policy. A few examples are policy gradient methods, which solve for the policy directly [21, 22], Q-Learning that solve for the optimal Q-Function [2, 23], and actor-critic methods which find both [24, 25]. OpenAI also created a well established standard in the RL community for developing new environments to allow for easier evaluation of RL algorithms [19].

Iii Methods

The environments presented inherit from the OpenAI Gym Environments and utilize the V-REP physics simulator developed by Fontanelli et al. [26]. When instantiated, the simulated environment is created and communicated through V-REP’s remote API in synchronous mode. To ensure safe creation and deletion of the simulated environment, the V-REP simulation is ran in a separate docker container. This also allows multiple instances of the environments in the same system, which can be utilized for distributed reinforcement learning [27, 28].

Iii-a Simulation Details

Fig. 2: Simulation scene in V-REP of the single PSM arm. This is the fundamental scene that the presented environments, PSM Reach and PSM Pick, are based on. The highlighted EndoWrist portion of the model can be switched with other models to support tool specific surgical tasks.
Fig. 3: Example policy solving the PSM Pick Environment. The purple cylinder is the object, and the red sphere is the goal. From left to right the following is done: move to the object, grasp the object, transport the object to the goal.

The presented environments only utilize one slave arm from the da Vinci® Surgical System as shown in Fig. 2, also known as a Patient Side Manipulator (PSM) arm. New environments can be easily scaled through the addition of multiple PSM arms and the endoscopic camera arm. The PSM arms on da Vinci® Surgical System also have a variety of attachable tools, known as EndoWrists, to accomplish different surgical tasks. The current environments use the Large Needle Driver (LND), which has a jaw gripper to grab objects such as suturing needles. Other tools can be supported in simulation by switching out the tool portion of the model in V-REP.

The environments also work in the end-effector space rather than the joint space so trained policies that do not require specific tooling, such as the gripper, can transfer to the real da Vinci® Surgical System for a variety of tools since each tool has unique kinematics. Furthermore, end-effector control is how surgeons operate the da Vinci® Surgical System. This gives the flexibility to use demonstrations from real operations. For the sake of simplicity, the end-effector orientation is held constant. Therefore, the PSM can be characterized by the three dimensional end-effector position in its base frame and jaw angle .

To set the workspace for the environments, it is bounded by range and centered around position . So the workspace can be written as:

(1)

where and

is the i-th dimension of the vector. In addition, the workspace is limited by the joint limits of the PSM arm and obstacles in the environment. Currently, a table is the only obstacle, but more obstacles can be added.

The jaw angle is bounded inclusively from 0 to 1, where 0 is completely closed and 1 is completely open. The values takes on directly correlate with the values used on the real LND during operation.

To grasp an object in simulation, there is a proximity sensor placed in the gripper of the LND. The object is considered rigidly attached to the gripper if the jaw angle is less than 0.25 and the proximity sensor is triggered. In one of the presented environments, there is a single, small cylindrical object and only its three dimensional position in the PSM arm base frame, , is utilized in the state space.

Due to the millimeter scale the PSM arms operate at, the positions are normalized by the range of the environment. Normalization of both states and actions is regularly used by popular RL libraries and performance improvements has been empirically found [29, 30]. The normalized end-effector position and object position are:

(2)
(3)

Another advantage of making the states relative to , is that the learned policies can be rolled out to various joint configurations by re-centering the states.

Since the orientation is fixed and the PSM arms are operated in the end-effector space, the actions change the end-effector position and set the jaw angle directly. This matches the real da Vinci® Surgical System. To keep the actions normalized between -1 and 1, the next state equation for the PSM arm is:

(4)
(5)

where and are bounded from -1 to 1 and are considered the actions that can be applied to the environment. The term is critical to ensuring effective transfer of policies from the simulation to the real robot. On the da Vinci® Research Kit, joint level control utilized [31], so every new end-effector position gives new set points for the joint angles through inverse kinematics. This means overshoot or even instability can occur if the difference between the new set point and current joint angle is too great. By choosing a value for that ensures negligible overshoot and no instability on the real robot, no dynamics are required for the simulation of the PSM Arm, which significantly speeds up the simulation time. Furthermore, prior work has shown the difficulty in modelling the dynamics of the PSM arm and currently not all dynamic parameters can be explicitly solved for [32].

Iii-B PSM Reach Environment

The PSM Reach environment is similar to the Fetch Reach environment [33]. The environment aims to find a policy to move the PSM arm to a goal position, , given a starting position . This type of environment is called a goal environment where an agent is capable of accomplishing multiple goals in a single environment [34]. The state and action space of the environment is:

(6)
(7)

where is normalized in a similar fashion as Equation (2) and (3). When resetting the environment to begin training, and are uniformly sampled from the workspace previously specified. The reward function is:

(8)

where is the threshold distance. By giving a negative reward until it reaches the goal, the policy should learn to also minimize distance to reach the goal. Note that this environment only uses the end-effector position, so the policy can be applied to all EndoWrists.

Iii-C PSM Pick Environment

The PSM Pick environment is also a goal environment and similar to the Fetch Pick environment [33]. The agent needs to reach to the object at from a starting position , grasp the object, and move the object to the goal position . This sequence is shown in Fig. 3. The state space is:

(9)
(10)

Similar to the PSM Reach environment, is uniformly sampled from the workspace when resetting the environment. The starting position of the object is placed directly below the gripper on the table. The reward function is:

(11)

where is once again the threshold distance.

Iv Experiments

To show the efficiency of the simulated environments, performance measurements are made. State of the art RL algorithms are utilized to solve the environments in simulation. The learned polices are then transferred to the real da Vinci® Surgical System using the da Vinci Research Kit (dVRK) [31] running at 50Hz. The policy transfer is evaluated individually by replicating the simulated scene and completion of the surgical: tasks suction and debris removal. Both the training of the RL policies and dVRK ran on an Intel® Core™ i9-7940X Processor and NVIDIA’s GeForce RTX 2080.

Iv-a Solving Environments

Both the PSM Reach and PSM Pick environments are given 100 steps per episode with no early termination and the threshold, , is set to 3mm. The range is set to 5 cm and 2.5 cm for PSM Reach and PSM Pick respectively. Through experimentation on a da Vinci® Surgical System with dVRK, we found mm to be the highest value where the PSM joints do not overshoot at 50Hz.

The environments are solved in simulation using Deep Deterministic Policy Gradients (DDPG) [25]

. DDPG is from the class of Actor-Crtic algorithms where it approximates both the policy and Q-Function with separate neural networks. The Q-Function is optimized by minimizing the Bellman loss error:

(12)

and the policy is optimized by minimizing:

(13)

Hindsight Experience Replay (HER) is used as well to generate new experiences for faster training [34]. HER generates new experiences for the optimization of the policy and/or Q-Function where the goal portion of the state is replaced with previously achieved goals. This improves the sample efficiency of the algorithms and combats the challenge of sparse rewards, which is the case for both the PSM Reach and PSM Pick environments.

The size of the state space relative to the distance the maximum action is very large in the presented environments. This makes exploration very challenging, especially for the PSM Pick environment. To overcome this, demonstrations which reach the goal, are generated in simulation and the behavioral cloning loss:

(14)

is augmented with the DDPG policy loss as done by Nair et al. [35]. OpenAI Baselines implementation and hyper parameters of DDPG + HER, with the addition of the augmented behavioral cloning, was used [29].

Iv-B Transfer to Real World

Using the LND tool with dVRK, the policies are tested on the real da Vinci® Surgical System after completing training in simulation. The positional state information for the end-effector is found by calculating forward kinematics from encoder readings. The PSM Reach policy transfer is evaluated by giving random goal locations and seeing if the threshold distance to the goal is met. The PSM Pick Environment is rolled out in a recreated scene of the simulation including the initial PSM position, initial object position, and table location. To simplify the recreated scene, the object position is assumed rigidly attached to the end-effector if the jaw is closed, similar to how the object is grasped in simulation, but this time blind. The object in this experiment is a small sponge.

Iv-C Suction & Irrigation Tool

The PSM Reach policy can be rolled out on any EndoWrist since it does not use any tool specific action. To show this, both LND and the Suction & Irrigation EndoWrists were utilized to rollout the PSM Reach policy on the real da Vinci® Surgical System. The Denavit Hartenberg (DH) parameters for both tools are shown in Table I. The table highlights the variability of the kinematics for EndoWrists. Note that for is the joint configuration, and represents positional and rotational change respectively along the x-axis relative to the previous frame, and and represents positional and rotational change respectively along the z-axis relative to the frame transformed by and .

LND Suction & Irrigation
Frame a D a D
1 0 0 + 0 0
2 0 0 - 0 0
3 0 0 0 0
4 0 0 - - - -
5 0 0 0 0
6 0 0
TABLE I:
DH Parameters for LND and Suction & Irrigation EndoWrists
Disk 1 Disk 2 Disk 3 Disk 4
Pitch 0.6 0.6 0 0
Yaw -0.6 0.6 0 0
Suction 0 0 1 0
Irrigation 0 0 0 1
TABLE II:
Actuator to joint matrix for Suction & Irrigation tool

The Suction & Irrigation tool was integrated into dVRK with slight modifications to the configuration files. The actuator to joint matrix for the EndoWrists portion of the Suction & Irrigation tool is put in Table II for reference. Furthermore, the analytical inverse kinematics that is used to set the end-effector is:

where and are the position and direction of the end-effector respectively and refer to the DH parameter. Note that the orientation of the Suction & Irrigation tool can be defined by a single directional vector since the tool tip is symmetric about the roll axis.

Iv-D Suction and Debris Removal

Fig. 4: The surgical scene for suction and debris removal tasks. Fake blood is removed from a simulated abdomen using a learned policy to reveal debris that must be removed by either an expert surgeon or another learned policy.

A simulated abdomen was created by molding pig liver, sausage, and pork rinds in gelatin. The gelatin mold has two large cavities that can be filled with fake blood made by food coloring and water. The surgical task is to use the Suction & Irrigation tool to remove the fake blood and the LND to grasp and hand the debris, revealed by the suction, to the first assistant. The debris used is a 3 mm by 28 mm dowel spring pin. The set up for the surgical scene is shown in Fig. 4.

The suction tool uses the policy trained by the PSM Reach Environment. The experiment was repeated where the LND is tele-operated by an expert surgeon who regularly gives care with the da Vinci® Surgical System and autonomously controlled by using both PSM Pick and PSM Reach learned policies to grasp the debris and to hand the debris to the first assistant. For the policies, the goal locations are preset by manually moving the arms to the goals and saving the position. The PSM Pick task in the experiment also uses the same simplification as previously described. To bring the LND in position to pick the debris, the learned PSM Reach policy is used.

V Results

Fig. 5:

Results of training PSM Reach and Pick using DDPG + HER and Behavioral Cloning (BC). Each epoch is six environments rolling out 50 times per environment for training. The success rate is the average number of times the final state reaches the goal within the threshold from 50 separate evaluation runs.

Fig. 8: The suction tool using a trained PSM Reach policy to remove fake blood to reveal debris so the surgeon can remove them from a simulated abdomen. After located and removed by teleoperational control from the simulated abdomen, the debris is handed off to the first assistant.

The timing results of the environments are shown in table III. As seen in the table, the parallelization optimization by running the simulations in separate docker containers can allow for more efficient training of RL algorithms. The results from training both PSM Reach and Pick with DDPG + HER are shown in Fig. 5. Note that a rollout is considered successful if the final state gives a reward of 0 which occurs when the goal is reached within the threshold distance. Without behavioral cloning, we were unable to solve the PSM Pick environment. When analyzing the final trained PSM Reach policy, the policy can reach the goal with 100% success rate if given 1000 simulation steps instead of 100.

Num. of Env. PSM Reach PSM Pick
1 2.09 sec 2.09 sec
2 2.36 sec 2.35 sec
4 2.78 sec 2.78 sec
6 3.03 sec 3.02 sec
8 3.27 sec 3.26 sec
TABLE III:
Timing Results of one rollout per Environment

Photos of rolling out the learned PSM Reach and PSM Pick policies are shown in Fig. 7. The policies used were the final PSM Reach policy and the final PSM Pick policy with Behavioral Cloning from training. Both policies were able to reach the threshold distance of 3 mm with 100% success rate for ten randomly chosen goal locations.

Photos showing the surgical suction and debris removal are in Fig. 8 and 9. The suction tool, utilizing the learned PSM Reach policy, reached the threshold distance of 3 mm for every goal and removed the fake blood in both experiments. For the autonomous debris removal, the learned PSM Pick policy on the LND successfully grasped all the debris and reached the threshold distance of 3mm. The learned PSM Reach policy on the LND also successfully handed all the debris to the first assistant and reached the threshold distance.

Fig. 7: Trained PSM Reach and PSM Pick policies rolled out on the da Vinci® Surgical Robot on the left and right figure respectively.

Vi Discussion and Conclusion

In this work, we present the first, open-sourced reinforcement learning environment for surgical robotics called dVRL. dVRL provides a syntatically common RL environments to OpenAI Gym with a simulation of the da Vinci® Surgical Robot system, a widely used platform with an international network of academic research platforms for which to transfer learned policies onto a real robot environment. Using state-of-art techniques from the RL community such as DDPG and HER, we show that through dVRL control policies were effectively learned and, importantly, could be transferred effectively to a real robot with minimal effort. Under a realistic surgeon-collaborative surgical setting, the learned policies could be used to share tasks in locating and assisting in debris removal. We see dVRL as enabling the broad surgical robotics community to fully leverage the newest strategies in reinforcement learning, and for reinforcement learning scientists with no previous domain knowledge of surgical robotics to be able to test and develop new algorithms that can have real-world, positive impact to patient care and the future of autonomous surgery.

Under dVRL, many options exist moving forward. First, including new objects into the simulator, such as new instruments, needles, gauze, thread, will advance the simulator capabilities. Soft tissue simulators, even at a coarse level, would be extremely useful for achieving greater depth of realism. Modeling of endoscopic stereo cameras with their uniquely tight disparities and narrow field of view would allow for visual servoing and visuo-motor policy approaches to be explored.

Fig. 9: The suction tool using a trained PSM Reach policy to remove fake blood to reveal debris. After the debris is revealed, the Large Needle Driver utilized a composition of trained PSM Reach and PSM Pick policies to remove the debris and hand it to the first assistant.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
  • [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  • [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
  • [4] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30, IEEE, 2017.
  • [5] J. Tobin, L. Biewald, R. Duan, M. Andrychowicz, A. Handa, V. Kumar, B. McGrew, A. Ray, J. Schneider, P. Welinder, et al., “Domain randomization and generative models for robotic grasping,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3482–3489, IEEE, 2018.
  • [6] A. J. Hung, J. Chen, D. H. Anthony Jarc, H. Djaladat, and I. S. Gilla, “Development and validation of objective performance metrics for robot-assisted radical prostatectomy: A pilot study,” The Journal of Urology, vol. 199, pp. 296–304, Jan 2018.
  • [7] F. Richter, R. K. Orosco, and M. C. Yip, “Motion scaling solutions for improved performance in high delay surgical teleoperation,” arXiv preprint arXiv:1902.03290, 2019.
  • [8] F. Richter, Y. Zhang, Y. Zhi, R. K. Orosco, and M. C. Yip, “Augmented reality predictive displays to help mitigate the effects of delayed telesurgery,” arXiv preprint arXiv:1809.08627, 2018.
  • [9] M. Yip and N. Das, ROBOT AUTONOMY FOR SURGERY, ch. Chapter 10, pp. 281–313.
  • [10] T. Osa, N. Sugita, and M. Mitsuishi, “Online trajectory planning in dynamic environments for surgical task automation.,” in Robotics: Science and Systems, pp. 1–9, 2014.
  • [11] J. Van Den Berg, S. Miller, D. Duckworth, H. Hu, A. Wan, X.-Y. Fu, K. Goldberg, and P. Abbeel, “Superhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations,” in 2010 IEEE International Conference on Robotics and Automation (ICRA), pp. 2074–2081, IEEE, 2010.
  • [12] J. J. Ji, S. Krishnan, V. Patel, D. Fer, and K. Goldberg, “Learning 2d surgical camera motion from demonstrations,” in 2018 IEEE 14th International Conference on Automation Science and Engineering (CASE), pp. 35–42, IEEE, 2018.
  • [13] O. Weede, H. Mönnich, B. Müller, and H. Wörn, “An intelligent and autonomous endoscopic guidance system for minimally invasive surgery,” in 2011 IEEE International Conference on Robotics and Automation, pp. 5762–5768, IEEE, 2011.
  • [14] B. Thananjeyan, A. Garg, S. Krishnan, C. Chen, L. Miller, and K. Goldberg, “Multilateral surgical pattern cutting in 2d orthotropic gauze with deep reinforcement learning policies for tensioning,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2371–2378, IEEE, 2017.
  • [15] A. Murali, S. Sen, B. Kehoe, A. Garg, S. McFarland, S. Patil, W. D. Boyd, S. Lim, P. Abbeel, and K. Goldberg, “Learning by observation for surgical subtasks: Multilateral cutting of 3d viscoelastic and 2d orthotropic tissue phantoms,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1202–1209, IEEE, 2015.
  • [16] B. Kehoe, G. Kahn, J. Mahler, J. Kim, A. Lee, A. Lee, K. Nakagawa, S. Patil, W. D. Boyd, P. Abbeel, et al., “Autonomous multilateral debridement with the raven surgical robot,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1432–1439, IEEE, 2014.
  • [17] J. Mahler, S. Krishnan, M. Laskey, S. Sen, A. Murali, B. Kehoe, S. Patil, J. Wang, M. Franklin, P. Abbeel, et al., “Learning accurate kinematic control of cable-driven surgical robots using data cleaning and gaussian process regression,” in 2014 IEEE International Conference on Automation Science and Engineering (CASE), pp. 532–539, IEEE, 2014.
  • [18] D. Seita, S. Krishnan, R. Fox, S. McKinley, J. Canny, and K. Goldberg, “Fast and reliable autonomous surgical debridement with cable-driven robots using a two-phase calibration procedure,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6651–6658, IEEE, 2018.
  • [19] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” CoRR, vol. abs/1606.01540, 2016.
  • [20] “da vinci research kit wiki.” https://research.intusurg.com/index.php/Main_Page. Accessed: 2019-02-20.
  • [21] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, pp. 1057–1063, 2000.
  • [22] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, pp. 1889–1897, 2015.
  • [23] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning.,” in AAAI, vol. 2, p. 5, Phoenix, AZ, 2016.
  • [24] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems, pp. 1008–1014, 2000.
  • [25] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [26] G. A. Fontanelli, M. Selvaggio, M. Ferro, F. Ficuciello, M. Vendittelli, and B. Siciliano, “A v-rep simulator for the da vinci research kit robotic platform,” in BioRob, 2018.
  • [27] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, et al., “Massively parallel methods for deep reinforcement learning,” arXiv preprint arXiv:1507.04296, 2015.
  • [28] N. Ono and K. Fukumoto, “Multi-agent reinforcement learning: A modular approach,” in Second International Conference on Multiagent Systems, pp. 252–258, 1996.
  • [29] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “Openai baselines.” https://github.com/openai/baselines, 2017.
  • [30] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International Conference on Machine Learning, pp. 1329–1338, 2016.
  • [31] P. Kazanzides, Z. Chen, A. Deguet, G. S. Fischer, R. H. Taylor, and S. P. DiMaio, “An open-source research kit for the da vinci ®surgical system,” IEEE Intl. Conf. on Robotics and Automation, pp. 6434–6439, 2014.
  • [32] G. A. Fontanelli, F. Ficuciello, L. Villani, and B. Siciliano, “Modelling and identification of the da vinci research kit robotic arms,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1464–1469, IEEE, 2017.
  • [33] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, and W. Zaremba, “Multi-goal reinforcement learning: Challenging robotics environments and request for research,” arXiv preprint arXiv:1802.09464, 2018.
  • [34] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in Advances in Neural Information Processing Systems, 2017.
  • [35] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6292–6299, IEEE, 2018.