Code for "SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning"
Autonomous surgical execution relieves tedious routines and surgeon's fatigue. Recent learning-based methods, especially reinforcement learning (RL) based methods, achieve promising performance for dexterous manipulation, which usually requires the simulation to collect data efficiently and reduce the hardware cost. The existing learning-based simulation platforms for medical robots suffer from limited scenarios and simplified physical interactions, which degrades the real-world performance of learned policies. In this work, we designed SurRoL, an RL-centered simulation platform for surgical robot learning compatible with the da Vinci Research Kit (dVRK). The designed SurRoL integrates a user-friendly RL library for algorithm development and a real-time physics engine, which is able to support more PSM/ECM scenarios and more realistic physical interactions. Ten learning-based surgical tasks are built in the platform, which are common in the real autonomous surgical execution. We evaluate SurRoL using RL algorithms in simulation, provide in-depth analysis, deploy the trained policies on the real dVRK, and show that our SurRoL achieves better transferability in the real world.READ FULL TEXT VIEW PDF
Code for "SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning"
Nowadays, robotic surgery systems, such as the da Vinci® system, have been widely used in minimally invasive surgeries, including urology, gynecology, cardiothoracic, and many other procedures. Recently, people have raised increasing interest in autonomous execution for surgical tasks or sub-tasks , especially with the help of the open-source da Vinci Research Toolkit (dVRK) , which significantly relieves tedious routines and reduces the surgeon’s fatigue. Nonetheless, substantial specific expertise of individual skills and a complicated development process are required to design the manually-tuned control policies [29, 26, 31].
Learning-based methods, especially reinforcement learning (RL) based methods, provide a promising alternative to automating manual effort. These approaches are able to develop controllers for complex skills and generalize to a broader range of tasks and environments [8, 3]. However, robot learning typically requires a large amount of labeled data and interactions with the environment [16, 18, 30], usually infeasible on real surgical robots due to the expensive time cost and the hardware wear and tear issue.
One intuitive choice to efficiently collect data and fast prototype for learning-based algorithms is to use the simulation, where we generate a set of labeled training data through the computer. Preliminary works mitigate the limited access situation by proposing medical robot simulation platforms with robotics tasks [6, 22]. More recently, the learning-based platforms, dVRL  and UnityFlexML , build the RL simulation environments for surgical robots on top of  and Unity, paving the way for follow-up research on surgical manipulation  and perception .
However, the existing learning-based platforms only support limited scenarios in the simulated environments [28, 32], detailed in Table I. The models trained on such platforms ignore some important scenarios, such as bimanual patient side manipulator (PSM) manipulation and endoscopic camera manipulator (ECM) control. Moreover, the physical interactions supported by the current learning-based simulators are simplified. For example, they consider it is successfully grasping the objects when the relative distance between the jaw tip and the object is smaller than a threshold. The modeled trained on such simulated settings may suffer from the reality gap and fail to transfer to the real world .
In this work, we build a novel surgical robotic simulation platform, SurRoL, which is an open-source RL-centered and dVRK compatible simulation platform for Surgical Robot Learning. The system design of SurRoL is shown in Fig. 1. Our SurRoL is able to support more surgical operation scenarios by incorporating more single-handed/bimanual PSM(s) and ECM control tasks. Further, the designed SurRoL with carefully modeled assets can successfully deal with more realistic physical interactions. Code is publicly available at https://github.com/med-air/SurRoL.
Our main contributions are summarized as follows:
|Physics||Objects||ECM Support||Action DoF||Bimanul Task||Task Number||Interface|
|dVRL ||Static+||Cylinder||✗||3||✗||2||Python, V-REP|
|UnityFlexML ||Static+||Fat tissue||✗||3||✗||1||Python, Unity|
|SurRoL (ours)||Dynamic||Needle, Block, etc.||✓||4||✓||10||Python|
|Static+: grasp the object using the simplified attachment manner with limited physical interaction.|
We design an open-source surgical robot learning simulation platform centered on reinforcement learning for surgical skills, which benefits low-cost data collection and accelerates the development of learning-based surgical robotic methods.
We build the dVRK compatible simulated environment based on the real-time physics engine, which includes diverse surgical contents and physical interaction. We build ten tasks (e.g., single-handed/bimanual PSM and ECM manipulation) in the platform, which are common in the real autonomous surgical execution.
We conduct extensive experiments for RL algorithm evaluation in simulation using the proposed tasks, provide in-depth analysis, and deploy the trained policies on the real dVRK. Results show that our SurRoL considering more rich physical interactions achieves better transferability in the real world.
Most of the deep RL’s success for complex robotics manipulation skills originates from large amounts of interactions, using real-world robots or physics simulations. Recent approaches leverage the data-driven manner to iteratively collect the data with physical robots and optimize the policy for continuous control, including grasping , poking , door opening , etc. However, there are limited dVRK available worldwide with more strict safety concerns. Alternatively, simulation is an proxy to real robots, with the benefits of low time cost and safety guarantee . Still, due to the non-trivial development process, there is no learning-based dVRK compatible simulation environment with ranges of surgical contents, tasks, and reasonable physical interaction.
Previous robotics efforts in surgical skills concentrate on the sophisticated controllers specifically design for sub-tasks including, looping , knot-tying , needle manipulation [29, 31], cutting , tissue dissection , endoscopic guidance [25, 15]. Although these carefully tuned methods can handle separate tasks reasonably, designing these algorithms exhibits substantial expertise requirements and generalization ability concerns. Instead, learning-based methods, typically RL, demonstrate a significant advantage in task generalization and surgical automation with improved performance . Therefore, we propose an easy-to-use simulated environment with low data collection cost and state-of-the-art reinforcement learning to facilitate surgical robotics manipulation development.
With the advancement of physics simulation and the demand for RL algorithm development, there has been a surge in robotics simulation platforms. OpenAI Gym  is widely used in RL as a benchmark. Other platforms focus on different features, e.g. RLBench with a wide range of manipulation tasks , SAPIEN with home assistant robots , and RoboSuite with reproducible research . However, surgical robots remain few attempts in simulations. Fontanelli et al.  relieve the situation by developing a dVRK V-REP simulator with operation scenes. Though AMBF  can produce dynamic environments with medical robots, it provides minimal learning environment support. The most related works to ours are dVRL  and UnityFlexML , reinforcement learning platforms for dVRK. However, the low capacity of tasks with limited physical interaction restricts their functionality and sim-to-real transferability. In this work, we develop a robot learning environment with improved scenarios and physics simulation, opening ways for future progress in surgical manipulation.
To provide a simulated platform for surgical robot learning, we first build a user-friendly RL library for agents to interact with. Then, we construct the dVRK robots and surgical contents on top of the physics engine. Finally, ten surgical learning-based tasks are built for algorithm development and evaluation. SurRoL builds on top of the open-source PyBullet because of its state-of-the-art physics simulation, wide adoption in the machine learning community, and removal of the commercial software limits, e.g., V-REP.
SurRoL enables surgical robot learning by providing the widely used Gym-like RL environment interface for algorithm development and evaluation.
Given the partially observed model of the system dynamics, we formulate the manipulation problem into the Markov Decision Process (MDP), represented by a tuple. The agents interact with the environment, receive the current state , reward based on the task specification, and generate the action according to their policies at each step , which forms the trajectory with a discount factor , where
denotes the episode time horizon. The transition probabilityis computed by the underlying physics engine.
In practice, people frequently change the dVRK robot base frame relative to the virtual world frame, so the Cartesian-space control is used as action space, which is easier to transfer across different settings. Though SurRoL supports six degrees of freedom (DoF) motion, we focus on the PSM tasks with the rotation within a plane in this work. Specifically, we restrict the action space to (), where determine the position movement in the Cartesian space, determines the orientation movement in a top-down or vertical space setting, respectively, and determines whether the jaw is open () or close (). For ECM, besides the Cartesian-space position control, the velocity of the camera in its own fame or the roll angle control is used when the observation is in the camera space.
SurRoL supports two observation methods, i.e.
low-dimensional ground-truth states (e.g., object 3D Cartesian positions, 6D poses), and high-dimensional RGB, depth, mask images rendered by OpenGL. The first manner abstracts away the perception procedure and lets the agents concentrate on continuous control learning with sample efficiency. The latter requires raw image perception, which is essential in robotic control. In this work, we focus on the low-level continuous control skills for reinforcement learning as some built tasks are challenging even in this setting. Unless stated otherwise, we use the low-dimensional object state (object position, orientation, etc.) and robot proprioceptive features (tip position, jaw status) represented by a fixed-length vector as the observation.
As reward shaping can be difficult to scale in practice , most SurRoL tasks are goal-based. The agent receives a binary reward given the goal requirement and the condition success check function , and receives a negative reward unless the goal requirement is met. While in the ECM continuous tracking task, the tracked object is constantly moving. A dense reward function is designed, which encourages the agent to follow the target.
Reinforcement learning algorithms aim to achieve the specified goal by learning a policy to maximize the expected return . Our RL library is compatible with the popular OpenAI Gym , which provides an easy-to-use interface for state-of-the-art RL algorithm evaluation and benchmark, such as DDPG , PPO , etc. Meanwhile, our tasks detailed in section III-C involve long-horizon reasoning and can enable future research with more recent RL advances, e.g., learned skill priors .
We build the dVRK compatible simulation environment by supporting PSM and ECM manipulation with diverse surgical contents, based on the state-of-the-art physics simulation with relatively rich robotic interactions.
We build our simulation environment based on PyBullet , a Python wrapper API for the real-time Bullet physics. Unlike previous works that approximate the grasping by attaching the object to the jaw when the tip-object relative distance is below a certain threshold [28, 32], we consider more realistic scenarios by enabling inter-object physical interactions and friction-based grasping. The grasping is stabilized only if the PSM can lift the grasped object above a threshold, which introduces the realism and difficulties in low-level skill learning.
Our simulation platform considers the manipulation of both PSM and ECM, which is compatible with the dVRK interface, as shown in Fig. 2. We build our dVRK robots based on the meshes from AMBF . As dVRK robots contain many redundant mechanisms with parallel linkages, we rebuild the link frames into a serially linked kinematic chain and use the built-in inverse kinematics. While PyBullet supports the off-the-shelf velocity and torque control, the dynamics discrepancy between the simulation and the real world is more significant than position control , beyond the scope of this work. The simulated robots behave the identical joint-space and Cartesian-space action with the real dVRK, which allows commonly used high-level control and smooth transfer.
PSM has seven DoFs, where we consider the first six DoFs () since the last DoF corresponds to the jaw angle. PSM includes the revolute (R) and prismatic (P) actuated joints formed in an RRPRRR sequence (Fig. 2, a). ECM is a 4-DoF actuated arm with an RRPR sequence (Fig. 2, b). Note that the calculated tip pose from the forward kinematics is not the final jaw/camera pose . We adopt the transformation matrix to transform the tip pose to the tool pose. Finally, we acquire the tool pose relative to the remote center of motion (RCM) as the base frame.
To enrich the manipulated contents and reflect the challenges during control, we create the SurRoL object asset (e.g., suture needle and pegboard), modeled using Blender. All the articulated object links are organized in the tree structure following the URDF format. We randomly or manually tune the object’s physically related parameters, including shape, mass, friction, to mimic the real-world counterparts. To enable reliable collision detection and physical interaction between instruments and objects, we extract the mesh convex decomposition using V-HACD .
|: ; : jaw open/close.|
We have established a spectrum of learning-based tasks given the dexterity and precision properties in the surgical context, which covers levels of surgical skills and involves manipulating PSM(s) and ECM. We build ten tasks with diversity, including nine goal-based tasks (four PSM single-handed tasks, two PSM bimanual tasks, three ECM tasks) and one reward-based ECM task, ranging from entry-level to sophisticated counterparts, as summarized in Table II.
This serves as a validation task for the environment since, with hindsight experience replay , the policy can quickly acquire the skill. The goal is to move the PSM jaw tip to the location slightly above the needle within a tolerance , where the needle is randomly placed on a surgical tray, and the jaw is close and of fixed orientation.
Imagine that we want to retrieve the suture gauze during a surgical operation. The goal is to sequentially pick the gauze and get it back (place it at the target position), with one DoF to indicate the jaw open/close.
Based on GauzeRetrieve, NeedlePick involves an additional yaw angle DoF, which considers the pose of the needle.
Peg transfer is one of the Fundamentals of Laparoscopic Surgery (FLS) tasks for hand-eye coordination , which requires collision avoidance and long-horizon reasoning. We build a single-handed version that moves the block from one peg to the other peg without handover.
Initial needle grasp with one PSM often results in a non-ideal picking pose. This task requires to hand over the held needle from one arm to the other arm with bimanual operations .
This is an advanced version of PegTransfer with bimanual operations, where the grasping arm needs to hand the block to the other arm before placing it.
Similar to the NeedleReach, the goal is to move the camera mounted on ECM to a randomly sampled position. Note that the 4th joint is fixed since it does not affect the camera position but only alters the orientation.
Misorientation, the difference between the camera orientation and the Natural Line-of-Sight (NLS), is inevitable during surgery since the endoscope moves under the RCM constraint. This task requires adjusting the ECM’s 4th joint such that the misorientation with the desired NLS is minimized, which is computed from an affine transformation A. The goal is achieved when is within .
The goal is to let the ECM track a static target cube with red color, disturbed by other surrounding cubes, that mimics the scenario to focus on the primary instrument during surgery. A successful tracking requires the tracked cube position in image space close to the image center and the misorientation is less than .
Instead of remaining static in the given place, the target cube keeps moving and follows an online generated path at a constant speed. The goal is to keep the ECM tracking the moving cube, with a relaxed misorientation requirement but a chance to lose the target out of the view. A dense reward is designed as follows:
where and are the same as Equ. 1
, and hyperparametersand are chosen as 1 and 0.1, respectively.
In this section, we focus on the proposed learning-based tasks and want to answer the following questions: 1) Do the tasks in our simulation platform cover a range of surgical scenarios and difficulties? 2) Does the physical interaction matter for surgical robots with low-level skills? 3) Can we smoothly transfer the policy trained in the simulated environment to the real dVRK?
To answer these questions, we first evaluate recent reinforcement learning algorithms optional with imitation learning in SurRoL. Secondly, we give an in-depth analysis of the physical interaction effect using the PSM manipulation task. Finally, we demonstrate that with the highly compatible interface, the policy from the simulation can successfully deploy on the dVRK, including PSM GauzeRetrieve, NeedlePick, PegTransfer, and ECM StaticTrack.
The initial experiment is to verify our proposed tasks are solvable using existing reinforcement algorithms. As we find the manipulation tasks extremely challenging, mainly due to the tiny objects with the high precision requirement, we present the results with low-dimensional state observations.
In our RL environments, we set up the manipulation workspace for robots and objects to interact within. For PSM tasks, the workspace is of the size and the goal tolerance distance . Every time the environment resets, the initial object and goal positions are randomly sampled from the workspace. For ECM tasks, the workspace for the target cube is , the misorientation tolerance , and the normalized image position error . Each episode lasts for 50 timesteps for goal-based tasks and 500 timesteps for reward-based tasks.
For all tasks, we evaluate with the model-free RL algorithms, including the off-policy method deep deterministic policy gradient (DDPG)  and the on-policy method proximal policy optimization (PPO) . We collect the agent experience interacting in multiple separate environments during training and maintain a shared replay buffer for gradient update. As model-free methods suffer from the sample complexity, we also evaluate the hindsight experience replay (HER) 
, a sample efficient learning algorithm desirable for goal-based tasks. The success rates and episode returns are used as the evaluation metrics for goal-based and reward-based tasks, respectively, as in[2, 18, 30].
Our SurRoL can run at a real-time rate, at about 150Hz simulation in the reaching tasks with position control and random actions, where the environment is stabilized at each time with multiple simulation steps. Most of the training and testing experiments are performed on a desktop with Ubuntu 18.04, Inter 3.6GHz CPU with 32GB RAM, and an Nvidia TITAN RTX GPU.
To demonstrate our manipulation tasks, we design scripted policies with heuristics given the ground-truth states available in the simulation, with the help of manual engineering. Meanwhile, it is yet challenging to obtain satisfactory RL performance for the PSM tasks, such as NeedlePick and PegTransfer, which contains rich physical contacts between the instruments and the objects. RL algorithms typically suffer from the exploration problem to discover the high reward space when the agents are trained from scratch, especially in the sparse reward setting. To sidestep exploration challenges and ease the training, we integrate the demonstrations into the learning process by collecting a small number of samples using scripted policies for behavior cloning.
Specifically, we can divide the PSM manipulation tasks into a multi-stage sequence, where waypoints are utilized to indicate the critical changing conditions between each simplified operations. E.g., the trajectories for NeedlePick and PegTransfer are composed of approaching, picking, placing, and optional releasing, as shown in Fig. 3 (a), (b). Waypoints are built manually with the position, orientation, and collision avoidance consideration, while the trajectories in-between are generated using the interpolation method. Besides, we demonstrate the ECM tracking tasks using visual servoing, implemented by a null space method for camera velocity control as in Fig. 3 (c).
A summary of the evaluation results for RL baselines is shown in Fig. 5. For ECM goal-based tasks without instrument-object physical interaction, the agent can successfully capture the complicated action-observation relationship using HER, even for MisOrient and StaticTrack, which involve complex matrix transformations. We also observe that in StaticTrack, the learned policy can smoothly center the target object without the jittering effect, which is non-trivial for the visual servoing method that requires careful parameter tuning. For the reward-based task, DDPG, with the proposed dense reward Equ. 2, incentivizes the agent to actively control the ECM and track the moving object in a dynamic environment, which follows online generated moving paths, as illustrated in Fig. 4.
However, in PSM settings, HER alone cannot solve all tasks within the given time horizon, mainly due to the tiny object and physically rich interaction nature. By visually inspecting the training progress, we find that the agents can quickly learn to approach the object such as the needle and attempt to pick reasonably, but failed because of the approximate positioning exceeding millimeters tolerance and unstable grasping. Few experiences with high reward lead the learning to diverge in the early stage, as the policy gradually finds that random actions produce similar no-gain returns.
To overcome the exploration challenge, we record a small amount of demonstration data using the scripted policies for imitation learning. After combining HER and demonstration (HER+DEMO) with Q-filtered behavior cloning , the agents manage to solve many challenging tasks with physics-rich simulation within 50 epochs of training, e.g. PegTransfer. From the results, though HER(+DEMO) performs well for robots with relatively large grippers and error tolerance , it performs poorly with tiny surgical instruments and objects (around 10 times smaller error tolerance), which indicates the difficulties in the medical robot field.
We further analyze the most challenging long-range BiPegTransfer failed even with imitation learning by constructing several variants with different levels of simplification. As shown in Fig. 6 left, we initialize the environment by letting the PSM2 accomplish the approach, pick and lift step manually, to inspect which part makes HER+DEMO suffer. Surprisingly, even with the correct grasping points, HER+DEMO fails to learn the picking action, which shows the extreme exploration difficulties during learning (Fig. 6, right). With successful picking and lifting, the agents succeed in handing over the blocks from PSM2 to PSM1, a non-trivial coordination skill. From the disentangled analysis, integrating motion planning and low-level control is one way to solve long-range peg transfer efficiently .
As we find the simplified instrument-object interaction in [28, 32] may cause unstable grasping with further sim-to-real reality gap, we evaluate different physical interaction levels using NeedlePick. Note that the simulation backends are different among the works, so we construct a similar environment to mimic the simplified setting, i.e., the needle is attached to the jaw when the relative distance is less than , which is denoted as "Approx@". Meanwhile, the needle picking point is restricted to the jaw tip to avoid unsafe jaw collisions with the holding surface. We compare the approximate manner with ours using physical interaction and friction-based grasping (denoted as "Interact"), shown in Fig. 7 top. The mean success rate and standard deviation of three trained policies for the two manners are presented based on the evaluation of 200 episodes per model in Table. III.
We show the experimental results when the policy is trained in one physical interaction manner and tested in other settings. Though the transition probability is changed with interaction manners, polices trained with Interact are robust in the Approx settings (from 81.3% in Interact to 70.7% in Approx@2), which indicates the learned accurate picking points. However, polices trained in the Approx settings suffer from dynamics confusion and significant performance degeneration while in Interact (from 76.5% to 34.2%) and usually fail with unrealistic no-contact grasping. A relatively large performance improvement in a relaxed setting also reflects the inaccurately learned dynamics (from 76.5% in Approx@ to 88.8% in Approx@).
|Approach||Success Rate (%)|
To demonstrate transferability, we conduct physical experiments by deploying the policies trained in SurRoL to the real-world dVRK platform. Four tasks, PSM GauzeRetrieve, NeedlePick, PegTransfer, and ECM StaticTrack, are selected for demonstration. Thanks to the compatible dVRK interface, we can smoothly transfer the learned skills, with experiment snapshots shown in Fig. 8.
For the first three PSM tasks, we set up the physical experiment following the setting of  and carefully align a workspace to ensure consistency between the simulated and the real environment. Since the tasks can be solvable only with HER+DEMO, we select the corresponding best-performance policies trained in simulation with actions generated for deployment. With 4-DoF actions to adjust the PSM position and the jaw’s open/close state, the learned GauzeRetrieve policy can pick and retrieve the gauze to the target position within with a 96% success rate on 25 episodes. For the PegTransfer, the learned policy sequentially picks the block, lifts it, and puts it to the target peg while avoiding collisions in the complex environment.
To investigate the reality gap that different levels of simulated interaction may cause, we conduct the physical NeedlePick experiment using the policies introduced earlier. We choose the best policies trained in the Approx@ and Interact manner, with a success rate of 82.0% and 83.5% in their corresponding simulated settings, respectively. The physical evaluation environments are set the same with only successful episodes for both policies in simulation to ensure fair comparisons. The success rates are reported based on 50 episodes for each method, as shown in Table. IV. From the result, the policy trained in the Approx@2 manner suffers from low real-world deployment success rates, mainly due to the imprecise picking points close but without physical contact with the needle (Fig. 7 bottom). By contrast, the policy trained in the Interact manner with improved physics simulation is more robust to environment changes with a high success rate. Besides, we find some failure cases resulting from dynamics discrepancies between the simulation and the real world, also observed in .
|Approach||Trials||Success Rate (%)|
For the ECM StaticTrack, we mimic the simulated scene with some colored cubes, where the target cube is in red. The best-trained policy using HER is selected to deploy into the real dVRK for ten episodes. The target cube is segmented from the image captured by ECM first, and then the extracted position from the segmentation is served as the observation. The policy generates joint position actions in step, converted from corresponding expressed in the camera frame, and center the cube in the captured image within a normalized position error and misorientation error, with a 90% success rate.
In this work, we present SurRoL, a simulated platform for surgical robot learning compatible with dVRK. Ten learning-based surgical relevant tasks with enriched assets and physical interaction are constructed, which involves manipulating PSM(s) and ECM with difficulty levels. Extensive experiments in simulation with further physical deployment are conducted and reveal the difficulty in low-level surgical skills learning. Moreover, the physical interaction experiments in SurRoL show that reproducing physics is one step towards a realistic simulation for surgical robot learning with transferability to the real world. We believe SurRoL will embrace the advances in learning-based methods, especially RL and surgical robotics, to enable more researchers to be involved in the development of surgical robot manipulation.
This project was supported by Hong Kong Research Grants Council TRS Project No. T42-409/18-R, CUHK Shun Hing Institute of Advanced Engineering (project MMT-p5-20), and Hong Kong Multi-Scale Medical Robotics Center.
Superhuman surgical peg transfer using depth-sensing and deep recurrent neural networks. arXiv preprint arXiv:2012.12844. Cited by: §IV-A4.
Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Cited by: §I, §II-A.