Superhuman Surgical Peg Transfer Using Depth-Sensing and Deep Recurrent Neural Networks

by   Minho Hwang, et al.
berkeley college

We consider the automation of the well-known peg-transfer task from the Fundamentals of Laparoscopic Surgery (FLS). While human surgeons teleoperate robots to perform this task with great dexterity, it remains challenging to automate. We present an approach that leverages emerging innovations in depth sensing, deep learning, and Peiper's method for computing inverse kinematics with time-minimized joint motion. We use the da Vinci Research Kit (dVRK) surgical robot with a Zivid depth sensor, and automate three variants of the peg-transfer task: unilateral, bilateral without handovers, and bilateral with handovers. We use 3D-printed fiducial markers with depth sensing and a deep recurrent neural network to improve the precision of the dVRK to less than 1 mm. We report experimental results for 1800 block transfer trials. Results suggest that the fully automated system can outperform an experienced human surgical resident, who performs far better than untrained humans, in terms of both speed and success rate. For the most difficult variant of peg transfer (with handovers) we compare the performance of the surgical resident with performance of the automated system over 120 trials for each. The experienced surgical resident achieves success rate 93.2 seconds. The automated system achieves success rate 94.1 time of 8.1 seconds. To our knowledge this is the first fully automated system to achieve "superhuman" performance in both speed and success on peg transfer. Supplementary material is available at



page 2

page 3

page 5

page 15


Applying Depth-Sensing to Automated Surgical Manipulation with a da Vinci Robot

Recent advances in depth-sensing have significantly increased accuracy, ...

Intermittent Visual Servoing: Efficiently Learning Policies Robust to Instrument Changes for High-precision Surgical Manipulation

Automation of surgical tasks using cable-driven robots is challenging du...

Learning to Localize, Grasp, and Hand Over Unmodified Surgical Needles

Robotic Surgical Assistants (RSAs) are commonly used to perform minimall...

A surgical dataset from the da Vinci Research Kit for task automation and recognition

The use of datasets is getting more relevance in surgical robotics since...

Robots of the Lost Arc: Learning to Dynamically Manipulate Fixed-Endpoint Ropes and Cables

High-speed arm motions can dynamically manipulate ropes and cables to va...

Recognizing Surgical Activities with Recurrent Neural Networks

We apply recurrent neural networks to the task of recognizing surgical a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robotic Surgical Assistants (RSAs) such as the da Vinci regularly augment the ability of surgeons to perform complex tasks with improved dexterity and visualization [general_surgery_2002, intuitive_dexterity_2000] through teleoperation. The use of minimally invasive surgery also has potential to reduce post-operative patient pain and length of stay in a hospital [danyal_fer_suggestion2, danyal_fer_suggestion]. Furthermore, RSAs provide an experimental platform for accelerating research in autonomous robot surgery, with the goal of eventually aiding surgeons by reducing fatigue or tedium [yip2017robot]. Arms and end-effectors on these robots are cable-driven to reduce their size, but this results in cable-related hysteresis and tracking errors that make it challenging to automate surgical tasks that require high precision [pastor2013, miyasaka2015, kalman_filter_2016, mahler2014case, Kehoe2014]. These errors are difficult to detect and track, because encoders are located far from the joints and therefore do not capture cabling effects. We propose a method to automate high-precision surgical subtasks by leveraging recent advances in depth sensing, recurrent dynamics modeling, and trajectory optimization.

We consider the Fundamentals of Laparoscopic Surgery (FLS) [fls_2004] peg-transfer surgeon training task, in which a surgeon must transfer 6 blocks from one set of pegs to another set of pegs, and then transfer them back (Fig. 1). Since each block’s opening has just a 4.5 mm radius, and blocks must be placed on cylindrical pegs which are 2.25 mm wide, the task requires high precision, making it a popular benchmark task for evaluating and training human surgeons [haptic-feedback-surgery-2019, ur5-peg-transfer-2019, rating_peg_transfer_2017, laparoscopic_transfer_2014, madapana2019desk]. The first research result on automating a variant of peg transfer with a surgical robot was from Rosen and Ma [auto_peg_transfer_2015], who built a system enabling a Raven II surgical robot to move three blocks from one side of a peg board to another.

This paper combines and extends our prior work on automating peg transfer using calibrated depth-sensing [hwang2020applying], fiducial tracking to learn models of cabling effects [hwang2020efficiently], and intermittent visual servoing [paradis2020intermittent], by additionally proposing several novel techniques for improving surgical robot precision, speed, and ease of use:

Hwang et al. [hwang2020applying] develop the first system for a surgical robot to autonomously perform a variant of the six block FLS peg-transfer task using depth sensing, and experiment with single and bilateral peg transfer without handover of the blocks between the two arms. The system moves the robot via open loop commands to pick and place blocks from one side of the peg board to another.

Hwang et al. [hwang2020efficiently] improve the accuracy of the physical robot using deep dynamics models trained on robot trajectories collected using fiducial markers and depth sensing, and improve the perception pipeline. The paper evaluates a new robot controller on the peg-transfer task and improves the physical properties of the peg board and blocks using 3D printing software, making the setup more reproducible and less likely to cause hardware damage.

Paradis et al. [paradis2020intermittent] propose intermittent visual servoing (IVS), a paradigm for using a coarse open-loop controller in free space, while using learned visual servoing [Kragic02surveyon] when the robot is close to an area of interest, such as when picking or placing a block. Paradis et al. learn visual servoing from demonstrations, and in peg-transfer experiments, show that the learned model can be robust to tool changes, which would not be the case with the calibration methods used in Hwang et al. [hwang2020applying, hwang2020efficiently] However, IVS requires additional time during the correction phase.

This paper extends the prior papers in several ways. First, the deep RNN models applied in Hwang et al. [hwang2020efficiently] require training data consisting of trajectories that are similar to those encountered in the target task, so using a different trajectory distribution for training results in a performance drop. In this new paper, we densely sample states on randomly-generated trajectories for training, and demonstrate that this achieves comparable performance on peg transfer without requiring task-specific trajectories.

Second, while Hwang et al. [hwang2020efficiently, hwang2020applying] and Paradis et al. [paradis2020intermittent] are able to meet human-level success rates on the peg transfer task, these approaches are 1.5x to 2x slower than a skilled human operator, respectively. In this new paper, we speed up motions by optimizing the robot arm trajectories. Furthermore, while human operators may have trouble executing multiple tasks simultaneously, a robot can parallelize task computation and execution using multiple concurrent processes. In contrast, prior work [hwang2020applying] considers only a basic bilateral setup without coordination between the arms, resulting in slower performance. We also do not utilize intermittent visual servoing [paradis2020intermittent] in this work, which means the robot does not need to spend time executing corrective actions when it picks or places blocks.

In addition, the full FLS peg-transfer task that human surgeons train on requires the blocks to be handed over from one arm to the other before placing them onto the pegs. This new paper also considers handovers, the full FLS peg-transfer task. We compare the performance of an experienced surgical resident with performance of the automated system over 120 trials for each variant. The surgical resident performs far better than untrained humans. On the peg transfer variant with handovers, the experienced surgical resident achieves success rate with mean transfer time of 8.6 seconds, while the automated system achieves success rate with mean transfer time of 8.1 seconds. To our knowledge this is the first fully automated system to achieve “superhuman” performance in both speed and success on peg transfer.

Automated peg-transfer task setup. We use a da Vinci Research Kit (dVRK) robot with two arms. The blocks, pegs, and peg board are monochrome to simulate a surgical setting. The dimensions of the pegs and the blocks are shown in the lower left, along with a top-down visualization of the peg board to the lower right. The robot takes actions based on images taken from a camera, installed from the task space and inclined from vertical.

2 Related Work

Surgical robots are routinely utilized in surgical operations worldwide [dvrk_meta_analysis_2010]

. Trained human surgeons control robots via teleoperation and can naturally compensate for cable-related effects by observing and reacting to the robot’s motion. No surgical procedures in clinical settings today use autonomous robots due to demanding accuracy and perception requirements, but researchers have made tremendous strides in automating surgery in laboratory settings 

[yip2017robot]. For example, prior researchers have shown promising results in automating tasks that are part of the FLS curriculum, such as peg transfer [auto_peg_transfer_2015, hwang2020applying], suturing [Schulman2013, sen2016automating, rosen_icra_suturing_2017, comparison_suturing_2018, saeidi_suturing_icra_2019] and cutting gauze [thananjeyan2017multilateral, rosen_icra_tissues_2019].

Researchers have additionally proposed surgical robot systems for automating tasks such as tumor identification [garg2016gpas, mckinley2016], debridement to remove dead or damaged particles [Kehoe2014, murali2015learning], tissue manipulation [wang2018unified, alambeigi2018toward, rosen_tissues_qlearn_2019], inserting and extracting surgical needles [automated_needle_pickup_2018, extraction_needles_2019, needle_insertion_deformation_2019, bimanual_regrasping_needles_2020], bloodflow detection [ritcher_bloodflow_2020], and eye surgery [auto_eyes_2020, EYES_EYES_EYES]. Researchers have also shown promise in using surgical robots for complex tasks with wider applicability, such as tying knots [vandenBerg2010, thananjeyan2019safety] and smoothing fabrics [seita_fabrics_2020]. The methods proposed in these papers often utilize recent advances in technologies such as depth sensing, trajectory optimization [Osa-RSS-14], 3D reconstructions [super_yip_2019]

, imitation learning 


, and reinforcement learning 


2.1 Improving Precision of Surgical Robots

Surgical tasks often require positional accuracy of the end-effector in the workspace within , and this is difficult to autonomously obtain with cable-driven surgical arms [dvrk2014, raven2013] due to effects such as cable tension, cable stretch, and hysteresis [k-flex, hysteresis_effect], and such effects may be exacerbated with flexible arms [hwang2017single]

and usage-related wear. To compensate for these errors, prior methods use techniques such as unscented Kalman filters to improve joint angle estimation 

[kalman_filter_2016] by estimating cable stretch and friction [miyasaka2015], or by learning corrective offsets for robot end-effector positions and orientations [pastor2013, mahler2014case, seita_icra_2018].

Peng et al. [peng2020real] recently propose a data-driven calibration method for the Raven II [raven2013] that uses three spheres and four RGB cameras to estimate the end-effector position. They collect a labeled dataset of 49407 poses including velocities and torques, and train a neural network to predict the 3D position to within accuracy on an offline test set. In contrast to their work, in prior work [hwang2020efficiently] we consider the problem of estimating the joint configuration, which can be incorporated more directly in collision checking, and we also learn to predict the commanded input given a history and desired joint angle. Additionally, we use both RGB and depth sensing to track the sphere fiducials and consider historical motions in the estimation of joint angles, which enable compensation for hysteresis and backlash-like effects. Furthermore, we design practical controllers using these models and benchmark the result of applying the proposed calibration procedure on a challenging peg transfer task. We use the techniques in Hwang et al. [hwang2020efficiently] to control the dVRK in experiments.

2.2 Peg Transfer Task

The Fundamentals of Laparoscopic Surgery (FLS) [fls_1998, fls_2004] consists of tasks designed to train and evaluate human surgeons teleoperating surgical robots, such as peg transfer, precision cutting, ligating loop, and suturing with either extracorporeal or intracorporeal knots. In this paper, we focus on the first task, peg transfer, in which the goal is to transfer six rubber triangular blocks from one half of a pegboard to the other, and then back. For a single block, this process requires grasping the block, lifting it off of a peg, handing it to the other arm, moving it over another peg, and then placing the block around the targeted peg. In a study of surgical residents performing the peg transfer task, Fried et al. [fls_2004] found that better performance in peg transfer correlates with performance during real laparoscopic surgical settings. This provides justification for having surgeons practice peg transfer, and consequently, much of prior research on the peg-transfer task explores how to improve human performance. For example, Abiri et al. [haptic-feedback-surgery-2019] focus on providing better feedback to the surgeon to compensate for the loss of force feedback when teleoperating, and Rivas-Blanco et al. [ur5-peg-transfer-2019] use a learning-based approach to better position a camera for the surgeon during teleoperated peg transfer. Other works [rating_peg_transfer_2017, laparoscopic_transfer_2014, madapana2019desk] focus on benchmarks and evaluations for peg transfer.

Rosen and Ma [auto_peg_transfer_2015] provide the first research efforts towards automating a version of peg transfer. Using one robot arm from the Raven II robot [raven2013] to transfer 3 rubber blocks in one direction, Rosen and Ma compare performance over 20 trials of the autonomous Raven II versus a human teleoperator. With three blocks, each trial consists of 60 total block transfer attempts from one side to another. Their results suggest that the autonomous robot attains almost the same block transfer success rate while operating at twice the speed of the human operator. We extend Rosen and Ma’s work, and our prior work [hwang2020applying, hwang2020efficiently, paradis2020intermittent], by considering 3 variants of the peg transfer task, all of which require 6 blocks and involve transfer in both directions.

3 Task Definition

Variants of Peg Transfer Task. (a) Unilateral, transfers performed by a single arm. (b) Parallel Bilateral: transfers performed by two arms in parallel. (c) Bilateral Handover: transfers performed by both arms, with a hand-off between arms during each transfer.

We focus on the FLS peg-transfer task, using the setup in Hwang et al. [hwang2020efficiently] that uses red 3D-printed blocks and a 3D-printed pegboard (see Fig. 1). The red setup simulates a surgical environment where blood is common and surgeons must rely on subtle visual cues to perceive the state. To perform the task, the robot moves each of the 6 blocks from the 6 left pegs to the 6 right pegs before moving them back. Unlike prior work, we consider bilateral variants of this task with handovers.

As in Paradis et al. [paradis2020intermittent], we define the peg-transfer task as consisting of a series of subtasks, with the following success criteria:

Pick: the robot grasps a block and lifts it off the pegboard.

Place: the robot securely places a block on a target peg.

Handover: the robot passes a block from one arm to the other. We define a transfer as a successful pick followed by a successful place, with a potential handover in between. The 6 blocks are initially randomly placed on the left 6 pegs of the pegboard. In a trial of the peg-transfer task, the robot attempts to move all 6 to the right 6 pegs before moving them back. A trial is considered successful if all 12 transfers are successful, and a trial can have fewer than 12 transfers if failures occur during the first 6 transfers and a block is irrecoverable.

We define the following variants of the peg-transfer task, visualized in Fig. 3:

Unilateral: all 12 transfers are performed by a single arm. This is the variant considered in most prior work [hwang2020applying, hwang2020efficiently, paradis2020intermittent].

Parallel bilateral: each transfer is performed by a single arm, but either arm can be used. This means that two transfers can be performed simultaneously. This variant is also considered in one prior work [hwang2020applying], but we propose ways to speed up the robot on this task.

Bilateral handover: each transfer is performed by both arms and consists of a pick followed by a handover followed by a place. Subsequent transfers can be pipelined as a pick can be performed for the next block while a place is performed for the current block. This is the standard surgeon training task in the FLS curriculum, though pipelining is not typically performed. In this work, we consider the bilateral handover variant of the peg-transfer task for the first time in an automation setting, and additionally propose ways to better optimize the robot for the parallel bilateral variant. These two settings require precise coordination between two arms to avoid collisions and to efficiently execute handovers if needed.

4 Notation

In this paper, we borrow notation from Hwang et al. [hwang2020efficiently] to define quantities related to the da Vinci Research Kit [dvrk2014] and its dynamics. Let be the robot’s physical configuration, and define to be the set of all possible configurations. The robot’s commanded configuration is , which is equal to the encoder readings if the robot is allowed to come to a rest. This can differ from , because the encoders are located at the motors and away from the joints, which can result in a mismatch between the joint configuration measured by the encoders and the true joint configurations due to cabling effects. Ideally,

after motions are completed, however, in the presence of a cable-driven actuation, this is not necessarily true. Subscripts index specific joints in vectors, e.g.,

. For ease of readability, we suppress the and subscripts when the distinction is not needed. We visualize the six joints in Fig. 7.1. Let


encode the prior trajectory information of the robot up to time .

We would like to estimate the function


which is a dynamics model that maps the current command at time and prior state information to the current physical configuration of the arm. Prior state information carries information about cable deformation, and is necessary for high-precision control [hwang2020efficiently]. In this work, this consists of a fixed horizon of prior commands sent to the robot. At execution time, we use the controller derived from by approximately inverting it for a desired output waypoint (Section 5.5).

It is difficult to derive an accurate model for the dynamics that incorporates hysteresis, so we estimate a parametric approximation from a finite sequence of samples .

When collecting training data, we wait for the robot to come to a stop in between commands, due to the capture frequency of the Zivid depth camera used to estimate joint configurations (Section 5.1

). However, at execution time, we do not use the Zivid for pose estimation, so we do not require the robot to stop. This allows the robot to use the calibrated model and execute fast and precise maneuvers.

5 Robot Calibration using Depth Sensing and 3D Printed Fiducial Markers

In this section, we describe methods to learn from data and then to use to more accurately control the robot. We start by sending a sequence of commands and tracking the physical trajectories of 3D printed fiducial markers attached to the robot. We then convert the marker’s positions to

using kinematic equations. After collecting a dataset of commands and physical configurations, we train a recurrent neural network to minimize the empirical risk for the mean square error (MSE) loss function.

5.1 Estimating Surgical Tool Configuration

It is challenging to accommodate additional encoders or sensors due to the confined space inside the surgical end effector. To measure the state of a surgical robot, prior work [peng2020real] uses multiple RGB cameras to localize 3 colored spheres on the jaw, which requires a delicate process of camera calibrations and post-processing of RGB images to reconstruct the 3D positions of spheres.

In this work, to estimate , we build on our prior work [hwang2020efficiently] and use an RGBD camera to track fiducials of colored spheres attached to the end effector (shown in Fig. 5.1). The two spheres on the shaft allows us to decouple the first three joints (, , ) from the last three (, , ) and thus to accurately estimate them excluding the cabling coupling. We design and place the four spheres on the jaw where they cannot overlap in the camera view within the working range of joints.

Given the point clouds provided by the camera, we cluster the six groups of point sets by masking the color and size of each sphere. We then calculate each sphere’s 3D position by formulating a least-squares regression.

We calculate six joint angles of the robot based on the dVRK kinematics. Since the end position of the tool shaft is only dependent on the first three joints, , , , we can get the inverse function in combinations of the end position, which simply extend the two measured positions of spheres. We obtain the last three joints, , , , by equating the rotation matrix of the end effector with the rotation matrix measured from the four spheres. The measurement accuracy is in sphere detection and less than 0.01 radians in joint angle estimation. [hwang2020efficiently].

Estimating tool configuration. We strategically attach six fiducial spheres among which two of them on the tool shaft and four on the jaw. We convert the detected sphere positions into joint configurations based on the dVRK kinematics. Left: The detected spheres are circled in yellow and the estimated joint configurations are overlaid with green dotted lines. Right: We cluster the point sets using color masking and the size of each sphere.

5.2 Data Collection

To facilitate data collection, we first obtain the transformation from the RGBD camera to the robot. While the robot randomly moves throughout its workspace, we collect the wrist positions using sphere detection and compare them with the positions reported by the robot. We use least squares estimation to match 684 sample points for the different base frames to obtain a transformation matrix from the camera to robot . We collect a training dataset consisting of random, smooth motions of the arm, visualized in Fig. 5.2. This long trajectory is first collected by teleoperating the robot in random, continuous motions, densely recording the waypoints, and then replaying them with the fiducials attached. This enables us to collect ground truth information for trajectories that are not specific to peg transfer. During the process, we collect the configuration estimated from the fiducials and commanded joint angles to compile a dataset :


The dataset consists of data points, and takes 24 minutes to collect. In prior work [hwang2020efficiently], we collect roughly the same amount of data, but from trajectories that consist of long, jerky motions resembling the peg transfer task, and sparsely sample waypoints from them. We find that dynamics models trained on those trajectories are prone to distribution shift errors if they are tested on tasks that consist of different motions from the training distribution. We hypothesize that we can address the distribution shift by collecting data in a more randomized process, but while sampling at a finer-grained resolution.

From the camera view, we define the manipulator on the right as Patient Side Manipulator1 (PSM1) and on the left as PSM2. We repeat the data collection and calibrate both arms separately, as they have different cabling characteristics.

Random, smooth Trajectories: We collect a dataset of random, smooth motions in the workspace. The dataset consists of 2809 data points. These motions are not specifically optimized for the peg transfer task.

Figure 1: We plot sub-sampled trajectories of the PSM1 and PSM2 from their training datasets (see Section 5.2). Comparing the desired and measured configurations, we observe that the wrist joints (, , ) are the main source of the positioning error of the end effector, while the three joints from the external arm (, , ) are relatively accurate. The quantitative results are shown in the Table 5.2.

Statistics of error in PSM1 and PSM2. In both cases, the first three joints have relatively small errors. Joint errors grow as joints get further from the proximal. PSM1 and PSM2 commonly have a small offset of and hysteresis with coupling of and , which produces errors larger than 0.2 rad. PSM1 PSM2 (rad) (rad) (m) (rad) (rad) (rad) (rad) (rad) (m) (rad) (rad) (rad) RMSE 0.0031 0.0016 0.00085 0.099 0.13 0.28 0.0026 0.0035 0.00034 0.17 0.21 0.21 SD 0.0011 0.00098 0.00024 0.030 0.055 0.092 0.0012 0.0014 0.00013 0.022 0.11 0.12 Max 0.0061 0.0055 0.0013 0.17 0.23 0.45 0.0062 0.0078 0.00066 0.29 0.38 0.44

5.3 Error Identification

We sub-sample a portion of the dataset from Section 5.2 with . Fig. 1 presents the desired and measured trajectory of each joint angle in both cases. We notice that the three joints of the robot arm, , , and , rarely contribute to the error compared to the last three joints from the statistics in Table 5.2. We observe that the three joints of the surgical tool, , , and , are repeatable and not affected by the arm joints. In addition, the last two joints are closely coupled, since synchronously moved with even though it was commanded to be stationary, and vice versa. We hypothesize this occurs because these two joints have two additional cables that extend together along the shaft of the tool.

5.4 Fiducial-free State Estimation

To estimate without the fiducials attached, we train a function approximator , such that

. The model used is a Long Short-Term Memory (LSTM) 

[hochreiter1997long] with 256 hidden units followed by two more hidden layers of 256 units, which was the best model identified in prior work [hwang2020efficiently]. Five prior commands are supplied as input. We minimize the mean squared error (MSE) loss function between the model predictions and ground-truth targets over the training dataset.

1:Target configuration , state estimator , number of iterations , learning rate
3:for  do
4:      Estimate error
5:      Adjust based on error
6:end for

5.5 Controller Design

Once we train , we would like to use it to accurately control the robot while compensating for the robot’s cabling effects. At time , the controller takes as input the target joint configuration and history-dependent input (Eq. 1) and computes joint configuration command to get the robot to that configuration. The controller (Alg. 5.4) iteratively refines the command based on the error relative to the target position. It evaluates the forward dynamics for an candidate command to obtain an estimate of the next configuration . Then, the algorithm modifies the input command to compensate for the error relative to the target position, executes the command, then updates the history . This process repeats for iterations.

6 Perception and Grasp Planning

In this section, we discuss how the calibrated and trajectory-optimized robot perceives the state of the pegs and blocks and generates a sequence of motions to perform peg transfer.

In previous work [hwang2020applying], we installed the depth camera to have a view perpendicular to the task board, which allows the point sets on the blocks to be obtained by cropping a depth range. To enable perception from a camera in a greater variety of poses, we use a point set registration algorithm, Iterative Closest Point (ICP) [icp], which is simple and computationally efficient.

At the beginning of the task, we register the captured point clouds from the known 3D model of the pegboard to obtain the transformation matrix from the camera to the pegboard. Then, we obtain the 12 peg positions by cropping a height range with respect to the base frame of the pegboard. Given the desired peg position and the known height of the block, we cluster a point set on the block to be transferred and get a transformation from the camera to the block using ICP. Since the transformation from camera to robot is known from the procedure in Section 5.2, we finally obtain the block pose with respect to the robot’s base frame. During the task, we perform a sequence of RGBD capturing, point set clustering, and registration of the block between each transfer motion. This takes an average of 241 ms per each block.

After detecting the blocks, we find a best grasp among a total of the six potential poses per block. We subdivide each block edge into two potential grasps and pre-calculate the six grasps for the 3D block model. See Fig. 6 for details. Given the block poses, we transform these grasping poses using and calculate the distance from the peg to find the farthest point among the poses that allows joint limits. We avoid the grasping point behind a peg to decrease the chances to collision.

Perception Pipeline We use point set registration to get poses of the pegboard and blocks. We perform a sequence of RGBD capturing, point set clustering, and registration of the block between each transfer motion.  Left: Peg transfer environment from the inclined camera view.  Right: Point sets of the detected pegs (black) and blocks (yellow), the registered 3D block model (green), the planned grasping point of the left arm (blue) and right arm (red).

7 Non-Linear Trajectory Optimization

In this section, we describe a trajectory time-optimization to improve the speed of task performance. In prior formulations [ichnowski2020gomp, ichnowski2020djgomp] of pick-and-place trajectory optimization, a time-minimized trajectory is found by discretizing the trajectory into a sequence of waypoints and formulating a sequential quadratic program that minimizes the sum-of-squared-acceleration or sum-of-squared distance between the waypoints. We observe that this prior formulation, while versatile enough for the peg transfer tasks, can be simplified to minimizing a sequence of splines defined by up to four waypoints, and relying on the kinematic design of the robot to avoid collisions. By reducing to four waypoints, we tradeoff some optimality of motion for the reduced computational complexity of the trajectory optimization.

The objective of this optimization is to minimize the trajectory motion time. The peg transfer task consists of 12 block transfers, each of which sequences motions that lift up, translate horizontally, and lower down. Due to the small clearance between blocks and pegs, the lifting motion for a block should be parallel to the peg until the block clears the top of the peg. Lowering the block onto a peg requires a similar parallel motion (Fig. 3).

7.1 Closed-form Inverse Kinematics

When computing a trajectory for the robot, we convert end-effector poses (e.g., for the pick location) to robot configuration using an inverse kinematic (IK) calculation. General IK solvers typically treat this problem as an iterative optimization problem and thus may take many iterations to converge to a solution. Due to the kinematic design of the dVRK robot, we observe that a closed-form analytic solution exists that allows for fast IK solutions.

We use Peiper’s method[peiper1968kinematics] to find a closed-form IK solution. The method proves that an analytic solution always exists for a 6 DoF manipulator under the condition that the last three axes mutually intersect. Many laparoscopic surgical robots have a special mechanical constraint, called Remote Center of Motion (RCM), to constrain movement at a point where the patient’s incision is positioned. We apply this method by inversely re-assigning the coordinates as shown in Fig. 7.1. We consider the end effector as a base frame. This method can be applied to any 6 DoF surgical manipulator with RCM motion constraint.

By using the closed-form IK solution, we further speed up trajectory computation time. Comparing to a numerical optimization method using 1000 random joint configurations, we get a computing time of

in average and standard deviation with the proposed IK solution, while

with the numerical optimization method. We use SciPy module in Python for the comparison.

Coordinate frames for Closed-form Inverse Kinematics We follow the modified Denavit-Hatenberg convention as in our prior work [hwang2020efficiently], except we inversely re-assign coordinates to calculate a closed-form IK solution according to Peiper’s method [peiper1968kinematics].

We define as the IK function, where the input is the pose of the end-effector and the output is the robot configuration:

and from Peiper’s method, we derive the following:

Here we use the notation to refer to the coefficient at row 1 and column 2 of the matrix , and subscripts in and to refer to the respective coefficients of the vector . The scalars , , , and represent physical measurements of the dVRK, visualized in Fig. 7.1.

7.2 Cubic-backed Time Minimization

Optimized trajectory segments for a transfer motion. Each peg transfer consists of 4 waypoints and the three time-optimized splines between. The motion starts at waypoint (0) and ends at waypoint (3) with zero velocity. At waypoints (1) and (2), the splines are connected to ensure continuity.

We propose an optimization based on a cubic spline interpolation in the joint space to compute a fast, smooth, and feasible trajectory to do the peg-transfer motions. To compute the spline, we define a cubic polynomial based on the configuration

and configuration-space velocity at the start and end of each motion segment . We then combine a sequence of splines to traverse through the whole peg-transfer motion. At each point where one spline ends and the next starts, we ensure continuity by setting the configuration and velocity of the end point. We define a spline as a function of time , segment parameterized start and end points, and duration of the spline segment as follows:


For a peg-transfer motion we define a trajectory by 4 waypoints (and therefore 3 segments): (0) the pose in which the robot is initially grasping the block on its starting peg, (1) the pose with the block lifted to clear its starting peg, (2) the pose with the block vertically over its ending peg, and (3) the pose with the block lowered onto the ending peg and about to release its grasp (Fig. 7.2). The configurations of these waypoints are all defined by the IK solution and denoted by . We denote the velocities at these waypoints by . The velocity at waypoints (0) and (3) is zero. The velocities at (1) and (2) are in the direction of lifting and lowering respectively, with their magnitude optimized through gradient descent. We let denote the duration of segment . For a single transfer motion, we thus have the following trajectory sequence:

We compute the direction of the velocity at (1) and (2) in joint space by computing the direction of the motion in end-effector space and transforming it through the inverse of the Jacobian of the robot at the corresponding configuration, thus:

where is the direction to lift for (1), and the direction to lower for (2). In practice, we orient the peg board such that the pegs are aligned to the -axis.

To compute the duration of the splines , , , we compute spline coefficients for integer multiples of 0.01 s, and select the shortest duration that satisfies all joint velocity and acceleration limits. The maximum velocity of the spline occurs at the root of the second derivative (when the spline’s coefficients ). The maximum acceleration occurs at either end of the spline (thus or ).

To compute the magnitude of the velocity at (1) and (2), and thus get the velocity , we iteratively take gradient steps according to a cost function defined to be that minimizes the time for each segment of the trajectory. This algorithm computes the time-optimized trajectories and is shown in Alg. 7.2.

1:Waypoints , directions , step rate , maximum iterations max_iter
2: Initial guess
4:for iter  do
7:     if  then
8:         break
9:     end if
11:end for

8 Physical Experiments

We evaluate the proposed calibration and trajectory optimization procedures by performing three peg-transfer tasks defined in Section 3. We use a bilateral dVRK [dvrk2014] with two large needle driver tools. We use a Zivid One+S RGBD camera mounted 0.5 m from the workspace. The camera has roughly a vertical incline. For consistency, we use the same 3D-printed peg board and blocks as in Hwang et al. [hwang2020efficiently]. Fig. 1 shows the setup.

8.1 Baselines

For each experiment, we consider the following approaches involving autonomous robot surgery:

Uncalibrated, Unoptimized: This uses the default robot PID controller that uses encoder estimates to track trajectories and performs no trajectory optimization.

Calibrated, Unoptimized: This uses the recurrent dynamics model (Section 5) for control but does not optimize naively generated trajectories for pick and place and handover. We hypothesize the calibration procedure will increase the accuracy of the system, but may not necessarily change its speed.

Calibrated, Optimized: This uses both the recurrent dynamics model (Section 5) for control and the trajectory optimization techniques (Section 7). We do not consider trajectory optimization without calibration, as high-speed collisions may damage instruments and we do not expect optimization to affect the accuracy of the uncalibrated model. We hypothesize that this method will be as accurate as Calibrated, Unoptimized, but result in much faster motions.

Unilateral Peg-Transfer Task Experiments: The calibrated methods and the Human Resident both achieve a success rate on 10 trials. Despite trajectory optimization, the surgeon is still faster than the automated system. However, the calibrated, optimized method is faster than our prior results. [hwang2020efficiently] Method Mean Transfer Time (s) Success / Attempts Success Rate (%) Human Untrained 9.9 112/117 95.7 Human Resident 4.8 120/120 100 Uncalibrated, Unoptimized 7.2 52/94 55.3 Calibrated, Unoptimized 7.0 120/120 100 Calibrated, Optimized 6.3 120/120 100

Parallel Bilateral Peg-Transfer Task Experiments: The calibrated, optimized method is slightly more successful and faster than the Human Resident. Unlike the unilateral case, even the unoptimized trajectories are faster than the Human Resident, whose mean transfer time did not decrease significantly from Table 8.1. This suggests that parallelism is significantly easier for the autonomous system to achieve than for the surgeon. Method Mean Transfer Time (s) Success / Attempts Success Rate (%) Human Untrained 7.1 118/119 99.2 Human Resident 4.5 117/119 98.3 Uncalibrated, Unoptimized 4.1 41/82 50.0 Calibrated, Unoptimized 4.0 116/120 96.7 Calibrated, Optimized 3.5 118/120 98.3

Bilateral Handover Peg-Transfer Task Experiments: We observe that the calibrated, optimized method is able to outperform the Human Resident in terms of success rate and speed. The mean transfer time for the Human Resident increased (i.e., slowed down) from the unilateral task, while the mean transfer time for the calibrated, optimized method only increased by (Table 8.1). As in Table 8.1, this suggests that the autonomous system is better able to parallelize subtasks than the human. Method Mean Transfer Time (s) Success / Attempts Success Rate (%) Human Untrained 10.8 117/120 97.5 Human Resident 8.6 109/117 93.2 Uncalibrated, Unoptimized 8.8 10/69 14.5 Calibrated, Unoptimized 8.3 108/118 91.5 Calibrated, Optimized 8.1 111/118 94.1

Failure Modes: We discuss the frequency of errors of all methods on all peg transfer task variants. We observe that the human surgeon makes few errors, except on the handover maneuver, which he reported to be a challenging subtask. The second row demonstrates the importance of calibration, since many failures occur during the picking and handover subtasks. Both calibrated versions rarely fail in comparison, and almost all failures occur during placement. The frequency of errors is comparable to the frequency of errors made by the surgical resident. Method Unilateral, Unilateral, Parallel, Parallel, Handover, Handover, Handover, Pick Place Pick Place Pick Place Handover Human Untrained 0 (0.0%) 5 (4.3%) 0 (0.0%) 1 (0.8%) 0 (0.0%) 0 (0.0%) 3 (2.5%) Human Resident 0 (0.0%) 0 (0.0%) 0 (0.0%) 2 (1.7%) 0 (0.0%) 0 (0.0%) 8 (6.8%) Uncalibrated, Unoptimized 35 (37.2%) 7 (11.9%) 40 (48.8%) 1 (2.4%) 9 (13.0%) 10 (50.0%) 40 (66.7%) Calibrated, Unoptimized 0 (0.0%) 0 (0.0%) 0 (0.0%) 4 (3.3%) 2 (1.7%) 8 (6.9%) 0 (0.0%) Calibrated, Optimized 0 (0.0%) 0 (0.0%) 0 (0.0%) 2 (1.7%) 0 (0.0%) 7 (5.9%) 0 (0.0%)

We compare the autonomous system to two human teleoperators. One does not have prior experience with robotic surgery or teleoperation (Human Untrained) and the other is a surgical resident (Human Resident). The untrained human is a graduate student in the Department of Mechanical Engineering at the University of California, Berkeley, who practiced transferring blocks for 1 hour prior to the experiment. The surgical resident is coauthor Dr. Danyal Fer. He has 3 years of general surgery experience, and has performed over 600 open and minimally invasive procedures. He also provided the human baselines for experiments reported in Hwang et al. [hwang2020applying]. Both humans teleoperate the arms and view the workspace through an endoscopic stereo camera.

8.2 Peg Transfer Results

We present and discuss results for the unilateral, parallel bilateral, and bilateral handover cases in Tables 8.18.1, and 8.1, respectively. Each case involves running the three proposed methods described in Subsection 8.1, plus the surgical resident, for 10 trials (see Section 3). We also report failure cases for all peg-transfer variants in Table 8.1.

8.2.1 Unilateral (One Arm)

In this case, we find that the human resident has the fastest mean transfer time of 4.8 s, with a perfect 120/120 success rate (Table 8.1). The two calibrated methods also attain perfect 120/120 success rates across 10 trials, matching the surgical resident’s performance. While the untrained human has a relatively high success rate of 95.7%, this comes at the cost of taking an average of 9.9 s per transfer attempt, suggesting that the untrained human optimized more for precision than speed. We observe a similar trend for the untrained human across the other task variants.

The fastest reported times for an autonomous surgical robot on this task and setup with a comparable success rate is 13.6 s per transfer (see Table III in Hwang et al. [hwang2020efficiently]), suggesting a substantial improvement of 65%. Within this study, the trajectory optimization procedure is able to reduce mean transfer time by 10%. However, the calibrated, optimized method is still slower than the surgeon, suggesting that there is still further room for improvement. Factors that delay the automated system include computing registration of the block and grasping pose, calculation of trajectory planning, and grasping timing which is not optimized in this paper.

8.2.2 Parallel Bilateral (Two Arms)

In this case, we find that the calibrated, optimized method is 22% faster than the human resident in mean transfer time, and is just a hair better than the human resident in success rate (Table 8.1). This task is also studied in in prior work (see Table I in Hwang et al. [hwang2020applying]), which reports a success rate of 78.0% and a mean transfer time of 5.7 s. In contrast, this work improves the success rate by over 20% and mean transfer time by 39%.

8.2.3 Bilateral Handover

In the bilateral handover case, we find that the proposed Calibrated, Optimized method has the highest success rate and the fastest mean transfer time (Table 8.1). The calibrated, optimized method outperforms the human resident by 0.9% in accuracy and 6% in mean transfer time. We attribute the faster speed in this setting to the surgical resident’s observation that ensuring the handover step worked correctly is difficult and requires time and care. Table 8.1 indicates that handovers were the main source of the resident’s failures. In addition, the way the surgical robot does the automated handover is slightly different and does not involve a significant rotation of the wrist joints. While the surgeon rotates the block with the first gripper to grasp it on the opposite side with the second gripper, the robot grasps the same side of the block with both grippers during handover. This does not require rotation of the block during the automated handover motion, which increases efficiency. To the best of our knowledge, this is the first automation study of the bilateral handover peg-transfer task.

9 Discussion and Future Work

This paper unifies and extends prior work on automating the Fundamentals of Laparoscopic Surgery peg-transfer task. We present results on automating three variants: unilateral, bilateral without handovers, and bilateral with handovers. Results suggest that the proposed system attains high accuracy and speed. Both are important; results from the untrained human suggest that he optimized primarily for accuracy, resulting in significantly slower block transfer times.

For the most difficult variant of peg transfer (with handovers) over 120 trials, the experienced surgical resident achieves success rate with mean transfer time of 8.6 s. The automated system achieves success rate with mean transfer time of 8.1 s. To our knowledge this is the first fully automated system to achieve “superhuman” performance in speed and success on peg transfer.

We are continuing to investigate methods for increasing transfer speeds with additional improvements to sensing, motion planning, and pipelining. We will also explore how this system could be extended for other fully- and partially- supervised surgical subtasks such as suturing. Acknowledgments This research was performed at the AUTOLAB at UC Berkeley in affiliation with the Berkeley AI Research (BAIR) Lab, Berkeley Deep Drive (BDD), the Real-Time Intelligent Secure Execution (RISE) Lab, the CITRIS “People and Robots” (CPAR) Initiative, and with UC Berkeley’s Center for Automation and Learning for Medical Robotics (Cal-MR). The authors wish to acknowledge the support of the Technology & Advanced Telemedicine Research Center (TATRC) for their support of this research, performed under contract No. W81XWH-19-C-0096. The research is an component of the Telerobotic Operative Network (TRON) project, being led by SRI International. The authors were supported in part by donations from Siemens, Google, Toyota Research Institute, Honda, Intel, and Intuitive Surgical. The da Vinci Research Kit was supported by the National Science Foundation, via the National Robotics Initiative (NRI), as part of the collaborative research project “Software Framework for Research in Semi-Autonomous Teleoperation” between The Johns Hopkins University (IIS 1637789), Worcester Polytechnic Institute (IIS 1637759), and the University of Washington (IIS 1637444). Daniel Seita is supported by a Graduate Fellowships for STEM Diversity. We thank Jaydev Desai and the ISMR program chairs, and Gary Guthart, Simon DiMaio, and Dale Bergman of Intuitive Surgical. We thank Yoonsoo Rho for conducting physical experiments to produce “Human Untrained” data.