Robotic Surgical Assistants (RSAs) currently rely on human supervision for the entirety of surgical tasks, which can consist of many very repetitive subtasks such as suturing. Automation of surgical subtasks may reduce surgeon fatigue [yip2017robot], with initial results in surgical cutting [thananjeyan2017multilateral, murali2015learning], debridement [seita_icra_2018, murali2015learning], suturing [sen2016automating, superhuman_dict, thananjeyan2019safety, chiu2020bimanual, saeidi_suturing_icra_2019, extraction_needles_2019, automated_needle_pickup_2018, improved_knots_case_2013], hemostasis [ritcher_bloodflow_2020], and peg transfer [paradis2020intermittent, hwang2020applying, hwang2020efficiently, auto_peg_transfer_2015]. This paper considers automation of the bimanual regrasping subtask [chiu2020bimanual] of surgical suturing, which involves passing a surgical needle from one end effector to another. This handover motion is performed in between stitches during suturing, and is a critical step, as accurately positioning the needle in the end effector affects the stability of its trajectory when guided through tissue. Because varying cable tension in the cables driving the arms causes inaccuracies in motions, a high precision task such as passing a needle between the end effectors is challenging [mahler2014case, hwang2020applying, paradis2020intermittent, hwang2020efficiently, peng2020real]. The task is also difficult because 3D pose information is critical for successfully manipulating needles, and surgical needles are challenging to perceive with RGB or active depth sensors due to their reflective surface, thin profile [kollar2021simnet], and self-occlusions (Figure 2). Prior work has mitigated this by painting needles [sen2016automating, chiu2020bimanual]
and using color segmentation, but this solution is not practical for clinical use. To manipulate unmodified surgical needles, we combine recent advances in deep learning, active sensing, and visual servoing. We present HOUSTON: Handoff of Unmodified, Surgical, Tool-Obstructed Needles, a problem and algorithm for using stereo vision with coarse and fine-grained control policies (Figure1) to sequentially localize, orient, and handover unmodified surgical needles. We present a localization method using stereo RGB with a deep segmentation network to output a point cloud of the needle in the workspace. This point cloud is used to define a coarse robot policy that uses visual servoing to reorient the needle for handover in a pose that is visible to the cameras and accessible by the other end effector. However, due to inaccuracies of robot positioning and the perception system, further corrections may be necessary. We train a fine robot policy from a small set of human demonstrations to perform these subtle but critical corrections from images for unmodified surgical needles.
This paper makes the following contributions:
A perception pipeline using stereo RGB to accurately estimate the pose of surgical steel needles in 3D space, enabling needle manipulation without active depth sensors and painted needles.
A visual servoing algorithm to perform coarse reorientation of a surgical needle for grasping.
A needle grasping policy that performs fine control of the needle learned from a small set of human demonstrations to compensate for robot positioning and needle sensing inaccuracies.
Combination of the pose estimator (1), the servoing algorithm (2) and needle controller (3) to perform bimanual surgical needle regrasping, where physical experiments on the da Vinci Research Kit (dVRK) [dvrk2014] suggest a success rate of on needles used in training, and on needles unseen in training. On sequential handovers, HOUSTON successfully executes 32.4 handovers on average before failure.
Ii Related Work
Ii-a Automation in Surgical Robotics
Automation of surgical subtasks is an active area of research with a rich history. Prior literature has studied automation of tasks related to surgical cutting [thananjeyan2017multilateral, murali2015learning, krishnan2019swirl], debridement [murali2015learning, kehoe2014autonomous], hemostasis [ritcher_bloodflow_2020], peg transfer [hwang2020applying, hwang2020efficiently, paradis2020intermittent], and suturing [sen2016automating, thananjeyan2019safety, chiu2020bimanual, extraction_needles_2019, automated_needle_pickup_2018, saeidi_suturing_icra_2019]. While automated suturing has been studied in prior work [saeidi_suturing_icra_2019, sen2016automating], suturing without modifications such as painted fiducial markers is an open research problem. Recent work studies robust and general approaches to specific subproblems within suturing, including the precise manipulation of surgical needles during suturing from needle extraction [extraction_needles_2019] to bimanual regrasping [chiu2020bimanual], which is the focus of this work. Needle manipulation is also studied by [extraction_needles_2019], where the approach studies the extraction of needles from tissue phantoms to compute robust grasps of the needle even in self-occluded configurations. Bimanual needle regrasping was studied in detail by chiu2020bimanual, with impressive results on simulation-trained policies that take needle end effector poses as input. We extend their problem definition to consider multiple handoffs of unmodified needles and end effectors, which requires perception of the needle and robot pose from images without color segmentation. Needle grasping has also been studied using visual servoing policies in automated_needle_pickup_2018, where the needle is painted with green markers to track its position during closed loop visual servoing. varier2020collaborative study tabular RL policies for needle regrasping in a discretized space in a fixed setup with known needle pose and experiments without the needle on the dVRK. The experiments in [varier2020collaborative]
suggest that value iteration-trained policies can mimic expert trajectories used for inverse reinforcement learning. In contrast, we present an algorithm compatible with significantly varying initial needle and gripper poses using only image observations. We additionally present many physical experiments with a needle, evaluating the success rate and speed of the algorithm.
Ii-B Visual Servoing, and Active Perception
Visual servoing (VS) is a popular technique in robotics [hutchinson1996tutorial, kragic2002survey], and has recently been applied to compensate for surgical robot imprecision in the surgical peg transfer task [paradis2020intermittent]. While classical VS approaches typically make use of hand-tuned visual features and known system dynamics [chaumette2006visual, caron2013photometric], recent work proposes learning end-to-end visual servoing policies from examples [levine2018learning, QT-Opt]. To reduce the need for tuned features and dynamics models and also reduce the number of training samples required to create a robust VS policy for bimanual needle regrasping, we present a hybrid approach that combines coarse motion planning with fine control, where a learned VS policy is only used in parts of the task where high precision is required. This framework, called intermittent visual servoing (IVS), was studied in detail by [paradis2020intermittent]
, where the system switches between a classical trajectory optimizer and imitation learning VS policy based on the precision required at the time. Inspired by this technique, we present an IVS approach to bimanual needle regrasping, that combines coarse perception and planning with fine VS control. Because this task requires reasoning about depth across several directions, we present a multi-view VS policy that learns to precisely hand over the needle based on several camera views. Active perception is a popular technique with many variations to localize objects prior to manipulation by maximizing information gain about their poses[mihaylova2002comparison, salaris2017online, whitehead1990active, bajcsy1988active, arruda2016active]. This has been studied in the context of robot-assisted surgery, where the endoscope position is automatically adjusted via a policy learned from demonstrations to center the camera focus on inclusions during surgeon teleoperation. In this work, we actively servo the needle to highly visible poses to maximize the accuracy of its pose estimate. This is most similar to [arruda2016active], where the authors propose an algorithm to actively select views of a grasping workspace to uncover enough information about unknown objects to plan grasps.
Iii Problem Formulation
The HOUSTON problem extends and generalizes the previous problem definition from chiu2020bimanual to include unmodified needles, occlusion, and multiple handoffs.
In the HOUSTON problem, the surgical robot starts with a curved surgical needle with known curvature and radius grasped by one gripper and must accurately pass it to the other gripper and back. This is challenging due to the needle’s reflective surface, thin profile, and pathological configurations [extraction_needles_2019, chiu2020bimanual] as depicted in Figure 2. Once the needle is successfully passed to the other end effector, it is passed back to the first end effector. This process is repeated times, or until the needle is dropped. We also consider a special case of this problem, the single-handover version, in which .
Let and denote the poses of the left and right grippers, respectively, at discrete timestep with respect to a world coordinate frame. The needle has pose with respect to the world frame. Observations of the workspace are available via RGB images from a stereo camera, and , or overhead monocular RGB images from an RGB camera. The left and right cameras in the stereo pair have world poses and , respectively. The overhead camera has pose in world frame . Each trial starts with the needle in the left gripper and ends when the needle is dropped. Additionally, the trial terminates if no successful handoff occurs in timesteps.
At timestep , the algorithm is provided observation which contains images from the sensors: . The algorithm outputs a target pose and jaw state for each end effector , where indicates the whether the left jaw is closed at timestep .
In order to deterministically evaluate HOUSTON policies in a wide variety of needle configurations, we discretize the needle-in-gripper pose possibilities by choosing a number of categories across three degrees of freedom:
The needle’s curve can face either towards or away from the camera, providing 2 possibilities
The gripper may hold the needle either at the tip, or 30° inwards following the curvature of the needle. This degree of freedom has 2 possibilities.
The rotation of the needle about the tangent line to the point the gripper intersects ranges from to , and is discretized into 7 bins as in Figure 3.
This gives a total of 28 possible needle configurations and we perform a grid search over these possibilities. We chose these configurations to be representative of those seen in suturing tasks post-needle extraction. While the robot encoders provide an estimate of the gripper poses and , the precise needle pose is unknown due to cabling effects of the arms. We assume access to a stereo RGB pair in the workspace, an overhead RGB camera, and the transforms between the coordinate frames of these cameras and the robot arms.
Iii-D Evaluation Metrics
We evaluate HOUSTON by recording: i) the number of successful handoffs in a multi-handoff trial, ii) the success rate per arm of single handoffs beginning from each configuration in III-C, and iii) the average time for each handoff.
Iv HOUSTON Algorithm
HOUSTON uses active stereo visual servoing with both a coarse-motion and fine-motion learned policy for the bimanual regrasping task.
Iv-a Phase 1: Active Needle Presentation
In the first phase, the algorithm repositions the needle to a pose where the other arm can easily grasp it without collisions and the cameras can clearly view it. Throughout execution of the coarse policy, we parameterize the needle as a circle of known radius, and measure its state in world frame as a center point
, normal vector, and a needle tip point as shown in Figure 6. Active Needle Presentation consists of two stages: needle acquisition and handover positioning. The needle acquisition stage moves the needle to maximize visibility, and the positioning stage uses visual servoing to move the needle into a graspable state.
Iv-A1 Needle State Estimation
The state estimator passes stereo images into a fully convolutional neural network that is trained to output segmentation masks for the needle in each image. See the project website for architecture details. It computes a distance transform of the segmentation mask to label each pixel with its distance to the nearest unactivated pixel. Next, it finds peaks in the distance transform along horizontal lines in each image, which correspond to points near the center of activated patches. It then triangulates all pairs of peaks along each horizontal line in the images to obtain a point-cloud as in Figure4
. Because this may triangulate outlier points from the gripper or incorrectly match points on different parts of the needle, RANSAC is applied to filter out incorrect point correspondences and returns the final predicted needle state. At each iteration, it samples a set of 3 points, to which a plane is fit. Each subset of 3 points generates 2 candidate circles in the plane, corresponding to the two circles which pass through one of the pairs. RANSAC uses an inlier radius of 1mm and runs for 300 iterations. The network is first trained on a dataset of 2000 simulated stereo images of randomly-placed, textured and scaled floating needles and random objects[calli2017yale] above a surface plane generated in Blender 2.92. Lighting intensity, size, and position, and stereo camera position are also randomized. The segmentation network is fine-tuned on a dataset of 200 manually-labeled images of the surgical needle in the end effector. Training a network on a PC with an NVIDIA V100 GPU takes 3 hours, and fine tuning takes 10 minutes.
Iv-A2 Visual servoing
HOUSTON uses Algorithm 1 to compute updates for visual servoing in both the needle presentation and the handover positioning phases. Algorithm 1 is a fixed point iteration method that uses a state estimator to iteratively visually servo to a target state. Similar to first and second order optimization algorithms, it computes a global update based on its current state and iterates until the computed update is zero. In each iteration, it queries the current 3D state estimate of the needle and then computes an update step in the direction of the target state. To compensate for estimation errors due to challenging needle poses, this process is repeated at each iteration until the algorithm converges to a local optimum within a pose error tolerance. During needle acquisition, the arm moves to a home pose, then rotates around the world and axes, stopping when the state estimator observes at threshold number of inlier points from RANSAC circle fitting. During trials, at most 2 consecutive rotations sufficed to resolve the state to this degree. After this initial acquisition, we apply Algorithm 1 to align towards the left stereo camera position , with defined as a rotation about the axis . Once clearly presented to the camera, we measure by choosing the inlier point from the circle fitting step which is furthest from the gripper in 3D space. Subsequently, during handover positioning, we compute inverse kinematics to move towards the center of the workspace with pointing towards the other gripper and orthogonal to the table plane. This flat configuration is critical for the grasping step, since the dVRK arm is primarily designed for top-down grasps near the center of its workspace. Because the arm only has 5 rotational degrees of freedom, we use a numerical IK solver from [tracik] and attempt to find a configuration minimizing rotational error within a cm tolerance region on end effector translation. After moving to this pose, we repeat Algorithm 1, with defined as a rotation aligning towards the other gripper and orthogonal to the table. We calculate IK to a configuration with needle curvature towards the camera and one with curvature away from the camera, then pick the configuration which minimizes rotational error to the goal.
Iv-B Phase 2: Executing a Grasping Policy
After the active presentation phase described in Section IV-A and the pose of the needle is relatively accurately known and accessible to the grasping arm, we execute a grasping policy to grasp the needle. However, the needle pose estimate after the first phase may not be perfect, so we must visually servo to compensate for these errors when grasping. Even if the needle pose was perfectly known, reliable grasping of a small needle is still challenging due to the positioning errors of the robot, which are a result of its cable-driven arms [paradis2020intermittent, hwang2020efficiently, hwang2020applying, seita_icra_2018, peng2020real, mahler2014case]. The policy splits corrective actions between the - and -axes with two sub-policies, and . Each policy uses RGB inputs ego-centrically cropped around the grasping arm, with the - and -axis policies using pixel crops from the inclined camera and pixel crops from the overhead camera respectively. The cropping forces the policy to condition based on the relative position of the gripper and needle without the ability to overfit to texture cues from other parts of the scene. The fine-grained grasping policy and image crops are displayed in Figure 5. We ablate different design choices for the grasping correction policy and also present open-loop grasping results in Section V. The grasping subpolicy
is a neural network classifier that outputs whether the grasping arm should move in the(down in the crop) or (up in the crop) direction. is trained similarly to output whether the grasping arm should move in the or directions. The policies are trained by collecting offline human demonstrations through two methods: 1) we sample poses for arms in the workspace such that needle orientation is perturbed by about each axis, then move the robot to a good grasping position via a keyboard teleoperation interface. 2) we execute the pre-handover positioning routine and position the robot in the desired grasp location by hand, after which the robot autonomously iterates through offsets in the and directions, labeling actions according to the offset from goal position. We experimentally find that separating the policy across two axes significantly improves grasp accuracy (Section V). A separate grasping policy is trained for each arm on 100 demonstrations each. Each demonstration takes 5-10 actions and each dataset takes about an hour to collect. The policies and are each represented by voting ensembles of 5 classifiers, each of which have three convolutional layers and two fully connected layers. Details about model architectures are located in the project website. During policy execution, we iteratively sample actions first from , then , waiting for each to converge before continuing. We multiply the action magnitude by a scalar every time the network outputs the opposite action of the previous timestep. Servoing terminates when action magnitude decays to under mm. This enables implicit convergence to the goal without explicitly training the policy to stop. After and convergence, we execute a simple downward motion of 1cm to grasp the needle.
V Physical Experiments
The experiments aim to answer the following question: how efficient and reliable is HOUSTON compared to baseline approaches? We also perform several ablation studies of method components in this section.
To evaluate the method for the task of active needle presentation, we compare to the following baselines:
Depth-based presentation: Instead of using the stereo RGB network to detect needle pose, we use a depth image-based detection algorithm to detect the needle pose and servo it to the flat grasping pose. This method takes the depth image from the built in depth calculation from the stereo camera, , as input, masks the gripper out of the depth image using the dVRK’s forward kinematics, then performs a volume crop around the end effector and fits a circle of known radius to the points using RANSAC to extract the state.
No-Sim-Data: This is an ablation of the RGB stereo segmentation network that is only trained on the small dataset of real data.
|Success Rate||Conf. Int.||Completion Time (s)||Failures|
|Successes / Total||Success||Low||High||P||X||Y|
|Shared Grasp Policy|
|No Sim Data|
|HOUSTON (Left to Right)|
|HOUSTON (Right to Left)|
|Needles Unseen in Training|
|HOUSTON (Right to Left, Needle 2)|
|HOUSTON (Right to Left, Needle 3)|
|HOUSTON (Right to Left, Needle 4)|
|Trial 1||Trial 2||Trial 3||Trial 4||Trial 5||Avg.|
To evaluate the design choices used in the fine-grained grasping policy, we compare to the following baselines:
Open Loop: Executes an open loop grasping motion to grasp the needle based only on needle geometry and inverse kinematics.
Shared Grasp Policy: Trains a single policy to output and displacements, and takes both and as input.
To evaluate whether the system can transfer to needles unseen in training, we evaluate HOUSTON on three additional needles as in Figure 7.
V-B Experimental Setup
We perform experiments using the daVinci Research Kit (dVRK) surgical robot [dvrk2014], a cable-driven surgical robot with two needle drivers, a gripper which can open 1cm. For perception, the setup includes a Zed Mini stereo camera angled slightly downwards to face the arms, and an overhead Zivid One Plus M camera facing directly down. Stereo images are captured at 2K resolution, and overhead images are captured at 1080p. Locations of the arms relative to each of the cameras is statically calibrated.
V-B1 Single handover
For single handover experiments, we manually vary the orientation of the gripper before each trial to the orientations described in III-C. A handoff is considered successful if the needle switches from one gripper to the other, and at the end is fully supported by the other gripper.
V-B2 Multiple handovers
For multiple handover experiments, we start the needle in the left gripper in a visible configuration to the camera, so that all errors are a result of handoffs rather than initialization. We evaluate two configurations: one where the needle arc ends facing the stereo camera in the grasping configuration (Towards), and one where it faces the opposite direction (Away). This configuration is typically maintained throughout each multi-handover trial because of the consistency of the needle presentation step.
V-C Single Handover Results
We evaluate HOUSTON and baselines on the single handover task in Table I, and perform multiple systematic passes over the starting configurations described in Section III-C. We find that HOUSTON is able to more reliably perform the task than comparisons, which either experience many presentation errors or many grasp positioning errors.
V-D Multiple Handover Results
We evaluate HOUSTON on the multiple handover task with with two different starting configurations (Table II). We observe that in the first configuration, the algorithm completes successful handovers on average and in the second. In three trials, no errors occur, and we manually terminate them after successful handovers.
V-E Failure Analysis
HOUSTON encounters three failure modes:
P: Presentation error: the robot fails to present the needle in an orientation that is in the plane of the table with the needle tip pointing toward the grasping arm. This may lead to grasping angles that are unreachable or out of the training distribution for the grasping arm.
X: Grasping positioning error (X): the -axis grasping policy fails to line up with the needle prior to executing the-axis grasping policy.
Y: Grasping positioning error (Y): the -axis grasping policy fails line up with the needle prior to grasping.
We categorize all of the failure modes encountered in Table I. We find that the open loop grasping policies are not able to consistently position well for grasping. HOUSTON has failures that are evenly distributed across the failure modes. Grasp policy servoing errors stem mainly from needle configurations that are far outside the distribution seen in training. Presentation phase failures stem primarily from mis-detection of the needle true tip, either because of incomplete segmentation masks or from drift in robot kinematics causing the most distal needle point to not be the tip. This causes the servoing policy to rotate the needle away from the camera, after which sometimes it loses visibility and fails to bring the needle to the pre-handover pose. Multi-handoff failures most frequently arise because of subtle imperfections in grasp execution where the needle rotates to a difficult angle. During the subsequent handover the needle can become obstructed by the holding gripper, inhibiting the grasping policy.
In this work we present HOUSTON, a problem and an algorithm for reliably completing the bimanual regrasping task on unpainted surgical needles. To our knowledge, this work is the first to study the unmodified variant of the regrasping task. The main limitations of this approach are its reliance on human demonstrations to learn the grasping policy, and sensitivity to needle and environment appearance. We hypothesize that the former could be mitigated via self-supervised demonstration collection, or by exploring unsupervised methods for fine-tuning behavior cloned policies. Future work will address the latter issue by exploring more powerful network architectures leveraging stereo disparity such as [kollar2021simnet], and designing more autonomous data collection techniques which can label real needle data without human input. In future work, we will also study how to reorient needles between handovers for precise control of needle-in-hand pose and attempt to make needle tracking more robust to occlusions from tissue phantoms.