SuPer: A Surgical Perception Framework for Endoscopic Tissue Manipulation with Surgical Robotics

Traditional control and task automation have been successfully demonstrated in a variety of structured, controlled environments through the use of highly specialized modeled robotic systems in conjunction with multiple sensors. However, application of autonomy in endoscopic surgery is very challenging, particularly in soft tissue work, due to the lack of high-quality images and the unpredictable, constantly deforming environment. In this work, we propose a novel surgical perception framework, SuPer, for surgical robotic control. This framework continuously collects 3D geometric information that allows for mapping of a deformable surgical field while tracking rigid instruments within the field. To achieve this, a model-based tracker is employed to localize the surgical tool with a kinematic prior in conjunction with a model-free tracker to reconstruct the deformable environment and provide an estimated point cloud as a mapping of the environment. The proposed framework was implemented on the da Vinci Surgical System in real-time with an end-effector controller where the target configurations are set and regulated through the framework. Our proposed framework successfully completed autonomous soft tissue manipulation tasks with high accuracy. The demonstration of this novel framework is promising for the future of surgical autonomy. In addition, we provide our dataset for further surgical research.


page 1

page 3

page 4

page 7


A 2D Surgical Simulation Framework for Tool-Tissue Interaction

The control and task automation of robotic surgical system is very chall...

Learning Soft Tissue Dynamics in Image Space for Automated Bimanual Tissue Manipulation with Surgical Robots

In this paper, reinforcement learning and learning from demonstration in...

Autonomous Tissue Manipulation via Surgical Robot Using Learning Based Model Predictive Control

Tissue manipulation is a frequently used fundamental subtask of any surg...

Robotic Sewing and Knot Tying for Personalized Stent Graft Manufacturing

This paper presents a versatile robotic system for sewing 3D structured ...

Deformable Models for Surgical Simulation: A Survey

This paper presents a survey of the state-of-the-art deformable models s...

Autonomous Robotic Suction to Clear the Surgical Field for Hemostasis using Image-based Blood Flow Detection

Autonomous robotic surgery has seen significant progression over the las...

I Introduction

Surgical robotic systems, such as the da Vinci robotic platform® (Intuitive Surgical, Sunnyvale, CA, USA), are becoming increasingly utilized in operating rooms around the world. Use of the da Vinci robot has been shown to improve accuracy through reducing tremors and provides wristed instrumentation for precise manipulation of delicate tissue [1]. Current innovative research has been conducted to develop new control algorithms for surgical task automation [2]. Surgical task automation could reduce surgeon fatigue and improve procedural consistency through the completion of tasks such as suturing [3], cutting [4], and tissue debridement [5].

Significant advances have been made in surgical robotic control and task automation. However, the integration of perception into these controllers is deficient even though the capabilities of surgical tool and tissue tracking technologies have advanced dramatically in the past decade. Without properly integrating perception, control algorithms will never be successful in non-structured environments, such as those under surgical conditions.

Fig. 1: A demonstration of the proposed surgical perception framework. A green point on the perception model of the tissue, shown in top right, is selected by the user and the real surgical robot grasps and stretches the tissue at that location. As seen in the bottom two images, the framework is able to capture the tissue’s deformation from the stretching.

In this work, we propose a novel Surgical Perception framework, SuPer, which integrates visual perception from endoscopic image data with a surgical robotic control loop to achieve tissue manipulation. A vision-based tracking system is carefully designed to track both the surgical environment and robotic agents, e.g. tissue and surgical tool as shown in Fig. 1. However, endoscopic procedures have limited sensory information provided by endoscopic images and take place in a constantly deforming environment. Therefore, we separate the tracking system into two methodologies: model-based tracking to leverage the available kinematic prior of the agent and model-free tracking for the unstructured physical world. With the proposed 3D visual perception framework, surgical robotic controllers can manipulate the environment in a closed loop fashion as the framework maps the environment, tracking the tissue deformation and localizing the agent continuously and simultaneously. In the experimental section, we also demonstrate an efficient implementation of the proposed framework on a da Vinci Research Kit in which we successfully manipulate tissue.

To the best of our knowledge, the proposed perception framework is the first work to combine 3D visual perception algorithms for general control of a surgical robot in an unstructured, deforming environment. More specifically, our contributions can be summarized as

  1. a perception framework with both model-based tracking and model-free tracking components to track the tissue and localize the robot simultaneously,

  2. deformable environment tracking to track tissue from stereo-endoscopic image data,

  3. surgical tool tracking to accurately localize and control the surgical tool in the endoscopic camera frame, and

  4. a released data set of tissue manipulation with the da Vinci Surgical® System.

The framework is implemented on a da Vinci Surgical® System and multiple tissue manipulation experiments were conducted to highlight its accuracy and precision. We believe that the proposed framework is a fundamental step toward endoscopic surgical autonomy in unstructured environments. With a uniform perception framework in the control loop, more advanced surgical task automation can be achieved.

Ii Related Works

As the presented work is at the intersection of multiple communities, the related works are split into three sections.

Ii-1 Deformable Reconstruction

The first group of related works are from the 3D reconstruction or motion capture community [6, 7, 8]. Newcombe et al. [9] proposed a real-time method for reconstruction of a static 3D model using a consumer-level depth camera based on volumes for their internal data structure, while Keller et al. [10] employed the use of surfel points rather than volumes. The rigidness assumption was then removed to capture motion of a deforming scene [11]. To enhance the robustness of reconstruction, key-point alignment was added to the original cost function of the deformable reconstruction [12]. In addition, multiple-sensor approaches have shown to further improve accuracy [13]. Guo et al. [14] achieved similar results for deformable object reconstruction with surfel points.

Ii-2 Endoscopic Tissue Tracking

Tissue tracking is a specific area of visual tracking that often utilizes 3D reconstruction techniques. A comprehensive evaluation of different optical techniques for geometry estimation of tissue surfaces concluded that stereoscopic is the only feasible and practically approach to tissue reconstruction and tracking during surgery  [15]. For image-guided surgery, Yip et al. [16] proposed a tissue tracking method with key-point feature detection and registration. 3D dynamic reonstruction was introduced by Song et al. [17] to track in-vivo deformations. Meanwhile, dense SLAM methods [18, 19] are applied to track and localize the endoscope in the surgical scene with image features. In contrast with the above mentioned algorithms, our proposed framework not only tracks the surgical environment through deformable reconstruction, but also integrates the control loop of the surgical robotic arm for automation.

Ii-3 Endoscopic Surgical Tool Tracking and Control

A recent literature survey by Bouget et al. [20] gave a detailed summary of image-based surgical tool detection. Markerless with tracking algorithms [21, 22, 23] requires features which can be learned [24, 25], generated via template matching [26], or hand-crafted [27]. After the features have been extracted, they are fused with kinematic information and encoder readings to fully localize the surgical robotic tools [28].

Once the surgical tool is localized, control algorithms can be applied on them to manipulate the environment. Previous work in control algorithms for surgical robotics includes compliant object manipulation [29], debridement removal [30, 31], suture needle manipulation  [3, 32, 33], and cutting [4, 34]. These control algorithms show advanced and sophisticated manipulations, however they rely on structured environments and would have difficulties in the real surgical scene.

Iii Methodology

The goal of the SuPer framework, as shown in Fig. 2, is to provide geometric information about the entire surgical scene including robotic agent and the deforming environment. A model-based tracker via particle filter is chosen to localize the surgical robotic tool by utilizing a kinematic prior and fusing the encoder readings and endoscopic image data. For the surgical environment, a model-free deformable tracker is employed since surgical environment is unstructured and constantly deforming. The model-free tracker uses the stereo-endoscopic data as an observation to reconstruct the deformable scene. To efficiently combine the two separate trackers, a mask of the surgical tool is generated based on the surgical tool tracker and removed from the observation given to the model-free tracking component. Since the trackers are both perceived in the same camera coordinate frame, a surgical robotic controller can be used in our SuPer framework to manipulate the unstructured surgical scene.

Fig. 2: Flow chart of the proposed SuPer framework which integrates perception for localization and environment mapping into surgical robotic control.

Iii-a Surgical Tool Tracking

Surgical robots, such as the da Vinci® Surgical System, utilize setup-joints to position the base robotic arm and the endoscopic camera. These setup-joints have long links and therefore have large errors relative to the active joints during a procedure of the surgical robot [24, 26]. Furthermore, calibration for the transform from the base of the robot to the camera, also known as the hand-eye transform, rather than relying on the setup-joint kinematics, has been highlighted as unreliable when controlling surgical robots [35]. Modelling this explicitly, a point on the -th link, is transformed to the camera frame:


where is the homogeneous hand-eye transform from calibration or the setup-joints, is the error in the hand-eye transform, and is the -th homogeneous joint transform with joint angle at time . Note that coordinate frame 0 is the base of the robot and that represents the homogeneous representation of a point (e.g. ). To track the surgical tools accurately, will be estimated in real-time. Similar problem formulations have been utilized in prior works for surgical tool tracking [24, 25, 26].

To track error,

is parameterized by six scalar values: an axis-angle vector,

, and a translational vector

. The motion model, feature detection algorithm, and observation models are described in the remainder of this subsection. For implementation, we elected to use the particle filter because of its flexibility to model the posterior probability density with a finite-number of samples 


Iii-A1 Motion Model

For initialization, the error of the hand-eye is assumed to be zero and the uncertainty of the calibration or setup-joints is modelled as Gaussian noise:


where is the covariance matrix. Similarly, the motion model is set to have additive mean zero Gaussian noise since the particle filter is tracking the uncertainty in the hand-eye which is a constant transform:


where is the covariance matrix.

Iii-A2 Features Detection and Camera Projections

Algorithms developed in previous literature can be utilized to detect features from the endoscopic image data on the surgical tool to update the estimation for the particle filter [20]. However, colored markers were drawn on the surgical tool using a paint pen for the sake of simplicity. The locations of the markers are similar to the detected features in Ye’s et al. tool tracking work [26].

The painted markers were detected by converting the image to the Hue-Saturation-Value (HSV) color space and thresholding the hue channel. The mask generated from the thresholding is then eroded and dilated to reduce the rate of small, false detections. Centroids, , are then calculated for each of the distinct countours of the mask to give a pixel point measurement of the markers. The camera projection equation for the detected pixel point of marker is:


where is the known marker position on link and is the standard camera projection operation and is the intrinsic camera calibration matrix.

The second feature detected is the projected edges of the insertion shaft of the surgical tool, which is a cylinder. Pixels potentially associated with the edges are detected using Canny edge detector [37]

and classified into distinct lines using the Hough transform 

[38]. This results in a list of detected lines parameterized by scalars and :


where and are pixel coordinates. For the sake of brevity, the camera projection equations for a cylinder resulting in two lines is omitted. Please refer to Chaumette’s work for a full derivation and expression [39]. The camera projection equation for a single line is denoted as and using the same parameterization as (5).

Iii-A3 Observation Model

To make associations between the detected marker features, , and their corresponding marker, a greedy matching technique is done because of the low computation time. An ordered list of the cost


for detection and projected marker is made where is a tuned parameter for later. Iteratively, detection and marker from the lowest value of this cost list is matched, the tuple is added to the associated data list , and all subsequent costs associated with either or are removed from the list. This is done until a max cost, , is reached.

The same procedure is utilized for the detected lines , and the projected edges of the insertion shaft except the cost equation is


where and are tuned parameters, the data list is denoted as , and a max cost of .

The association costs are wrapped in a radial basis function so they can be directly used for the observation models. The probability of the detected markers,

, is modelled as:


where there are a total of markers painted on the surgical tool. Similarly, the probability of the detected lines, , , is modelled as


where there are a total of cylinders used as features on the surgical tool. In this case . These functions are chosen since they increase the weight of a particle for stronger associations, but does not completely zero out the weight if no associations are made which can occur in cases of obstruction or missed detections. Since these two observations occur in synchronous, the update is combined using the switching observation models-synchronous case [40]. Example images of the tool tracking are shown in Fig. 3.

Iii-B Depth Map from Stereo Images

The depth map from the stereoscopic image data is generated using the Library for Efficient Large-Scale Stereo Matching (LIBELAS) [41]. To fully exploit the prior and enhance the robustness of our system, the surgical tool portion of the image and depth data is not passed to the deformable tissue tracker since the surgical tool is already being tracked. Therefore, a mask of the surgical tool is generated using the same OpenGL rendering pipeline we previously developed [42], and applied to the depth and image data passed to the deformable tissue tracker. To ensure the mask covers all of the tool, it is dilated before being applied.

Iii-C Deformable Tissue Tracking

Fig. 3: Surgical tool tracking implementation on the da Vinci Surgical® System running 30fps in real-time. From left to right the figures show: detected markers and edges, re-projected kinematic tool and shaft edges, and the full Augmented Reality rendering of the surgical tool [42] on top of the raw endoscopic data. These image are best viewed in color.

To represent the environment, we choose surfel [10] as our data structure due to the direct conversion to point cloud which is a standard data type for the robotics community. A surfel represents a region of an observed surface and is parameterized by the tuple , where are the expected position, normal, and color respectively and scalars are the radius, confidence score, and time stamp of last update respectively. Alongside the geometric structure the surfel data provides, it also gives confidence and timestamp of last update which can be exploited to further optimize a controller working in the tracked environment. For adding/deleting and fusing of surfels, refer to work done by Keller et al. [10] and Gao et al. [14].

Iii-C1 Driven/Parameterized Model

The number of surfel grows proportionally to the number of image pixels provided to the deformable tracker, so it is infeasible to track the entire surfel set individually. Inspired by the work of Embedded Deform (ED) [43], we drive our surfel set with a less-dense ED graph which has significantly fewer parameters to track. The ED graph can be thought of as a skeletonization of the surfels and captures their deformations. Thus, the transformation of every surfel is modeled as follows:


where is the global homogeneous transformation (e.g. common motion shared with all surfel), is a normalized weight, and is an index set contains all the ED nodes connected to the surfel. An ED node consists of the tuple where is the position of the ED node and and are the quaternion and translation parameters respectively and converted to a homogeneous transform matrix with . Both and are generated using the same method proposed by Sumner et al. [43]. Note that is a vector in homogeneous representation(e.g. ). The normal transformation is similarly defined as:


When implementing the ED graph, the and for node are the current frames estimated deformation. After every frame, the deformations are committed to and the surfels based on (10) and (11). Therefore, with an ED graph of nodes, the whole surfel model is estimated with by parameters. Note that the extra 7 parameters come from which is also estimated with a quaternion and translational vector. An example of using this model to track deformations is shown in Fig. 4.

Fig. 4: Deformable tracking results with testing dataset [12]. The color represents the normal of our surfel data. As the model fuses with more data from left to right, the normal becomes smooth and the deformations are captured.

Iii-C2 Cost Function

To track the visual scene with the parameterized surfel model, a cost function is defined to represent the distance between an observation and the estimated model. It is defined as follows:


where is the error between the depth observation and estimated model, is a rigidness cost such that ED nodes nearby one another have similar deformation, is a normalization term for the quaternions to satisfy a rotation in space, and is a visual feature correspondence cost to ensure texture consistency.

More specifically, the traditional point-plane error metric [9] is used for the depth data cost. When minimized, the model is aligned with the observed depth image. The expression is:


where is the observed position from the depth map, at pixel coordinate , and and are the associated surfel position and normal from the most up to date model. This cost term however is highly curved and not easy to solve. To simplify the optimization, the normal is fixed at every iteration during optimization. This results in the following expression at iteration :


where and is the set of ED nodes at iteration . This is a normal-difference cost term similar to Iterative Closest Point [9].

The rigid term is constructed by norm of the difference between the positions of a ED node transformed by two nearby transformations. The cost expression is:


where KNN is the set of ED nodes neighboring node generated by the k-nearest neighbor algorithm based on their positions. This cost term forces the model to have consistent motion among the nearby ED nodes. Intuitively, it gives hints to the model when a portion of the ED nodes do not receive enough data from the observation in the current frame.

To have a rigid-like transformation, the normalizing term in the cost function is set to:


since quaternions hold . Both and are critical to ensuring all ED nodes move as rigid as possible. This is since is a very large space to optimize over relative to the observed data. For example, in cases of obstruction the optimization problem is ill-defined without these terms.

The final cost term is for visual feature correspondence to force visual texture consistency between the model and the observed data. The expression for the cost is:


where is a set of associated pairs of matched feature points between the rendered color image of our model and the observed color image data respectively. The observed point is obtained using the same expression as before: . The feature matching gives sparse but strong hint for the model to fit the current data.

Iii-C3 Optimization solver

To solve the non-linear least square problem proposed in (12), the Levenberg Marquardt (LM) algorithm [44] is implemented to efficiently obtain the solution for the model. The LM algorithm requires the cost function to be in the form of a sum of squared residuals. Therefore, all the parameters from are stacked into a vector, , and all cost terms are reorganized into vector form such that . In this form, the function is linearized with a Taylor expansion:


where is the Jacobian matrix of . Following the LM algorithm, is solved for by using:


where is a damping factor. The LM algorithm accepts the by setting when the cost function decreases: . Otherwise, it increases the damping factor. Intuitively, LM algorithm tries to find a balance between Gaussian-Newton method and the gradient descent solver. In our implementation, (19) is solved with a GPU version of the preconditioned conjugate gradient method within 10 iterations.

Iv Experiments

To measure the effectiveness of the proposed framework, our implementation was deployed on a da Vinci Surgical® System. The stereo camera is the standard 1080p laparoscopic camera running at 30fps. The Open Source da Vinci Research Kit (dVRK) [45] was used to send end-effector commands and get joint angles and the end-effector location in the base frame of a single surgical robotic arm with a gripper, also known as Patient Side Manipulator (PSM). The data for the PSM is being sent at a rate of 100Hz. All of the communication between subsystems of the code was done using the Robotic Operating System, and everything ran on two identical computers with a Intel® Core™ i9-7940X Processor and NVIDIA’s GeForce RTX 2080.

Iv-a Implementation Details

Details for implementation of the proposed framework on the dVRK are stated below and organized by the components of the framework.

Iv-A1 Surgical Tool Tracking

The particle filter used particles, bootstrap approximation for the prediction step, and stratified resampling when the number of effective particles dropped below to avoid particle depletion. For initialization, the covariance, is set to diag() where is in radians and is in mm. The motion model covariance, , is set to . For the observation model, and . The endoscopic image data is resized to 960 by 540 before processing for features. For the initial hand-eye transform, , OpenCV’s perspective-n-point solver is used on the segmented centroids of the markers.

Iv-A2 Depth Map from Stereo Images

The endoscopic image data is resized to 640 by 480 before processing. The LIBELAS parameters used are the default robotics settings from its open sourced repository [41]. After computing the depth map, , it is masked by the rendered surgical tool. The mask is dilated by nine pixels before applied. The depth map is then smoothed spatially with a bilateral filter and temporally with a median filter of four frames to decrease noise.

Iv-A3 Deformable Tracking

The surfel radius is set to and confidence score is calculated with at pixel coordinate where is the z component of camera frame normal, is the cameras focal length, and is the normalized distance from the pixel coordinate to the center of the image [9][10]. Whenever new surfels are added to the model, ED nodes are randomly sampled from them [14]

. This typically results in 300 ED nodes, and therefore roughly 2K parameters to estimate. OpenCV’s implementation of SURF is used for feature extraction and matching in the cost functions visual correspondence term. For the cost function, the parameters

are set to .

Fig. 5: Autonomous tissue manipulating with the proposed SuPer framework implemented on the da Vinci® Surgical System in real-time. From left to right the figures show: the real scene, tool tracking from the endoscopic camera, deformable reconstruction, and RViz with point cloud of the environment, robot localization, and the tracked point to grasp.
(a) Test environment
(b) Without tool tracking
(c) Without mask
(d) Without deformable tracking
Fig. 6: Results from the repeated tissue manipulation experiment without using the complete proposed SuPer framework. None of these results are ideal since they do not properly capture the real surgical scene through failed robotic localization or improper environmental mapping.

Iv-B Repeated Tissue Manipulation

To test the effectiveness of the proposed framework, a simple controller was implemented to grasp and tug on tissue at the same tracked point repeatedly. A small cluster of surfels is selected on the tissue in the deformable tracker, and their resulting averaged position, , and normal, , is the tracked point to be grasped. The following steps are then repeated five times or until failure on the PSM gripper.

  1. Align above surface: move to where cm and orientation such that the opening of the gripper is pointed towards the surface normal, .

  2. Move to the tissue: stop updating and from the deformable tracker and move to where cm and orientation

  3. Grasp and stretch the tissue: close the gripper to grasp the tissue and move to where cm and orientation

  4. Place back the tissue: move to where cm and orientation and open the gripper

  5. Continue updating and from the deformable tracker.

To move PSM to the target end-effector position, , and orientation,

, trajectories are generated using linear and spherical linear interpolation respectively. The trajectories are re-generated after every update to

and from the deformable tracker and generated in the camera frame from the current end-effector pose. The current end-effector pose is calculated by transforming the PSM end-effector pose from dVRK with the hand-eye transform from the surgical tool tracker. Finally, to follow the trajectory, the end-effector poses are transformed back to the base frame of the PSM using the surgical tool tracker and set via dVRK.

This experiment is repeated with these configurations:

  • The complete proposed framework

  • The proposed framework without deformable tracking, just static reconstruction, by setting the number of ED nodes to 0

  • The proposed framework without surgical tool masking

  • The proposed framework without surgical tool tracking, and instead relying on calibrated hand-eye

The tissue used is the skin of a chicken leg.

Iv-C Additional Experiments

Using the same trajectories as described in Repeated Tissue Manipulation experiment, the surgical tool is regulated to precisely move over and around a lump of chicken leg skin within two millimeter of the surface. This experiment highlights the precision of the implemented framework. Moreover, using teleoperation control of the da Vinci Surgical® System, an operator grasps tissue and deforms it. We qualitatively observe the results of both the localization of the surgical tool and the tracked environment.

V Results

Complete No Deformable No Surigical No Surgical
Framework Tracking Tool Masking Tool Tracking
5 3 3 0
Repeated Tissue Manipulation Results
Fig. 7: Left shows the gripper following a trajectory just above the surface of tissue and never coming in contact with it. The right is the deformable tracker capturing the tissue deformations when the surgical tool is tele-operated.

The separate components of the framework ran at 30fps, 30fps, 8fps, and 3fps for the surgical tool tracking, surgical tool rendering, depth map generation, and deformable tissue tracker respectively. An example of the procedure used for the repeated tissue manipulation experiment is shown in Fig. 1 and the results are shown in Table I. When using the complete framework, the PSM arm continuously grasped the same location of the tissue even after repeated deformations. As shown in Fig. 5, the deformable tracker even managed to capture the structure of the tissue that was not visible to the endoscopic camera during stretching.

When not using the deformable tracker, the computer crashed due to memory overflow after 3 grasps and the reconstruction was not at all representative of the real environment. With no mask, the reconstructed scene in the deformable tracker was unable to converge properly. Finally, when not using surgical tool tracking, no attempt could be made successful because the grasper misses the tissue. All three of these failure cases are shown in Fig. 6.

Results from the additional experiments are shown in Fig. 7. The trajectory experiment successfully regulated the tip of the surgical tool just above the tissue without coming into contact. For the teloperational experiment, the operator stretched the tissue very aggressively and the deformable tracker still manged to capture the deformations, including crevices.

Vi Discussion and Conclusion

The ability to continuously and accurately track the tissue during manipulation enables control algorithms to be successful in the unstructured surgical environment. Currently, we believe that the limiting factor of our system is the noise from the depth map reconstructed by stereo-endoscopic camera. Improving this component will vastly enhance the deformable tracking. Furthermore, the certainty of the perception can be used for optimal control algorithms, endoscopic camera control to maximize certainty, and other advanced control techniques. Handling blood and topological changes, such as cutting, are the next big challenges to overcome to make our proposed framework even more suitable for real clinical scenarios.

In conclusion, we proposed a surgical perception framework, SuPer, to localize the surgical tool and track the deformable tissue. With the system, a preliminary surgical autonomy was completed on the da Vinci® System and showed very promising results of invasive surgical tasks. In addition, a deformable tissue tracking dataset was released for further community research.


  • [1] G. H. Ballantyne and F. Moll, “The da vinci telerobotic surgical system: the virtual operative field and telepresence surgery,” Surgical Clinics, vol. 83, no. 6, pp. 1293–1304, 2003.
  • [2] M. Yip and N. Das, ROBOT AUTONOMY FOR SURGERY, ch. 10, pp. 281–313. World Scientific, 2018.
  • [3] R. C. Jackson and M. C. Çavuşoğlu, “Needle path planning for autonomous robotic surgical suturing,” in Intl. Conf. on Robotics and Automation, pp. 1669–1675, IEEE, 2013.
  • [4] B. Thananjeyan et al.

    , “Multilateral surgical pattern cutting in 2d orthotropic gauze with deep reinforcement learning policies for tensioning,” in

    Intl. Conf. on Robotics and Automation, pp. 2371–2378, IEEE, 2017.
  • [5] B. Kehoe et al., “Autonomous multilateral debridement with the raven surgical robot,” in Intl. Conf. on Robotics and Automation, pp. 1432–1439, IEEE, 2014.
  • [6] D. T. Ngo et al., “Dense image registration and deformable surface reconstruction in presence of occlusions and minimal texture,” in

    Intl. Conf. on Computer Vision

    , pp. 2273–2281, 2015.
  • [7] M. Salzmann and P. Fua, Deformable Surface 3D Reconstruction from Monocular Images. Synthesis Lectures on Computer Vision, Morgan & Claypool Publishers, 2010.
  • [8] J. Zhu, S. C. H. Hoi, Z. Xu, and M. R. Lyu, “An effective approach to 3d deformable surface tracking,” in European Conf. on Computer Vision, pp. 766–779, 2008.
  • [9] R. A. Newcombe et al., “Kinectfusion: Real-time dense surface mapping and tracking,” in Symp. on Mixed and Augmented Reality, vol. 11, pp. 127–136, IEEE, 2011.
  • [10] M. Keller et al., “Real-time 3d reconstruction in dynamic scenes using point-based fusion,” in Intl. Conf. on 3D Vision, pp. 1–8, IEEE, 2013.
  • [11] R. A. Newcombe, D. Fox, and S. M. Seitz, “Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time,” in

    Conf. on Computer Vision and Pattern Recognition

    , pp. 343–352, IEEE, 2015.
  • [12] M. Innmann, M. Zollhöfer, M. Nießner, C. Theobalt, and M. Stamminger, “Volumedeform: Real-time volumetric non-rigid reconstruction,” in European Conf. on Computer Vision, pp. 362–379, Springer, 2016.
  • [13] M. Dou et al., “Fusion4d: Real-time performance capture of challenging scenes,” Transactions on Graphics, vol. 35, no. 4, p. 114, 2016.
  • [14] W. Gao and R. Tedrake, “Surfelwarp: Efficient non-volumetric single view dynamic reconstruction,” in Robotics: Science and System, 2018.
  • [15] L. Maier-Hein et al., “Comparative validation of single-shot optical techniques for laparoscopic 3-d surface reconstruction,” Transactions on Medical Imaging, vol. 33, no. 10, pp. 1913–1930, 2014.
  • [16] M. C. Yip, D. G. Lowe, S. E. Salcudean, R. N. Rohling, and C. Y. Nguan, “Tissue tracking and registration for image-guided surgery,” Transactions on Medical Imaging, 2012.
  • [17] J. Song, J. Wang, L. Zhao, S. Huang, and G. Dissanayake, “Dynamic reconstruction of deformable soft-tissue with stereo scope in minimal invasive surgery,” Robotics and Automation Letters, vol. 3, no. 1, pp. 155–162, 2017.
  • [18] N. Mahmoud et al., “Live tracking and dense reconstruction for handheld monocular endoscopy,” Transactions on Medical Imaging, vol. 38, no. 1, pp. 79–89, 2018.
  • [19] A. Marmol, A. Banach, and T. Peynot, “Dense-arthroslam: Dense intra-articular 3-d reconstruction with robust localization prior for arthroscopy,” Robotics and Automation Letters, vol. 4, no. 2, pp. 918–925, 2019.
  • [20] D. Bouget, M. Allan, D. Stoyanov, and P. Jannin, “Vision-based and marker-less surgical tool detection and tracking: a review of the literature,” Medical Image Analysis, vol. 35, pp. 633–654, 2017.
  • [21] M. Kristan et al., “The visual object tracking vot2015 challenge results,” in Proc. of the IEEE Intl. Conf. on Computer Vision Workshops, pp. 1–23, 2015.
  • [22] Y. Li et al., “Robust estimation of similarity transformation for visual object tracking,” in

    Proc. of the AAAI Conf. on Artificial Intelligence

    , vol. 33, pp. 8666–8673, AAAI, 2019.
  • [23] Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker with feature integration,” in European Conf. on Computer Vision, pp. 254–265, Springer, 2014.
  • [24] A. Reiter, P. K. Allen, and T. Zhao, “Appearance learning for 3d tracking of robotic surgical tools,” The Intl. Journal of Robotics Research, vol. 33, no. 2, pp. 342–356, 2014.
  • [25] A. Reiter, P. K. Allen, and T. Zhao, “Feature classification for tracking articulated surgical tools,” in Intl. Conf. on Medical Image Computing and Computer-Assisted Intervention, pp. 592–600, Springer, 2012.
  • [26] M. Ye, L. Zhang, S. Giannarou, and G.-Z. Yang, “Real-time 3d tracking of articulated tools for robotic surgery,” in Intl. Conf. on Medical Image Computing and Computer-Assisted Intervention, pp. 386–394, Springer, 2016.
  • [27] R. Hao, O. Özgüner, and M. C. Çavuşoğlu, “Vision-based surgical tool pose estimation for the da vinci® robotic surgical system,” in Intl. Conf. on Intelligent Robots and Systems, IEEE, 2018.
  • [28] T. Zhao, W. Zhao, B. D. Hoffman, W. C. Nowlin, and H. Hui, “Efficient vision and kinematic data fusion for robotic surgical instruments and other applications,” 2015. US Patent 8,971,597.
  • [29] F. Alambeigi, Z. Wang, R. Hegeman, Y.-H. Liu, and M. Armand, “A robust data-driven approach for online learning and manipulation of unmodeled 3-d heterogeneous compliant objects,” Robotics and Automation Letters, vol. 3, no. 4, pp. 4140–4147, 2018.
  • [30] F. Richter, R. K. Orosco, and M. C. Yip, “Open-sourced reinforcement learning environments for surgical robotics,” arXiv preprint arXiv:1903.02090, 2019.
  • [31] B. Kehoe et al., “Autonomous multilateral debridement with the raven surgical robot,” in Intl. Conf. on Robotics and Automation, pp. 1432–1439, IEEE, 2014.
  • [32] C. D’Ettorre et al., “Automated pick-up of suturing needles for robotic surgical assistance,” in Intl. Conf. on Robotics and Automation, pp. 1370–1377, IEEE, 2018.
  • [33] F. Zhong, Y. Wang, Z. Wang, and Y.-H. Liu, “Dual-arm robotic needle insertion with active tissue deformation for autonomous suturing,” Robotics and Automation Letters, vol. 4, no. 3, pp. 2669–2676, 2019.
  • [34] A. Murali et al., “Learning by observation for surgical subtasks: Multilateral cutting of 3d viscoelastic and 2d orthotropic tissue phantoms,” in Intl. Conf. on Robotics and Automation, pp. 1202–1209, 2015.
  • [35] D. Seita et al., “Fast and reliable autonomous surgical debridement with cable-driven robots using a two-phase calibration procedure,” in Intl. Conf. on Robotics and Automation, pp. 6651–6658, IEEE, 2018.
  • [36] S. Thrun, “Particle filters in robotics,” in Proc. of the Eighteenth Conf. on Uncertainty in Artificial Intelligence, pp. 511–518, Morgan Kaufmann Publishers Inc., 2002.
  • [37] J. Canny, “A computational approach to edge detection,” Transactions on Pattern Analysis and Machine Intelligence, 1986.
  • [38] J. Matas, C. Galambos, and J. Kittler, “Robust detection of lines using the progressive probabilistic hough transform,” Computer Vision and Image Understanding, vol. 78, no. 1, pp. 119–137, 2000.
  • [39] F. Chaumette, La relation vision-commande: théorie et application à des tâches robotiques. PhD thesis, L’Université de Rennes I, 1990.
  • [40] F. Caron, M. Davy, E. Duflos, and P. Vanheeghe, “Particle filtering for multisensor data fusion with switching observation models: Application to land vehicle positioning,” Transactions on Signal Processing, vol. 55, no. 6, pp. 2703–2719, 2007.
  • [41] A. Geiger, M. Roser, and R. Urtasun, “Efficient large-scale stereo matching,” in Asian Conf. on Computer Vision, pp. 25–38, 2010.
  • [42] F. Richter, Y. Zhang, Y. Zhi, R. K. Orosco, and M. C. Yip, “Augmented reality predictive displays to help mitigate the effects of delayed telesurgery,” in Intl. Conf. on Robotics and Automation, IEEE, 2019.
  • [43] R. W. Sumner, J. Schmid, and M. Pauly, “Embedded deformation for shape manipulation,” Transactions on Graphics, vol. 26, no. 3, p. 80, 2007.
  • [44] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C: The Art of Scientific Computing. New York, NY, USA: Cambridge University Press, 1992.
  • [45] P. Kazanzides et al., “An open-source research kit for the da vinci ®surgical system,” Intl. Conf. on Robotics and Automation, pp. 6434–6439, 2014.