MINIMALLY Invasive Surgery (MIS) has flourished over the past decade due to its small surgical trauma, less pain and shorter recovery . In MIS, a laparoscope is utilized and inserted through a trocar into the human body to provide surgeons with visual information about surgical scene . To facilitate the automatic/semi-automatic robotic surgery, surgical navigation system that generally offers internal structural information for intra-operative planning and employs external trackers for laparoscope localization is popularly integrated into the existing platforms . However, compared to conventional open surgeries, laparoscopic images observed in MIS are usually two-dimensional and the view of surgical field provided by laparoscope is commonly limited, which significantly decreases the understanding of the internal anatomy and negatively affects the practical operations. Moreover, extra tracking sensors may add complexity to robotic systems used in Operating Room (OR).
To improve the visualization of surgeons during surgery, depth information of the tissue surface needs to be extracted from the 2D stereo laparoscope. During the past decade, numerous depth estimation algorithms have been presented to provide depth measurements by establishing the correspondence between rectified left and right images’ pixels, and the result can be adopted for 3D reconstruction . Traditionally, the stereo depth computation is generally performed in four steps: 1) feature matching and matching cost computation; 2) cost aggregation; 3) disparity computation; 4) disparity refinement . Considering the characteristics of tissue surface, Stoyanov et al.  adopted salient points based on Lucas-Kanade to establish sparse feature matching. Chang et al.  proposed a stereo matching algorithm by constructing a 3D cost volume with a pixel data term and then performing convex optimization. However, these methods can only be operated at 10 fps estimation speed for images with 360 288 resolution. Zhou et al. 
presented post-processing refinement steps, such as removing outliers, hole filling and disparities smoothing, to address low texture problems. However, the zero-mean normalized cross-correlation (ZNCC)-based local matching part only considered 100 candidate disparity values. Recently, the learning-based stereo depth estimation method was proposed. Liang et al.
used the convolution neural network (CNN) to extract features and compute similarity of each pixel for feature matching. Li et al. proposed a transformer-based method that considered the sequential nature of videos in performing feature matching, running at 2 fps for 640 512 image pairs. However, these methods produce suboptimal depth information because of either poor texture and unique color of tissues or insufficient disparity candidates.
Furthermore, to provide extensive views of surgical site for surgeons, a simultaneous localization and mapping (SLAM)-based reconstruction module is utilized, which can enlarge the portion of reconstructed cavity by dynamically moving the laparoscope and fusing the 3D surfaces reconstructed at different time. Chen et al.  extended the SLAM algorithm to recover the sparse point clouds of the tissue surface. However, this method required the use of Poisson surface reconstruction method to fit the sparse point for inferring the tissue surface. Mahmoud et al.  embedded a multi-view reconstruction approach into the SLAM system, but the reconstruction results were not smooth and dense enough for surgical visualization and navigation. Marmol et al.  also combined the multi-view stereo method and SLAM module for 3D reconstruction, but it required an external camera and odometry data in surgical robot to calculate the arthroscope’s localization. In this paper, we propose a reconstruction method, which can estimate the online depth of the surgical scene and reconstruct large-scale, smooth, and dense 3D surface anatomical structures of tissues among the view only based on stereo images from the laparoscope.
After constructing the 3D structure of surgical scene, the surgeons can navigate in the environment and automatically localize the laparoscope within a given view. Traditional methods using external trackers such as optical tracking system  and electromagnetic tracking system  may increase the system complexity when tracking the position and orientation of the camera, whereas the direct positional relationship between the laparoscope and the surgical scene cannot be provided . Given the recent advances in autonomous driving, several learning-based visual localization methods, which can recover environments and camera poses, are proposed   . However, estimating the pose of laparoscope using only the images is scarce in surgical navigation because of the texture-less and complex geometry natures of surgical tissues and scenes. To locate the laparoscope only using images, we creatively combine the dense 3D model from our reconstruction module with the laparoscopic localization method.
In this paper, we propose a novel learning-driven framework to recover the dense 3D structure of the surgical scene and estimate the laparoscopic pose. The contributions of our work are summarized as follows:
We fine-tune a learning-based depth estimation module for a dense depth calculation of each single frame using supervised and unsupervised method by surgical data. It can be applied to challenging surgical scenarios such as tissues with texture-less and monochromatic surface.
To reconstruct the entire surgical scene, we propose a dense visual reconstruction algorithm that leverages surfels to efficiently represent the 3D structure and can compute the camera pose simultaneously. It only utilizes the stereo images from the laparoscope, thus completing the entire processes from an online depth estimation to a dense surgical scene reconstruction.
On the basis of the reconstructed dense 3D structure, we propose a laparoscopic localization module to achieve coarse-to-fine pose estimation, in which a knowledge distillation strategy is used to train an efficient feature extraction network.
Based on the SCARED dataset, our in-house in-vivo DaVinci robotic surgery dataset, and the self-collected ex-vivo phantom and tissue-based data with their 3D anatomy ground truths obtained using the structure light techniques, we conducted quantitative and qualitative experiments, and the corresponding results extensively demonstrated the accuracy and efficacy of our proposed reconstruction and localization modules, showing its potential applications in robot-assisted surgical navigation systems.
The remainder of this paper is organized as follows. Section II introduces the proposed framework. In Section III, the method is evaluated in the experiments on the basis of different datasets. Finally, conclusions and future works are given in Section IV.
The Fig. 1(a) shows the overview of the proposed stereo dense reconstruction and laparoscope tracking framework. The left (L) and right (R) RGB image at time stamp t were defined as and . By fine-tuning a HSM network, the depth image can be computed by using and . The depth image and left frame from t = 0 to t = T are inputted to a dense visual reconstruction algorithm, which can recover the whole 3D structure of the tissue surface. Notably, both the depth estimation network and reconstruction algorithm were designed to enable real-time performance. Combining the scale-aware 3D model of the surgical scene with the visual tracking method, we formulated a laparoscopic localization module to estimate the laparoscope pose of a new given frame.
Ii-a Learning-based Depth Estimation for Single Frame
Considering the poor and homogeneous textures and unique color of tissue appearance shown in Fig. 2, we found that learned features with large receptive fields and multi-scale properties will help establish accurate pixel-level correspondence between left and right stereo images. Then, given that generating high-resolution textures is important to help clinicians make a diagnosis, a large number of candidate disparity maps are required; thus, a high-resolution cost volume must be handled. Therefore, we select the Hierarchical Deep Stereo Matching (HSM) network as depth estimation module for single frame. The HSM network uses a U-Net (encoder-decoder) architecture to efficiently extract features with different levels, the encoder part of which is followed by 4 spatial pyramid pooling (SPP)  layers to broaden the receptive fields. After feature extraction, a 3D convolution-based feature volume decoder is utilized to reduce memory and computational time in processing the cost volume. Considering that HSM is designed for high-resolution images, it can estimate depth information more accurately by providing more candidate disparity maps in computing feature volume. The detailed structure is shown in Fig. 1(b).
Given that publicly available datasets with depth annotations are scarce in surgical scene, an expert model of HSM can be pre-trained by using the autonomous driving dataset KITTI, which is a commonly used training dataset for stereo matching network. To alleviate the domain gap between driving data KITTI and surgical scene, we first used the SERV-CT  tissue data, including endoscopic images and corresponding depth annotations, to supervised fine-tune the expert model, and then tried the unsupervised method  to continue building a refined depth estimation network by using the warping-based view generation loss.
Ii-B Dense Visual Reconstruction of Whole Surgical Scene
In order to reconstruct the whole surgical scene, the estimated depth of single frame at different time will be gradually fused. We adopt an unordered list of surfels   which is more memory efficient to represent the 3D structure of tissue surface, where each surfel s contains following attributes: a 3D point , surface normal , radius , confidence , and timestamp. When a pair (, ) is coming from depth estimation module, new surfels under the current camera coordinates will be obtained. For a 2D pixel in depth image , we convert each depth sample into a 3D point of surfel, where denotes the laparoscope intrinsic parameter. The process is presented in Fig. 3(b). The normal in surfel is expressed as:
The radius represents the local area around a point, calculated as:
where f is the focal length part of . The surfel confidence is initialized as:
where (,) are the center of camera and . After calculating each surfel, will be fused into the canonical surfels which are under the reference coordinates defined by the first frame based on the current laparoscope pose . The surfels are illustrated in Fig. 3(a).
For computing the current pose , reference surfels are initially transformed to under the camera coordinates of , and we then iteratively minimized the geometric and photometric reprojection errors between and . Between and , if the surfel’s point distance and angle of normal are closer than a threshold , it can be added to a surfel set P. Thus, the geometric reprojection error is expressed as:
where is the transformation pose from to . The photometric error, which is the image intensity difference, is defined as follows:
We define the minimization function as follows:
where [0,10] is an adjustable parameter. Therefore, the laparoscope pose at time t is calculated as .
After calculating the current laparoscope pose, new surfels will be integrated into the through surfel association and fusion. Each surfel is paired a corresponding to find the association between and . First, surfels are transformed to the current camera coordinates as by using the camera pose , and each point can be further projected onto the image plane to construct a depth image , the process of which is shown in Fig. 3(c). Second, for each pixel in , we find a neighborhood I around the same position in , which is illustrated in Fig. 3(c). Then, three metrics are calculated as follows:
where is defined as and is a pixel within I. If and are lower than threshold and , then the pixel holding the smallest will be considered as the matching pixel in ; thus, the corresponding surfel can be found for surfel in . When the assocaition between and is established, we use following rules to update the reference surfels :
The corresponding pseudo codes of the surfel association and fusion algroithm are summarized in Algorithm 1.
Ii-C Accurate Laparoscope Localization for Navigation
Based on the computed 3D structure of the whole surgical scene, we aim at localizing the camera of a given intra-operative view using the coarse-to-fine laparoscopic localization module. The process is shown in intra-operative laparoscopic localization part of Fig. 1
(a). First, a global map is established to combine the 3D structure and input images. Second, a learning-based image retrieval system infers a set of images from global map that have the most similar location as the given query image. An iterative estimation process are then used to compute the fine pose of the laparoscope.
Map building: We build a global map shown in Fig. 1(a) by using the input pre-operative images, the estimated laparoscope poses, and reconstructed 3D structure of the tissue surface from the proposed reconstruction framework. First, we combine the input images into an image database. Second, 3D points in reconstructed structure are projected onto the image plane of the camera coordinates, which is defined on the basis of each estimated laparoscope pose. Then, the 2D keypoints in image can be found from the projection, and the correspondence between 3D structure and input images can be stored by the corresponding pixels in image.
Coarse retrieval: Based on the NetVLAD network , we use knowledge distillation  to train an efficient and smaller student feature extraction network at learning global features predicted by the teacher (NetVLAD). The training process is shown in Fig. 1(c). The student network is composed of an encoder and a smaller VLAD layer . Using the
, global features are computed and indexed for every image in the image database. For each intra-operative query image, we initially extract the features and then use the KNN algorithm to find the nearest images, which have the shortest distance in feature space in the image database. These nearest images are then clustered by the 3D points they co-observe. Fig.4. shows the clustering process. By retrieving a list of nearest images in global feature space using the KNN, the laparoscope pose can be roughly calculated.
Fine localization: The cluster with the most images is initially used to estimate a fine laparoscopic pose by utilizing a perspective-n-point (PnP) geometric consistency check. We first extract hand-crafted ORB features  from the query image and retrieved nearest images and then calculate the feature matches between them. Therefore, the corresponding 3D points in the reconstructed structure for the 2D keypoints of the query image can be selected. After outlier rejection within a RANSAC scheme, we can estimate a global laparoscopic pose from n 3D-2D matches using PnP method. If a valid pose is calculated, then the process will terminate, and the image of query laparoscopic view is successfully localized.
Iii-a Experiment Setup
The effectiveness of our stereo dense reconstruction method and the laparoscopic tracking is evaluated on three datasets: Dataset 1: The public SCARED dataset  consists of seven training datasets and two test datasets captured by a da Vinci Xi surgical robot. Each dataset corresponds to a porcine subject, and it has four or five keyframes. A keyframe contains a 1280 1024-res stereo video, relative endoscopic poses and a 3D point cloud of the scene computed by a structured light-based reconstruction method ; Dataset 2: Our ex-vivo phantoms and tissues data are collected by a Karl Storz laparoscope held by UR5 robot, each consists of 640 480-res calibrated stereo videos, laparoscope poses calculated using the robot’s joint data, and corresponding ground truths of 3D point cloud reconstructed by an active stereo surface reconstruction method assisted with the structured light (SL) , of which the accuracy is 45.4 ; Dataset 3: Our in-vivo DaVinci robotic surgery dataset contains six cases of stereo videos recording the whole procedure of robotic prostatectomy.
We used the Root Mean Squared Error (RMSE) to evaluate the quantified accuracy of our reconstructed 3D model. The RMSE is computed as follows. The 3D reconstructed structure is initially registered with the ground truth 3D point cloud by manually selecting land markers such as edge points. Then, the registration is refined by the ICP method. We also adopt three metrics, namely, absolute trajectory error (ATE), relative translation error (RTE) and relative rotation error (RRE), to estimate the precision of the laparoscope pose, and the three metrics are defined as follows:
where is the ground truth camera pose; denotes the estimated pose, and is the rigid transformation between and .
Iii-B Performance Assessment of 3D Dense Reconstruction
Four keyframes ( ) in SCARED dataset and two cases (, ) in our ex-vivo data were utilized to quantitatively evaluate the precision of the 3D structure and corresponding poses. Considering that acquiring the ground truth of the tissue’s 3D model in surgery is currently impractical because of clinical regulation, we qualitatively tested our method on in-vivo DaVinci robotic surgery dataset.
Quantitative Evaluation on ex-vivo Data: As shown in Fig. 5(a), the obtained 3D tissue models usually contain millions of points, which can provide rich details of the surface texture. Furthermore, a surgical simulator is established for rendering color images generated by the estimated camera pose and the 3D structure. We compared the rendering images with corresponding input images, and the result are presented in Fig. 5(b). Our reconstructed 3D surfaces of tissues and the textures on their re-projections both match those observed from the input images. As for the quantitative results concerning the reconstruction, we compared the SL-Points and SR-Points which accordingly refer to the numbers of points in surface geometry calculated by using the structure light and our stereo reconstruction method. As can be noticed in Table. I, the results of RMSE are under the level of 1.71 mm in all testing sets, which to a certain extend demonstrates the high accuracy of our reconstruction framework.
We simultaneously estimate the laparoscope pose in surfel fusion. Since the precision of the camera pose estimation can also affect the accuracy of our reconstruction outcomes, we hence validated the poses by comparing the calculated results with the ground truth camera poses using ATE, RTE and RRE metrics. Fig. 6 shows the camera trajectories for each dataset and the quantitative comparisons, and the result illustrates that the estimated camera pose matches closely with the ground truth poses, thereby proving the effectiveness of the proposed reconstruction framework.
The holistic stereo reconstruction framework was further run on our platform (UR5 robot and Karl Storz lapraoscope). The average computational time to process one frame is 83.3 ms (12 fps), suggesting that our method is efficient and real-time.
Qualitative Evaluation on in-vivo Data: We collected four video clips from our in-vivo surgery dataset to evaluate the performance of the reconstruction method in real-world surgical environment. Each clip lasts for around 1 second (2060 frames) with fast camera movements, which includes complex tissue surface, small tissue deformation, and instrument motions. As shown in Fig. 7, although the laparoscope moved quickly and the surgical scene was complicated with slight deformations, a potential 3D point cloud and smooth laparoscope poses can be estimated, which qualitatively proves that the proposed method is accurate.
Iii-C Performance of Laparoscope Localization for Navigation
To evaluate our visual localization method, three laparoscopic movement types namely zoom-in, zoom-out, follow which are commonly existed in laparoscopic surgery, together with a random camera motion were adopted to generate test data. Typical examples of these three kinds of motion are listed in Fig. 8. We initially collected the validation data on our platform, and then sampled images from SCARED and others in dataset were for map building. Furthermore, some new views were created from the reconstructed 3D structure of tissue surface using our simulator. Hence, ten groups of data in which each has 100200 frames were generated to validate the performance of the laparoscope motion estimation for four types of camera motions.
In Fig. 9, we show typical examples of the comparison between the estimated pose and ground truth pose. For each type of motion, the black wireframe refers to the starting point of the camera movement, whereas the red and blue wireframes accordingly denote the ground truths of camera poses and those computed by our visual localization module. As can be noticed in these figures, the experimental results show that the estimated poses are qualitatively similar with the ground truths both in rotation and translation parts.
Given that the ground truths of the camera poses can be obtained in each data, we can quantitatively evaluate the accuracy of the calculated laparoscopic pose. In addition, MapNet , an end-to-end network-based camera pose estimation method, was used to compare with our approach. As reported in Table II, translation and rotation errors concerning the camera pose estimation were presented. It is worth noticing that the average errors in translation and rotation were only 2.17 mm and 2.23 °, showing that our localization method can track the camera in real laparoscopic views and simulated new views. However, MapNet lacks the localization ability in new scenes. Therefore, our visual localization module has the preliminary ability to track the laparoscope in complicated surgical scene with only images as input.
|zoom-in||15.712 mm, 26.677 °||2.398 mm, 2.321 °|
|zoom-out||16.028 mm, 25.249 °||2.205 mm, 2.463 °|
|follow||10.539 mm, 17.614 °||2.866 mm, 2.744 °|
|random||0.588 mm, 2.439 °||1.194 mm, 1.374 °|
|Average||10.717 mm, 17.994 °||2.166 , 2.226 °|
In this paper, we propose an efficient learning-driven framework, which can achieve an image-only 3D reconstruction of surgical scenes and preliminary laparoscopic localization. A fine-tuned learning-based stereo estimation network and a dense visual reconstruction algorithm are proposed to recover the 3D structure of tissue surface. In addition, a visual localization module that incorporates our reconstructed 3D structure is presented to achieve coarse-to-fine laparoscopic tracking using only image as input. The proposed reconstruction method for complicated surgical scenes can run at 12 fps. We also evaluate our framework qualitatively and quantitatively in three datasets to demonstrate its accuracy and efficiency.
This work assumes a surgical scene with small deformation for the reconstruction and localization framework. In the future, we will apply our stereo dense reconstruction and camera localization framework to ENT surgery.
-  (2021) Stereo correspondence and reconstruction of endoscopic data challenge. arXiv preprint arXiv:2101.01133. Cited by: §I, §III-A.
NetVLAD: cnn architecture for weakly supervised place recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5297–5307. Cited by: §II-C.
-  (2014) Stitching and surface reconstruction from endoscopic image sequences: a review of applications and methods. IEEE journal of biomedical and health informatics 20 (1), pp. 304–321. Cited by: §I.
-  (2018) Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625. Cited by: §III-C, TABLE II.
-  (2013) Real-time dense stereo reconstruction using convex optimisation with a cost-volume for image-guided robotic surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 42–49. Cited by: §I.
-  (2018) SLAM-based dense surface reconstruction in monocular minimally invasive surgery and its application to augmented reality. Computer methods and programs in biomedicine 158, pp. 135–146. Cited by: §I.
-  (2020) SERV-ct: a disparity dataset from ct for validation of endoscopic 3d reconstruction. arXiv preprint arXiv:2012.11779. Cited by: §II-A.
-  (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279. Cited by: §II-A.
-  (2016) Clinical application of a surgical navigation system based on virtual laparoscopy in laparoscopic gastrectomy for gastric cancer. International journal of computer assisted radiology and surgery 11 (5), pp. 827–836. Cited by: §I.
Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §II-C.
-  (2013) Real-time 3d reconstruction in dynamic scenes using point-based fusion. In 2013 International Conference on 3D Vision-3DV 2013, pp. 1–8. Cited by: §II-B.
-  (2021) Local to global: efficient visual localization for a monocular camera. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2231–2240. Cited by: §I.
-  (2018) Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery on in vivo clinical data. IEEE transactions on medical imaging 37 (10), pp. 2185–2195. Cited by: §I.
-  (2020) Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2011.02910. Cited by: §I.
-  (2018) Learning for disparity estimation through feature constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2811–2820. Cited by: §I.
-  (2021) E-dssr: efficient dynamic surgical scene reconstruction with transformer-based stereoscopic depth perception. arXiv preprint arXiv:2107.00229. Cited by: §II-B.
-  (2018) Live tracking and dense reconstruction for handheld monocular endoscopy. IEEE transactions on medical imaging 38 (1), pp. 79–89. Cited by: §I, §I.
-  (2013) Optical techniques for 3d surface reconstruction in computer-assisted laparoscopic surgery. Medical image analysis 17 (8), pp. 974–996. Cited by: §I.
-  (2019) Dense-arthroslam: dense intra-articular 3-d reconstruction with robust localization prior for arthroscopy. IEEE Robotics and Automation Letters 4 (2), pp. 918–925. Cited by: §I.
-  (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics 31 (5), pp. 1147–1163. Cited by: §II-C.
-  (2019) From coarse to fine: robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12716–12725. Cited by: §I.
-  (2018) Leveraging deep visual descriptors for hierarchical efficient localization. In Conference on Robot Learning, pp. 456–465. Cited by: §I, §II-C.
-  (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision 47 (1), pp. 7–42. Cited by: §I.
-  (2003) High-accuracy stereo depth maps using structured light. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 1, pp. I–I. Cited by: §III-A.
-  (2010) Real-time stereo reconstruction in robotically assisted minimally invasive surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 275–282. Cited by: §I.
-  (2020) Active stereo 3-d surface reconstruction using multistep matching. IEEE Transactions on Automation Science and Engineering 17 (4), pp. 2130–2144. Cited by: §III-A.
-  (2014) Augmented reality navigation with automatic marker-free image registration using 3-d image overlay for dental surgery. IEEE transactions on biomedical engineering 61 (4), pp. 1295–1304. Cited by: §I.
-  (2015) ElasticFusion: dense slam without a pose graph. Cited by: §II-B.
-  (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §II-A.
-  (2019) Real-time dense reconstruction of tissue surface from stereo optical video. IEEE transactions on medical imaging 39 (2), pp. 400–412. Cited by: §I.