Stereo Dense Scene Reconstruction and Accurate Laparoscope Localization for Learning-Based Navigation in Robot-Assisted Surgery

10/08/2021 ∙ by Ruofeng Wei, et al. ∙ City University of Hong Kong The Chinese University of Hong Kong 5

The computation of anatomical information and laparoscope position is a fundamental block of robot-assisted surgical navigation in Minimally Invasive Surgery (MIS). Recovering a dense 3D structure of surgical scene using visual cues remains a challenge, and the online laparoscopic tracking mostly relies on external sensors, which increases system complexity. In this paper, we propose a learning-driven framework, in which an image-guided laparoscopic localization with 3D reconstructions of complex anatomical structures is hereby achieved. To reconstruct the 3D structure of the whole surgical environment, we first fine-tune a learning-based stereoscopic depth perception method, which is robust to the texture-less and variant soft tissues, for depth estimation. Then, we develop a dense visual reconstruction algorithm to represent the scene by surfels, estimate the laparoscope pose and fuse the depth data into a unified reference coordinate for tissue reconstruction. To estimate poses of new laparoscope views, we realize a coarse-to-fine localization method, which incorporates our reconstructed 3D model. We evaluate the reconstruction method and the localization module on three datasets, namely, the stereo correspondence and reconstruction of endoscopic data (SCARED), the ex-vivo phantom and tissue data collected with Universal Robot (UR) and Karl Storz Laparoscope, and the in-vivo DaVinci robotic surgery dataset. Extensive experiments have been conducted to prove the superior performance of our method in 3D anatomy reconstruction and laparoscopic localization, which demonstrates its potential implementation to surgical navigation system.



There are no comments yet.


page 1

page 2

page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

MINIMALLY Invasive Surgery (MIS) has flourished over the past decade due to its small surgical trauma, less pain and shorter recovery [3]. In MIS, a laparoscope is utilized and inserted through a trocar into the human body to provide surgeons with visual information about surgical scene [18]. To facilitate the automatic/semi-automatic robotic surgery, surgical navigation system that generally offers internal structural information for intra-operative planning and employs external trackers for laparoscope localization is popularly integrated into the existing platforms [9]. However, compared to conventional open surgeries, laparoscopic images observed in MIS are usually two-dimensional and the view of surgical field provided by laparoscope is commonly limited, which significantly decreases the understanding of the internal anatomy and negatively affects the practical operations. Moreover, extra tracking sensors may add complexity to robotic systems used in Operating Room (OR).

To improve the visualization of surgeons during surgery, depth information of the tissue surface needs to be extracted from the 2D stereo laparoscope. During the past decade, numerous depth estimation algorithms have been presented to provide depth measurements by establishing the correspondence between rectified left and right images’ pixels, and the result can be adopted for 3D reconstruction [1]. Traditionally, the stereo depth computation is generally performed in four steps: 1) feature matching and matching cost computation; 2) cost aggregation; 3) disparity computation; 4) disparity refinement [23]. Considering the characteristics of tissue surface, Stoyanov et al. [25] adopted salient points based on Lucas-Kanade to establish sparse feature matching. Chang et al. [5] proposed a stereo matching algorithm by constructing a 3D cost volume with a pixel data term and then performing convex optimization. However, these methods can only be operated at 10 fps estimation speed for images with 360  288 resolution. Zhou et al. [30]

presented post-processing refinement steps, such as removing outliers, hole filling and disparities smoothing, to address low texture problems. However, the zero-mean normalized cross-correlation (ZNCC)-based local matching part only considered 100 candidate disparity values. Recently, the learning-based stereo depth estimation method was proposed. Liang et al.


used the convolution neural network (CNN) to extract features and compute similarity of each pixel for feature matching. Li et al.

[14] proposed a transformer-based method that considered the sequential nature of videos in performing feature matching, running at 2 fps for 640  512 image pairs. However, these methods produce suboptimal depth information because of either poor texture and unique color of tissues or insufficient disparity candidates.

Furthermore, to provide extensive views of surgical site for surgeons, a simultaneous localization and mapping (SLAM)-based reconstruction module is utilized, which can enlarge the portion of reconstructed cavity by dynamically moving the laparoscope and fusing the 3D surfaces reconstructed at different time. Chen et al. [6] extended the SLAM algorithm to recover the sparse point clouds of the tissue surface. However, this method required the use of Poisson surface reconstruction method to fit the sparse point for inferring the tissue surface. Mahmoud et al. [17] embedded a multi-view reconstruction approach into the SLAM system, but the reconstruction results were not smooth and dense enough for surgical visualization and navigation. Marmol et al. [19] also combined the multi-view stereo method and SLAM module for 3D reconstruction, but it required an external camera and odometry data in surgical robot to calculate the arthroscope’s localization. In this paper, we propose a reconstruction method, which can estimate the online depth of the surgical scene and reconstruct large-scale, smooth, and dense 3D surface anatomical structures of tissues among the view only based on stereo images from the laparoscope.

After constructing the 3D structure of surgical scene, the surgeons can navigate in the environment and automatically localize the laparoscope within a given view. Traditional methods using external trackers such as optical tracking system [27] and electromagnetic tracking system [13] may increase the system complexity when tracking the position and orientation of the camera, whereas the direct positional relationship between the laparoscope and the surgical scene cannot be provided [17]. Given the recent advances in autonomous driving, several learning-based visual localization methods, which can recover environments and camera poses, are proposed [22] [21] [12]. However, estimating the pose of laparoscope using only the images is scarce in surgical navigation because of the texture-less and complex geometry natures of surgical tissues and scenes. To locate the laparoscope only using images, we creatively combine the dense 3D model from our reconstruction module with the laparoscopic localization method.

In this paper, we propose a novel learning-driven framework to recover the dense 3D structure of the surgical scene and estimate the laparoscopic pose. The contributions of our work are summarized as follows:

  • We fine-tune a learning-based depth estimation module for a dense depth calculation of each single frame using supervised and unsupervised method by surgical data. It can be applied to challenging surgical scenarios such as tissues with texture-less and monochromatic surface.

  • To reconstruct the entire surgical scene, we propose a dense visual reconstruction algorithm that leverages surfels to efficiently represent the 3D structure and can compute the camera pose simultaneously. It only utilizes the stereo images from the laparoscope, thus completing the entire processes from an online depth estimation to a dense surgical scene reconstruction.

  • On the basis of the reconstructed dense 3D structure, we propose a laparoscopic localization module to achieve coarse-to-fine pose estimation, in which a knowledge distillation strategy is used to train an efficient feature extraction network.

  • Based on the SCARED dataset, our in-house in-vivo DaVinci robotic surgery dataset, and the self-collected ex-vivo phantom and tissue-based data with their 3D anatomy ground truths obtained using the structure light techniques, we conducted quantitative and qualitative experiments, and the corresponding results extensively demonstrated the accuracy and efficacy of our proposed reconstruction and localization modules, showing its potential applications in robot-assisted surgical navigation systems.

The remainder of this paper is organized as follows. Section II introduces the proposed framework. In Section III, the method is evaluated in the experiments on the basis of different datasets. Finally, conclusions and future works are given in Section IV.

Ii Methods

The Fig. 1(a) shows the overview of the proposed stereo dense reconstruction and laparoscope tracking framework. The left (L) and right (R) RGB image at time stamp t were defined as and . By fine-tuning a HSM network, the depth image can be computed by using and . The depth image and left frame from t = 0 to t = T are inputted to a dense visual reconstruction algorithm, which can recover the whole 3D structure of the tissue surface. Notably, both the depth estimation network and reconstruction algorithm were designed to enable real-time performance. Combining the scale-aware 3D model of the surgical scene with the visual tracking method, we formulated a laparoscopic localization module to estimate the laparoscope pose of a new given frame.

Ii-a Learning-based Depth Estimation for Single Frame

Considering the poor and homogeneous textures and unique color of tissue appearance shown in Fig. 2, we found that learned features with large receptive fields and multi-scale properties will help establish accurate pixel-level correspondence between left and right stereo images. Then, given that generating high-resolution textures is important to help clinicians make a diagnosis, a large number of candidate disparity maps are required; thus, a high-resolution cost volume must be handled. Therefore, we select the Hierarchical Deep Stereo Matching (HSM) network as depth estimation module for single frame. The HSM network uses a U-Net (encoder-decoder) architecture to efficiently extract features with different levels, the encoder part of which is followed by 4 spatial pyramid pooling (SPP) [29] layers to broaden the receptive fields. After feature extraction, a 3D convolution-based feature volume decoder is utilized to reduce memory and computational time in processing the cost volume. Considering that HSM is designed for high-resolution images, it can estimate depth information more accurately by providing more candidate disparity maps in computing feature volume. The detailed structure is shown in Fig. 1(b).

Fig. 2: Examples of the SACRED dataset, our ex-vivo phantom data and the in-vivo DaVinci robotic surgery dataset.

Given that publicly available datasets with depth annotations are scarce in surgical scene, an expert model of HSM can be pre-trained by using the autonomous driving dataset KITTI, which is a commonly used training dataset for stereo matching network. To alleviate the domain gap between driving data KITTI and surgical scene, we first used the SERV-CT [7] tissue data, including endoscopic images and corresponding depth annotations, to supervised fine-tune the expert model, and then tried the unsupervised method [8] to continue building a refined depth estimation network by using the warping-based view generation loss.

Fig. 3: (a). Illustration of surfels; (b). Conversion of the depth sample to point of surfel; (c). Transformation of surfel from the reference camera coordinate to the image coordinate and illustration of corresponding surfel searching in depth image.

Ii-B Dense Visual Reconstruction of Whole Surgical Scene

In order to reconstruct the whole surgical scene, the estimated depth of single frame at different time will be gradually fused. We adopt an unordered list of surfels [11] [16] which is more memory efficient to represent the 3D structure of tissue surface, where each surfel s contains following attributes: a 3D point , surface normal , radius , confidence , and timestamp. When a pair (, ) is coming from depth estimation module, new surfels under the current camera coordinates will be obtained. For a 2D pixel in depth image , we convert each depth sample into a 3D point of surfel, where denotes the laparoscope intrinsic parameter. The process is presented in Fig. 3(b). The normal in surfel is expressed as:


The radius represents the local area around a point, calculated as:


where f is the focal length part of . The surfel confidence is initialized as:


where (,) are the center of camera and . After calculating each surfel, will be fused into the canonical surfels which are under the reference coordinates defined by the first frame based on the current laparoscope pose . The surfels are illustrated in Fig. 3(a).

For computing the current pose , reference surfels are initially transformed to under the camera coordinates of , and we then iteratively minimized the geometric and photometric reprojection errors between and . Between and , if the surfel’s point distance and angle of normal are closer than a threshold [28], it can be added to a surfel set P. Thus, the geometric reprojection error is expressed as:


where is the transformation pose from to . The photometric error, which is the image intensity difference, is defined as follows:


We define the minimization function as follows:


where [0,10] is an adjustable parameter. Therefore, the laparoscope pose at time t is calculated as .

After calculating the current laparoscope pose, new surfels will be integrated into the through surfel association and fusion. Each surfel is paired a corresponding to find the association between and . First, surfels are transformed to the current camera coordinates as by using the camera pose , and each point can be further projected onto the image plane to construct a depth image , the process of which is shown in Fig. 3(c). Second, for each pixel in , we find a neighborhood I around the same position in , which is illustrated in Fig. 3(c). Then, three metrics are calculated as follows:


where is defined as and is a pixel within I. If and are lower than threshold and , then the pixel holding the smallest will be considered as the matching pixel in ; thus, the corresponding surfel can be found for surfel in . When the assocaition between and is established, we use following rules to update the reference surfels :


The corresponding pseudo codes of the surfel association and fusion algroithm are summarized in Algorithm 1.

Input: Reference surfels , new surfels and current laparoscope pose ;
1 Transform the to ;
2 Calculate the depth image ;
3 for pixel p in  do
4       for pixel u within I in  do
5             Compute and using Eq. (7)(8);
6             if  < and <  then
7                   Compute the using Eq. (9);
9             end if
11       end for
12      Find the location of who has the smallest ;
14 end for
15Obtain the corresponding sufel in ;
16 Fuse the sufel in into the reference surfels using Eq. (10) (11) (12) (13);
Output: Updated reference surfels ;
Algorithm 1 Surfel association and fusion
Fig. 4: Example of clustering. (a). Five nearest images (yellow) are retrieved from image database, along with the 3D points they see (red); (b). Two clusters are found by the co-observed 3D points (orange and green), and the intra-operative image is initially matched to the Cluster_0 who has more frames.

Ii-C Accurate Laparoscope Localization for Navigation

Fig. 5: (a). 3D reconstruction results of ( ) and (, ) dataset; (b). The input laparoscope images and corresponding rendering images. In each dataset, the left column is the input images, and the right column is the rendering image.

Based on the computed 3D structure of the whole surgical scene, we aim at localizing the camera of a given intra-operative view using the coarse-to-fine laparoscopic localization module. The process is shown in intra-operative laparoscopic localization part of Fig. 1

(a). First, a global map is established to combine the 3D structure and input images. Second, a learning-based image retrieval system infers a set of images from global map that have the most similar location as the given query image. An iterative estimation process are then used to compute the fine pose of the laparoscope.

Map building: We build a global map shown in Fig. 1(a) by using the input pre-operative images, the estimated laparoscope poses, and reconstructed 3D structure of the tissue surface from the proposed reconstruction framework. First, we combine the input images into an image database. Second, 3D points in reconstructed structure are projected onto the image plane of the camera coordinates, which is defined on the basis of each estimated laparoscope pose. Then, the 2D keypoints in image can be found from the projection, and the correspondence between 3D structure and input images can be stored by the corresponding pixels in image.

Coarse retrieval: Based on the NetVLAD network [2], we use knowledge distillation [10] to train an efficient and smaller student feature extraction network at learning global features predicted by the teacher (NetVLAD). The training process is shown in Fig. 1(c). The student network is composed of an encoder and a smaller VLAD layer [22]. Using the

, global features are computed and indexed for every image in the image database. For each intra-operative query image, we initially extract the features and then use the KNN algorithm to find the nearest images, which have the shortest distance in feature space in the image database. These nearest images are then clustered by the 3D points they co-observe. Fig.

4. shows the clustering process. By retrieving a list of nearest images in global feature space using the KNN, the laparoscope pose can be roughly calculated.

Fine localization: The cluster with the most images is initially used to estimate a fine laparoscopic pose by utilizing a perspective-n-point (PnP) geometric consistency check. We first extract hand-crafted ORB features [20] from the query image and retrieved nearest images and then calculate the feature matches between them. Therefore, the corresponding 3D points in the reconstructed structure for the 2D keypoints of the query image can be selected. After outlier rejection within a RANSAC scheme, we can estimate a global laparoscopic pose from n 3D-2D matches using PnP method. If a valid pose is calculated, then the process will terminate, and the image of query laparoscopic view is successfully localized.

Fig. 6:

Quantitative pose evaluation. The second row’s 3D plot shows the laparoscope trajectory (red for the ground truth and blue for the estimation). The fourth row is the distance of camera movement. The last three rows present the ATE, RTE and RRE errors between the estimated poses and the real poses.

Iii Experiments

Iii-a Experiment Setup

The effectiveness of our stereo dense reconstruction method and the laparoscopic tracking is evaluated on three datasets: Dataset 1: The public SCARED dataset [1] consists of seven training datasets and two test datasets captured by a da Vinci Xi surgical robot. Each dataset corresponds to a porcine subject, and it has four or five keyframes. A keyframe contains a 1280  1024-res stereo video, relative endoscopic poses and a 3D point cloud of the scene computed by a structured light-based reconstruction method [24]; Dataset 2: Our ex-vivo phantoms and tissues data are collected by a Karl Storz laparoscope held by UR5 robot, each consists of 640  480-res calibrated stereo videos, laparoscope poses calculated using the robot’s joint data, and corresponding ground truths of 3D point cloud reconstructed by an active stereo surface reconstruction method assisted with the structured light (SL) [26], of which the accuracy is 45.4 ; Dataset 3: Our in-vivo DaVinci robotic surgery dataset contains six cases of stereo videos recording the whole procedure of robotic prostatectomy.

We used the Root Mean Squared Error (RMSE) to evaluate the quantified accuracy of our reconstructed 3D model. The RMSE is computed as follows. The 3D reconstructed structure is initially registered with the ground truth 3D point cloud by manually selecting land markers such as edge points. Then, the registration is refined by the ICP method. We also adopt three metrics, namely, absolute trajectory error (ATE), relative translation error (RTE) and relative rotation error (RRE), to estimate the precision of the laparoscope pose, and the three metrics are defined as follows:


where is the ground truth camera pose; denotes the estimated pose, and is the rigid transformation between and .

Iii-B Performance Assessment of 3D Dense Reconstruction

Four keyframes ( ) in SCARED dataset and two cases (, ) in our ex-vivo data were utilized to quantitatively evaluate the precision of the 3D structure and corresponding poses. Considering that acquiring the ground truth of the tissue’s 3D model in surgery is currently impractical because of clinical regulation, we qualitatively tested our method on in-vivo DaVinci robotic surgery dataset.

Quantitative Evaluation on ex-vivo Data: As shown in Fig. 5(a), the obtained 3D tissue models usually contain millions of points, which can provide rich details of the surface texture. Furthermore, a surgical simulator is established for rendering color images generated by the estimated camera pose and the 3D structure. We compared the rendering images with corresponding input images, and the result are presented in Fig. 5(b). Our reconstructed 3D surfaces of tissues and the textures on their re-projections both match those observed from the input images. As for the quantitative results concerning the reconstruction, we compared the SL-Points and SR-Points which accordingly refer to the numbers of points in surface geometry calculated by using the structure light and our stereo reconstruction method. As can be noticed in Table. I, the results of RMSE are under the level of 1.71 mm in all testing sets, which to a certain extend demonstrates the high accuracy of our reconstruction framework.

SL-Points () 1.08 1.18 1.02 0.84 1.71 0.70
SR-Points () 0.76 1.26 1.41 1.21 2.47 1.59
RMSE (mm) 1.027 1.308 1.339 0.714 1.705 1.220
TABLE I: Quantitative evaluation of the 3D structure

We simultaneously estimate the laparoscope pose in surfel fusion. Since the precision of the camera pose estimation can also affect the accuracy of our reconstruction outcomes, we hence validated the poses by comparing the calculated results with the ground truth camera poses using ATE, RTE and RRE metrics. Fig. 6 shows the camera trajectories for each dataset and the quantitative comparisons, and the result illustrates that the estimated camera pose matches closely with the ground truth poses, thereby proving the effectiveness of the proposed reconstruction framework.

The holistic stereo reconstruction framework was further run on our platform (UR5 robot and Karl Storz lapraoscope). The average computational time to process one frame is 83.3 ms (12 fps), suggesting that our method is efficient and real-time.

Qualitative Evaluation on in-vivo Data: We collected four video clips from our in-vivo surgery dataset to evaluate the performance of the reconstruction method in real-world surgical environment. Each clip lasts for around 1 second (2060 frames) with fast camera movements, which includes complex tissue surface, small tissue deformation, and instrument motions. As shown in Fig. 7, although the laparoscope moved quickly and the surgical scene was complicated with slight deformations, a potential 3D point cloud and smooth laparoscope poses can be estimated, which qualitatively proves that the proposed method is accurate.

Fig. 7: Qualitative evaluation results on four in-vivo datasets. For each data, the first column is the example of input frames, and the other columns are different views of the reconstructed 3D point cloud and the estimated camera poses.

Iii-C Performance of Laparoscope Localization for Navigation

To evaluate our visual localization method, three laparoscopic movement types namely zoom-in, zoom-out, follow which are commonly existed in laparoscopic surgery, together with a random camera motion were adopted to generate test data. Typical examples of these three kinds of motion are listed in Fig. 8. We initially collected the validation data on our platform, and then sampled images from SCARED and others in dataset were for map building. Furthermore, some new views were created from the reconstructed 3D structure of tissue surface using our simulator. Hence, ten groups of data in which each has 100200 frames were generated to validate the performance of the laparoscope motion estimation for four types of camera motions.

Fig. 8: Example of localization dataset. The first column is the laparoscope motion. The other columns are corresponding frames.

In Fig. 9, we show typical examples of the comparison between the estimated pose and ground truth pose. For each type of motion, the black wireframe refers to the starting point of the camera movement, whereas the red and blue wireframes accordingly denote the ground truths of camera poses and those computed by our visual localization module. As can be noticed in these figures, the experimental results show that the estimated poses are qualitatively similar with the ground truths both in rotation and translation parts.

Fig. 9: Example of comparison between estimated pose and ground truth pose (red for the ground truth and blue for the estimation). (a). zoom-in; (b). zoom-out; (c). follow; (d) random.

Given that the ground truths of the camera poses can be obtained in each data, we can quantitatively evaluate the accuracy of the calculated laparoscopic pose. In addition, MapNet [4], an end-to-end network-based camera pose estimation method, was used to compare with our approach. As reported in Table II, translation and rotation errors concerning the camera pose estimation were presented. It is worth noticing that the average errors in translation and rotation were only 2.17 mm and 2.23 °, showing that our localization method can track the camera in real laparoscopic views and simulated new views. However, MapNet lacks the localization ability in new scenes. Therefore, our visual localization module has the preliminary ability to track the laparoscope in complicated surgical scene with only images as input.

Motion MapNet [4] Our
zoom-in 15.712 mm, 26.677 ° 2.398 mm, 2.321 °
zoom-out 16.028 mm, 25.249 ° 2.205 mm, 2.463 °
follow 10.539 mm, 17.614 ° 2.866 mm, 2.744 °
random 0.588 mm, 2.439 ° 1.194 mm, 1.374 °
Average 10.717 mm, 17.994 ° 2.166, 2.226 °
TABLE II: Translation and rotation error on different motion.

Iv Conclusions

In this paper, we propose an efficient learning-driven framework, which can achieve an image-only 3D reconstruction of surgical scenes and preliminary laparoscopic localization. A fine-tuned learning-based stereo estimation network and a dense visual reconstruction algorithm are proposed to recover the 3D structure of tissue surface. In addition, a visual localization module that incorporates our reconstructed 3D structure is presented to achieve coarse-to-fine laparoscopic tracking using only image as input. The proposed reconstruction method for complicated surgical scenes can run at 12 fps. We also evaluate our framework qualitatively and quantitatively in three datasets to demonstrate its accuracy and efficiency.

This work assumes a surgical scene with small deformation for the reconstruction and localization framework. In the future, we will apply our stereo dense reconstruction and camera localization framework to ENT surgery.


  • [1] M. Allan, J. Mcleod, C. Wang, J. C. Rosenthal, Z. Hu, N. Gard, P. Eisert, K. X. Fu, T. Zeffiro, W. Xia, et al. (2021) Stereo correspondence and reconstruction of endoscopic data challenge. arXiv preprint arXiv:2101.01133. Cited by: §I, §III-A.
  • [2] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic (2016) NetVLAD: cnn architecture for weakly supervised place recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 5297–5307. Cited by: §II-C.
  • [3] T. Bergen and T. Wittenberg (2014) Stitching and surface reconstruction from endoscopic image sequences: a review of applications and methods. IEEE journal of biomedical and health informatics 20 (1), pp. 304–321. Cited by: §I.
  • [4] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz (2018) Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625. Cited by: §III-C, TABLE II.
  • [5] P. Chang, D. Stoyanov, A. J. Davison, et al. (2013) Real-time dense stereo reconstruction using convex optimisation with a cost-volume for image-guided robotic surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 42–49. Cited by: §I.
  • [6] L. Chen, W. Tang, N. W. John, T. R. Wan, and J. J. Zhang (2018) SLAM-based dense surface reconstruction in monocular minimally invasive surgery and its application to augmented reality. Computer methods and programs in biomedicine 158, pp. 135–146. Cited by: §I.
  • [7] P. Edwards, D. Psychogyios, S. Speidel, L. Maier-Hein, D. Stoyanov, et al. (2020) SERV-ct: a disparity dataset from ct for validation of endoscopic 3d reconstruction. arXiv preprint arXiv:2012.11779. Cited by: §II-A.
  • [8] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279. Cited by: §II-A.
  • [9] Y. Hayashi, K. Misawa, M. Oda, D. J. Hawkes, and K. Mori (2016) Clinical application of a surgical navigation system based on virtual laparoscopy in laparoscopic gastrectomy for gastric cancer. International journal of computer assisted radiology and surgery 11 (5), pp. 827–836. Cited by: §I.
  • [10] G. Hinton, O. Vinyals, and J. Dean (2015)

    Distilling the knowledge in a neural network

    arXiv preprint arXiv:1503.02531. Cited by: §II-C.
  • [11] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and A. Kolb (2013) Real-time 3d reconstruction in dynamic scenes using point-based fusion. In 2013 International Conference on 3D Vision-3DV 2013, pp. 1–8. Cited by: §II-B.
  • [12] S. J. Lee, D. Kim, S. S. Hwang, and D. Lee (2021) Local to global: efficient visual localization for a monocular camera. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2231–2240. Cited by: §I.
  • [13] S. Leonard, A. Sinha, A. Reiter, M. Ishii, G. L. Gallia, R. H. Taylor, and G. D. Hager (2018) Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery on in vivo clinical data. IEEE transactions on medical imaging 37 (10), pp. 2185–2195. Cited by: §I.
  • [14] Z. Li, X. Liu, N. Drenkow, A. Ding, F. X. Creighton, R. H. Taylor, and M. Unberath (2020) Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2011.02910. Cited by: §I.
  • [15] Z. Liang, Y. Feng, Y. Guo, H. Liu, W. Chen, L. Qiao, L. Zhou, and J. Zhang (2018) Learning for disparity estimation through feature constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2811–2820. Cited by: §I.
  • [16] Y. Long, Z. Li, C. H. Yee, C. F. Ng, R. H. Taylor, M. Unberath, and Q. Dou (2021) E-dssr: efficient dynamic surgical scene reconstruction with transformer-based stereoscopic depth perception. arXiv preprint arXiv:2107.00229. Cited by: §II-B.
  • [17] N. Mahmoud, T. Collins, A. Hostettler, L. Soler, C. Doignon, and J. M. M. Montiel (2018) Live tracking and dense reconstruction for handheld monocular endoscopy. IEEE transactions on medical imaging 38 (1), pp. 79–89. Cited by: §I, §I.
  • [18] L. Maier-Hein, P. Mountney, A. Bartoli, H. Elhawary, D. Elson, A. Groch, A. Kolb, M. Rodrigues, J. Sorger, S. Speidel, et al. (2013) Optical techniques for 3d surface reconstruction in computer-assisted laparoscopic surgery. Medical image analysis 17 (8), pp. 974–996. Cited by: §I.
  • [19] A. Marmol, A. Banach, and T. Peynot (2019) Dense-arthroslam: dense intra-articular 3-d reconstruction with robust localization prior for arthroscopy. IEEE Robotics and Automation Letters 4 (2), pp. 918–925. Cited by: §I.
  • [20] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics 31 (5), pp. 1147–1163. Cited by: §II-C.
  • [21] P. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk (2019) From coarse to fine: robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12716–12725. Cited by: §I.
  • [22] P. Sarlin, F. Debraine, M. Dymczyk, R. Siegwart, and C. Cadena (2018) Leveraging deep visual descriptors for hierarchical efficient localization. In Conference on Robot Learning, pp. 456–465. Cited by: §I, §II-C.
  • [23] D. Scharstein and R. Szeliski (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision 47 (1), pp. 7–42. Cited by: §I.
  • [24] D. Scharstein and R. Szeliski (2003) High-accuracy stereo depth maps using structured light. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 1, pp. I–I. Cited by: §III-A.
  • [25] D. Stoyanov, M. V. Scarzanella, P. Pratt, and G. Yang (2010) Real-time stereo reconstruction in robotically assisted minimally invasive surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 275–282. Cited by: §I.
  • [26] C. Sui, K. He, C. Lyu, Z. Wang, and Y. Liu (2020) Active stereo 3-d surface reconstruction using multistep matching. IEEE Transactions on Automation Science and Engineering 17 (4), pp. 2130–2144. Cited by: §III-A.
  • [27] J. Wang, H. Suenaga, K. Hoshi, L. Yang, E. Kobayashi, I. Sakuma, and H. Liao (2014) Augmented reality navigation with automatic marker-free image registration using 3-d image overlay for dental surgery. IEEE transactions on biomedical engineering 61 (4), pp. 1295–1304. Cited by: §I.
  • [28] T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, and A. Davison (2015) ElasticFusion: dense slam without a pose graph. Cited by: §II-B.
  • [29] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §II-A.
  • [30] H. Zhou and J. Jagadeesan (2019) Real-time dense reconstruction of tissue surface from stereo optical video. IEEE transactions on medical imaging 39 (2), pp. 400–412. Cited by: §I.