3D Semantic Mapping from Arthroscopy using Out-of-distribution Pose and Depth and In-distribution Segmentation Training

06/10/2021 ∙ by Yaqub Jonmohamadi, et al. ∙ The University of Adelaide 0

Minimally invasive surgery (MIS) has many documented advantages, but the surgeon's limited visual contact with the scene can be problematic. Hence, systems that can help surgeons navigate, such as a method that can produce a 3D semantic map, can compensate for the limitation above. In theory, we can borrow 3D semantic mapping techniques developed for robotics, but this requires finding solutions to the following challenges in MIS: 1) semantic segmentation, 2) depth estimation, and 3) pose estimation. In this paper, we propose the first 3D semantic mapping system from knee arthroscopy that solves the three challenges above. Using out-of-distribution non-human datasets, where pose could be labeled, we jointly train depth+pose estimators using selfsupervised and supervised losses. Using an in-distribution human knee dataset, we train a fully-supervised semantic segmentation system to label arthroscopic image pixels into femur, ACL, and meniscus. Taking testing images from human knees, we combine the results from these two systems to automatically create 3D semantic maps of the human knee. The result of this work opens the pathway to the generation of intraoperative 3D semantic mapping, registration with pre-operative data, and robotic-assisted arthroscopy



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Minimally invasive surgery (MIS) is a surgical procedure where the operation is conducted via a few incision holes. It is favorable over open surgery due to its clinical benefits such as small scars, lower chances of bleeding and infection, and shorter recovery time. However, MIS forces the surgeon to lose direct eye contact with the scene and, consequently, to rely on endoscopic video for the whole surgery. The limited field of view (FoV) and 2D nature of endoscopic images are challenges that surgeons face which quite often result in surgeons failing to identify the tissue structures and recourse to visual surveying by moving the camera around. In arthroscopy, this phenomenon happens repeatedly during surgery, which could prolong the operation time and lead to unintentional damage to critical tissue. According to a survey on knee arthroscopy [10], about 50% of the surgeons admitted to damage a knee once every 10 operations.

Computer vision could assist surgeons by augmenting the reality produced by the endoscopic image with the creation of a 3D semantic map of the scene, color-coded to represent the different anatomical structures and surgical tools. To produce a 3D semantic map in arthroscopy, we need to solve the following challenges: semantic segmentation, depth estimation, and pose estimation. In other types of MIS, like laparascopy and sinus endoscopy, techniques such as simultaneous localization and mapping (SLAM) [8] and structure from motion (SfM) [12] have been applied successfully. However, such techniques will fail in arthroscopy, due to poor texture, lack of photometric constancy across the frames, and assumptions regarding the camera motion.

In this paper, we introduce the first method to produce 3D semantic maps from arthroscopy. To achieve this goal, we create two datasets: 1) out-of-distribution (OOD) datasets containing non-human knees that have camera pose annotation, and 2) an in-distribution (ID) dataset containing human knees that have semantic segmentation annotation of femur, Articular Cartilage Ligament (ACL), and meniscus. We train a system with the OOD datasets to estimate depth+pose using self-supervised view synthesis loss + supervised pose loss. We also train a method to produce semantic segmentation using the ID dataset in [11]. We then combine the pose, depth, and semantic segmentation of both systems and use the method in [29] to produce 3D semantic maps of testing images from the ID dataset. Quantitative results of the pose estimation and qualitative visual results from the 3D semantic maps suggest that our approach can be reliably used for mapping human knees, even though part of the training was based on OOD training sets. To the best of our knowledge, this is the first method that can estimate the depth and pose from arthroscopy and the first to create 3D semantic maps in clinical endoscopy.

2 Related Work

Deep learning has shown impressive results in complex computer vision tasks such as segmentation, depth perception, and pose estimation [26, 7, 30]. These approaches work well on feature rich datasets like road scenes but perform poorly for environments such as medical endoscopy as shown in [24]. This is because of poor texture information and the lack of photometric constancy between frames in endoscopy due to the joint motion between the camera and light source [14]. Recently, depth and pose estimation methods above have been adapted for colonoscopy [4, 3, 17] and sinus endoscopy [14, 15, 16]. Compared to the original work on self-supervised estimation of depth and pose shown in [26, 7, 30], a key aspect to these proposed methods is the incorporation of supervision for depth+pose estimation. For example, [3, 17, 14, 15, 16] used structure from motion (SfM) [25] to create sparse depth frames from the training images and used them for supervision of the depth+pose training. Arthroscopy images have little texture information due to the smooth bone surfaces. Furthermore, the problem of over and under illumination in arthroscopy is a frequent occurrence that will impact the approaches above [2]. As a result, feature tracking based techniques such as SfM, cannot create reliable feature maps in arthroscopy as has been shown in [18].

Hence, we advocate the use of pose annotation acquired from images from non-human environments to supervise the training of depth+pose using a self-supervised+supervised loss function. We also trained a novel supervised model for semantic segmentation with the method in 

[1] that extends the semantic segmentation in [11] based on the use of multi-spectral frame reconstruction [20]. By considering that the biological compositions of each tissue type namely bone, ACL, and meniscus are intrinsically different, the RGB arthroscopic frames are transformed into 36 spectral bands and then the spatial features of anatomical structures are used at wavelengths from 380-740 nm with 10 nm of intervals as a preprocessing step. A segmentation network extracts spatial characteristics at these 36 spectral bands and subsequently learns the location along with its label.

3 Methods

The aim of depth + pose network, for a given source image at time , , and source frames, , is to estimate the pixel level depth and the ego motion , where and

refer to the 6 degree of freedom, rotation and translation, in the Euler coordinates We achieve this by training the depth+pose network on the self-supervised plus supervised objectives. In our case, with a stereo endoscope, the source images

are the left image at time and the right image at time , while the target image is the left image at time .

3.0.1 Self-supervised objective

minimizes a photometric reprojection error between the synthesized target image, and the target image, as shown in [6, 30] and edge-aware smoothing term as shown in [28, 7]:


where is the pixel level photometric reprojection error shown in [6] and consist of structural similarity (SSIM) term [27] and a loss:


where . Similar to [7], the minimum reprojection error is used to minimize the effect of the pixels which are not visible in some of the source images compared with the target image due to ego motion or occlusion:


The minimum reprojection loss is particularly helpful in reducing the edge artifacts of the depth. The auto masking term is a binary mask to reject the pixels with no change in appearance between frames such as static scenes and the moving objects at the same velocity and orientation of the camera [7]:


Similar to [6] the weighted edge-aware term is used to regularize the depth on low texture areas:


where refers to the gradient function is 1e-3.

3.0.2 Supervised objective

is to minimize the error between the estimated ego motion by the pose network and the groundtruth relative camera pose :


The term is calculated on the normalized translations, i.e., and . Similarly, and . This is because the relative displacement from frame to frame could substantially change for the endoscopic sequences with some frames having more than 20 times change in translation or angle compared with other frames. Without the , the network performs poorly for the frames with small changes in motion.

The final loss equation for training the depth+pose network in this work is:


Since most of the variation in the camera pose is in at the x and y axes of the translation, the weighting of [0.5, 0.5, 1] was applied to the . Fig. 1 shows the pipeline for training the depth+pose and segmentation networks.

Figure 1: The training pipeline for the segmentation, depth and pose estimation. The upper network performs semantic segmentation in a supervised manner. The second and third networks are the depth+pose networks being trained jointly using the supervise+self-supervised approach.

The method in [29] was used to fuse depth frames and create the 3D maps in chunks. Since in the actual arthroscopy, the endoscope movement is limited to areas surrounding the incision holes, the sequences used for fusion are typically 3 to 8 seconds long  ( frames), which correspond to sweep in translation to cover a certain part of the knee accessible via the incision hole.

4 Experimental setup

4.1 Depth+pose training

Training + validation data was recorded using the stereo endoscope with 384384 resolution, 1.52 mm baseline, 87.5 FoV, and 25 fps. Stereo images were rectified and downsampled to 256256. The groundtruth poses of the camera tip were recorded by attaching an NDI magnetic sensor.

Liu et al. [13] showed that training a self-supervised depth network directly on the arthroscopy videos failed due to the poor texture and pretraining on texture rich frames is crucial. Therefore, for the training, a 3D printed model of the knee was placed in a water tank and recorded while the magnetic sensor was attached to the camera tip to provide the groundtruth camera poses. The images of the 3D printed knee provided rich texture and are ideal for the training phase. Another data was recorded from a sheep joint with the corresponding groundtruth camera poses. Similar to the human knee, the animal joint video frames suffer from the poor texture problem. During the training and validation, the images from the 3D printed knee and the animal joint were used at the same time.

In total 28500 frames from the 3D printed knee and 10000 frames from the animal experiment were recorded, out of which 25000 from the 3D printed and 8800 from the animal experiment were used for training and the rest for validation. The Absolute Trajectory Error (ATE) was used to quantify the error between the network estimated and the groudtruth poses.

The testing data was obtained from a cadaveric experiment in which multiple sequences were recorded from a left knee. About 9 sequences were recorded with varying lengths, few seconds to a few minutes. This resulted in approximately 12000 images. No groundtruth camera pose was available for these sequences.

The depth network architecture was similar to Disp-Net [19], but also uses skip connection [22] from the encoder’s activation blocks, leading to higher resolution details [6] with sigmoids at the output. For the pose, only the encoder was used. The ResNet50 [9]

was used as the encoder for both depth and pose networks. The weights were pretrained on ImageNet

[23]. The training augmentations was used with 50% possibility of random brightness, contrast, saturation, and hue jitter with respective ranges of ±0.2, ±0.2, ±0.2, and ±0.1 [7]

. The model was trained on 30 epochs, using Adam

[5] optimizer, with a batch size of 18. Since most of the variation in the camera pose is in at the x and y axes of the translation, the weighting of [0.5, 0.5, 1] was applied to the translation loss.

4.2 Segmentation training

Data from four cadaveric experiments were used for training and testing. Data from the last experiment were used as the test data for the depth+pose networks [11]. There were 2868 images from the first experiment and 1524 from the last experiment (two sequences among nine) that were used for training. The remaining images from the last experiment (3460 frames) were used for testing along with the other three cadaver experiments. We test on two sets: i) high quality of images, and ii) all remaining cadaver datasets, excluding saturated and bad frames. It has been confirmed that the accuracy of the proposed method can be improved if high quality imaging system and sufficient information about the irregular knee geometrical structures are provided. More details are available on [1]. The training data was augmented 6 times using shift and rotation (with angles 90, 180, 270), flip vertical and horizontal, and brightness changes.

The model in [1] is a U-Net with the contraction layer containing two successive convolution layers and a 3

3 kernel. The spatial context map is downsampled by max pooling operation with pool size 2

2. Padding ’same’ is used to get the same resolution of input and output images. Kernel initializer is used to set initialize weights of the convolution layer during training. The dilation rate is set to 2 which provides a wider field of view so that it can avoid adjacent pixels having the same reflectance. The softmax is used at the final layer. Categorical cross-entropy loss function and Stochastic gradient descent optimizer are used for training. The Tensorflow

[21] was used to implement all the models.

5 Results

The training and validation losses of the depth+pose networks are shown in Fig. 2. For comparison between the supervised+self-supervised depth+pose estimation using images at time t+1 and the stereo pair, we included the plot for the self-supervised counterpart as well as supervised+self-supervised using image at time t+1 only, i.e., mono supervised+self-supervised. In this way, it is possible to evaluate the impact of the pose supervision and stereo versus mono scenario.

Validation loss
Validation loss
Validation photometric reprojection loss
Qualitative result
Figure 2: The validation losses are shown in subfigure (a) for translation, (b) for the angle, and (c) for photometric reprojection loss. The qualitative results are showing the actual camera poses for sheep joint in (d). The first row of (d) is the groundtruth, the middle row is the corresponding network predictions (supervised+self-supervised stereo) and the third row is the ATE. The left column is the translation and the right column is the angle (rotation).

According to the plots on the training and validation data, Fig. 2(a) and (b) respectively, the photometric reprojection loss (which is an indication of the combined accuracy of camera pose and the depth estimation) is lowest for the self-supervised network. It is closely followed by the supervised+self-supervised stereo networks. The supervised+self-supervised mono has the poorest outcome with the photometric reprojection loss. On the other hand, the supervised+self-supervised networks outperformed the self-supervised network on the camera pose estimation for the validation data as shown in Fig. 2(c) and (d). These results from Fig. 2 indicate that the supervised+self-supervised networks using the stereo pairs and time t+1 as reference images have higher accuracy in depth and slightly higher on camera pose estimation compared with the supervised+self-supervised mono. Hence, this model was considered for 3D mapping of the scene.

The actual pose network predictions on the validation data (animal experiment) are shown in Fig. 2(d). The groundtruth is shown in the first row while the prediction is on the second row and the corresponding ATE on the 3rd row. In general, the changes in rotation proved to be harder for the network to predict than translation. Overall, the prediction of the camera rotation was more difficult than the translation. Fig. 3 shows sample 3D maps obtained by fusing chunks of arthroscope frames from a human cadaver knee. The number of frames to create the maps is indicated by . The corresponding camera translations are shown in the plots on the right side of each map. For every map, the semantic map is also provided with green being the cartilage (femur and tibia), meniscus in red, and ACL in blue. The cyan refers to other structures such as typically floating fat and skin. The segmentations appear correct except for a minor error in the third map where the segmentation network falsely detects meniscus (red) on the right hand side of the map.

Figure 3: Sample 3D maps on the test data from actual human knee. Once the networks are trained, the arthroscopic frames are provided as input to them and their output of depth and camera pose, , will be given as inputs to the TSDF function [29] to create the extended map. Either the image or the corresponding semantic label, , can be provided as the 3rd input to the TSDF function. The variable refers to the number of the frames used to create the corresponding map. Since most of the variation in is due to translation, only the was shown in the figure. The knee models on the left hand side of the figure with dark squares, show the approximate locations of the camera with respect to the actual knee.

6 Conclusion

In this work for the first time, we presented a pipeline to perform 3D semantic mapping in arthroscopy. To the best of our knowledge, this has not been done in any medical endoscopy before. To achieve these, we used the deep learning approaches for semantic segmentation, depth perception and camera pose estimation. The proposed domain adaptive approach produced superior accuracy in camera pose estimation and comparable depth accuracy in comparison with the self-supervised counterpart. Furthermore, we used a segmentation tool to semantically segment the images into cartilage, ACL, and meniscus. The segmentation approach utilizes the multi-spectral properties of surgical tissue in the images rather than merely the geometrical cues as was shown in [11, 13].


Supported by AISRF53820 and Australian Research Council through grants DP180103232 and FT190100525.


  • [1] S. Ali, Y. Jonmohamadi, J. Roberts, R. Crawford, G. Carneiro, and A. K. Pandey (2021) Arthroscopic multi-spectral scene segmentation using deep learning. arXiv preprint arXiv:2001.05566. Cited by: §2, §4.2, §4.2.
  • [2] S. Ali, Y. Jonmohamadi, Y. Takeda, J. Roberts, R. Crawford, and A. K. Pandey (2020) Supervised Scene Illumination Control in Stereo Arthroscopes for Robot Assisted Minimally Invasive Surgery. IEEE Sensors Journal 4 (10), pp. 11577–11587. Note: Publisher: IEEE Cited by: §2.
  • [3] G. Bae, I. Budvytis, C. Yeung, and R. Cipolla (2020) Deep Multi-view Stereo for Dense 3D Reconstruction from Monocular Endoscopic Video. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 774–783. Cited by: §2.
  • [4] R. J. Chen, T. L. Bobrow, T. Athey, F. Mahmood, and N. J. Durr (2019) Slam endoscopy enhanced by adversarial depth prediction. arXiv preprint arXiv:1907.00283. Cited by: §2.
  • [5] K. Da (2014) A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [6] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 270–279. Cited by: §3.0.1, §3.0.1, §4.1.
  • [7] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE international conference on computer vision, pp. 3828–3838. Cited by: §2, §3.0.1, §3.0.1, §4.1.
  • [8] O. G. Grasa, E. Bernal, S. Casado, I. Gil, and J. M. M. Montiel (2013) Visual SLAM for handheld monocular endoscope. IEEE transactions on medical imaging 33 (1), pp. 135–146. Note: Publisher: IEEE Cited by: §1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • [10] A. Jaiprakash, W. B. O’Callaghan, S. L. Whitehouse, A. Pandey, L. Wu, J. Roberts, and R. W. Crawford (2017) Orthopaedic surgeon attitudes towards current limitations and the potential for robotic and technological innovation in arthroscopic surgery. Journal of Orthopaedic Surgery 25 (1), pp. 2309499016684993. Note: Publisher: SAGE Publications Sage UK: London, England Cited by: §1.
  • [11] Y. Jonmohamadi, Y. Takeda, F. Liu, F. Sasazawa, G. Maicas, R. Crawford, J. Roberts, A. K. Pandey, and G. Carneiro (2020) Automatic segmentation of multiple structures in knee arthroscopy using deep learning. IEEE Access 8, pp. 51853–51861. Note: Publisher: IEEE Cited by: §1, §2, §4.2, §6.
  • [12] S. Leonard, A. Sinha, A. Reiter, M. Ishii, G. L. Gallia, R. H. Taylor, and G. D. Hager (2018) Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery onin vivoclinical data. IEEE transactions on medical imaging 37 (10), pp. 2185–2195. Note: Publisher: IEEE Cited by: §1.
  • [13] F. Liu, Y. Jonmohamadi, G. Maicas, A. K. Pandey, and G. Carneiro (2020) Self-supervised Depth Estimation to Regularise Semantic Segmentation in Knee Arthroscopy. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 594–603. Cited by: §4.1, §6.
  • [14] X. Liu, A. Sinha, M. Ishii, G. D. Hager, A. Reiter, R. H. Taylor, and M. Unberath (2019)

    Dense depth estimation in monocular endoscopy with self-supervised learning methods

    IEEE transactions on medical imaging 39 (5), pp. 1438–1447. Note: Publisher: IEEE Cited by: §2.
  • [15] X. Liu, A. Sinha, M. Unberath, M. Ishii, G. Hager, R. H. Taylor, and A. Reiter (2018) Self-supervised Learning for Dense Depth Estimation in Monocular Endoscopy. arXiv, pp. arXiv–1806. Cited by: §2.
  • [16] X. Liu, M. Stiber, J. Huang, M. Ishii, G. D. Hager, R. H. Taylor, and M. Unberath (2020) Reconstructing Sinus Anatomy from Endoscopic Video–Towards a Radiation-free Approach for Quantitative Longitudinal Assessment. arXiv preprint arXiv:2003.08502. Cited by: §2.
  • [17] R. Ma, R. Wang, S. Pizer, J. Rosenman, S. K. McGill, and J. Frahm (2019) Real-time 3d reconstruction of colonoscopic surfaces for determining missing regions. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 573–582. Cited by: §2.
  • [18] A. Marmol, A. Banach, and T. Peynot (2019) Dense-arthroSLAM: Dense intra-articular 3-D reconstruction with robust localization prior for arthroscopy. IEEE Robotics and Automation Letters 4 (2), pp. 918–925. Note: Publisher: IEEE Cited by: §2.
  • [19] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016-06) A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 4040–4048. External Links: ISBN 9781467388511, Link, Document Cited by: §4.1.
  • [20] H. Otsu, M. Yamamoto, and T. Hachisuka (2018) Reproducing spectral reflectances from tristimulus colours. Computer Graphics Forum. External Links: ISSN 1467-8659, Document Cited by: §2.
  • [21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    Cited by: §4.2.
  • [22] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.1.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Note: Publisher: Springer Cited by: §4.1.
  • [24] L. Sharan, L. Burger, G. Kostiuchik, I. Wolf, M. Karck, R. De Simone, and S. Engelhardt (2020) Domain gap in adapting self-supervised depth estimation methods for stereo-endoscopy. Current Directions in Biomedical Engineering 6 (1). Note: Publisher: De Gruyter Cited by: §2.
  • [25] S. Ullman (1979) The interpretation of structure from motion. Proceedings of the Royal Society of London. Series B. Biological Sciences 203 (1153), pp. 405–426. Note: Publisher: The Royal Society London Cited by: §2.
  • [26] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki (2017) Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804. Cited by: §2.
  • [27] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Note: Publisher: IEEE Cited by: §3.0.1.
  • [28] Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia (2018) Lego: Learning edge with geometry all at once by watching videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 225–234. Cited by: §3.0.1.
  • [29] C. Zach, T. Pock, and H. Bischof (2007) A globally optimal algorithm for robust tv-l 1 range image integration. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §1, §3.0.2, Figure 3.
  • [30] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §2, §3.0.1.