Knee arthroscopy is a minimally invasive surgery (MIS) conducted via small incisions that reduce surgical trauma and post-operation recovery time . Despite these advantages, arthroscopy has some drawbacks, namely: limited access and loss of direct eye contact with the surgical scene, limited field of view (FoV) of the arthroscope, tissues too close to the camera (e.g., 10 mm away) being only partially visible in the camera FoV, diminished hand-eye coordination, and prolonged learning curves and training periods . In this scenario, surgeons can only confidently identify the femur due to its distinctive shape, while other structures, such as meniscus, tibia, and anterior cruciate ligament (ACL), remain challenging to be recognised. This limitation increases surgical operation time and may lead to unintentional tissue damage due to un-tracked camera movements. The automatic segmentation of these tissues has the potential to help surgeons by providing contextual awareness of the surgical scene, reducing surgery time, and decreasing the learning curve .
Deep learning (DL) semantic segmentation has been intensively studied by the computer vision community [11, 1, 10, 16, 2]. For arthroscopy, we are aware of just one method that produces automatic semantic segmentation of knee structures . These semantic segmentation approaches tend to be prone to overfitting, depending on the data set available for the training process. As a consequence, there is an increasing interest in the development of regularisation methods, such as the ones based on multi-task learning (MTL) . For instance, fusing semantic segmentation and depth estimation has been shown to be an effective approach , but it requires the manual annotation for the training of the segmentation and depth tasks. Considering that obtaining the depth ground truth for knee arthroscopy is challenging, self-supervised techniques such as [5, 6] are highly favourable as they do not require ground truth depth. A similar approach has been successfully explored in robotic surgery , but not for knee arthroscopy. Moreover, self-supervised depth estimation techniques have been recently combined with semantic segmentation for training regularisation in non-medical imaging approaches [3, 15]. Nevertheless, these approaches rely on data sets that contain stereo images captured from street or indoor scenes, where visual objects are far from the camera, contain rich texture, and images have few recording issues, such as overexposure and focus problems. On the other hand, knee arthroscopy images generally suffer from under or overexposure and focus problems, where visual objects are too close to the camera and contain poor texture, as shown in Fig. 1.
In this paper, we present an MTL approach for jointly estimating semantic segmentation and depth, where our aim is to use self-supervised depth estimation from stereo images to regularise the semantic segmentation training from knee arthroscopy. Contrary to  that uses outdoor scenes, we tackle the segmentation of challenging arthroscopy images (Fig. 1). To this end, we pre-train our model on images of routine objects that do not show any of the issues displayed in Fig. 1. Then, we fine-tune our model with an MTL loss formed by the fully supervised semantic segmentation and the self-supervised depth estimation, as shown in Fig. 2. Using a data set containing 3868 arthroscopic images (with semantic segmentation annotations), 2000 stereo pairs captured during five cadaveric experiments and 2150 stereo image pairs of routne objects, we demonstrate that our method achieves higher accuracy in semantic segmentation (for the visual classes Femur, Meniscus, Tibia, and ACL) than state-of-the-art pure semantic segmentation methods.
2 Proposed Method
2.1 Data Sets
We use three data sets: 1) the pre-training depth estimation data set , where and represent the left and right images of a stereo pair, indexes the out-of-the-knee scene, and denotes the number of frames in the scene; 2) the fine-tuning depth estimation and semantic segmentation data sets, respectively denoted by and , where indexes a human knee, and and denote the number of frames in the knee. In these data sets, colour images are denoted by ,where represents the image lattice, and the semantic annotation is represented by , with .
2.2 Data Set Acquisition
|Femur||ACL||Tibia||Meniscus||Number of images|
The arthroscopy images were acquired with a monocular Stryker endoscope (4.0 mm diameter) and a custom built stereo arthroscope using two muC103A cameras and a white LED for illumination (see Fig. 3). The Stryker endoscope has resolution with FoV of 30 degrees, and the custom built camera has resolution and FoV of 87.5 degrees. Stryker images were cropped to have resolution and then down-sampled to . Two clinicians performed the semantic segmentation annotations for classes femur, ACL, tibia and meniscus of 3868 images taken from four cadavers (where for one of the cadavers we used images from both knees) – see annotation details in Tab. 1.
We also collected 2000 stereo pairs captured during these five cadaveric experiments. The data set with images acquired of routine objects contains 2050 stereo images pairs used for pre-training the depth estimator and 100 stereo image pairs to validate the depth output performance (see an example of this type of image in Fig. 2). To fine tune the depth estimation method, we grab video frames from original arthroscopy stereo camera video by every two seconds.
Note that there is no disparity ground truth available for any of the data sets above, so we cannot estimate the performance of the depth estimator.
2.3 Model for Semantic Segmentation and Self-supervised Depth Estimation
The goal of our proposed network is to simultaneously estimate semantic segmentation and depth estimation from a single image. Motivated by , the model backbone is the U-net++ , which predicts semantic segmentation and depth at four different levels, where the features are shared between these two tasks, as shown in Fig 2.
In the model depicted in Fig. 4, each module consists of blocks of convolutional layers (the input is represented by and weights are represented by ), where the index denotes the down-sampling layer and represents the convolution layer of the dense block along the same skip connections (horizontally in the model). These modules are defined by
denotes an up-sampling layer (using bilinear interpolation),represents a concatenation layer, and is the disparity map that is defined only when (otherwise it is empty), as described below in Eq. 3. The input image enters the model at . Each encoder convolution module (white nodes in Fig. 2) consists of a 3
3 filter followed by max pooling, and each decoder convolution module (green nodes in Fig.2) comprises bi-linear upsampling with scale factor 2, followed by two layers of 3
where , and is a convolutional layer parameterised by that outputs the estimation of the semantic segmentation for the convolutional layer. In particular, is formed by a 1 1 convolution filter followed by pixel-wise softmax activation. The left and right disparity maps are obtained from
where , and is a convolutional layer parameterised by that outputs the estimation of the left and right disparity maps for the resolution at the down-sampling layer with representing the image lattice at the same layer. The nodes consist of a 3 3 convolution filter with sigmoid activation to estimate the disparity result.
The training for the supervised semantic segmentation for a particular image with annotation and the average semantic segmentation results from the intermediate layers from (2
) is based on the minimisation of the following loss function:
where is the pixel-wise cross entropy loss computed between the annotation and the average of the estimated semantic segmentation , denotes the Dice loss , with being set to . The inference for the supervised semantic segmentation is based solely on the segmentation result from the last layer from (2).
The self-supervised depth estimation training  uses rectified stereo pair images to predict the disparity maps to match the left-to-right and right-to-left images. The loss to be minimised is defined as
where is similarly defined, represents the structural similarity index , denotes the size of the image lattice at the resolution, is the reconstructed left image using the right image re-sampled from the disparity map . Also in (5), we have
and similarly for – this loss minimises the -norm between the left disparity map and the transformed right-to-left disparity map. The last loss term in (5) is defined by
and similarly for – this loss penalises large disparity changes in smooth regions of the image, and when there are large image changes, there can be large transitions in the disparity maps. The inference for the depth estimation relies on the result for the finer scale .
Model pre-training is done with the data set by minimising the depth estimation loss (5), where we learn the model parameters in (1) and disparity module parameters in (3). After pre-training, we add the layers and perform an end-to-end training of all model parameters with by summing the losses in (4) and (5).
3 Experiments and Results
We implement our model in Pytorch. The encoder for the model consists of the ResNet50 . Pre-training
takes 200 epochs with batch size 32, where initial learning rate isand halved at 80 and 120 epochs, and we use Adam 
optimizer. Data augmentation includes random horizontal and vertical flipping, random gamma from [0.8,1.2], brightness [0.5,2.0], and colour shifts [0.8,1.2] by sampling from uniform distributions. Forfine-tuning of segmentation and depth using arthroscopic images, we use the pre-trained encoder and re-initialise the decoder. The training takes 120 epochs with batch size 12. We use polynomial learning decay  with and weight decay . The data augmentation for segmentation includes horizontal and vertical flipping, random brightness contrast change and non-rigid transformation, including elastic transformation (the elastic transformation was particularly important to avoid over-fitting the training set) and depth data augmentation is the same as pre-training stage. For the inference time, the network takes 50ms to process a single test image and output the segmentation mask and depth.
We assess the performance of our method using the Dice coefficient computed on the testing set in a leave one out cross validation experiment (i.e., we train with 4 knees and test with the remaining one from Tab. 1). In Fig. 5
we show the mean and standard deviation of the Dice results over each test set and each anatomy, and the final average over all sets and anatomies. We compare our newly proposed method (labelled as Ours) against the pure semantic segmentation model Unet++[23, 8], our method without the pre-training stage (labelled as Ours w/o pretrain), and the joint semantic segmentation and depth estimation method designed for computer vision applications by Ramirez et al. . The results indicate that our method (mean Dice of ) is significantly better than Unet++ (mean Dice of ), with a Wilcoxon signed rank test showing a p-value 0.05, indicating that the use of depth indeed improves the segmentation result from a pure segmentation method [23, 8]. In fact, our method produces significant gains in the segmentation of ACL (arguably the most challenging anatomy in the experiment). Our method that uses pre-training is better than the one without pre-training (mean Dice of ), but not significantly so given that p-value is 0.05. An interesting point is that even though our method without pre-training is better than the pure segmentation approach, it still cannot produce accurate segmentation for ACL. Finally, compared to the method by Ramirez et al.  (mean Dice of ) ours is slightly better, indicating that both methods are competitive. Figure 6 shows a few segmentation and depth estimation results. Note that we cannot validate depth estimation because we do not have ground truth available for it.
In this paper, we proposed a method to improve semantic segmentation using self-supervised depth estimation for arthroscopic images. Our network architecture is end-to-end trainable and does not require depth annotation. We also showed that the use of arthroscopic images of normal objects to pre-train the model can mitigate the challenging image conditions presented by this problem. By using geometry information, the model provides a slight improvement in terms of semantic segmentation accuracy. Future work will focus on improving the segmentation accuracy for the non-femural anatomies.
We acknowledge several technical discussions that influenced this paper with Ravi Garg and Adrian Johnston. This work was supported by the Australia India Strategic Research Fund (Project AISRF53820) and in part by the Australian Research Council through under Grant DP180103232. The cadaver studies is covered by the Queensland University of Technology Ethics Approval under project1400000856.
-  (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §1.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.
Towards scene understanding: unsupervised monocular depth estimation with semantic-aware representation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2624–2632. Cited by: §1.
-  (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision, pp. 2650–2658. Cited by: §1.
-  (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In European Conference on Computer Vision, pp. 740–756. Cited by: §1.
-  (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279. Cited by: §1, §2.3.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.
-  (2020) Automatic segmentation of multiple structures in knee arthroscopy using deep learning. IEEE Access 8, pp. 51853–51861. Cited by: §1, §2.3, Table 1, §3.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.
-  (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1925–1934. Cited by: §1.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §2.3.
-  (2017) Automatic differentiation in pytorch. Cited by: §3.
-  (2015) Evidence-based surgical training in orthopaedics: how many arthroscopies of the knee are needed to achieve consultant level performance?. The bone & joint journal 97 (10), pp. 1309–1315. Cited by: §1.
-  (2018) Geometry meets semantics for semi-supervised monocular depth estimation. In Asian Conference on Computer Vision, pp. 298–313. Cited by: §1, §1, §3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1.
-  (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §1.
-  (2017) Arthroscopic surgery for degenerative knee arthritis and meniscal tears: a clinical practice guideline. Bmj 357, pp. j1982. Cited by: §1.
-  (2012) Advanced stereoscopic projection technology significantly improves novice performance of minimally invasive surgical skills. Surgical endoscopy 26 (6), pp. 1522–1527. Cited by: §1.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §2.3.
-  (2017) Self-supervised siamese learning on stereo image pairs for depth estimation in robotic surgery. arXiv preprint arXiv:1705.08260. Cited by: §1.
-  (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §3.
-  (2018) Unet++: a nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Cited by: Figure 4, §2.3, §3.