Unsupervised Surgical Instrument Segmentation via Anchor Generation and Semantic Diffusion

08/27/2020 ∙ by Daochang Liu, et al. ∙ 18

Surgical instrument segmentation is a key component in developing context-aware operating rooms. Existing works on this task heavily rely on the supervision of a large amount of labeled data, which involve laborious and expensive human efforts. In contrast, a more affordable unsupervised approach is developed in this paper. To train our model, we first generate anchors as pseudo labels for instruments and background tissues respectively by fusing coarse handcrafted cues. Then a semantic diffusion loss is proposed to resolve the ambiguity in the generated anchors via the feature correlation between adjacent video frames. In the experiments on the binary instrument segmentation task of the 2017 MICCAI EndoVis Robotic Instrument Segmentation Challenge dataset, the proposed method achieves 0.71 IoU and 0.81 Dice score without using a single manual annotation, which is promising to show the potential of unsupervised learning for surgical tool segmentation.



There are no comments yet.


page 8

Code Repositories


Code for 'Unsupervised Surgical Instrument Segmentation via Anchor Generation and Semantic Diffusion' (MICCAI 2020)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Instrument segmentation in minimally invasive surgery is fundamental for various advanced computer-aided intervention techniques such as automatic surgical skill assessment and intra-operative guidance systems [2]. Given its importance, surgical instrument segmentation has witnessed remarkable progress from early traditional methods [25, 4, 20]

to recent approaches using deep learning 

[7, 12, 23, 13, 9, 15, 18, 10, 11, 14, 27]

. However, such success is largely built upon supervised learning from a large amount of annotated data, which are very expensive and time-consuming to collect in the medical field, especially for the segmentation task on video data. Besides, the generalization ability of supervised methods is almost inevitably hindered by the domain gaps in real-world scenarios across different hospitals and procedure types.

In the literature, several attempts have been made to handle the lack of manual annotations [22, 26, 5, 16, 11]. Image level annotations of tool presence were utilized in [16, 26]

to train neural networks in a weakly-supervised manner. Jin et al. 


propagated the ground truth across neighboring video frames using motion flow for semi-supervised learning, while Ross et al. 


reduced the number of necessary labeled images by employing re-colorization as a pre-training task. Recently, a self-supervised approach was introduced to generate labels using the kinematic signal in the robot-assisted surgery 

[5]. Compared to prior works, this study steps further to essentially eliminate the demand for manual annotations or external signals by proposing an unsupervised method for the binary segmentation of surgical tools. Unsupervised learning has been successfully investigated in other surgical domains such as surgical workflow analysis [3] and surgical motion prediction [6], implying its possibility in instrument segmentation.

Figure 1: Framework overview

Our method, which includes anchor generation and semantic diffusion, learns from the general prior knowledge about surgical tools. As for the anchor generation, we present a new perspective to train neural networks in the absence of labeled data, i.e., generating reliable pseudo labels from simple cues. A diverse collection of cues, including color, objectness, and location cues, are fused to be positive anchors and negative anchors, which correspond to pixels highly likely to be instruments and backgrounds respectively. Although individual cues are coarse and biased, our fusion process leverages the complementary information in these cues and thereby suppresses the noise. The segmentation model is then trained based on these anchors. However, since the anchors only cover a small portion of image pixels, a semantic diffusion loss is designed to propagate supervisory signals from anchor pixels to remaining pixels that are ambiguous to be instruments or not. The core idea of this loss is to exploit the temporal correlation in surgery videos. Specifically, adjacent video frames should share similar semantic representations in both instrument and background regions.

In the experiments on the EndoVis 2017 dataset [2], the proposed method achieves encouraging results (0.71 IoU and 0.81 Dice) on the binary segmentation task without using a single manual annotation, indicating its potential to reduce cost in clinical applications. Our method can also be easily extended to the semi-supervised setting and obtain performance comparable to the state-of-the-art. In addition, experiments on the ISIC 2016 dataset [8] demonstrate that the proposed model is inherently flexible to be applied in other domains like skin lesion segmentation. In summary, our contributions are three-fold: 1) An unsupervised approach for binary segmentation of surgical tools 2) A training strategy by generating anchor supervision from coarse cues 3) A semantic diffusion loss to explore the inter-frame semantic similarity.

2 Method

As illustrated in Fig. 1, our unsupervised framework111By unsupervised, we mean no manual annotation of surgical instruments is used. consists of two aspects, 1) generating anchors to provide initial training supervision 2) augmenting the supervision by a semantic diffusion loss. Our framework is elaborated as follows.

2.1 Anchor Generation

In conventional supervised methods, human knowledge is passed to the segmentation model through annotating large-scale databases. To be free of annotation, we encode the knowledge about surgical instruments into hand-designed cues instead and generate pseudo labels for training. The selection of such cues should adhere to two principles, i.e., simplicity and diversity. The simplicity of cues prevents virtually transferring intensive efforts from the data annotation to the design process of cues, while the diversity enriches the valuable information that we can take advantage of in different cues. Based on these principles, three cues are computed, including color, objectness and location cues. Given a video frame at time with height and width

, probability maps

, , are extracted according to the three cues respectively, which are fused as pseudo labels later.

Color. Color is an obvious visual characteristic to distinguish instruments from the surrounding backgrounds. Surgical instruments are mostly of grayish and plain colors, while the background tissues tend to be reddish and highly-saturated. Therefore, we multiply the inverted A channel in the LAB space, i.e., one minus the channel, and the inverted S channel in the HSV space to yield the probability map .

Objectness. Another cue can be derived from to what extent the image region is object-like. Surgical instruments are often with well-defined boundaries, while the background elements scatter around and fail to form a concrete shape. In detail, the objectness map is retrieved using a class-agnostic object detector [1]. Although this detector originally targets at daily scenes, we find it also give rich information in abdominal views.

Location. The third cue is based on the pixel location of instrument in the screen. Instead of a fixed location prior, an adaptive and video-specific location probability map is obtained by averaging the color maps across the whole video: , where is the video length. The location map roughly highlights the image areas where instruments frequently appear in this video.

Anchor Generation. As shown in Fig. 1, the resultant cue maps are very coarse and noisy. Therefore, anchors are generated from these cues to suppress the noise. Concretely, the positive anchor 222[0, 1] means values are between 0 and 1 both inclusively. is defined as the element-wise product of all cues: , which captures the confident instrument regions that satisfy all the cues. Similarly, the negative anchor is defined as the element-wise product of all inverted cues: , which captures the confident background regions that satisfy none of the cues. As in Fig. 1, the false response is considerably minimized in the generated anchors.

Anchor Loss. The anchors are then regarded as pseudo labels to train the segmentation network, a vanilla U-Net [21] in this paper. We propose an anchor loss to encourage network activation on the positive anchor and inhibit activation on the negative anchor:


where denotes the prediction map from the network and is the pixel index. The loss is computed for each pixel and averaged over the whole image. Compared to the standard binary cross-entropy, this anchor loss only imposes supervision on the pixels that are confident to be instruments or backgrounds, keeping the network away from being disrupted by the noisy cues. However, the anchors only amount to a minority of image pixels. On the remaining ambiguous pixels outside the anchors, the network is not supervised and its behavior is undefined. Such a problem is tackled by the following semantic diffusion loss.

2.2 Semantic Diffusion

Apart from the cues mentioned above, temporal coherence is another natural source of knowledge for unsupervised learning in the sequential data. We argue that the instruments in adjacent video frames usually share similar semantics, termed as inter-frame instrument-instrument similarity. This temporal similarity is assumed to be stronger than the semantic similarity between the instrument and the background within a single frame, i.e., the intra-frame instrument-background similarity

. To this end, the semantic feature maps from a pre-trained convolutional neural network (CNN) are first aggregated within the instrument and background regions respectively using the prediction map:


where represents the CNN feature maps of frame , and denotes the features at pixel , and is the channel number, and are the aggregated features for the instrument and the background correspondingly. Then given two adjacent frames and , a semantic diffusion loss in a quadruplet form is proposed to constrain the inter-frame instrument-instrument similarity to be higher than the intra-frame instrument-background similarities by a margin:



denotes the cosine similarity between two features and

is a hyperparameter controlling the margin. Likewise, another semantic diffusion loss can be formulated to enforce the

inter-frame background-background similarity:


Lastly, the anchor loss and the semantic diffusion loss are optimized collectively:


Driven by the semantic diffusion loss, the initial signals on the confident anchor pixels are propagated to remaining ambiguous pixels. Our network benefits from such augmented supervision and outputs accurate and complete segmentation. Note that the semantic diffusion loss is generally not restricted to adjacent frames and can be also imposed on any image pair exhibiting inter-image similarity.

3 Experiment

Dataset. Our method is evaluated on the dataset of the 2017 MICCAI EndoVis Robotic Instrument Segmentation Challenge [2] (EndoVis 2017), which consists of 10 abdominal porcine procedures videotaped by the da Vinci Xi systems. Our work focuses on the binary instrument segmentation task, where each frame is separated into instruments and backgrounds. As our method is unsupervised, we do not use any annotations during the training process. Note that the ground truth of the test set is still held out by the challenge organizer.

Setup. Experiments are carried out in two different settings. 1) Train Test (TT): This setting is common for supervised methods, where the learning and the inference are performed on two different sets of data. In this setting, we follow the previous convention [11] and conduct 4-fold cross-validation on the released 8 training videos of EndoVis 2017, with the same splits as prior works. Our method can attain real-time online inference speed in this setting. 2) Single Stage (SS): This is a specific setting for our unsupervised method. Since the learning involves no annotation, we can directly place the learning and the inference on the same set of data, i.e., the released training set of EndoVis 2017. In application scenarios, the model needs to be re-trained when new unseen data comes, therefore this setting is more suitable for the offline batch analysis. Following previous work [11], we use intersection-over-union (IoU) and Dice coefficient to measure our performance.

Implementation Details. We extract the semantic feature maps from the layer of the VGG16 [24]

pre-trained on ImageNet, which are interpolated to the same size as the prediction map. The VGG16 extractor is frozen when training our U-Net. The margin factors

and are set as and . The prediction map is thresholded to be final segmentation mask using the Otsu algorithm [17]

. Our implementation uses official pre-trained CNN models and parameters in PyTorch 

[19]. Codes will be released to offer all details.

3.1 Results on EndoVis 2017

Results on EndoVis 2017 are reported in Table 1. Firstly we assess the network performance only using the anchor loss based on our cues, where we get the basic performance. After we combine the semantic diffusion losses, especially the background semantic diffusion loss , the performance is strikingly improved. This result proves our assumption that adjacent video frames are similar to each other in both foreground and background regions. Since the background area is relatively more similar between the video frames, it is seen from Table 1 that brings more improvement on the performance than .

IoU (%) Dice (%)
49.47 64.21
50.78 65.16
67.26 78.94
70.56 81.15
Table 1: Results of the binary segmentation task from EndoVis 2017. Experimental results in the setting SS are reported.

3.2 The Choice of Cues

Different combinations of cues are examined to research their effects on the network performance. Here we run the Otsu thresholding algorithm [17] not only on the network prediction map but also on the corresponding positive anchor and the inverted negative anchor to generate segmentation masks. The Otsu algorithm is adaptive to the disparate intensity level of the probabilistic maps. The resultant masks are then evaluated against the ground truth. As the results shown in Table 2, the best network performance comes from the combination of all three cues, because more kinds of cues can provide extra information from different aspects. Meanwhile, a single kind of cue may produce good results on the anchors, but may not be helpful to the final network prediction, because a single cue may contain lots of noise and it needs to be filtered out by the fusion with other useful cues. Also, different kinds of cues have varying effects on the network performance. For example, it is noticed that the and cues are more important than the from the table.

IoU (%) Dice (%)
55.60 55.60 45.27 69.21 69.21 60.09
16.01 16.01 14.23 26.57 26.57 23.57
16.90 16.90 21.32 28.11 28.11 33.97
20.28 18.93 47.39 32.48 30.99 62.51
41.44 19.21 43.74 57.00 31.46 59.30
38.69 22.09 63.27 53.70 35.19 75.56
38.64 18.53 70.56 53.73 30.56 81.15
Table 2: The choice of cues (Setting SS)

3.3 Compared to Supervised Methods

At present, unsupervised instrument segmentation is still less explored, with few methods that can be directly compared with. Therefore, to provide an indirect reference, our method is adjusted to the semi-supervised and fully-supervised settings and compared with previous supervised methods in Table 3. When fully-supervised, we substitute the anchors by the ground truth on all the frames. Since our contribution is not in the network architecture and we do not use special modules such as attention beyond the U-Net, our fully-supervised performance is close to some earlier works, which can be thought of as an upper bound of our unsupervised solution. When semi-supervised, the anchors are replaced with the ground truth on 50% frames in the same periodical way as in [11]. Our method has competitive performance with the state-of-the-art in the semi-supervised setting. Lastly, in the last two rows without using any annotation, we achieve the preeminent performance. More data is exploited for learning in the setting SS than in the setting TT, which explains why the setting SS has better results.

Supervision Method Setting IoU (%) Dice (%)
100% U-Net [21] TT 75.4418.18 84.3714.58
100% Ours () TT 81.5514.52 88.8311.50
100% TernausNet [23] TT 83.6015.83 90.0112.50
100% MF-TAPNet [11] TT 87.5616.24 93.3712.93
50% Semi-MF-TAPNet [11] TT 80.0316.87 88.07 13.15
50% Ours () TT 80.3314.69 87.9411.53
0% Ours () TT 67.8515.94 79.4213.59
0% Ours () SS 70.5616.09 81.1513.79
Table 3: Comparison with supervised methods (meanstd). Results of prior works are quoted from [11]. Not all the existing fully-supervised methods are listed due to limited space. Our network architecture is the vanilla U-Net.

3.4 Qualitative Result

In this section, some visual results from our method are plotted. Firstly, as seen in Fig. 2, the three cues are very coarse, e.g., the background can still be found on the color cue maps. By the fusion of noisy cues, the anchors become purer, which are nonetheless very sparse. Then via the semantic diffusion loss, which augments the signals on the anchors, the network can find the lost instrument region in those anchor pictures, as shown in the success cases in Fig. 2a and Fig. 2b. Although in some pictures there are difficulties such as complicated scenes and lighting inconstancy, we can also get good performance in these cases. However, there are still some failure cases, such as the special probe (Fig. 2e) that is not thought of as the instrument in the ground truth. Also, the dimmed light and the dark organ (Fig. 2d) can also have negative effects on the reliability of cues. A video demo is attached in the supplementary material.

Figure 2: Visual results for success and failure cases.

3.5 Extension to Other Domain

An exploratory experiment is conducted on the skin lesion segmentation task of ISIC 2016 benchmark [8] to inspect whether our model can be migrated to other domains. We conform to the official train-test split. Due to the dramatic color variations of lesions, the color cue is excluded. The location cue is set as a fixed 2D Gaussian center prior since ISIC is not a video dataset. In view of the background similarity shared by most images, we sample random image pairs for semantic diffusion. Our flexibility is provisionally supported by the results in Table 4. Specific cues for skin lesions can be designed in future for better results.

Supervision Method Setting IoU (%) Dice (%)
100% Challenge Winner [8] TT 84.3 91.0
100% Ours () TT 83.6 90.3
50% Ours () TT 81.1 88.6
0% Ours () TT 63.3 74.9
0% Ours () SS 64.4 75.7
Table 4: Results on the skin lesion segmentation task of ISIC 2016

4 Conclusion and Future Work

This work proposes an unsupervised surgical instrument segmentation method via anchor generation and semantic diffusion, whose efficacy and flexibility are validated by empirical results. The current framework is still limited to binary segmentation. In future works, multiple class-specific anchors could be generated for multi-class segmentation, while additional grouping strategies could be incorporated as post-processing to support instance or part segmentation.


This work was partially supported by MOST-2018AAA0102004 and the Natural Science Foundation of China under contracts 61572042, 61527804, 61625201. We also acknowledge the Clinical Medicine Plus X-Young Scholars Project, and High-Performance Computing Platform of Peking University for providing computational resources. Thank Boshuo Wang for making the video demo.


  • [1] B. Alexe, T. Deselaers, and V. Ferrari (2012) Measuring the objectness of image windows. IEEE TPAMI 34 (11), pp. 2189–2202. Cited by: §2.1.
  • [2] M. Allan, A. Shvets, T. Kurmann, Z. Zhang, R. Duggal, Y. Su, N. Rieke, I. Laina, N. Kalavakonda, S. Bodenstedt, et al. (2019) 2017 robotic instrument segmentation challenge. In arXiv:1902.06426, Cited by: §1, §1, §3.
  • [3] S. Bodenstedt, M. Wagner, D. Katić, P. Mietkowski, B. Mayer, H. Kenngott, B. Müller-Stich, R. Dillmann, and S. Speidel (2017) Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis. arXiv:1702.03684. Cited by: §1.
  • [4] D. Bouget, R. Benenson, M. Omran, L. Riffaud, B. Schiele, and P. Jannin (2015) Detecting surgical tools by modelling local appearance and global shape. IEEE Transactions on Medical Imaging 34 (12), pp. 2603–2617. Cited by: §1.
  • [5] C. da Costa Rocha, N. Padoy, and B. Rosa (2019) Self-supervised surgical tool segmentation using kinematic information. In ICRA, Cited by: §1.
  • [6] R. DiPietro and G. D. Hager (2018) Unsupervised learning for surgical motion by learning to predict the future. In MICCAI, Cited by: §1.
  • [7] L. C. García-Peraza-Herrera, W. Li, L. Fidon, C. Gruijthuijsen, A. Devreker, G. Attilakos, J. Deprest, E. Vander Poorten, D. Stoyanov, T. Vercauteren, et al. (2017) Toolnet: holistically-nested real-time segmentation of robotic surgical tools. In IROS, Cited by: §1.
  • [8] D. Gutman, N. C. Codella, E. Celebi, B. Helba, M. Marchetti, N. Mishra, and A. Halpern (2016) Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC). arXiv:1605.01397. Cited by: §1, §3.5, Table 4.
  • [9] S. K. Hasan and C. A. Linte (2019) U-NetPlus: a modified encoder-decoder u-net architecture for semantic and instance segmentation of surgical instruments from laparoscopic images. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Cited by: §1.
  • [10] M. Islam, Y. Li, and H. Ren (2019) Learning where to look while tracking instruments in robot-assisted surgery. In MICCAI, Cited by: §1.
  • [11] Y. Jin, K. Cheng, Q. Dou, and P. Heng (2019) Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video. In MICCAI, Cited by: §1, §1, §3.3, Table 3, §3.
  • [12] I. Laina, N. Rieke, C. Rupprecht, J. P. Vizcaíno, A. Eslami, F. Tombari, and N. Navab (2017) Concurrent segmentation and localization for tracking of surgical instruments. In MICCAI, Cited by: §1.
  • [13] F. Milletari, N. Rieke, M. Baust, M. Esposito, and N. Navab (2018) CFCM: segmentation via coarse to fine context memory. In MICCAI, Cited by: §1.
  • [14] Z. Ni, G. Bian, G. Wang, X. Zhou, Z. Hou, X. Xie, Z. Li, and Y. Wang (2020) BARNet: bilinear attention network with adaptive receptive field for surgical instrument segmentation. arXiv:2001.07093. Cited by: §1.
  • [15] Z. Ni, G. Bian, X. Xie, Z. Hou, X. Zhou, and Y. Zhou (2019) RASNet: segmentation for tracking surgical instruments in surgical videos using refined attention segmentation network. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Cited by: §1.
  • [16] C. I. Nwoye, D. Mutter, J. Marescaux, and N. Padoy (2019) Weakly supervised convolutional LSTM approach for tool tracking in laparoscopic videos. IJCARS 14 (6), pp. 1059–1067. Cited by: §1.
  • [17] N. Otsu (1979) A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9 (1), pp. 62–66. Cited by: §3.2, §3.
  • [18] D. Pakhomov, V. Premachandran, M. Allan, M. Azizian, and N. Navab (2019) Deep residual learning for instrument segmentation in robotic surgery. In

    International Workshop on Machine Learning in Medical Imaging

    Cited by: §1.
  • [19] A. Paszke et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Cited by: §3.
  • [20] N. Rieke, D. J. Tan, C. A. di San Filippo, F. Tombari, M. Alsheakhali, V. Belagiannis, A. Eslami, and N. Navab (2016) Real-time localization of articulated surgical instruments in retinal microsurgery. Medical Image Analysis 34, pp. 82–100. Cited by: §1.
  • [21] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §2.1, Table 3.
  • [22] T. Ross, D. Zimmerer, A. Vemuri, F. Isensee, M. Wiesenfarth, S. Bodenstedt, F. Both, P. Kessler, M. Wagner, B. Müller, et al. (2018) Exploiting the potential of unlabeled endoscopic video data with self-supervised learning. IJCARS 13 (6), pp. 925–933. Cited by: §1.
  • [23] A. A. Shvets, A. Rakhlin, A. A. Kalinin, and V. I. Iglovikov (2018) Automatic instrument segmentation in robot-assisted surgery using deep learning. In IEEE International Conference on Machine Learning and Applications (ICMLA), Cited by: §1, Table 3.
  • [24] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §3.
  • [25] S. Speidel, E. Kuhn, S. Bodenstedt, S. Röhl, H. Kenngott, B. Müller-Stich, and R. Dillmann (2014) Visual tracking of da vinci instruments for laparoscopic surgery. In Medical Imaging 2014: Image-Guided Procedures, Robotic Interventions, and Modeling, Cited by: §1.
  • [26] A. Vardazaryan, D. Mutter, J. Marescaux, and N. Padoy (2018) Weakly-supervised learning for tool localization in laparoscopic videos. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, Cited by: §1.
  • [27] Y. Yamazaki, S. Kanaji, T. Matsuda, T. Oshikiri, T. Nakamura, S. Suzuki, Y. Hiasa, Y. Otake, Y. Sato, and Y. Kakeji (2020)

    Automated surgical instrument detection from laparoscopic gastrectomy video images using an open source convolutional neural network platform

    Journal of the American College of Surgeons. Cited by: §1.