Deep Attention Based Semi-Supervised 2D-Pose Estimation for Surgical Instruments

12/10/2019 ∙ by Mert Kayhan, et al. ∙ 10

For many practical problems and applications, it is not feasible to create a vast and accurately labeled dataset, which restricts the application of deep learning in many areas. Semi-supervised learning algorithms intend to improve performance by also leveraging unlabeled data. This is very valuable for 2D-pose estimation task where data labeling requires substantial time and is subject to noise. This work aims to investigate if semi-supervised learning techniques can achieve acceptable performance level that makes using these algorithms during training justifiable. To this end, a lightweight network architecture is introduced and mean teacher, virtual adversarial training and pseudo-labeling algorithms are evaluated on 2D-pose estimation for surgical instruments. For the applicability of pseudo-labelling algorithm, we propose a novel confidence measure, total variation. Experimental results show that utilization of semi-supervised learning improves the performance on unseen geometries drastically while maintaining high accuracy for seen geometries. For RMIT benchmark, our lightweight architecture outperforms state-of-the-art with supervised learning. For Endovis benchmark, pseudo-labelling algorithm improves the supervised baseline achieving the new state-of-the-art performance.



There are no comments yet.


page 3

page 4

page 5

page 8

Code Repositories


Code for the paper "Deep Attention Based Semi-Supervised 2D-Pose Estimation for Surgical Instruments"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has been shown that deep learning algorithms can achieve human- or super-human- level performance on variety of tasks by utilizing large amounts of labeled data. However, these achievements come at a cost: Creating these massive annotated datasets usually require a great deal of time investment, sometimes also expertise and is prone to human errors. For many practical problems and applications, it is not feasible to create such a vast and accurately labeled dataset, which restricts the application of deep learning in many areas.

A possible solution to this problem may be semi-supervised learning (SSL). Unlike supervised learning algorithms, which require all the examples to be labeled, SSL algorithms can improve performance by also leveraging unlabeled data. SSL algorithms generally enable the learning system to learn the structure of the data.

This work investigates if the need for labels can be reduced by using semi-supervised learning in 2D-pose estimation setting. To the best of our knowledge, so far, there has not been any investigation of the usage and performance of SSL for surgical instrument tracking, where data labeling requires substantial time, and therefore, amount of unlabeled data is large compared to the labeled ones. However, this poses some fundamental challenges. In particular for 2D-pose estimation where there is no proposed method to measure the confidence of the network outputs. This is a big setback for the pseudo-labeling method where a confidence threshold is utilized to select samples where the network is certain of the answer. This study introduces total variation as a confidence measure for 2D-pose estimation task to enable the usage of pseudo-labeling.

In this work, we have applied 2D-pose estimation on surgical instruments. For this purpose, a lightweight deep attention based network architecture is proposed. On this architecture, three SSL algorithms are investigated: Mean teacher, virtual adversarial training and pseudo-labeling. Detailed experimental analysis is conducted on single-instrument Retinal Microsurgery Instrument Tracking (RMIT) dataset and multi-instrument EndoVis challenge dataset. As there is no unlabeled data for RMIT dataset, hyper parameter search is done using supervised learning. For this dataset, proposed network architecture achieves superior performance compared to state-of-the-art. For Endovis dataset, supervised learning is taken as baseline and SSL algorithms are benchmarked, where pseudo-labelling algorithm outperforms the previous state-of-the-art results.

2 Related Work

2.1 Operations Requiring Surgical Tools

Retinal microsurgery is a very challenging field for surgeons. In a typical vitreoretinal surgery, the surgeon has to manipulate retinal layers that are very delicate and less than 10 m thick [12]. A surgical precision in the order of tens of microns is required for this operation. Furthermore, the resistance applied by the retinal tissue to the instruments is exceedingly small [12], which limits the haptic feedback. Therefore, it is very difficult to estimate the precise location of the instruments. However, knowing where exactly the instruments are can provide vital information which can help avoid injuries inside the eye, e.g. broken blood vessel.

Another category of surgery that can benefit from knowing exact instrument location is robotic laparoscopic surgery. Laparoscopy is a surgical procedure which examines the organs inside the abdomen to check for signs of disease [1]. During laparoscopic surgery, small incisions are made in the wall of the abdomen and a laparoscope (a thin, lighted tube) is inserted into one of the incisions. During robotic laparoscopic surgery, surgeons receive visual information about the instruments using the cameras embedded on the robotic device [24]. Utilizing this information, the robotic master handles are used to move the robot to the desired position. Since the surgeons are limited to the visual information collected by a rod-like instrument where left and right channels are closely embedded, estimating the depth and precise locations of instruments are very challenging. Therefore, a real-time knowledge of the instruments’ position with respect to anatomical structures is a key component to improve the assistive or autonomous capabilities of surgical robots [7].

2.2 Approaches for Surgical Tool Pose Estimation

Recent developments in computer vision have resulted in advanced approaches for vision-based tracking of surgical tools. The work prior to deep-learning era relies on handcrafted features, such as Haar wavelets

[25], gradient [17, 32] or color features [34, 22]. These approaches are not robust enough for real life scenarios due to strong illumination changes and motion blur that occur during surgeries.

With the surge of deep learning the focus has shifted towards instrument localization and/or segmentation through CNNs. However, most of these approaches focus only on segmentation of the image, localization of keypoints on the instrument tip or bounding box detection [14, 20, 19, 10, 9]. The method proposed by I. Laina and N. Rieke et. al. [14] focuses on the interdependency between instrument segmentation and tip localization. This is the first attempt to combine these two tasks into one pipeline. By jointly optimizing for these two objectives, they improve the state of the art by a clear margin. The reported network runtime for this work is 56 ms on Nvidia TITAN X. The major shortcoming of this work is that it cannot represent the full pose of the instrument or include articulation. In response to these challenges, Du et. al. [7] provide the first work on articulated pose estimation for surgical instruments. They base their approach on the methods proposed by [3, 4] which consist of two stages. First, joints and joint connections are segmented, and then these are refined to come up with the final output heatmaps. These heatmaps represent the confidence of the network about the presence of a joint or joint connection at any given pixel location. Final pose of the instrument is inferred using bipartite graph matching after non-maximum suppression as post-processing step. They report a network runtime of 24 ms and post-processing runtime of 89 ms on Nvidia TITAN X GPU. Although their approach provides good generalization performance, the biggest challenge remains to be achieving real time performance while maintaining low localization error.

3 Methodology

In this section, we initially give the details of the network architecture. Then we explain the proposed confidence measure, total variation, which is needed for pseudo-labeling algorithm. Finally, we mention the training details.

3.1 Network Architecture

For surgical tool pose estimation, a modified U-Net [18]

architecture is used, where each joint location is found via a separate heatmap output channel. Our architecture makes use of attention mechanism intensively. Accordingly we have named our architecture DAU-Net referring to Deep Attention based U-Net. DAU-Net diverges from U-Net in the following regards: Downsampling operation is applied for 3 times, ReLU activation function is replaced with RLReLU activation

[30], 2D attention module is added to upsampling blocks at each concatenation point, group normalization [29] is used before each activation function in the main network, whereas it is omitted in the attention module. The final output maps are generated using a 1x1 convolution to scale the output channels to the number of joints and joint associations of interest.

The final model that is used for all experiments is illustrated in Figure 1. The details of downsample and attention based upsample blocks are also illustrated in Fig. 2 (a) and Fig. 2 (b), respectively. Skip connections are applied from downsample block (before maxpooling) to attention based upsample blocks after deconvolution.

Figure 1: The modified U-Net architecture which is used in all experiments. For visual clarity, the downsample, attention based upsample and attention blocks are illustrated in Fig. 2 (a), Fig. 2 (b) and Fig. 3, respectively.

3.1.1 2D Attention Mechanism for Pose Estimation

Girshick et. al. [11]

have shown that by cropping relevant locations from feature maps, we can detect bounding boxes and classify the corresponding object. The biggest drawback of this method is that we need bounding box annotations to learn the correct answers.

Figure 2: Architectures of downsample (a) and attention based upsample (b) blocks. Each convolution is followed by group normalization and RLReLU activation. Deconvolution layer is followed by RLReLU activation before concatenation. stands for elementwise multiplication of the attention map with the feature maps after concatenation. At Downsample 3, maxpooling is not used.

To eliminate the need for bounding box annotations, 2D attention module turns on/off elements in the feature maps. The turn on/off effect is achieved by elementwise multiplication after sigmoid activation. In other words, for each element in the feature map, the attention mechanism tries to decide if this element contains information about the joints and/or connections between joints. This leads to a drastic reduction in search space for the network because only relevant elements are propagated further. The applied attention architecture is depicted in Fig. 3. A visualization of the learned attention maps and the corresponding images can be seen in Fig. 4. As can be seen, the attention mechanism successfully concentrates on the important parts of the input image.

Figure 3: Attention mechanism applied in Fig. 2, where refers to elementwise multiplication. Output of the attention mechanism has 1 channel with input volume size and it is broadcasted across channels during elementwise multiplication.
Figure 4: Visualization of the attention mechanism for multi-instrument case.

3.2 Post-processing

For single instrument localization only the joint probability maps are predicted, whereas for multi-instrument localization the connection probability maps are predicted as well. The following procedure is used to retrieve the final joint locations.

For single instrument detection, Gaussian filter is applied to the output, and then, for each channel of the output the pixel location that contains the maximum value is found.

For multiple instrument detection, Gaussian filter is applied to the joint probability maps which is followed by thresholded non-maximum suppression to retrieve the joint candidates. Then, if total variation measure of the output maps are below a certain threshold high-boost filter is applied to the connection probability maps. Finally, line integral [4, 7] is utilized to find joint pairs and the instrument is parsed.

3.3 Total Variation as a Confidence Measure for Pose Estimation

In mathematics, total variation is a measure that describes the local and global structure of functions [21]. Furthermore, in the context of image processing, it is often assumed that signals with high total variation have excessive detail. Following this notion, this study proposes total variation of probability maps as a way of assessing the confidence of the inferred pose estimates.

Formally, the anisotropic version of total variation is shown as

for multi-channel images [2, 21]. As can be seen in the above given formulation, total variation is the sum of the local discrete gradients in x and y direction. In other words, images with high total variation have large value differences between neighboring pixels. This is often assumed to be noise and irrelevant information, and therefore, total variation denoising [28]

has been proposed to eliminate the noise from the images. However, in the context of CNN based 2D pose estimation, the global structure of the output maps match the instrument location because during training MSE objective is minimized. Exploiting this information, total variation of output maps can be used to evaluate the local properties of the output maps. As it can be seen in the autoencoder literature


, two images may have low MSE but look quite different because MSE does not necessarily address the sharpness of the image. In this study, by using total variation, the sharpness of the output maps is evaluated. In other words, if an output map has low total variation, this translates to a flat output distribution which represents a low confidence prediction. Thus, total variance measure can be used as a post-processing step to evaluate the quality of predictions and if necessary, enable a decision mechanism which can be used to evaluate the need for further processing. Furthermore, this measure complements the pseudo-labeling method for pose estimation because this method requires a confidence threshold to be used effectively.

Figure 5: An example target map for End-Shaft joint pair [7], where, (a) is an illustration of the groundtruth annotation, (b) represents the connection probability map and, (c) and (d) represent the respective joint probability maps.

3.4 Training Details

Learning: Throughout the training Adam solver is used with default parameters [13]. The training lasted 50k iterations. Following Du et. al. [7]

, input resolution is set to 288x384 pixels and 256x320 pixels for RMIT and EndoVis datasets respectively. DAU-Net kernels are initialized from a truncated Gaussian distribution and kernels in attention module is initialized using Xavier initialization. Target labels are created by heatmaps, where each joint annotation corresponds to a 2D Gaussian density map centred at the labelled point location and the annotation for joint association corresponds to a Gaussian distribution along the joint pair center line. Following Du et. al.


, the standard deviation of 20 pixels is used for Gaussian distributions. An example target label can be seen in Figure


Regularization: The network is regularized using dropout [23]

with dropping rate of 30% and %10 for RMIT and EndoVis, respectively. Also, noisy labels are applied by sampling a random variable uniformly between -0.01 and 0.01, and adding to each pixel of the target heatmap.

Augmentation: Since both datasets contain very limited data, heavy data augmentation is used to avoid overfitting. For RMIT dataset, random flipping, random translation (5 px), random rotation (10 degrees), Gaussian noise, random brightness, random contrast, random saturation, histogram equalization, random blurring, pepper noise, salt noise, speckle noise and random erasing [33] are used. For EndoVis dataset, random flipping, random translation (5 px), random rotation (20 degrees), random swapping are used.

Figure 6: The resulting frames from random swapping is depicted in the figure. To make it visually more comprehensible, parts that come from different images are visualized using BGR and RGB color formats respectively.

Random Swapping Data Augmentation: EndoVis dataset contains very limited annotated samples. Furthermore, a large fraction of these samples consist of frames where only a single instrument is visible. This makes it very difficult to learn models with good generalization across single and multi-instrument cases. To deal with this issue, Random Swapping is introduced as a data augmentation strategy.

Inspired from [5, 8], Random Swapping

is a method that uses keypoint annotation to generate semantically meaningful mixtures of images and simulate occlusion. In this study, the clasper annotations are used to split the frames into 2 parts. Afterwards another image is sampled from the training set and again a split is formed depending on the clasper annotation. Finally, two cropped parts from these two training images are fused together. If the sum of the crop sizes do not correspond to the original frame size, the final image is either zero-padded from the middle or cropped from the edges. The same operations are performed on the target heatmaps as well to generate labels for training. An illustration of the output images can be seen in Figure



The whole training setup and network is implemented using Tensorflow. For reproduciblity of results, we make our code publicly available


4 Experiments

In this section, we share the obtained results from our experiments on two publicly available datasets: RMIT 222 and Endovis 333 First, RMIT dataset is used to develop the deep attention U-Net (DAU-Net) model. Since RMIT dataset does not contain any unlabeled samples we do not investigate semi-supervised learning on this dataset. Next, EndoVis dataset is utilized to evaluate the performance of the developed network architecture. Furthermore, the unlabeled training data is used to evaluate the effectiveness of mean teacher, virtual adversarial learning and pseudo labeling algorithms.

4.1 Datasets

RMIT Dataset: Retinal Microsurgery Instrument Tracking (RMIT) dataset consists of three surgical sequences which are recorded during in vivo retinal microsurgery where only a single instrument is visible during recording. The original frames extracted from the videos have a resolution of 640 x 480 pixels. Following Du et. al. [7], the dataset was split into training and test datasets where the training set consists of the first halves of each sequence and rest of the data was used for testing. A detailed distribution of the data can be seen in Table 1. For most of the frames 4 keypoints (tip1 - tip2 - shaft - end) are annotated. An example annotation can be seen in Figure 7.

Endovis Dataset: EndoVis Challange dataset is a multi-instrument dataset that contains 6 video sequences from endoscopic surgeries where in fraction of the sequences, 2 instruments are present in the frame. The training set consists of four 45 seconds ex vivo video sequences of surgeries whereas the test set consists of four 15 seconds video sequences which are complementary to the training set as well as two additional 1 minute recorded interventions. A detailed distribution of the data can be seen in Table 1. The frame resolution for each of the videos is 720 x 576 pixels. Since the sparse annotations proposed by Du et. al. [7] are used, as done by Du et. al., the entire training set is used for training which differs from the leave-one-surgery-out training strategy requirement of the original challenge. For semi-supervised learning, the unlabeled training data is used as well. Du et. al. construct a high quality multi-joint annotation which consists of Left Clasper, Right Clasper, Head, Shaft and End joint positions. An example annotation can be seen in Figure 7.

EndoVis Dataset RMIT Dataset
Training Testing Training Testing
Seq 1 210 / 1107 80 / 370 201 / 201 201 / 201
Seq 2 240 / 1125 76 / 375 111 / 111 111 / 111
Seq 3 252 / 1124 76 / 375 265 / 271 266 / 276
Seq 4 238 / 1123 76 / 375
Seq 5 301 / 1500
Seq 6 301 / 1500
Total 940 / 4479 910 / 4495 577 / 583 578 / 588
Table 1: The distribution of the data across different sequences for RMIT and Endovis datasets. Each row contains number of labeled images / number of total images for corresponding sequence. It should be noted that Sequence 5 and 6 are only present in the test set for Endovis Dataset.
Figure 7: Example training labels for RMIT (a) and Endovis (b) datasets. For RMIT dataset, the tips (cyan, blue), shaft (green) and end (red) of the instrument are annotated. For Endovis dataset, the claspers (red, blue), head (green), shaft (yellow) and end (cyan) joints are annotated.

4.2 Results Using RMIT Dataset

For the experiments shown in Table 2, the network is trained on the groundtruth bounding boxes to enable faster experimentation and simulate an object detection based localization system. Bounding boxes are extracted using 3-point annotation as shown in [6]. Since this work is not published, the readers are referred to the Supplementary Material(Bounding Box Generation from 3-point Annotation) for detailed information on this method. A resolution of 128x128 is used for all experiments. Here the pixel error corresponds to mean absolute error because the groundtruth bounding boxes are used for training which eliminates the possibility of false detection.

width=0.5center Pixel Error Rates (MAE) Shaft Tip 1 Tip 2 Avr U-Net + augmentation 5.9 7.12 5.57 6.2 U-Net + augmentation + attention 3.79 6.67 4.72 5.06 U-Net + heavy augmentation + attention 3.73 5.98 4.12 4.61 U-Net + heavy augmentation + attention + L2 regularization 4.72 6.63 4.98 5.44 U-Net + heavy augmentation + attention + dropout 3.81 5.71 4.4 4.64 U-Net + heavy augmentation + attention + dropout + noisy labels 3.19 5.59 4.23 4.34 U-Net + heavy augmentation + attention + dropout + noisy labels + lrelu 3.1 6.59 4.49 4.73 U-Net + heavy augmentation + attention + dropout + noisy labels + rlrelu 2.8 4.82 4.17 3.93

Table 2: A compact summary of the results obtained by varying one component at a time to find the right architecture and training pipeline for RMIT dataset. Augmentation refers to only geometric transformations while heavy augmentation also includes color space transformations.

First, a vanilla U-Net is trained using only the geometric augmentations. It is observed that the network produces very coarse output maps which lead to high pixel error. In response to this observation, attention mechanism is introduced to help the network to concentrate on the important parts of the image. Following this modification, it is observed that the network trains a lot faster and produces more finegrained outputs. However, it is also observed that the network is highly prone to overfitting, and therefore, more data augmentation is introduced to deal with this. Even with the additional augmentation, it is observed that the network overfits, thus, regularization is added in the following experiments. It can be seen that dropout yields superior performance compared to L2 regularization, and therefore, dropout with 30% drop rate is used in the following experiments. Because increasing the drop rate does not increase the generalization performance, more creative ways of regularizing the network are investigated. It is observed that the combination of heavy data augmentation and noise injection to the labels simulates new data points more convincingly and leads to a better generalization performance. The structure of the injected noise is described in Subsection 3.4. Finally, an investigation over the activation functions is conducted to see if generalization performance can be improved. It is observed that RLReLU activation function [30] improves the generalization performance furthermore. All in all, one can see that by increasing the input and network level stochasticity (random data augmentation, dropout and RLReLU), the generalization performance is improved drastically. This model is used throughout this study and represented by DAU-Net-base_feature_maps-depth which corresponds to DAU-Net-32-3 in this case.

To measure the effectiveness of the designed system, the detection threshold is set to 15 pixels on the original frame and DAU-Net-32-3 is compared with the state of the art for 3-point annotation system. As can be seen in Table 12, this system yields comparable results with the state of the art while using fewer parameters (530k) compared to their refinement network. However, it should be noted that Du et. al. uses 4-point annotation and an end to end learning system to achieve these results. To have a fairer comparison, the proposed system is scaled up and trained on 4-point annotations in end to end manner. Except increasing the number of trainable parameters, no other modifications are made and same input resolution as Du et. al. is used. As can be seen in Table 3, DAU-Net-64-3 improves the state of the art while using fewer parameters (2.1M) for a detection threshold of 15 pixels on the original frame. It should be noted that in Tables 12 and 3, the pixel error does not correspond to mean absolute error but to root mean squared error computed for the detected joints.

Precision / Recall / Pixel error (RMSE)
DAU-Net-64-3 Du et. al. [7]
Tip1 96 / 96 / 4.44 99.13 / 99.13 / 5.26
Tip2 98.3 / 98.3 / 5.13 97.58 / 97.58 / 4.61
Shaft 99.5 / 99.5 / 4.01 94.12 / 94.12 / 4.93
End 92.4 / 92.4 / 5.68 86.51 / 86.51 / 4.68
Avr 96.6 / 96.6 / 4.82 94.3 / 94.3 / 4.87
Table 3: Comparison of DAU-Net-64-3 with the state of the art for 4-point annotation and end to end training on RMIT dataset.

4.3 Results Using Endovis Dataset

Experiments (Test set loss (MSE))
Avr. Loss
30% dropout 0.002337
Dilated Conv 0.003029
10% dropout 0.002357
Random Swap 0.002288
Elastic Disp. 0.002302
Table 4: A compact summary of the results obtained by varying one component at a time to find the right amount of regularization and data augmentation strategies for EndoVis dataset.

For all the experiments given in Table 4, DAU-Net-64-3 is used because it was shown to deliver very accurate pose estimates for single instrument cases. The main idea of these experiments is to test the performance on multi-instrument cases (Table 4) and measure the effectiveness of semi-supervised learning in 2D-pose estimation setting (Table 8). Since finding the exact poses of multiple instruments require a post-processing procedure based on thresholded non-maximum suppression and graph matching, test set loss is compared to find models with better performance.

First, the network is trained with the exact setup from the previous section. However, it is observed that the generalization performance is not very good. At the beginning it is speculated that this is caused by the larger receptive field requirement for the EndoVis dataset. Therefore, dilated convolutions with dilation rate 2 are introduced. As can be seen in the table, this did not improve the performance. After analysing the output maps, it is observed that network produces flat outputs to minimize MSE which is interpreted as underfitting. In response to that, dropout rate is reduced to 10%. Furthermore, the color space augmentations are removed because in EndoVis the lighting does not vary between sequences. Next, random swapping data augmentation is introduced to generate more data. As can be seen, introduction of random swapping reduced the test set error below 0.0023. Afterwards, to see the effectiveness of random swap, it is removed from the augmentation pipeline and elastic displacement is introduced. However, this model performs slightly worse.

Sequence 1-4 (Seen instruments)
Precision Recall F1-score
Pixel error
Left Clasper 95.6 100 97.8 4.44
Right Clasper 99.7 100 99.9 2.83
Head 99.7 100 99.9 4.23
Shaft 100 100 100 2.86
End 100 100 100 5.93
Avr 99.1 100 99.5 4.06
Table 5: Performance of the supervised baseline on the seen instruments after post-processing.
Sequence 5-6 (Unseen instruments)
Precision Recall F1-score
Pixel error
Left Clasper 58.0 83.5 68.5 8.13
Right Clasper 90.7 63.5 74.7 5.85
Head 95.1 65.1 77.3 4.92
Shaft 99.4 66.1 79.4 8.11
End 91.9 56.1 69.7 7.13
Avr 87.0 68.7 73.9 6.83
Table 6: Performance of the supervised baseline on the unseen instruments after post-processing.

Table 5 shows the precision, recall, f1 scores and the RMSE of the network for the detected parts. As can be seen, the network delivers very accurate pose estimates for seen instruments. However, as shown in Table 6, the network has difficulty extrapolating to an unknown geometry. Except for the left clasper, it can be seen that the detected joints are mostly within the 20 pixel threshold, whereas for left clasper, detections are not very accurate. After analysing the output maps, it is seen that the network produces low confidence predictions which get thresholded away. It is speculated that this is the main reason for the low recall for most of the joints. To counter that, our proposed total variation confidence measure is utilized. More information about this method can be found in Subsection 3.3. Using the steps given in Subsection 3.2, an improvement from 73.9 to 76.1 in average f1 score is observed. A detailed report of the results with this new post-processing pipeline can be seen in Table 7.

Sequence 5-6 (Unseen instruments)
Precision Recall F1-score
Pixel error
Left Clasper 61.6 90.2 73.2 7.89
Right Clasper 86.0 75.8 74.7 6.31
Head 83.4 67.9 74.9 5.32
Shaft 94.8 71.5 81.5 8.25
End 90.3 64.0 74.9 7.70
Avr 82.8 72.2 76.1 7.09
Table 7: Performance of the supervised baseline on the unseen instruments with the modified post-processing which utilizes total variation measure.
Experiments (Test set loss (MSE))
SSL Algorithms Avr. Loss
VAT ( = 1) 0.002319
VAT ( = 0.1) 0.002295
VAT ( = 10) 0.002361
Pseudo-labeling 0.002335
Mean teacher ( = 1) 0.002428
Mean teacher ( = 0.1) 0.002335
Table 8:

An overview of the experiments and the respective test set losses. Test set loss is used to only find the better performing hyperparameters but not to compare algorithms.

After establishing the right data augmentation strategies and post-processing pipeline, the unlabeled training data is utilized in semi-supervised learning context to see if further performance improvement is possible. To this end, mean teacher [26], pseudo-labeling [31] and VAT [15] algorithms are implemented and evaluated. It should be noted that for pseudo-labeling, the confidence threshold is set to be above 1000 total variation for multi-instrument cases and above 400 total variation for single instrument cases. For the mean teacher algorithm, = 0.95 is used for EMA. For VAT, the distance metric to compute the virtual adversarial loss is chosen to be MSE. In Table 8, represents the maximum consistency coefficient for the mean teacher algorithm and is the magnitude of the virtual adversarial noise. The maximum consistency coefficient is reached after 20k iterations for mean teacher algorithm, whereas there is no ramp-up for VAT as it was the case for the original paper as well [15]. The sigmoid schedule that was used by Oliver et. al. [16] is utilized for these experiments to determine the value of throughout the training. As can be seen in Table 8, 3 candidates with lower test set loss is selected for post-processing to enable a thorough comparison of the algorithms.

Sequence 1-4 (F1-score / Pixel error (RMSE))
VAT Pseudo-labeling Mean Teacher
Left Clasper 95.6 / 4.48 96.1 / 3.97 95.8 / 5.34
Right Clasper 98.8 / 2.94 97.9 / 2.22 95.6 / 6.99
Head 99.7 / 3.43 100 / 3.54 97.1 / 3.58
Shaft 100 / 3.61 100 / 3.28 96.0 / 2.90
End 100 / 5.24 99.9 / 6.05 99.1 / 5.23
Avr 98.8 / 3.94 98.8 / 3.81 96.7 / 4.81
Table 9: An exhaustive comparison of different semi-supervised learning algorithms for seen instruments.
Sequence 5-6 (F1-score / Pixel error (RMSE))
VAT Pseudo-labeling Mean Teacher
Left Clasper 74.8 / 9.43 77.8 / 7.57 80.5 / 8.87
Right Clasper 82.5 / 6.15 68.8 / 6.39 71.0 / 6.28
Head 72.0 / 4.50 81.3 / 4.89 71.5 / 5.15
Shaft 81.4 / 7.98 87.2 / 8.24 88.9 / 9.48
End 83.0 / 8.25 82.3 / 8.40 85.6 / 8.66
Avr 78.7 / 7.26 79.5 / 7.10 79.5 / 7.69
Table 10: An exhaustive comparison of different semi-supervised learning algorithms for unseen instruments.

As can be seen on Tables 9 and 10, semi-supervised learning methods consistently improve the ability to extrapolate to unseen geometries while maintaining high accuracy for seen instruments. It is observed that the network trained with pseudo-labeling method is more consistent across seen and unseen instruments, and therefore, this model has been selected as the final semi-supervised model. Tables 13 and 14 provide detailed results for the pseudo-labeling method for all sequences. Furthermore, in Table 11, a comparison of the supervised baseline, final semi-supervised model and state of the art in terms of f1 score and root mean squared pixel error provided. It should be noted that following Du et. al. [7], for all the experiments a pixel threshold of 20 pixels on the original frame is used. As can be seen in this table, semi-supervised learning improves the supervised baseline in average f1 score and pixel error. Plus, the state of the art is improved in terms of average f1-score and pixel error while only using 2.1M trainable parameters which goes to show the usefulness of semi-supervised learning and the strength of the designed network architecture. The runtime of DAU-Net-64-3 is measured as 35 ms on Nvidia TITAN X GPU.

An example of the virtual adversarial noise for EndoVis dataset is visualized in Figure 8. As can be seen noise is added to pixels where reflective surfaces are present. This is to be expected because instruments do not have defining texture for the majority of the parts, however, they reflect the light. Therefore, VAT tries to fool the network into thinking that reflective surfaces are instruments.

All sequences (F1-score / Pixel error (RMSE))
Supervised Pseudo-labeling Du et. al [7]
Left Clasper 80.6 / 6.67 82.8 / 6.63 86.4 / 5.03
Right Clasper 81.7 / 5.58 76.2 / 5.39 85.7 / 5.40
Head 80.9 / 5.19 85.6 / 4.56 76.3 / 6.55
Shaft 85.2 / 7.25 90.1 / 7.32 91.0 / 8.63
End 81.4 / 7.26 86.4 / 7.84 77.3 / 9.17
Avr 82.0 / 6.39 84.2 / 6.35 83.3 / 6.96
Table 11: A comparison of f1 score and root mean squared pixel error for the supervised baseline, selected semi-supervised model and the state of the art.
Figure 8: Visualization of virtual adversarial noise on the EndoVis dataset. On the left side the original image is presented and on the right side the virtual adversarial noise is illustrated.

5 Conclusion

This study encompasses an evaluation of semi-supervised learning for 2D-pose estimation for surgical instruments where data labeling is prone to human errors and requires a lot of time investment.

All in all, it is observed that utilization of the attention mechanism improves the performance drastically and eliminates the need for a 2-stage pipeline that consists of detection and refinement. Furthermore, it has been shown that semi-supervised learning improves the performance for unseen instruments while maintaining high accuracy for seen ones. More specifically, it is recognized that the combination of pseudo-labeling and total variation is more consistent and easier to use, whereas VAT and mean teacher algorithms require extensive hyperparameter search and additional computational overhead during training. Furthermore, the introduced confidence measure, total variation, is shown to be very useful in many aspects. Our experiments indicate that the utilization of total variation as a post-processing step and/or as a part of pseudo-labeling algorithm can yield serious performance improvement.


  • [1] Laparoscopy. Accessed: 2019-08-20.
  • [2] Total variation. Accessed: 2019-09-05.
  • [3] Adrian Bulat and Georgios Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016.
  • [4] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 7291–7299, 2017.
  • [5] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  • [6] Luca Dombetzki. Deep learning for tool detection and tracking in microsurgery. Bachelor’s Thesis, 2018.
  • [7] Xiaofei Du, Thomas Kurmann, Ping-Lin Chang, Maximilian Allan, Sebastien Ourselin, Raphael Sznitman, John D Kelly, and Danail Stoyanov. Articulated multi-instrument 2-d pose estimation using fully convolutional networks. IEEE transactions on medical imaging, 37(5):1276–1287, 2018.
  • [8] Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Modeling visual context is key to augmenting object detection datasets. In Proceedings of the European Conference on Computer Vision (ECCV), pages 364–380, 2018.
  • [9] Luis C García-Peraza-Herrera, Wenqi Li, Lucas Fidon, Caspar Gruijthuijsen, Alain Devreker, George Attilakos, Jan Deprest, Emmanuel Vander Poorten, Danail Stoyanov, Tom Vercauteren, et al. Toolnet: holistically-nested real-time segmentation of robotic surgical tools. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5717–5722. IEEE, 2017.
  • [10] Luis C García-Peraza-Herrera, Wenqi Li, Caspar Gruijthuijsen, Alain Devreker, George Attilakos, Jan Deprest, Emmanuel Vander Poorten, Danail Stoyanov, Tom Vercauteren, and Sébastien Ourselin. Real-time segmentation of non-rigid surgical tools based on deep learning and tracking. In International Workshop on Computer-Assisted and Robotic Endoscopy, pages 84–95. Springer, 2016.
  • [11] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [12] Puneet K Gupta, Pahick S Jensen, and Eugene de Juan. Surgical forces and tactile perception during retinal microsurgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 1218–1225. Springer, 1999.
  • [13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [14] Iro Laina, Nicola Rieke, Christian Rupprecht, Josué Page Vizcaíno, Abouzar Eslami, Federico Tombari, and Nassir Navab. Concurrent segmentation and localization for tracking of surgical instruments. In International conference on medical image computing and computer-assisted intervention, pages 664–672. Springer, 2017.
  • [15] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
  • [16] Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pages 3235–3246, 2018.
  • [17] Nicola Rieke, David Joseph Tan, Chiara Amat di San Filippo, Federico Tombari, Mohamed Alsheakhali, Vasileios Belagiannis, Abouzar Eslami, and Nassir Navab. Real-time localization of articulated surgical instruments in retinal microsurgery. Medical image analysis, 34:82–100, 2016.
  • [18] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [19] Manish Sahu, Anirban Mukhopadhyay, Angelika Szengel, and Stefan Zachow. Addressing multi-label imbalance problem of surgical tool detection using cnn. International journal of computer assisted radiology and surgery, 12(6):1013–1020, 2017.
  • [20] Duygu Sarikaya, Jason J Corso, and Khurshid A Guru.

    Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection.

    IEEE transactions on medical imaging, 36(7):1542–1549, 2017.
  • [21] Otmar Scherzer, Markus Grasmair, Harald Grossauer, Markus Haltmeier, and Frank Lenzen. Variational methods in imaging. Springer, 2009.
  • [22] Stefanie Speidel, Julia Benzko, Sebastian Krappe, Gunther Sudra, Pedram Azad, Beat Peter Müller-Stich, Carsten Gutt, and Rüdiger Dillmann. Automatic classification of minimally invasive instruments based on endoscopic image sequences. In Medical Imaging 2009: Visualization, Image-Guided Procedures, and Modeling, volume 7261, page 72610A. International Society for Optics and Photonics, 2009.
  • [23] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.

    The journal of machine learning research

    , 15(1):1929–1958, 2014.
  • [24] Gyung Tak Sung and Inderbir S Gill. Robotic laparoscopic surgery: a comparison of the da vinci and zeus systems. Urology, 58(6):893–898, 2001.
  • [25] Raphael Sznitman, Rogerio Richa, Russell H Taylor, Bruno Jedynak, and Gregory D Hager. Unified detection and tracking of instruments during retinal microsurgery. IEEE transactions on pattern analysis and machine intelligence, 35(5):1263–1273, 2012.
  • [26] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
  • [27] Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069, 2018.
  • [28] Curtis R Vogel and Mary E Oman. Iterative methods for total variation denoising. SIAM Journal on Scientific Computing, 17(1):227–238, 1996.
  • [29] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
  • [30] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
  • [31] I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546, 2019.
  • [32] Menglong Ye, Lin Zhang, Stamatia Giannarou, and Guang-Zhong Yang. Real-time 3d tracking of articulated tools for robotic surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 386–394. Springer, 2016.
  • [33] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.
  • [34] Jiawei Zhou and Shahram Payandeh. Visual tracking of laparoscopic instruments. Journal of Automation and Control Engineering Vol, 2(3), 2014.

Supplementary Material

Bounding Box Generation from 3-point Annotation

In the context of retinal microsurgery, accurate localization of tips and shaft of the instrument is considered to be more valuable compared to the accurate localization of end joint of the instrument because these 3 joints are the closest to the retina during surgery. In other words, one can also argue that the detection and/or localization of the end joint is redundant for real world applications. Considering this argument, [6] uses the below given formulation to compute tight bounding boxes around the tips and the shaft of the instruments

where a bounding box is defined by the coordinates (, ) and (, ) which correspond to the vertices of the bounding box. For a given joint set , these vertices are computed by finding the minimum and the maximum over all the x and y coordinates. Furthermore, a scaling variable is used to determine the width and the height of the bounding box. can be computed using

where = 1 is used for all experiments.

Tables and Results

Precision / Recall / Pixel error (RMSE)
DAU-Net-32-3 Du et. al. [7]
Tip1 95.3 / 95.3 / 4.97 99.13 / 99.13 / 5.26
Tip2 97.9 / 97.9 / 4.78 97.58 / 97.58 / 4.61
Shaft 100 / 100 / 3.83 94.12 / 94.12 / 4.93
Avr 97.7 / 97.7 / 4.53 96.9 / 96.9 / 4.93
Table 12: Comparison of DAU-Net-32-3 with the state of the art for 3-point annotation on RMIT dataset.
Sequence 1-4 (Seen instruments)
Precision Recall F1-score
Pixel error
Left Clasper 92.5 100 96.1 3.97
Right Clasper 95.9 100 97.9 2.22
Head 100 100 100 3.54
Shaft 100 100 100 3.28
End 100 99.7 99.9 6.05
Avr 97.7 99.9 98.8 3.81
Table 13: Detailed results for the pseudo-labeling algorithm for seen instruments.
Sequence 5-6 (Unseen instruments)
Precision Recall F1-score
Pixel error
Left Clasper 63.8 99.6 77.8 7.57
Right Clasper 66.8 71.0 68.8 6.39
Head 86.2 76.9 81.3 4.89
Shaft 97.3 79.0 87.2 8.24
End 88.5 77.0 82.3 8.40
Avr 80.5 80.7 79.5 7.10
Table 14: Detailed results for the pseudo-labeling algorithm for unseen instruments.