Reducing Annotating Load: Active Learning with Synthetic Images in Surgical Instrument Segmentation

by   Haonan Peng, et al.
University of Washington

Accurate instrument segmentation in endoscopic vision of robot-assisted surgery is challenging due to reflection on the instruments and frequent contacts with tissue. Deep neural networks (DNN) show competitive performance and are in favor in recent years. However, the hunger of DNN for labeled data poses a huge workload of annotation. Motivated by alleviating this workload, we propose a general embeddable method to decrease the usage of labeled real images, using active generated synthetic images. In each active learning iteration, the most informative unlabeled images are first queried by active learning and then labeled. Next, synthetic images are generated based on these selected images. The instruments and backgrounds are cropped out and randomly combined with each other with blending and fusion near the boundary. The effectiveness of the proposed method is validated on 2 sinus surgery datasets and 1 intraabdominal surgery dataset. The results indicate a considerable improvement in performance, especially when the budget for annotation is small. The effectiveness of different types of synthetic images, blending methods, and external background are also studied. All the code is open-sourced at:



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 8


Towards Better Surgical Instrument Segmentation in Endoscopic Vision: Multi-Angle Feature Aggregation and Contour Supervision

Accurate and real-time surgical instrument segmentation is important in ...

Automatic Instrument Segmentation in Robot-Assisted Surgery Using Deep Learning

Semantic segmentation of robotic instruments is an important problem for...

Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels

We propose using active learning based techniques to further improve the...

Active Learning using Deep Bayesian Networks for Surgical Workflow Analysis

For many applications in the field of computer assisted surgery, such as...

Graph-based Reinforcement Learning for Active Learning in Real Time: An Application in Modeling River Networks

Effective training of advanced ML models requires large amounts of label...

Char-RNN and Active Learning for Hashtag Segmentation

We explore the abilities of character recurrent neural network (char-RNN...

Deep Active Learning by Model Interpretability

Recent successes of Deep Neural Networks (DNNs) in a variety of research...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Minimally invasive surgery (MIS) has seen rapid development in recent years in applications such as intra-abdominal surgery and otolaryngology, and can improve the patients’ outcome and recovery [sayari2019review], [peters2018review]

. In MIS, endoscopes are used to provide vision of the surgical site in real-time. One of the most important components of understanding endoscopic surgical images is the segmentation of instruments, and much recent research is applying deep learning technology

[maier2017surgical], [shvets2018automatic]. However, the lack and cost of labeled data still remains a major challenge for many learning-based methods, especially in medical practice where the resource of data is limited and sometimes only trained experts can annotate images with high quality [cheplygina2019not], [yang2017suggestive].

Recently, using synthetic data to alleviate the workload of annotating data draws more attention, especially image-to-image translation techniques such as generative adversarial networks (GANs)

[wang2021review]. Synthetic images generated from simulation have accurate labels without manual work, but the domain gap between synthetic and real datasets can reduce the accuracy of a model trained by synthetic images when evaluated on real images. Thus, domain randomization or domain adaptation is usually needed to address this gap [tobin2017domain], [kouw2019review]. Although their performance is competitive, properly setting up GAN-based simulation environments for each application still remains a considerable amount of work.

In contrast, the method of generating synthetic images by cutting and pasting generalizes to many segmentation tasks. For endoscopic sinus surgery, the reflections on metallic instruments, as well as blur, liquids, and occlusion on the tissue-instrument boundary makes it even harder to perform image-to-image translation [lin2020lc]. An alternative to generating synthetic images from simulation is copying and pasting real object images onto real background images. This is proven to be an efficient method to generate synthetic images without concern of domain gap [dwibedi2017cut]. Compared with simulation, this method requires a certain amount of labeled data.

Another popular method to reduce the usage of real labeled data is active learning(AL) [budd2021survey], when combined with deep learning, it achieves faster convergence and increased performance with fewer data, and goes back as far as 1988 [angluin1988queries]. AL uses query criterion to select the most uncertain or informative samples from an unlabeled data set and ask for annotating [gorriz2017cost]. This is suitable for surgical instrument segmentation because there are plenty of unlabeled videos but annotation is expensive [qin2020towards].

In this work, we develop a copy-and-paste method to generate synthetic images, combined with active learning. We use active learning to choose the most informative unlabeled images to annotate, and use the copy-and-paste method to generate synthetic images which can ‘make best use of’ the selected real images. Then we experimentally show that the segmentation model trained with the synthetic images and a smaller number of selected real images have competitive performance compared to those trained on fully labeled real datasets. Three open source datasets are used in the experiments - UW-Sinus-Surgery-C/L Dataset [qin2020towards] and EndoVis 2017 Dataset [allan20192017].

Ii Related Work

Fig. 1: Workflow of the system

With the vigorous development of deep learning in recent years, cutting-edge performance in surgical instrument segmentation is achieved by deep convolutional neural networks

[qin2020towards], [islam2019learning], [kalinin2020medical]. However the Deep models’ hunger for large amounts of labeled data draws attention to reducing the workload of annotation, especially in medical image segmentation [rajotte2021reducing], [fujita2020deep]. Using synthetic images is an intuitive approach. Among approaches for generation of accurate, labeled synthetic images, copying the target object from one image and pasting it onto another image is feasible and relatively easy to implement. Dwibedi et al. propose a ‘cut’ and ‘paste’ method to generate synthetic images and train neural networks for kitchen object detection. Their result shows that just simply copying and pasting can result in artifacts such as aliasing of boundaries which decreases the learning performance. By improving the blending, synthetic data combined with real data can reach a competitive performance [dwibedi2017cut]. Ghiasi et al. proposed a study on copy-paste data argumentation for instance segmentation which described a simple copy-pasting mechanism that improved the performance of strong baselines and could also be combined with semi-supervised methods such as self-training. Unlike [dwibedi2017cut], the study shows that simple pasting without any blending has similar performance with blended ones [ghiasi2021simple]. Remez et al. present an object instance segmentation with a weakly-supervised cut-and-paste adversarial learning. Detection boxes and Faster R-CNN features are passed into a mask generator, which outputs segmentation masks that can be used to cut-and-paste the object into a new image location. And then a discriminator tries to distinguish between real and synthetic images, and select for training the ones that can improve learning performance [remez2018learning]. GANs [goodfellow2020generative] are also implemented to generate synthetic medical images[singh2021medical] [yoo2020generative]. GANs train two networks simultaneously - a generator and a discriminator, where the generator is trained to generate synthetic images that can cheat the discriminator, meanwhile the discriminator is trained to distinguish the real images and synthetic images. Recent implementations include image-to-image translation from simulated images to real images[colleoni2021robotic], and from cadaver images to live images[lin2020lc].

Active learning is dedicated to selecting and labeling the most informative training images that can reach near-optimal performance with the fewest annotations (human effort) [tajbakhsh2020embracing], [kim2020active]. Typically, in active learning, unlabeled images are selected by criteria such as maximum entropy or least confidence [holub2008entropy], [roels2019cost], [schein2007active]. In some cases, however, these criteria do not outperform random selection [yang2017suggestive], [belharbi2021deep]. Thus, more advanced criteria such as Bayesian active learning by disagreement (BALD) are proposed [houlsby2011bayesian]

. In BALD, Bayesian networks can be obtained by applying Monte Carlo dropout (MC dropout) to the network to generate a population of modified networks. The BALD criterion combines a high overall uncertainty with a term which increases weight on disagreement among the population. Saidu et al. presented a study on semantic segmentation of prostate medical images with active learning, and the BALD criterion outperformed maximum entropy, especially when the budget for annotation is small

[gal2017deep]. Tran et al. proposed a Bayesian generative active deep learning, which combines active learning and GAN data augmentation. The evaluation on image classification tasks suggests that the combined method outperform each single method.

In our proposed method, copy-and-paste synthetic images are generated based on informative real images selected by AL. Segmentation models trained by these images outperforms the models trained by only AL chosen real images or randomly generated synthetic images. In synthetic images, fusion near the boundary of the instrument reduces the artifact caused by copy-and-paste and thus improves the performance of segmentation near the boundary. By modifying several parameters, the proposed method is easily generalized to three different datasets.

Iii Methods

Fig. 2: Generation of synthetic image. Please notice that is only for visualization, where the solid green line indicates the outline of the instrument, yellow area comes from the background image and blue area comes from the instrument image . Transition area can be found around the boundary of the instrument on the synthetic image.

Iii-a System Workflow

Fig. 1 shows the workflow of the entire system (details below). The overall objective of the proposed system is to use active learning to choose the most informative samples from the unlabeled pool of the database and ask for annotation. And then, synthetic images are generated and added to the labeled pool together with the real images to ‘make best use of’ the real images which are identified as most informative. The goal is to train a segmentation model on fewer labeled real images but having competitive performance.

The database (described in detail in IV-A) consists of a labeled pool and an unlabeled pool of images. In terms of semantic segmentation of videos of endoscopic robotic surgery, it is not difficult to obtain unlabeled videos and images. Some of the datasets contain background images in which surgical tools are not presented. Compared to labeling the mask of instruments, it is not costly to manually select the background-only images to form the background pool.

Initially, some real images are randomly chosen and moved from the unlabeled pool to the labeled pool with annotation. Synthetic images are first generated using the images in the labeled pool and are then added to the labeled pool. Next, a segmentation model is trained using the labeled pool and then makes predictions on the images of the unlabeled pool. Then, uncertainty estimation is applied on the predictions and the most informative images are queried by the BALD active learning criterion, asking for annotation. If the dataset does not have background images or has few background images, background inpainting of instrument pixels in selected labeled real images is performed and the generated backgrounds are added to the background pool. Synthetic images are then generated based on the newly labeled images. For each real image, there are two types of synthetic images. Type-1 synthetic images have the same surgical tool as the original real image, and the background is randomly selected from the background pool. Type-2 synthetic images have the same background as the original real image (background inpainting is applied to the original real image to remove the original tool), and the surgical tool is randomly selected from the labeled pool.

After the generation, the newly labeled real images and the synthetic images are added to the labeled pool and then the segmentation model is trained again to start a new iteration, until the labeling budget or the desired performance is reached. Budget in this paper is defined as the fraction of training images which are given annotation, compared to the total number of training images.

Iii-B Generation of Synthetic Image

One synthetic image is generated from a real labeled image containing a surgical tool, and from a background image, either a real background or an inpainting background. The overall goal is to copy the tool from the instrument image and paste it on the background image, with resizing, movement and fusion. Fig. 2 shows the workflow of the generation of synthetic images.

The procedure starts with 2 images. , a labeled real image, includes an instrument and is a pure background. The instrument image also has a mask , which is a binary matrix with the same size as in which the instrument pixels are while other pixels are . Resizing, movement and rotation is first applied to the instrument and the mask:


where is the operator, , , and are the factors of resizing, movement in width and height, and angle of rotation, respectively. These operations are applied sequentially. Then, a binary dilation is applied on the new mask , so that in the dilated mask , the region of the instrument is larger than the true mask of the instrument .


where is a matrix and is the dilation kernel size, is the translation of by . After dilation, the fusion mask is generated by applying average blur or Gaussian blur [young1995recursive] on the mask . The reason for having 2 different blurs (blending methods) and details will be introduced later in this section.


where and are the operators of average blurring and Gaussian blurring, respectively, is the kernel size and

is the standard deviation. The reason for using two different fusion and blending methods is that according to

[dwibedi2017cut], the artifacts from the copy-and-paste operation may result in decreased performance if the model is trained on the synthetic images. In particular, having 2 exact same synthetic images but only with different blending methods in the training set can prevent the model from learning blending artifacts and improves performance on real images, as Fig. 3 shows. We study the effectiveness of this idea on our images in IV-E.

Fig. 3: Multi-blending: images on the left and right have the same size and position of instrument and background, but have different blending method (average fusion on the left and Gaussian fusion on the right) and parameters.

In the UW-Sinus-Surgery-C/L Dataset, it is visually apparent that instruments acquire coloration from the background through a diffuse reflection, as shown in Fig. 4. To capture this effect in our synthetic images, color and brightness adjustment is applied first to narrow the gap of the color style of the instrument and the background, for each channel of the adjusted image:


where is the factor of color adjustment, is the factor of brightness adjustment, and are the same channel of the background image and instrument image, respectively.

Fig. 4: Example of images from UW-Sinus-Surgery-C/L Dataset dataset, where the outlooks of the instrument are different due to reflection.

After the adjustment of color and brightness, the instrument is blended on the background by:


where indicates for element-wise multiplication, is a matrix of ones with the same size as . After the instrument and the background are combined, to imitate endoscopic vision, we apply a weak Gaussian blur and trim the border to restore the outline, and thus finalize the generation of the synthetic image and the corresponding mask . For UW-Sinus-Surgery-C/L Dataset:


where is the circle trimming operator, and are the center and radius of the trimming circle, respectively, and is the operator of Gaussian blur with kernel size and standard deviation . Similarly, for EndoVis 2017 dataset, because the vision is rectangular:


where is the rectangular trimming operator and , , , are the trimming width of top, bottom, left and right, respectively. Fig. 5 shows some examples of Type-1 and Type-2 synthetic images.

Fig. 5: Original real images (left), Type-1 synthetic images (center) and Type-2 synthetic images (right). Type-1 has the same instrument but a different background, and type-2 has the same background but a different instrument.

Iii-C Inpainting of Backgrounds

As introduced in III-B

, a synthetic image could be generated from a labeled instrument image and a background image. However, it is not always possible to find background images in every dataset. Thus, for those datasets without backgrounds, image inpainting is performed to generate backgrounds from labeled instrument images.

Fig. 6: Generation of inpainting backgrounds

Fig. 6 shows the procedure of background inpainting. The inpainting of the background image is similar to the generation of synthetic images. The difference is that for the inpainting of background, an area of background is blended over the instrument pixels, instead of blending an instrument over a background. There are 2 types of inpainting, self-inpainting and external inpainting. First, self-inpainting can be performed by self-flipping or rotation of the original image (12), if the flipped or rotated mask does not overlap with the original mask (13):


where is the image including instrument, is the inpainting background with instrument removed, is the fusion mask generated by the method mentioned in III-B, and are the flipped or rotated image and mask . Flipping can be applied vertically or horizontally, and rotation can be applied for degrees of , and .

However, sometimes the masks and always overlap regardless of flipping and rotation. In this case, external inpainting can be performed by randomly selecting another background from the background pool as a source for the pixels covering the instrument:


where is the background (original or inpainting) from the background pool and other variables are the same as self-inpainting. Fig. 7 shows the comparison between original (real) backgrounds and inpainting (synthetic) backgrounds.

Fig. 7: Example of inpainting backgrounds (top) and real backgrounds (bottom).

Iii-D Active Learning

Active learning (AL) is a feasible technology to reduce the load of annotating while having competitive performance of the model trained by limited data. The proposed system uses a pool based AL method shown in Fig. 1, in which there are a limited number of labeled images and plenty of unlabeled images. The iteration of active learning is introduced in III-A and BALD is used as the criterion to query unlabeled images.

The BALD criterion is to choose the images which are expected to maximise the mutual information between predictions and model posterior. To perform BALD, Monte-Carlo (MC) dropout, which is a frequently used stochastic regularization technique, is performed during training and inference.


where is the input image and is the output label, and are the entropy [shannon1948mathematical] of the prediction and distribution , respectively. is the labeled training data, and are model weights. The first term seeks the images which have high average entropy in the sampled models. And the second term gives the penalty such that the images on which the models disagree are kept and images on which many models are unconfident are dropped.

To be more specific, implementation of the BALD criterion on image semantic segmentation is:


where is the number of classes, is the number of committee members (model trained and inferred with MC dropout),

is the softmax probability of the pixel.

Iv Experiments and Results

Iv-a Datasets

All the images in the datasets were manually labeled by experts (surgeons and surgical residents). However, ground truth masks are provided only when the images are in the test set or are marked as ‘labeled’. The labeled images can be used for initial training or queried by AL.

UW-Sinus-Surgery-C/L Dataset [qin2020towards] contains two parts: the live dataset (Sinus-Live) and cadaver dataset (Sinus-Cadaver). The live dataset is collected from the videos of 3 live surgeries on 3 patients. The duration of the videos is around 2.5 hours in total, with image resolution of 19201280 and frame rate of 30 fps. 3955 images from the first two videos are used as the training set. 696 background images are generated by subsampling the videos at 3 Hz and manually selecting background-only frames. These selected backgrounds are provided to the system when external backgrounds are asked for. Manually choosing backgrounds is not as costly as manually annotating the images. Three subjects without surgical background are asked to select 700 backgrounds out of 20000 images and the time spent is 25 min, 17 min and 31 min, respectively. 703 hand-labeled images from the third video are used as the test set. All the images are resized and center cropped to 240240.

In order to prove the ability of generalization, the images in the test set and training set are from different videos from different surgical procedures. Because there are less than 15 real images used in some of our present experiments, meaningless images (such as pure black or white images caused by over-exposure or blocking) were manually cleaned from the training set. However, the meaningless images are not cleaned in the test set to ensure that the performance is fairly evaluated.

In Cadaver dataset, the sinus surgery video dataset is built similarly, collected from 10 surgeries on 5 cadaver specimens. The training set is generated from the first 7 videos, which has 2908 images. The test set is generated from the remaining 3 videos, which has 1437 images. 597 backgrounds are chosen from the training set videos. And the images are also resized and center cropped to 240240. Due to the different condition of each cadaver specimen, the overall appearance of the images can be different. However, none of the recorded videos shows considerable similarity to real surgeries. Humans have no difficulty to visually distinguish cadaver videos and live videos.

EndoVis 2017 Robotic Instrument Segmentation Dataset is from one of the sub-challenges of MICCAI 2017 [allan20192017]. The images are derived from 10 sequences of abdominal porcine procedures recorded using da Vinci Xi robotic endoscopic surgery systems, when significant instrument motion can be observed. The instruments used include Large Needle Driver, Prograsp Forceps, Monopolar Curved Scissors, Cadiere Forceps, Bipolar Forceps, Vessel Sealer and an ultrasound probe. 300 images are collected per video at 1 Hz and repetitive images caused by non-moving instruments are manually cleaned. The selected frames were labeled by a segmentation team at Intuitive Surgical. Although the videos were recorded by stereo camera, only left eye images are labeled. We used 900 images (225 each video) with labels, from videos 1-4, as our training set. 900 images and labels from videos 5-8 are used as our test set. Due to the long training time of active learning, the images are resized to 427240 to reduce computational time.

Iv-B Training Details

There are two main groups of parameters, parameters related to the generation of synthetic images, and parameters related to model training and active learning.

The segmentation model used in this paper has the same structure as [qin2020towards]. This is a modified DeepLabv3+ [deeplabv3plus2018] encoder-decoder model with Mobilenet [howard2017mobilenets] as feature extractor. In order to fit in the active learning iterations, the learning rate is increased and meanwhile the training iterations are decreased significantly to accelerate the training, and the accuracy was slightly compromised due to accelerated training speed. Adam [kingma2014adam]

is used as the optimizer, and the exponential decay rates of 1st and 2nd order moment estimates were 0.9 and 0.999, respectively. Batch size is set as 16. Because we use different budgets of the training set in the experiments, for all the experiments in this paper, the training iterations was set to be equivalent to 5 epochs on 100% training set (real images) plus 20 basic epochs to ensure convergence when a higher proportion of training set is used. Initial learning rate is set as 0.001 and exponential decay strategy is applied. The backbone lightweight MobileNet is pretrained on ImageNet


. Image argumentation was also applied to the training data for better generalization ability, which includes hue, brightness, saturation, contrast, flipping, rotation, zooming and zero-padding. The model is trained on a local Nvidia GTX Titan X GPU and remote NVIDIA Tesla P100 GPUs from Google Colaboratory.

We tested active learning performance by applying a variable budget of real images. To start, all the images in the dataset were in the unlabeled set and the labels were hidden. For example, if the real-image budget was 394 images (10% of the Live dataset), then 197(half) real images are randomly chosen first, and the other half 197 images are chosen by the BALD criterion in 3 iterations. Once chosen from the unlabeled set, the real images are moved to the labeled set and their labels are revealed. And synthetic images are generated with labels naturally.

The main parameters of synthetic images generation were empirically chosen and are shown in Table 1. These parameters apply to all the tests unless stated otherwise.

Parameter Value (Live / Cadaver / EndoVis) Description
(Type-1, Type-2) Synthesis per query (2,0) / (2,0) / (0,1) A Type-1 synthetic image has the same instrument with the original real image. A Type-2 synthetic image has the same background with the original images.
Multi-blending No / No / Yes Multi-blending generates two exact same synthetic images with only the difference of the blending method, to minimize the effect of artifact.
External Backgrounds Yes / Yes / No
Background Inpainting No / No / Yes
Factor of resizing [0.9, 1.2] The ratio of new size to the origin
Movement in width [-0.1, 0.1] / [-0.1, 0.1] / [-0.05, 0.05] , and are the proportion of the movement. For example w=0.1 of an image with width of 100 pixels results in movement of 10 pixels.
Movement in height [-0.1, 0.1] / [-0.1, 0.1] / [-0.05. 0.05]
Angle of rotation [-30, 30] In degree
Dilation kernel size 15 / 15 / 15 Used to build fusion mask
Fusion blur kernel size [10,15] / [5, 10] / [10, 15] Used to build fusion mask
Factor of color adjustment [0.4, 1.0] 1.0 means that the color of the instrument from the original image is not adjusted. Smaller means stronger adjustment
Factor of brightness adjustment [0.9, 1.3] The larger the brighter and 1.0 means that the brightness is not adjusted
Center of circle trimming (, ) ([115,125], [115,125]) In pixels
Radius of circle trimming [150, 170] In pixels
Width of rectangular trimming (, , , ) ([6, 9], [6, 9], [71, 74], [71, 74]) In pixels
Kernel size and standard deviation of Gaussian blur (, ) (3, 3)
* [a, b] means that the parameter is randomly chosen from range a to b.
TABLE I: Parameters of the generation of synthetic images

Iv-C Evaluation Metrics

Two main evaluation metrics are used in this paper, Dice similarity coefficient (DSC) and intersection of union (IoU)

[taha2015metrics], which are defined as

where is the foreground pixels of prediction, is the corresponding ground truth, and is the counting operation. To study the effect of blending and fusion on the performance of segmentation near boundary, IoU near boundary (IoUNB) is also used as an additional metric:

where denotes the near-boundary binary mask with width of 20 pixels band region near the instruments’ boundary. The mean values of these three metrics are calculated over each test, denoted as mDSC, mIOU and mIoUNB.

Iv-D Experiment 1: Usage of real images

We compare the proposed method with baseline results, which are obtained when 100 of real images in the training set are labeled and used to train the segmentation model. Additionally we study the performance when different annotation budgets are used. Budgets are set as proportions of the total real images in the training set, from 1 to 100. For example, 1(39), 5(197), 10(395) images were used from the Sinus-Live training dataset (3955 images in total), and 1(29), 5(145), 10(290) images are used from the Sinus-Cadaver training dataset (2908 images in total). However, the training on EndoVis 2017 dataset did not converge on the 1 or 2 budget because of too few images. Thus, the evaluation of this dataset started at 5. For each budget, 4 tests were performed - 1) randomly chosen training images without synthetic images, 2) BALD implemented - half chosen by BALD and the other half chosen randomly, without synthetic images, 3) randomly chosen images with generated synthetic images, 4) BALD implemented with generated synthetic images.

Fig. 8: Evaluation (mDSC) of the model with different budgets of real images, with and without active learning (BALD) and synthetic images (SYN). Please notice that the horizontal axises are nonlinear.

The evaluation results of mDSC are given in Fig. 8 , and more details of mIoU and mIoUnb are shown in Table 4 in Appendix. Each entry is the average of 5 repetitive tests with different global random seeds to reduce the influence of randomness, which applies to all the experiments in this paper. From Fig. 8, it can be seen that by generating synthetic images with active learning, the performance of segmentation is significantly improved when the number of real labeled images (budget) is small. Compared with randomly chosen and no synthetic images, the average improvement of the proposed method in mDSC with small budgets (less than 10 of training set) is 5.31, 8.15 and 3.41 in Sinus-Live, Sinus-Cadaver and EndoVis 2017 dataset, respectively. Specifically, the improvements on the Sinus-Live and Sinus-Cadaver dataset with only 1 budget are 6.67 and 12.20. Overall, as the labeled image budget increases, the improvement becomes smaller. However, a considerable improvement can still be achieved when the budget is 100 of the training set and BALD active learning is not applicable. When the budget is 100, generation of additional synthetic images still results in an improvement of 2.29, 0.86 and 1.54 in the three datasets, respectively. Compared with the results on the sinus datasets, the average improvement in EndoVis 2017 dataset is not as significant, but the result of the proposed method begins to outperform the baseline result at only 10 usage of the training set.

From Table 4 in Appendix, similar trends can be seen in mIoU and mIoU near the boundary. The average improvements of the proposed method with small budgets (less than 10) are significant in mIoU - 7.11, 10.02 and 4.93 in the three datasets, as well as mIoUNB - 5.22, 7.42 and 6.39. Considerable improvements in mIoU can still be seen when the budget is 100 of the training set - 3.04, 1.33 and 2.14. However, for mIoUNB, only the result of the Sinus-Live dataset shows improvement (4.52), while the results of the other two datasets are close to the baseline results.

Iv-E Experiment 2: Number and type of synthetic images

As introduced in III-B, there are 2 types of synthetic images. For each chosen and labeled real image, Type-1 synthetic images have the same instrument and Type-2 synthetic images have the same background as the real image. And multi-blending is also reported in [dwibedi2017cut] to be able to avoid decreased performance caused by the artifact near the boundary in synthetic images. Thus, this experiment is performed to study the effectiveness of the 2 types of synthetic images and multi-blending. To better compare the results, the experiments are separated into 3 groups. Each group tests the same number of generated images per BALD query. For example, in Table Table II, tests in Group 2 feature 8 synthetic images for each queried real image - . Consequently, tests in each group have the same number of training iterations to ensure that the model is trained for the same fixed number of steps in each case. To ensure convergence of training, instead of keeping training iterations , the training epochs of Group 2 and 3 are the same as Group 1 so that the training iterations of Group 2 and 3 are four and six times larger compared to Group 1, respectively. All the other parameters are set as Table 1 and the annotation budget is fixed at 10 of the training set.

The results are shown in Table II. The parameter of Type-1 and Type-2 means that for each labeled real image selected by the active learning mechanism, how many Type-1 and Type-2 synthetic images were generated, respectively. A “multi-blending” value of 1 means that each synthetic image is single and multi-blending is not applied. And multi-blending value of 2 means that for each synthetic image blended by average fusion (4), there is another similar synthetic image blended by Gaussian fusion (5). In Group 1, for the two sinus surgery datasets, the best performance on mDSC and mIoU were achieved by 2 Type-1 synthetic images, while the best performance on mIoUNB was achieved by 1 Type-2 synthetic image with multi-blending on Sinus-Live dataset, and by 1 Type-1 synthetic image with multi-blending on Sinus-Cadaver dataset. And for the EndoVis 2017 dataset, 1 Type-2 synthetic image with multi-blending gives the best result on mDSC and mIoU, and 1 Type-1 synthetic image with multi-blending gives the best result on mIoU.

In Group 2, although the training steps are increased significantly compared to group 1, a decrease in performance can be seen in the Sinus-Cadaver and EndoVis 2017 datasets compared to Group 1. Within the group, there is no considerable difference in results with and without multi-blending on the Sinus-Live and EndoVis 2017 dataset. However, multi-blending increases the performance significantly in the Sinus-Cadaver dataset. Similar trend can be observed in Group 3. Although no considerable difference is seen on mDSC and mIoU in Sinus-Live and EndoVis 2017 dataset, multi-blending gives an improvement on mIoUNB of around 1.

Group Syn per Real Image Performance(%)
Live Cadaver EndoVis
Type-1 Type-2 M-blend mDSC mIOU mIoUNB mDSC mIOU mIoUNB mDSC mIOU mIoUNB
1 No synthetic image 71.70 64.19 55.47 69.35 61.81 52.76 79.55 68.06 67.88
1 1 1 74.74 68.24 59.34 73.32 66.89 56.15 81.94 71.77 71.76
0.5 0.5 2 74.64 68.09 60.54 74.30 67.80 58.31 81.47 71.19 71.14
2 0 1 76.97 70.47 61.65 76.59 69.93 59.22 81.37 71.14 72.62
1 0 2 74.75 68.50 60.16 75.29 68.45 59.24 81.34 71.35 74.24
0 2 1 75.07 68.67 60.63 71.90 65.77 55.69 82.25 72.03 71.60
0 1 2 76.80 70.38 63.23 72.85 66.44 56.34 82.43 72.46 73.31
2 4 4 1 78.03 72.23 62.96 70.86 65.33 54.84 80.90 70.48 68.50
2 2 2 77.57 71.89 62.91 76.09 70.32 59.67 80.96 70.39 68.57
3 6 6 1 77.42 71.73 61.95 69.90 64.46 54.21 80.43 69.77 66.96
3 3 2 77.91 72.15 62.79 74.90 69.45 58.91 80.95 70.29 67.61
*The bold font indicates the best performance in each group. The parameter of Type-1 and Type-2 means that for each labeled real image
selected by the active learning mechanism, how many Type-1 and Type-2 synthetic images were generated, respectively. A M-blend value
of 1 means that each synthetic image is single and multi-blending is not applied. And M-blend value of 2 means that for each synthetic
image blended by average fusion (4), there is another similar synthetic image blended by Gaussian fusion (5).
TABLE II: Segmentation Result with Different Types of Synthetic Images and Multi-blending

Iv-F Experiment 3: Strength of Fusion and Blending

As introduced in III-B, the fusion and blending of borders synthetic instrument and background of synthetic images are controlled by 2 parameters, dilation kernel size and fusion kernel size . Generally speaking, a larger results in a larger area near the instrument in the original real image blended on the new synthetic image. And a larger results in a larger transition area near the fusion borders. Fig. 9 shows an example of 3 images from weak fusion, medium fusion and strong fusion.

Fig. 9: Synthetic images generated from weak fusion =15, =10 (left, less instrument background is retained), medium fusion =40, =30 (middle) and strong fusion =60, =45 (right, more instrument background is retained).

Thus, to study the effectiveness of fusion and blending, different combinations of d and k are evaluated while all the other parameters are held fixed according to Table 1. The result is shown in Table 3. Group 0, 1, 2, and 3 are no fusion, weak fusion, medium fusion and strong fusion, respectively. The difference in performance is around 1-3, not as large as that in the previous test. It can be seen that most of the best performances are found in group 1. However, for Sinus-Live and EndoVis 2017 dataset, the best results in mIoUNB is found in group 3.

Group Proposed Performance(%)
Live Cadaver EndoVis
Dilation Kernel Fusion Kernel mDSC mIoU mIoUNB mDSC mIoU mIoUNB mDSC mIoU mIoUNB
0 No fusion 74.03 67.39 59.27 72.77 65.97 57.38 82.31 72.26 72.98
1 15 [5, 10] 73.85 66.94 59.83 74.90 68.16 58.02 82.82 72.81 72.26
15 [10, 15] 74.96 68.44 59.87 74.48 67.59 57.39 82.81 72.94 73.25
15 [15, 20] 74.75 67.88 58.34 74.69 67.78 57.58 82.24 72.11 72.16
2 40 [20, 30] 74.54 66.68 58.78 72.96 65.45 55.73 82.55 72.53 72.81
40 [30, 40] 74.11 67.12 59.93 72.59 65.44 56.77 82.71 72.76 72.93
40 [40, 50] 73.13 66.16 57.30 74.45 67.46 57.62 81.98 71.71 71.46
3 60 [30, 45] 74.89 68.09 60.09 73.82 66.79 57.53 82.33 72.34 72.36
60 [45, 60] 74.71 67.65 59.11 72.82 65.64 56.03 82.63 72.55 72.51
60 [60, 75] 74.72 67.50 58.48 72.12 65.50 57.28 82.06 71.90 73.30
* 1)The bold font indicates the best performance in the column. 2) [a, b] means that the parameter is randomly chosen from range a to b.
TABLE III: Segmentation Result with Different Fusion and Blending

Iv-G Experiment 4: External backgrounds

For the proposed method, backgrounds are critical in generating synthetic images which we generated two ways. First we used image inpainting (III-C), and second we manually identified“external” background images (without any instruments present) from the video frames in the database. This experiment studied whether providing these external backgrounds can help with the segmentation result. The tests were separated into 4 groups according to the budget of real images (1 means the budget of the real images is equivalent to 1 of the training set), as shown in Table V in Appendix. In each group, there are 4 sub-tests. The baseline test is to only use the real image to train the segmentation model without synthetic images. The remaining 3 sub-test are all with synthetic images (BALD implemented). The only difference is how the backgrounds are provided. For ‘No-Yes’ tests, all the backgrounds used to generate synthetic images are inpainting backgrounds from real images with instruments. For ‘Yes-No’ tests, all backgrounds are external real background images and no inpainting backgrounds are generated. For ‘Yes-Yes’ tests, both external backgrounds and generated inpainting backgrounds are provided. All the other parameters were set according to Table I. Because no external backgrounds (frames without instruments) could be found in the EndoVis 2017 dataset, only Sinus-Live and Sinus-Cadaver datasets were used to perform this experiment.

The results in Fig. 10 (mDSC) and Table V in Appendix show that including external backgrounds improves performance significantly when an extremely small amount of real images were used (1 of the training set). For the Sinus-Live dataset, compared with no external backgrounds, the improvement of best result (with or without inpainting backgrounds) is 2.77, 2.89 and 1.93 in mDSC, mIOU and mIoUNB, respectively. For the Sinus-Cadaver dataset, the improvement is 4.02 (mDSC), 4.86(mIOU) and 1.74(mIoUNB). However, when a large amount of real images were used, the improvement is less considerable. For the Sinus-Live dataset, when 100 of the labeled real-image training set was used, the improvement is 0.52(mDSC), 0.70(mIOU) and 0.60(mIoUNB). But for the Sinus-Cadaver dataset, although improvement can still be seen in mDSC and mIOU with external backgrounds, a decrease of performance was found in mIoUNB(-0.99).

Fig. 10: Segmentation Result (mDSC) with Different Backgrounds

V Discussion

In this paper, we study the use of selectively generated synthetic images to improve the performance of surgical instrument segmentation. The proposed method can also be easily generalized to object localization and classification. Although in this paper the effectiveness of the proposed method is validated on surgical scenes, we believe that it can also be applied to other cases such as visual object detection in self-driving cars.

The result of Experiment 1 indicates that the proposed method improves the performance of segmentation significantly, especially when few real labeled images are used. With 10 real images budget combined with active generated synthetic images, the performance of segmentation is comparable to using 50 of real images without synthetic images, cutting manual labeling effort by 80. The performance of the proposed method is comparable with the baseline result (using 100 of the hand labeled training set) when using only 50, 75 and 10 of hand-labeled data when evaluated on the Sinus-Live, Sinus-Cadaver and EndoVis 2017 datasets, respectively.

For sinus surgery datasets, the Type-1 synthetic images (having the same instrument as the real image but with different background) slightly outperformed Type-2 synthetic images (having different instruments combined with the same background). Although use of two different image blending methods (“multi-blending”) made no major improvement on overall performance, it does improve the segmentation near the boundary. However, increasing the number of synthetic images did not always help. Too many synthetic images can cause a decrease in the performance.

When combining (fusing) instrument and background, details such as the radius of a blend applied along the instrument boundary can have substantial effect on segmentation training. Among different fusion strengths (Table III), weaker fusion (a small area around the instrument is blended on the synthetic images but with a steep transition) gave slightly better results. The subjective realism of synthetic images for humans varied from image to image, but medium fusion and strong fusion synthetic images appear to be harder to distinguish from real images due to weaker artifacts near the border.

We varied the method of generating background images between inpainting the space occupied by instruments, and using video frames which contained no instruments originally (“external backgrounds”). Using external backgrounds improved the performance significantly when the annotating budget was extremely small, without adding much workload. However, as the budget increased, no considerable difference was observed by using external backgrounds. Because manually selecting background-only frames from surgical videos requires no expertise and is not as time consuming as labeling the instruments, especially when different parts of the instruments have different labels, adding external backgrounds is an efficient way to improve segmentation with small numbers of labeled images.

Vi Conclusions and Future Work

Motivated by alleviating the experts’ workload of annotating for challenging instrument segmentation in endoscopic images, we propose use of actively generated synthetic images to reduce the need for labeled real images while having comparable performance. The idea of active generated synthetic images is to select the most informative unlabeled images, then annotate these images and generate synthetic images derived from the selected real images. Thus, a more diverse training set is formed by labeled real images and synthetic images, which results in considerable improvement in performance compared with using real images only, especially when the budget for annotating “new” images is small. To sum up, the proposed method utilizes and combines active learning and generation of synthetic images to reduce the usage of real labeled images, and can be flexibly applied to different segmentation models and datasets, with different active learning criteria.

In the future, we plan to study in principle how synthetic images help with performance. We also hope to explore the relationship between realism to humans and effectiveness to artificial intelligence, in other words, to study whether the most subjectively realistic synthetic images give the best performance on training segmentation models.


Vii Appendix

Budget Proposed Performance(%)
Sinus-Live Sinus-Cadaver EndoVis 2017

1% 59.69 50.99 46.37 53.71 43.57 40.18
60.18 52.06 46.65 54.85 45.34 42.12
64.76 57.39 49.72 56.93 49.37 43.40
66.36 59.52 53.26 65.91 58.51 50.82
2% 62.86 54.00 50.02 61.70 52.53 46.48
63.48 54.95 50.66 61.64 55.03 47.94
68.51 61.74 54.55 64.61 57.54 49.89
69.54 62.59 54.57 68.92 61.82 53.25
5% 69.31 61.83 54.00 68.97 60.57 53.04 76.89 64.50 62.44
70.72 63.07 57.42 72.40 64.33 56.18 77.48 65.30 63.72
71.32 64.53 57.09 73.62 66.58 57.57 81.01 70.73 70.58
72.41 65.86 56.96 75.96 69.12 59.09 81.56 71.09 71.28
10% 70.75 63.19 55.18 70.41 62.63 53.02 80.29 69.19 69.37
70.89 63.18 54.85 73.67 66.04 56.52 81.90 71.63 73.47
74.90 68.15 60.38 75.42 68.81 58.86 81.73 71.54 70.71
75.53 70.47 61.65 76.59 69.93 59.22 82.44 72.46 73.31
20% 74.05 66.59 58.19 73.45 65.70 55.58 80.90 70.10 71.16
76.08 68.89 61.70 74.18 66.72 57.67 82.78 72.86 74.39
77.01 70.19 62.00 76.68 69.61 59.85 82.37 72.22 71.41
76.78 70.43 61.82 76.58 70.27 60.73 83.05 73.25 73.20
50% 77.93 71.26 62.83 75.36 68.31 58.18 81.70 71.04 71.40
78.30 71.73 64.28 77.97 70.97 61.17 83.02 73.13 74.52
80.69 74.89 66.22 78.33 71.89 61.33 83.52 73.80 73.08
80.66 74.87 66.30 78.64 72.49 62.49 83.01 73.32 73.45
75% 78.94 72.47 64.42 75.18 67.94 58.23 82.27 72.08 73.72
80.29 73.93 65.49 78.20 71.26 61.97 83.64 74.21 76.91
81.74 76.11 68.62 79.20 73.01 63.02 83.64 74.17 73.58
81.96 76.24 67.30 79.72 73.68 63.31 83.57 74.04 74.04
100% 81.35 75.14 66.34 79.42 72.50 62.61 82.50 72.52 74.28
83.64 78.18 70.86 80.28 73.83 62.84 84.04 74.66 74.21
*The training on EndoVis 2017 database can not converge when using 1 and 2 of the training set, due to very few images.
The bold font indicates the best performance in each budget.
TABLE IV: Segmentation Performance with Different Budgets
Budget Proposed Performance(%)
Live Cadaver
External Background Background Inpainting mDSC mIOU mIoUNB mDSC mIOU mIoUNB
1% No synthetic image 60.18 52.06 46.65 54.85 45.34 42.12
63.94 56.63 51.33 61.89 53.65 49.27
66.36 59.52 53.26 65.91 58.51 50.82
66.71 59.33 53.14 65.32 57.62 51.01
10% No synthetic image 70.89 63.18 54.85 73.67 66.04 56.52
75.40 68.90 62.76 76.34 69.71 60.90
75.53 70.47 61.65 76.59 69.93 59.22
75.91 69.37 61.34 78.42 72.05 62.49
50% No synthetic image 78.30 71.73 64.28 77.97 70.97 61.17
79.81 74.15 67.79 78.31 72.40 63.57
80.66 74.87 66.30 78.64 72.49 62.49
80.99 75.21 67.91 80.48 74.43 64.59
100% No synthetic image 81.35 75.14 66.34 79.42 72.50 62.61
83.12 77.48 70.26 79.48 73.37 64.45
83.64 78.18 70.86 80.28 73.83 62.84
82.72 77.18 69.11 80.51 74.46 63.46
The bold font indicates the best performance in each budget.
TABLE V: Segmentation Result with Different Backgrounds