Generating large labeled data sets for laparoscopic image processing tasks using unpaired image-to-image translation

07/05/2019 ∙ by Micha Pfeiffer, et al. ∙ 1

In the medical domain, the lack of large training data sets and benchmarks is often a limiting factor for training deep neural networks. In contrast to expensive manual labeling, computer simulations can generate large and fully labeled data sets with a minimum of manual effort. However, models that are trained on simulated data usually do not translate well to real scenarios. To bridge the domain gap between simulated and real laparoscopic images, we exploit recent advances in unpaired image-to-image translation. We extent an image-to-image translation method to generate a diverse multitude of realistically looking synthetic images based on images from a simple laparoscopy simulation. By incorporating means to ensure that the image content is preserved during the translation process, we ensure that the labels given for the simulated images remain valid for their realistically looking translations. This way, we are able to generate a large, fully labeled synthetic data set of laparoscopic images with realistic appearance. We show that this data set can be used to train models for the task of liver segmentation of laparoscopic images. We achieve average dice scores of up to 0.89 in some patients without manually labeling a single laparoscopic image and show that using our synthetic data to pre-train models can greatly improve their performance. The synthetic data set will be made publicly available, fully labeled with segmentation maps, depth maps, normal maps, and positions of tools and camera (



There are no comments yet.


page 2

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the increase in computing power, there is an obvious trend towards training larger and deeper networks. However, in the medical domain, the lack of large data sets is a strong limiting factor [10]. The difficulty of recording real patient data in an operating room, legal restrictions on sharing and the great expense of manual labeling by experts make it near impossible to generate large training benchmarks. This work focuses on the example of the segmentation of laparoscopic videos, where deep networks can achieve high accuracies, but sometimes fail to generalize to new patients due to the lack of more labeled data [4]. A solution to this problem could be the usage of synthetic training data. In computer simulations, large amounts of fully labeled data can be created automatically. The main issue here is that models trained on synthetic data usually do not generalize well to real data, due to the domain gap between the two.

Instead, we propose to use image-to-image translation techniques to translate images from the domain of simulated images in which labels are known (domain ), to the domain of real images in which we want to train our model (domain ). Recent advances in image translation make it possible to do this even if the data is unpaired, i.e. no direct mapping between samples in one domain to samples in the other domain exists [13]. Additionally, multi-modal image-to-image translation [6, 8] enables us to control the style of the translation result, which can be utilized to increase the diversity in the final data set. In the present work, domain consists of images from very simple laparoscopic 3D computer simulations while domain is the domain of images from real laparoscopy video feeds. In order to use the translated data for training, care must be taken that a) the translated images look realistic enough to bridge the domain gap and b) the labels remain valid. This is especially difficult in laparoscopic images, since image content can change drastically between different viewpoints and between patients. To achieve our goal, we build up on several methods:

1.0.1 Unpaired translation

The CycleGAN [13] has made it possible to translate images between two unpaired domains by usage of a cycle consistency loss and adversarial losses. A generator network translates images from to which a discriminator network tries to differentiate from real images in . At the same time, generator and use the same method to translate images from to . The cycle consistency states that an image translated to and back to must match the original image, i.e. (and symmetrically for an image ). This method can only learn a one-to-one mapping (uni-modal), meaning each input image will generate exactly one output.

1.0.2 Multi-Modal translation

The key idea behind multi-modal image translation is the separation of an image’s content from its style. The assumption is that the content between domains remains the same, while the style is domain-specific (texture, lighting). An encoder first extracts a style-code and a content-code from the source image and a generator then uses this content-code together with a style-code from the target domain to create the image in the target domain [6, 8]. The opposite direction works analogously. A cycle loss and various reconstruction losses bind the networks together.

1.0.3 Label-preserving translation

SPIGAN [9] proposes to train an additional network which tries to predict the depth map from the translated image, arguing that this preserves image structure. In our experience, this bears the risk of co-adaptation between the networks. AugGAN [5] and GANTruth [1] bind the generators to the image structure via weight-sharing with segmentation networks. However, AugGAN requires segmentation labels to be known for both domains and GANTruth requires a pre-trained segmentation network in the target domain. Our goal is to not use labels during the translation process, simplifying the training procedure.

1.0.4 Contribution

In this work, we show how both the goal of realism as well as the preservation of label accuracy during translation can be achieved. First, we build an extension to the MUNIT framework which is asymmetrical and does not require the simulated domain to have multiple styles, speeding up the process of creating the simulated data. Next, we incorporate an additional multi-scale structural similarity loss [12] and show that it helps to preserve image content and structure despite large changes in camera viewpoint. Additionally, we show how the addition of noise in the encoders can help avoid mode collapse - where multiple images map to a similar output - and steganography. To validate the approach, we show that pre-training a segmentation model on the synthetic data can increase segmentation scores.

As part of this work, we translate 100 000 images to domain (see Fig. 1). This data set, fully labeled with segmentation maps, depth maps and further labels as well as the code will be publicly available111Data set and code available at:, with possible applications ranking from pre-training to benchmarking.

2 Methods

Unpaired multi-modal image-to-image translations can output convincing results, but have mostly been tested on scenarios where the content stays similar in all images across both domains (such as faces to faces or mountains to mountains) [6, 8]. In laparoscopy, viewpoints can change and structures - such as the gallbladder or abdominal wall - move into and out of the view. Incorporating this into our data set is necessary as we want it to be very diverse, however, the mismatch in domain distributions can lead to many wrongly added details, such as a gallbladder where there should only be liver and fat tissue replacing liver tissue. The following describes our extensions to the MUNIT architecture which enable us to deal with these issues, namely adding a structure-preserving loss, simplifying the encoder and using noise to avoid co-adaptation of the networks. The resulting training process is outlined in Fig. 2.

Figure 2: Architecture based on MUNIT [6]. Image randomly drawn from is translated to and back to , where a cycle loss ensures that is reconstructed correctly. The same is done in the opposite direction for images drawn from . Various reconstruction losses ensure that the generators and encoders work as expected (please see [6] for more details). During the translation process, images from are encoded to a latent code , while images from are split into two latent codes: content and style . Unlike MUNIT, we do not have a style in , which simplifies the creation of the rendered images. Furthermore, we add noise to all encoders to prevent the hiding of information and add the MS-SIM loss between source images and their translations.

2.1 Architecture

Multi-Scale Structural Similarity (MS-SSIM) loss: Unpaired translation networks often invent details in their output. This is likely due to two reasons: 1) Some structures and some viewpoints occur more in one of the two domains than in the other. For example, domain contains more close-ups of the liver due to the random placement of the camera. The discriminator will discourage these images, resulting in the generator inventing structures like an additional gallbladder. 2) Generative models are susceptible to mode collapse. We add a multi-scale structural similarity [12] loss between an image and its translation (and similarly in the other direction). The loss works on the image brightness (average over the channels) which ensures that brighter regions (such as the gallbladder) remain brighter and darker regions remain dark while at the same time not penalizing style-dependent changes in hue.

Noise against steganography: GANs have shown to be very effective hiding information in their output images [2]. Since the generators and are trained jointly to fulfill the cycle consistency, learns to hide details of the image in its translation which are useful for . This is problematic when giving a real image to translate, since these details are not present in this case. To circumvent this effect, we add Gaussian noise to the input of each translation network.

Asymmetrical style: One of our aims is to reduce the amount of manual work required to generate data. In this spirit, we want to translate from a simple and easy to set up domain to a very complex domain and let the computer do the bulk of the work automatically. We remove the part of encoder which extracts the style and the style-injection from . As a result, our setup becomes asymmetrical and we do not need to worry about creating multiple textures or lighting styles in the simulated domain , simplifying the simulation process. During training, both the style extracted by

as well as randomly drawn style vectors are used when translating from

to . In this way, the network can later translate images either using a random style or the style taken from a real image.

2.2 Translation data

To train our translation networks, we use two unpaired data sets, which both contain images with livers, gallbladders, tools, fat and abdominal wall (see Fig. 3).

Figure 3: Sample images from the two domains. Both contain similar objects, but no pairing information is known, and the distribution of content does not necessarily match.

Rendered data set - Domain : We create six synthetic laparoscopic 3D-scenes using the liver and gallbladder surface meshes extracted from CT scans of six patients (3D-IRCADb 01 data set, IRCAD, France). We add meshes which represent fat tissue, ligament and the inflated abdominal wall. Each tissue type is assigned a distinctive texture with small random details. We randomly place the camera together with a light source (representing the laparoscope) and tools. In this way, we render 2000 images from random perspectives for each patient, resulting in 12 000 synthetic images. To increase the diversity in our translated results, we repeat the process for four additional patients where no gallbladder is present, resulting in scenes similar to liver staging procedures. The images from all ten patients together make up our extended rendered data set .

Real data set - Domain : The real images are taken from 80 videos of the Cholec80 data set (videos of 80 laparoscopic cholecystectomies) [11]. We first identify parts of the videos in which the gallbladder is still intact and then extract frames at five frames per second. We separate the resulting images into a training data set (75 patients, roughly 74 000 images) and a segmentation data set (5 patients). We manually segment the liver in 196 images of (at a rate of one frame every five seconds).

2.3 Experiments

We train the translation networks for 375 000 iterations. Afterwards, we translate all images from , using five randomly drawn style vectors for each image, resulting in 100 000 images which we call the synthetic data set .

Evaluating the image quality quantitatively is difficult. Instead, we validate the usefulness of the synthetic data set by using it as training data for a segmentation task: As a baseline, we first train a TernausNet-11 [7] on the real Cholec80 validation data set in a leave-one-patient-out cross-validation (five models trained, each time one patient is left out of the training data to be used for testing). We then train the same network only on the synthetic data and validate it on all five patients in . Furthermore, we test how the performance changes if the network which is already trained on

is fine-tuned on the real data in the same cross-validation as before. The experiments are repeated for a TernausNet which has previously been pre-trained on the ImageNet data set


To see how our synthetic data helps in the adaptation to a wider diversity of images, we evaluate the pre-initialized TernausNets on images from 13 liver staging sequences, in which a total of roughly 2000 images are segmented [4].

3 Results

Using the MS-SSIM loss can greatly improve the preservation of image structure, as shown qualitatively in Fig. 4 and helps in the correct usage of textures: The correct assignment of texture to the various organs can be clearly seen and close-up shots of the liver surface result in highly detailed liver texture translations (more translation results in the supplementary materials).

Figure 4: Qualitative results for the MS-SSIM loss. During translation of images and , the networks tend to remove () or add () detail. In contrast, networks and , which are trained with an MS-SSIM loss, preserve structures in both directions.
Training data P76 P77 P78 P79 P80 Staging Procedures
0.50 0.68 0.42 0.52 0.56
0.73 0.70 0.13 0.74 0.76
+ 0.74 0.72 0.40 0.64 0.61
+ 0.80 0.81 0.48 0.86 0.83 0.25
+ 0.89 0.80 0.12 0.80 0.85 0.61
+ + 0.92 0.83 0.64 0.89 0.91 0.77
Table 1: Median dice scores for (Patients 75 to 80 from Cholec80 data) and for the 13 staging procedures. In all cases where is part of the training data, the reported results are from a leave-one-patient-out cross-validation (except for the staging procedures, where all five patients were used). In patient P78 most of the visible liver region is covered in ligament and fat tissue. Median scores for the 13 staging procedures increase considerably by using the synthetic data for pre-training. An additional improvement is achieved by pre-training on the ImageNet data .

Training on our synthetic data shows considerable improvements over training only on the real data (Table 1). When using the synthetic data for pre-training, the median dice score improved by an average of 16 percent (no ImageNet pre-training) and 11 percent (with ImageNet pre-training).

When the network was tested on the 13 staging procedures [4] containing data that had not been seen at all during training, the mean dice score using only real training data was 0.25, and improved to 0.77 when the network was pre-trained with the synthetic data .

4 Discussion

In this work, we have shown that consistent translation results can be achieved despite having a large change in content and viewpoints.

The translated results alone can be used to achieve reasonably good scores on a segmentation task without labeling a single image. When pre-training a network with our synthetic data, we can demonstrate an increase in performance, compared with only using real data. We also show that the training data can help a network in generalizing to new situations.

Unpaired image-to-image translation is proving to be a very powerful tool in the generation of training data. Since the domain of surgical data science still mostly lacks large benchmarks and open data sets, it could greatly benefit from further development in this field.


  • [1] Bujwid, S., Martí, M., Azizpour, H., Pieropan, A.: Gantruth - an unpaired image-to-image translation method for driving scenarios (11 2018)
  • [2] Chu, C., Zhmoginov, A., Sandler, M.: Cyclegan, a master of steganography (2017)
  • [3] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09 (2009)
  • [4] Gibson, E., Robu, M.R., Thompson, S., Edwards, P.E., Schneider, C., Gurusamy, K., Davidson, B., Hawkes, D.J., Barratt, D.C., Clarkson, M.J.: Deep residual networks for automatic segmentation of laparoscopic videos of the liver (2017)
  • [5]

    Huang, S.W., Lin, C.T., Chen, S.P., Wu, Y.Y., Hsu, P.H., Lai, S.H.: Auggan: Cross domain adaptation with gan-based data augmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. pp. 731–744. Springer International Publishing, Cham (2018)

  • [6] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: The European Conference on Computer Vision (ECCV) (September 2018)
  • [7] Iglovikov, V.I., Shvets, A.A.: Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation. CoRR abs/1801.05746 (2018)
  • [8] Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Diverse image-to-image translation via disentangled representations. In: The European Conference on Computer Vision (ECCV) (September 2018)
  • [9] Lee, K.H., Ros, G., Li, J., Gaidon, A.: SPIGAN: Privileged adversarial learning from simulation. In: International Conference on Learning Representations (2019)
  • [10] Maier-Hein, L., Vedula, S.S., Speidel, S., Navab, N., Kikinis, R., Park, A., Eisenmann, M., Feussner, H., Forestier, G., Giannarou, S., et al.: Surgical data science for next-generation interventions. Nature Biomedical Engineering 1(9),  691 (2017)
  • [11] Twinanda, A., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Transactions on Medical Imaging 36 (02 2016)
  • [12] Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003. vol. 2, pp. 1398–1402 Vol.2 (Nov 2003)
  • [13] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Computer Vision (ICCV), 2017 IEEE International Conference on (2017)