Supervised segmentation in medical image analysis is an almost solved problem, where methodological improvements have a marginal effect on accuracy. Such methods depend on a large annotated training corpus, where pixelwise labels have to be provided by medical experts. In practice, such data are expensive to gather. In contrast, weakly labelled data, such as images showing a certain disease are easily obtainable, since they are created on a daily basis in medical practice. We want to take advantage of these data for pathology segmentation, by providing a means to finding the difference between healthy and pathological data distributions. We present a weakly supervised framework capable of pixelwise segmentation as well as generating samples from the pathology distribution.
Our idea is inspired by CycleGAN , a recently proposed solution for unpaired image to image translation, where the combination of domain-specific generative adversarial networks (GANs) and so-called cycle consistency allow for robust translation. We call our adaptation PathoGAN and count the following contributions: We formulate a model capable of segmentation based on a single label per training sample. We simultaneously train two generative models, able to generate inpaintings at a localized region of interest to transform an image from one domain to the other. We are able to sample healthy brains as well as sample possible pathologies for a given brain. Furthermore, our method enforces domain-specific information to be encoded outside of the image, which omits adversarial “noise” common to CycleGAN  to some degree.
We show the performance of our implementation on 2d slices of the training data of the Brain Tumor Segmentation Challenge 2017 [8, 2] and compare our segmentation performance to a supervised segmentation technique .
CycleGAN has been previously used to segment by transfering to another target modality, where segmentation maps are available (e.g. ), or applied to generate training from cheaply generated synthetic labelmaps 
. Using a Wasserstein GAN, another method directly estimates an additive visual attribution map. To our knowledge, there has not been a method that jointly learns to segment on one medical imaging modality using only image-level labels and generate new data using GANs for both healthy and pathological cases.
2.1 Problem Statement
We assume two image domains and , where the former contains only images of healthy subjects and the latter consists of images showing a specific pathology. We seek to approximate the functions and that perform the mappings and , where and
. Vectorsand encode the missing target image information (e.g. the pathology):
In the remaining paper, we use and as placeholders for and or and to overcome redundancy due to symmetrical components. We encourage to produce results, such that the mappings are realistic (2), bijective (3), specific (4) and that only the affected part in the image is modified (5):
2.2 Model Topology
To fulfill Eqs. (2–5), we adopt the main setup and objective from CycleGAN: we employ two discriminators, and together with generators and to perform the translation from domain to and vice versa, formulating two generative adversarial networks (GANs) . In both directions, the respective discriminator is trained to distinguish a real image from the output of the generator, whereas the generator is incentivized to generate samples that fool the discriminator.
2.2.1 Residual Generator
In order to segment pathologies, we seek to only modify a certain part of the image. In contrast to CycleGAN, we model the transformation from one domain to the other as a residual or inpainting which is exchanged with part of the original image. We achieve this by letting generator directly estimate featuremaps , where is the number of image channels used. We obtain labelmap and inpaintings , activating with a sigmoid and each with a activation:
where and . With , we turn into samples from using the reparameterization trick . This allows reliable calculations of only for large absolute values of , forcing to be binary and intensity information to be encoded in the inpaintings. We set to zero during testing. From and we compute the translated result , supposedly in domain now:
In the following, we detail the computation of for the two possible translation directions. Both translation pathways are visualized in Fig. 1.
To map from healthy to pathological data, we estimate (and thus
) using a variational autoencoder (VAE). First, we employ encoders and to encode anatomical information around and inside the pathological region:
where is our healthy image, and are the labelmap and pathological image of the previous transformation and . If and are not available because is a real healthy image, we simply sample . Finally, a decoder is applied to and :
To generate healthy samples from pathological images, we use a generator directly on the input as introduced in  and estimate directly:
Here, we omit , since the location and appearance of missing healthy tissue can be inferred from . We also omit using an encoding bottleneck due to possible information loss and less accurate segmentation.
To train this model, a number of different loss terms are necessary. In the following, we explain the individual components using and to denote results from the translated and reconstructed images respectively (e.g. mapping into results in , translating it back results in ). We use to weight the contribution of different loss terms.
2.3.2 Variational Autoencoder
A variational autoencoder (VAE) is trained by minimizing the KL-divergence of the distribution of encoding to some assumed distribution and the expected reconstruction error , where is the data. In contrast to a classical VAE, we use two distinct encoding vectors and , encoding the healthy and the pathological part, and produce two separate results, the labelmap and the inpainting . We directly calculate the KL-divergence for our two encodings:
For the expected reconstruction error, we assume that and
follow approximately a Bernoulli and a Gaussian distribution (
). We selectively penalize the responsible encoding, by using separate loss functions for the residual region and the rest. Unfortunately, we only ever have access to the ground truth of one of these regions, since we do not use paired data. We solve this, by using the relevant application in the network, where individual ground truths are available, to calculate the approximation of the marginal likelihood lower bound:
where denotes the inpainting used to translate the original image to domain and is the total number of pixels. is the inpainting produced when translating an already translated image that originated from back to that domain. Similarly, and denote the respective labelmaps:
where and are considered constant during optimization, with . Finally, we use the labelmap produced by the other generator responsible for the opposite transformation as ground truth for , where we consider constant in this term:
To restrict the solution space of our model, we use for both directions.
2.3.3 Identity Loss
We apply an identity loss  on labelmap which results from feeding with the wrong input . In this case should not change anything, since the input is already in the desired domain :
2.3.4 Relevancy Loss
By now, we have defined all necessary constraints for a successful translation between image domains. The remaining constraints restrict the location and amount of change, . Fulfilling Eq. (5), we want to entice label map to be only set at locations of a large difference between inpainting and image and penalize large label maps in general:
In order to not reward exaggerated pathology inpaintings, we consider constant in this expression.
2.3.5 Full PathoGAN Objective
combining all loss terms for direction to as , we can finally define:
We include all training patients of Brats2017 and normalize each brain scan to follow , excluding zero voxels, and clip the resulting intensities to . We select all slices from 60 to 100. In order to create two distinct datasets and relying on the manual segmentations, we label slices without pathology as healthy, with more than 20 pixels segmented as pathological slices, and discard the rest. For training, we select 1 500 unaffected and 6 000 pathological slices from a total of 1 755 and 9 413 respectively111Thus we would like to stress that the manual segmentations were only used to create the two image domains, but not for the actual training..
Since the BratS evaluation is volumetric and comparing performance is difficult, we also train a supervised segmentation technique on our data. We chose MDGRU 
for this task, a multi-dimensional recurrent neural network, due to code availability and consistent state-of-the-art performance on different datasets.
, the probability mapsfrom PathoGAN and from MDGRU for whole tumor.
We train PathoGAN222Our implementation is based on https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix.
for 119 epochs using batches of 4 andand . We trained MDGRU333We use the implementation of MDGRU at https://github.com/zubata88/mdgru. as defined in , using batches of 4 and 27 500 iterations. During training of both PathoGAN and MDGRU, we use weak data augmentation . Table 1 shows the results on the pathological training and test data. On the left, Fig. 2 shows an exemplary sample from the testset, with generated inpaintings and translation result. On the right, we provide the generated labelmap together with the manual label for “whole tumor” and the computed segmentation with MDGRU. Details on the used architecture and training procedure as well as further qualitative samples are described in the supplementary material.
|Dice (in %)||HD95 (in pixel)||AVD (in pixel)||Dice PP (in %)|
The results in Fig. 2 indicate that our relative weighting of the two inpainting reconstruction losses results in better reconstruction inside the tumor region than outside. The labelmaps of the supervised method compared to ours in Fig. 2 show great agreement, and both are relatively close to the gold standard. As the 95th-percentile and average Hausdorff measures in Table 1
show, there are some outliers in our proposed method, due to its weakly-supervised nature. The Dice score is about 10% smaller for both the per-slice as well as the per-patient case, given no labels are provided. It is important to remember that we segment with the only criterion of being not part of the healthy distribution, which could vary from the subjective measures used to manually segment data. The increase in accuracy and decrease in standard deviation in the per-patient case for both methods is most likely caused by the inferior segmentation performance in slices showing little pathology. The per-patient Dice of the supervised method is in the range of the top methods of BraTS 2017. Although not directly comparable, this suggests that we can use our computed supervised scores as good reference to compare our results to.
We did only scratch the surface on the possible applications of our proposed formulation. Future work will include unaffected samples that are actually healthy. Furthermore, the model architecture could be drastically simplified using one discriminator for both directions, allowing for larger generator networks as well as using multiple discriminators at different scales to find inpaintings that are not just locally but also globally consistent with the image. A restriction to slices is unfortunate but necessary due to memory requirements. A generalisation of our approach to volumetric data would make it feasible for more real clinical applications.
We presented a new generative pathology segmentation model capable of handling a plethora of tasks: First and foremost, we presented a weakly supervised segmentation method for pathologies in 2D medical images, where it is only known if the image is affected by the pathology. Furthermore, we were able to sample from both our healthy as well as our pathology model. We showed qualitatively and quantitatively, that we are able to produce compelling results, motivating further research towards actual clinical applications of PathoGAN.
Andermatt, S., Pezold, S., Cattin, P.: Multi-dimensional Gated Recurrent Units for the Segmentation of Biomedical 3D-Data. In: International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. pp. 142–151. Springer (2016)
-  Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J., Freymann, J., Farahani, K., Davatzikos, C.: Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Nature Scientific Data (2017), [in press]
-  Baumgartner, C.F., Koch, L.M., Tezcan, K.C., Ang, J.X., Konukoglu, E.: Visual feature attribution using wasserstein gans. arXiv preprint arXiv:1711.08998 (2017)
-  Chu, C., Zhmoginov, A., Sandler, M.: CycleGAN, a Master of Steganography. arXiv:1712.02950 [cs, stat] (Dec 2017)
-  Fu, C., Lee, S., Ho, D.J., Han, S., Salama, P., Dunn, K.W., Delp, E.J.: Fluorescence microscopy image segmentation using convolutional neural network with generative adversarial networks. arXiv preprint arXiv:1801.07198 (2018)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)
-  Kingma, D.P., Welling, M.: Stochastic gradient vb and the variational auto-encoder. In: Second International Conference on Learning Representations (2014)
-  Menze, B.H., Jakab, A., Bauer, S., et al: The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Transactions on Medical Imaging 34(10), 1993–2024 (Oct 2015)
-  Tsaftaris, S.A.: Adversarial image synthesis for unpaired multi-modal cardiac data. In: Simulation and Synthesis in Medical Imaging: Second International Workshop, SASHIMI 2017, Held in Conjunction with MICCAI 2017. p. 3. Springer (2017)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks. pp. 2223–2232 (2017)
Appendix A Network structure
We use the same network architecture as in CycleGAN  for the discriminator.
The generator consists of two encoder networks and a decoder network. Using a similar notation to the one used in the CycleGAN paper, our encoder networks are defined as:
cfs1-k denotes a convolution operation with f
filters, stride 1 andk filters combined with instance norm and exponential linear unit (ELU). A downsampling operation of a stride 2, convolution paired with an instancenorm and following ELU is coded as dk. Cfs1-k is the same as cfs1-k without the instance norm and activation. Q2F describes reshaping the 2d featuremaps to a flat 1d vector. lk describes a fully connected layer with k output units, where a trailing t denotes a following activation. z stands for the encoding size of 256 and i
denotes the smallest image size before reshaping, in our case 15. The output of the last layer is split in half to produce the mean and log variance vectors of length 256 each.
Similarly, the decoder network is defined as:
We denote an additional ELU activation with a trailing e, F2Q describes the reshaping operation from a vector to one square-shaped featuremap, uk means a transposed convolution followed by an instance norm and an ELU. R
stands for a Resnet block consisting of a residual computation with a 3x3 convolution layer with padding, an instance norm, an ELU and a 3x3 convolution layer with padding, which is added to the input.r stands for the number of input channels
. For both generators, one featuremap is activated with a Sigmoid activation function and the rest with theactivation function.
Appendix B Training Details: Data Augmentation
We augment our training data by randomly mirroring the samples, applying a random rotation sampled from and a random scaling where is sampled from . Furthermore we apply random deformations as described in , with a grid spacing of 128 and sample the grid deformation vectors from .
Appendix C Additional Visual Results
Additional visual results were randomly selected with the command ls testfolder | shuf | tail -n 2 and are shown in Figs 3 to 22. The first row in the figures shows the available sequences (FLAIR, T1CE, T1, T2) and the manual segmentation, the second the respective generated inpaintings and probability map and the last row shows on the left the four final translations for each sequence as well as the generated probability map using the supervised MDGRU reference method.