I Introduction
Accurate image segmentation is important for many clinical applications that rely on medical images. In the recent years, deep neural networks have been successful in yielding high segmentation performance at the expense of requiring large amount of annotated training data. Obtaining many annotated examples is difficult for medical images since getting clinical experts to annotate a large number of segmentation masks, which require perpixel annotations, is an expensive and timeconsuming process. Hence, it is not a preferable solution in clinical settings. At the heart of this issue lies the fundamental gap between generalization performance of humans and current Deep Learning (DL) methods. While humans can generalize well for image segmentation after observing very few examples, even one or two examples seem to suffice in some applications, this is not the case with the current DL algorithms.In this work, we focus on algorithmic approaches aiming to close the mentioned gap for medical image segmentation.
First we contemplate the question: why do we need large annotated data for the deep neural networks to obtain high segmentation accuracy? We hypothesize that for segmentation task, when trained with large number of annotated samples, the target relationship learned by the neural networks between images and segmentation masks is robust to the variations in shape and intensity characteristics of the target and surrounding structures. Algorithms trained on a small number of annotated samples may not be exposed to a sufficient amount of variations, and consequently, perform poorly on unseen test images that contain variations not observed during training. These variations in shape arise due to anatomical variations in the population and variations in intensity characteristics are due to differences in (i) tissue properties and its composition, and (ii) image acquisition and scanner protocols, especially in Magnetic Resonance Imaging (MRI).
To address the need for a large number of annotated data to achieve high segmentation performance with DL methods, in this work we propose a taskdriven and semisupervised data augmentation method (shown in Fig. 1). The method is based on learning generative models that can be used to sample deformation fields and additive intensity transformations. Segmentation cost is included during training of these models for the synthesis of imagelabel pairs, which incorporate the taskdriven nature. Semisupervised nature, on the other hand, is incorporated by including unlabeled data in the training through an adversarial term, which help generators synthesize diverse set of shape and intensity variations present in the population, even in scenarios where the number of labeled examples are extremely low.
The proposed approach in essence aims to optimize the augmentation task and can be intuitively understood by drawing analogies with existing augmentation methods. For example, if we consider random elastic deformations proposed in [simard2003best, ronneberger2015u], the augmentation is based on a deformation model with a few number of parameters like Gaussian kernel size and width. These parameters can be seen as hyperparameters and their values can be optimized by training separate segmentation networks for a number of combinations, and selecting the combination that yields the best performance on validation images. Our approach uses a much more flexible transformation model using networks and optimizes its weights using both labeled and unlabeled training images. Its only hyperparameter is the number of training iterations which is determined based on performance on validation images. Thus, both the training and validation images play a crucial role. In our experiments, we evaluated the proposed approach using three different publicly available datasets of cardiac, prostate and pancreas. We present comparisons with existing alternatives as well as an ablation study empirically analyzing the proposed approach.
A preliminary version of this work has been presented at the conference on Information Processing in Medical Imaging [kcipmi19]. In this extended version we additionally:

analyze quantitatively how each term of regularization loss, namely adversarial loss and large deviation loss components affects the performance gains obtained in the proposed method (Refer Results in Sec 5.A).

investigate the benefit of optimizing the generator jointly with the segmentation network as compared to independent optimization of generator w.r.t segmentation network (Refer Results in Sec 5.B).

examine the generality of the proposed method by evaluating it on two more datasets, namely prostate and pancreas.

compare with a larger set of related methods including selftraining with conditional random fields and imagelevel adversarial training.
Ia Related work
We broadly classify the relevant literature into two categories in light of the proposed method in this work:
Data Augmentation: Data Augmentation is a simple technique to enlarge the training set based on generating synthetic imagelabel pairs. The idea is to transform the images in such a way that the label remains the same, or the transformation is welldefined for both the image and the label. The popular approaches are random affine transformations [cirecsan2011high], random elastic transformations [simard2003best, ronneberger2015u] and random contrast transformations [hong2017convolutional, perez2018data]. These methods are very simple to implement and empirically have been shown to reduce overfitting and improve performance on unseen examples. Several recent works trained Generative Adversarial Networks(GANs) [goodfellow2014generative], using the existing labeled dataset to generate realistic imagelabel pairs. The idea has been applied to various analysis tasks [salehinejad2018generalization, frid2018synthetic, guibas2017synthetic, hou2017unsupervised, wu2018conditional, liu2018pixel, sixt2018rendergan] including MR image segmentation [costa2018end, shin2018medical, bowles2018gan]. MixUp [zhang2017mixup]
is different from the above approaches in the fashion that the generated data is not realistic in nature. Here, the additional data is obtained by linearly interpolating available images and corresponding labels, respectively. Despite the unrealistic nature, it seems to improve the performance of the neural networks on standard benchmark datasets
[zhang2017mixup] and also on medical image segmentation [eaton2018improving] in limited annotation setting. All of the above mentioned methods have parameters that control the generation process. These are either set by experience or learned to generate realistic samples based on the available labeled examples, which itself often requires large number of training samples. None of the methods optimize the parameters with respect to the task performance nor leverage unlabeled data. The proposed approach optimizes the parameters of the generator to get the best segmentation performance and leverages unlabeled data during the optimization.The closest work to ours was proposed in [wang2018low]. In a metalearning setup, the authors proposed a fully supervised approach that incorporated the classification performance in learning the generator so to generate augmented images that are optimal for the task. Although it is a fewshot learning method, we still need a large number of different classes samples to train the generator. Here, no modeling assumptions are considered in the generator setup, and the augmented images are synthesized directly. While we instead incorporate domain knowledge and model the generator to output deformation and intensity transformations to generate the synthetic images. Due to these assumptions of modeling transformations as well as leveraging the unlabeled data in the generation process, we do not need a large number of labeled examples during training. In the proposed method, we can readily obtain the synthetic mask using the generated transformation which cannot be obtained with the method [wang2018low]. We show in the results (Section V and Fig. 2(a)) that the semisupervised framework of our method yields significant improvements over using only the labeled examples with taskspecific loss as in [wang2018low].
Semisupervised learning (SSL) methods leverage unlabeled data to accompany limited annotated data during training. The underlying idea is to regularize training by leveraging the unlabeled data and avoid overfitting. We divide SSL works into 3 subcategories: (i) selftraining, (ii) adversarial training, and (iii) learned registration based approaches.
Selftraining approaches [yarowsky1995unsupervised, li2005setred]
are based on the concept of iteratively retraining an alreadytrained network on the estimated labels (pseudolabels) of unlabeled data as ground truths. In
[bai2017semi] authors show an improvement in the segmentation performance on cardiac MRIs using a selftraining approach along with a Conditional Random Field (CRF) model postprocessing intermediate predictions before retraining. However, it has also been shown that the selftraining approaches can suffer if the initial predictions are erroneous on the unlabeled data [chapelle2009semi, zhu2009introduction].Adversarial training and GANs have been used in the semisupervised setting both using the generator as the segmentation network and the discriminator in a regularization term [zhang2017deep], and viceversa [souly2017semi]. In limited annotation setting, we compare and illustrate that the proposed method outperforms the earlier stated semisupervised approaches.
Alternatively, in a concurrent work [zhao2019data], the authors proposed a oneshot data augmentation approach that learns registration between unlabeled image and labeled image where two independent models are trained to learn spatial and appearance transformations for the registration. Later, both learned models are used to generate augmented imagelabel pairs which are used to train a segmentation network. As in the other augmentation works, the generator of this model is also not optimized to yield the best task performance. Furthermore, the approach relies on image registration, which, on one hand, is a difficult task for nonbrain anatomy, and on the other hand, leads to taskirrelevant background structures substantially influence the augmentation process.
Lastly, with weaklysupervised learning the issue of expensive and timeconsuming pixelwise annotations is addressed using weaker labels during training such as scribbles [can2018scribble] and imagewide labels [andermatt2018pathology], which are different approaches that can be complementary to dataaugmentation.
Ii Methods
Let be a set of training images and be the set of corresponding ground truth segmentation labels, be a segmentation network and
be its trainable parameters. In the supervised learning setting, a loss function
is defined as the disagreement between the labels predicted by the network for the set of input images and ground truth labels . The objective is to minimize the loss function with respect to the parameters as can be stated as in Eq. 1.(1) 
When data augmentation is used, the minimization becomes
(2) 
where and denote the sets of generated images and corresponding labels, which can be generated using the transformations mentioned previously. In this minimization, the effective training set is formed of the augmented sets and . When modelbased transformations are used for generating the augmentation sets and , such as geometric or contrast transformations, each augmented image and the corresponding label are created by a conditional generator function that takes as input an existing labeled pair and applies a random transformation with fixed parameters. We can represent this with the following notation, as in Eq. 3.
(3) 
where are the imagelabel pair sampled from the set of labeled examples, is the transformation function, is the random component of the transformation and are the parameters of the transformation. For instance, for the random elastic deformations proposed in [ronneberger2015u], would be the deformation model,
would include the grid spacing between anchor points and standard deviation of the distribution of displacement vectors at each anchor point, while
would correspond to a random draw of displacement vectors. The same transformation would be applied to both and to generate the augmented pairs. As described in the introduction, in the modelbased transformations, parameters are often predefined and their number is kept low.When GANs are used for generation, a neural network is used as the generator as , and its parameters are determined by optimizing Eq. 4 as per the adversarial learning framework [goodfellow2014generative], using an additional discriminator network with its own set of parameters . (The GAN can also be a conditional image generator defined as and the discussion still holds.)
(4) 
The generator is input only with a random draw from the distribution to output an imagelabel pair . The labeled pairs are still used but during the training. The discriminator is optimized to distinguish between generated pairs from and real pairs , while the generator is optimized to produce such that can not differentiate between generated and real pairs. This forces the generated set to be as ”realistic” as possible.
In both modelbased and GANbased approaches, generators would be predefined or trained in advanced without considering the task, and during training random draws would be sampled from the generator models to create and .
Iia SemiSupervised and TaskDriven Data Augmentation
In this work, we propose to generate augmented imagelabel pairs that are optimized for the segmentation task(Fig. 1). We achieve this by optimizing the generator function, similar to the GANbased approach, but with a crucial difference, we integrate the task loss and leverage unlabeled images in the process. The proposed model optimizes Eq. 5 instead of Eq. 4.
(5) 
where denotes a set of unlabeled images. Furthermore, we would like to be able to optimize the parameters with limited number of labeled examples. To this end, we integrate domain knowledge into the generation process, similar to the modelbased approaches, but allowing a network parameterization of the transformation models to increase flexibility while remaining trainable. We define two conditional generators to model shape and intensity variations using deformation fields generator (nonaffine spatial transformations) and intensity fields generator, respectively. Crucially, both models are constructed such that the segmentation mask corresponding to an augmented image is obtained by applying the same transformation to the input image mask .
With this optimization, we want to incorporate two sets of ideas with the two lossterms in Eq. 5. The first term ensures that the model generates set of augmented pairs such that they are helpful for the minimization of segmentation loss, which is the taskdriven nature of the approach. The second term is a regularization term built on adversarial loss and integrates a preference for larger transformations leveraging the unlabeled images in the generation process, which is the semisupervised component of the method.
IiA1 Deformation Field Generator
The deformation field generator is trained to output a deformation field transformation. The conditional generator , a network with parameters , takes as input a labeled image and a
vector randomly drawn from a unit Gaussian distribution to produce a dense perpixel deformation field
. Later, the input image and its corresponding label (in onehot encoding form) are warped using bilinear interpolation based on the generated deformation field
v to produce the augmented imagelabel pair and , respectively. The augmented image and label sets are denoted by and , and each sample pair is defined using the following: , , where denotes warping operation.IiA2 Additive Intensity Field Generator
Similar to , here the generator is trained to output an additive intensity mask transformation. The generator , again a network with parameters , takes as input a labeled image from and a vector randomly drawn from a unit Gaussian distribution to output an additive intensity mask . Then is added to the input image to obtain the augmented image , and its corresponding segmentation mask remains unchanged as the initial mask . The augmented imagelabel set is denoted by and one sample pair is defined using the following: , .
IiA3 Regularization Loss
The regularization term is defined as in Eq. 6 for both conditional generators.
(6) 
The first term is the following standard adversarial loss
(7) 
where , denotes the discriminator network and its weights. This first term incorporates the semisupervised nature of the model by including the set of unlabeled images in the optimization. It allows samples in the unlabeled set that show different shape and intensity variations than the labeled examples guide the optimization of the generators. The second term in Eq. 6 is the Large Deviation (LD) term that embeds a preference for large transformations. The LD term prevents the generator from producing very small deformations and intensity fields, which would satisfy the adversarial loss as well as lead to lower segmentation loss in the cost given in Eq. 5. The definition of depends on the generator type. We define the two terms for the deformation and intensity field generators as and . The negative signs ensures that minimizing the LD term maximizes the norms. The LD term forces the generator to produce large transformations while the adversarial loss tries to constrain them. Optimizing for both terms yield augmented images that differ substantially from the input labeled images for both the generators.
Finally, the weighting terms and balances the effect of the two terms on the optimization.
IiA4 Optimization Sequence
Before the optimization of the conditional generators, the segmentation network (
) in the proposed setup is pretrained on only the labeled data for a few epochs as per Eq.
1 and later optimized with both the labeled and generated data from the conditional generators as per Eq. 2. This is done to ensure that produces reasonable segmentation masks for the labeled data when the optimization of the generators begins. Following the pretraining of , both generative models for deformation and intensity fields are trained separately by minimizing the cost given in Eq. 5 with corresponding regularization terms. Note that this minimization trains over , the generators and the discriminators. This implies training of the set of networks and independently for deformation and intensity fields generation, respectively. The optimization sequence is run for a fixed number of iterations, and the segmentation network is evaluated on the validation images using the Dice’s similarity coefficient (DSC) at every iteration. Once the optimization is complete, we fix both the generators and with the parameters that yielded the best mean DSC over the validation images. Then, the segmentation network is optimized using Eq. 2 once again from a random initialization. The data used for this final training comprises of both the original labeled sets, and , and the augmented sets, and sampled from the trained generators or or both (where we perform two back to back transformations, e.g. is applied over the deformed image obtained from ).The validation images play a crucial role in the defined optimization as they determine the parameters of the generator model chosen to generate the augmented data that is later used for the independent optimization of the segmentation network.
Iii Dataset and Network details
Iiia Dataset details
Cardiac Dataset: This is a publicly available dataset hosted as part of MICCAI’17 ACDC challenge [bernard2018deep] ^{1}^{1}1https://www.creatis.insalyon.fr/Challenge/acdc. It comprises of 100 subjects’ shortaxis cardiac cineMR images captured using 1.5T and 3T scanners. The inplane resolution ranges from 0.70x0.70mm to 1.92x1.92mm and throughplane resolution ranges from 5mm to 10mm. The segmentation masks are provided for left ventricle (LV), myocardium (Myo) and right ventricle (RV) for both endsystole (ES) and enddiastole (ED) phases of each subject. This dataset is divided into 5 subgroups (details in [bernard2018deep] ^{2}^{2}2https://www.creatis.insalyon.fr/Challenge/acdc), each comprising of 20 subjects, respectively. For our experiments we only used the ES images.
Prostate Dataset: This is a public dataset made available as part of MICCAI’18 medical segmentation decathlon challenge ^{3}^{3}3http://medicaldecathlon.com/index.html. It comprises of 48 subjects T2 weighted MR scans of prostate. The inplane resolution ranges from 0.60x0.60mm to 0.75x0.75mm and throughplane resolution ranges from 2.99mm to 4mm. Segmentation masks comprise of two adjoint regions: peripheral zone (PZ) and central gland (CG).
Pancreas Dataset: This dataset also is from medical decathlon MICCAI’18 challenge ^{4}^{4}4http://medicaldecathlon.com/index.html. It comprises of 282 subjects CT scans. Segmentation masks available comprise of labels with large (background), medium (pancreas) and small (tumor) structures. The inplane resolution ranges from 0.6x0.6mm to 0.97x0.97mm and throughplane resolution ranges from 0.7mm to 7.5mm. In this work, we create two labels for segmentation task, where label 1 denotes the foreground which was created by merging the labels of pancreas and tumor, and label 0 denotes the background label.
IiiB Preprocessing
N4 [tustison2010n4itk] bias correction was performed on the cardiac and prostate datasets. The below preprocessing was applied to all images of all the datasets. (i) intensity normalization: all the volumes were normalized using minmax normalization according to: , where and denote the and intensity percentiles of the 3D volume. (ii) resampling: all 2D image slices and their corresponding label maps were resampled to a fixed inplane resolution
using bilinear and nearestneighbour interpolation, respectively and then cropped or padded with zeros to a fixed size of
. The resolution and fixed size were chosen empirically for each dataset, and the values were: (a) cardiac dataset: 1.367x1.367mm and 224x224, (b) prostate dataset: 0.6x0.6mm and 224x224, (c) pancreas dataset: 0.8x0.8mm and 320x320.IiiC Network Architecture
There are 3 types of networks in the proposed method (see Fig. 1): a segmentation network , a discriminator network and a generator network . We describe the architectures of these networks below. and use the same architecture except at the last layer, which are used to model the deformation field and the intensity mask, respectively.
Generator: Generator takes an image and a randomly drawn vector of dimension 100 as input. Both inputs are initially passed through 2 subnetworks namely and . comprises of 2 convolutional layers and comprises of a fullyconnected layer, followed by reshaping of output into downsampled image dimensions. Then a set of bilinear upsampling and convolutional layers are applied consecutively to output feature maps of image dimensions. The resulting outputs of both the subnetworks, which are of same dimensions, are then concatenated and passed through a common subnetwork , which consists of 4 convolutional layers where the last layer is different for and . The final convolution for yields two feature maps that correspond to dense perpixel deformation field v, while for , it outputs a single feature map that corresponds to the intensity mask . The final layer of uses
activation to restrict the range of values in the intensity mask. All the convolutional layers except the final layers are followed by batch normalization layers and ReLU activation. All convolutional layers’ kernels are 3x3 except the final layers’ which are 1x1.
Discriminator: has an architecture similar to the DCGAN [radford2015unsupervised]
, which comprises of five convolutional layers each with a kernel size of 5x5 and a stride of 2. The convolutions are followed by batch normalization layers and leaky ReLU activations with the leak value of the negative slope set to 0.2. After the convolutional layers, the output is reshaped and passed through three fullyconnected layers where the final layer has an output size of 2 that predicts the probability of the input being real or fake.
Segmentation Network: We chose the architecture of segmentation network similar to UNet [ronneberger2015u]
. It comprises of encoding and decoding paths. The encoder comprises of four convolution blocks where each block has two 3x3 convolutions followed by a 2x2 maxpool layer. The decoder comprises of four convolution blocks where each block comprises of the concatenation of features from the corresponding level of the encoder, followed by two 3x3 convolutions and bilinear upsampling by a factor of 2. Except for the last layer, all the layers have batch normalization and ReLU activation.
IiiD Training Details
The segmentation loss () used is weighted cross entropy. We empirically set the weights of background pixels as 0.1 and foreground pixels as 0.9 since the number of pixels belonging to the foreground are fewer in quantity and are of primary interest for the segmentation task at hand. For the datasets with more than one foreground label, we divided the foreground weight of 0.9 equally among all the labels. With this rationale, the weights of the output labels are as follows: (a) 0.1 for background and 0.3 for each of the three foreground classes of the cardiac dataset, (b) 0.1 for background and 0.45 for each of the two foreground classes of the prostate dataset, and (c) 0.1 for background and 0.9 for one foreground class of the pancreas dataset.
We split the data into training pool, validation, unlabeled and test sets. We empirically set as 1 to match the magnitude of adversarial loss to the segmentation loss. To determine parameter, we randomly sampled one 3D volume from the training pool of cardiac dataset (the sampled volume is one possibility of all training combinations used in full analysis) and trained the network, and evaluated performance on the validation images for three values of : . The value of for yielded the best validation performance for this experiment. So, we used this set values () for all future experiments on all datasets. Owing to the computationally expensive nature of the proposed method, we did not perform an exhaustive grid search on many combinations of weights of the loss terms (, ) on the validation set. But if one has enough computational resources, one could do a grid search of these hyperparameters for each dataset to potentially obtain higher performance gains.
The batch size and the total number of iterations are set as 20 and 10000, respectively, based on the evolution of the training curves. For all the networks, while training, the iteration where the model yields the best performance on the validation images is saved for the evaluation. AdamOptimizer [kingma2014adam] is used for the optimization of all the networks with learning rate of and default beta values (, ).
Iv Experiments
We evaluated the proposed method on three datasets: cardiac, prostate and pancreas. For each dataset, we split the data into 4 sets: labeled training (), unlabeled training (), test () and validation (). The size of each set is denoted by followed by a subscript indicating the set. The validation set consists of two 3D volumes (=2) for all datasets. For the cardiac, prostate and pancreas datasets, the number of 3D volumes (, ) for unlabeled and test sets are (25, 20), (20, 13) and (25, 20) respectively. , and sets are selected randomly apriori and fixed for all experiments. The remaining 3D volumes constitute the training pool . Note that the entire training pool is never utilized for training. Rather a small number of training images () is sampled from for each experiment. As the interest of this work is to analyze the performance in the limited annotation setting, we set or . Each experiment is run five times with different 3D training volumes. Further, to account for the variations in the random initialization and convergence of the networks, each of the five experiments is run three times. Thus, overall, we have 15 runs for each experiment. Since the cardiac dataset has five subgroups of images (see Sec. IIIA), we ensure that each set contains an equal number of images from each subgroup, and, when is run five times, each time the 3D volume is selected from a different subgroup.
Segmentation performances of the following models were compared:
No data augmentation (no aug): is trained without any data augmentation.
Affine data augmentation (Aff):
is trained with data augmentation comprising of affine transformations such as: (a) scaling (random scale factor is chosen from a uniform distribution with min and max value as 0.9 and 1.1), (b) flipping along xaxis, (c) rotation (randomly a value is chosen between 15 and +15 degrees and another type of rotation that is multiple of 45 degrees (defined as 45 deg*N where N is randomly chosen between 0 and 8)). For each slice in the batch, we apply one of the above random transformation 80% of the time and 20% of the time we use the image as is.
For all the subsequent data augmentation methods, MixUp and semisupervised methods, we include the random affine data augmentations by default as described above. For training of , half of each batch was composed random affine augmentation and the remaining half was chosen from the specific augmentation technique.
Random elastic deformations (RD): Random elastic augmented images are created as stated in [ronneberger2015u], where a deformation field is created using a matrix of size 3x3x2. Each element of this matrix is sampled from a Gaussian distribution with a mean of 0 and standard deviation of 10 and is then resampled to image dimensions using bicubic interpolation. (Refer appendix for results with different sigma & kernel sizes)
Random contrast and brightness fluctuations [hong2017convolutional, perez2018data] (RI): These augmented images are created with the help of contrast adjustment step () and brightness adjustment step (), where c and b are sampled uniformly from [0.8,1.2] and [0.1,0.1], respectively and denotes mean of 2D image. (Refer appendix for more combinations of c and b evaluated)
Deformation field transformations (GD): The deformation field generator trained with the proposed method is used to generate the augmented data i.e., .
Intensity field transformations (GI): The intensity field generator trained with the proposed method is used to generate the augmented data i.e., .
Both deformation and intensity field transformations (GD+GI): Augmented data included both and , obtained from the generators and , respectively. We also generated additional images which have both the deformation and intensity transformations. These are obtained by applying intensity transformation using generator on the deformation field transformed images . The augmented data consists of all the images generated .
MixUp [zhang2017mixup] (Mixup): The augmentation sets and consist of the linear combination of available labeled images using the Mixup formulation as stated in Eq. 8 [zhang2017mixup].
(8) 
where
is sampled from beta distribution Beta
with and . The value of 0.2 yielded the best results for the datasets considered. controls the ratio of mixing of two imagelabel pairs , which are randomly sampled from the labeled image set.Mixup over deformation and intensity field transformations (GD+GI+Mixup): These set of images are obtained by applying Mixup over all possible pairs of available images: original data (), their affine transformations and the generated images using deformation and intensity field generators
Adversarial Training (Adv_tr): For comparison, we investigate previously proposed adversarial training methods with the discriminator trained to operate on: (i) the image level discrimination [zhang2017deep] and (ii) the pixel level discrimination [souly2017semi] in a semisupervised (SSL) setting.
Selftraining (Self_tr): The selftraining based method as proposed in [bai2017semi] is evaluated on the datasets.
Ablation Studies: In addition to the aforementioned comparisons, we carried out additional ablation studies as described below. These studies were done only on the cardiac dataset owing to lack of computational resources, as each experiment for each ablation study and dataset requires 15 runs.
A. Effects of adversarial () and large deviation () loss terms of regularization loss on segmentation performance: We investigate the effect of each term of regularization on the performance of the proposed method. Different values of and are considered to examine how much each term impacts the performance. The training case of is similar to earlier work [wang2018low].
B. Independent optimization of the generator and the segmentation networks: Here, we optimize both and without the segmentation loss similar to [shin2018medical, bowles2018gan]. Later, augmented data created from these optimized generators are used for the independent training of the segmentation network. This experiment reveals the value of the segmentation loss.
C. Varying the number of unlabeled data: We used different number of unlabeled volumes in the training of the generators. The number of 3D volumes studied () were: 5, 10, 20 and 25.
D. Varying the number of labeled 3D volumes used in training: The number of 3D volumes considered () were: 1, 3, 5, 10, 15, 40. This experiment is done to study if any improvement in Dice score is obtained when a large number of annotated volumes are available for training.
E. Different set of train, validation, test and unlabeled 3D volumes: We randomly sample another training, validation, test, and unlabeled image sets from the cardiac dataset different from the earlier sets and rerun learning deformation and intensity field transformations for this new set for one 3D training volume case (). This experiment is done to analyze if the proposed method overfits to a specific dataset split or generalizes for any split.
F. No validation images: Lastly, we report the performance observed when we do not use any validation images in the training of the generators. In this case, the training is stopped after running the model for some predefined number of iterations and these model parameters are used for generating images for augmentation.
Evaluation: Dice’s similarity coefficient (DSC) is used to evaluate the segmentation performance of each method. The performance reported is obtained on number of test images for the structures of each dataset as stated earlier.
V Results and Discussion
Figures ( 1(a)to1(c)),( 1(d)to1(e)) and 1(f) present the quantitative results of the experiments on the cardiac, prostate and pancreas datasets, respectively. The mean DSC and standard deviation values over all the runs are reported on the top of each boxplot in these figures. We observe that the proposed method of augmentation (i.e.,GD+GI) outperforms the other data augmentation and semisupervised learning methods considered here. For qualitative inspection, we present some examples of visual results in Fig. 4. In the rest of this section, we present further analysis of the experimental results.
As expected, the lowest performance was observed when no data augmentation is used. Employing affine augmentation alone provided a substantial boost in performance. Adding random elastic deformations or random intensity fluctuations on top of affine augmentation yielded further improvements.
Using the proposed learned deformation fields(GD) for augmentation yielded higher performance compared to random elastic deformations(RD). This results suggest that the proposed approach to learn a deformation field generator, by optimizing the segmentation accuracy along with a regularization term that leverages unlabeled examples, provided augmented examples more useful for obtaining high segmentation performance than random deformations. Some samples of the generated deformed images are illustrated in the Fig. 5. Surprisingly, we observed that the generated images did not always have realistic anatomical shapes. This is contrary to the popular belief, but generating realistic images may not be necessary nor optimal to obtain the best segmentation network.
Similar to the deformation case, the proposed additive intensity mask(GI) based augmentations performed better than random intensity fluctuations(RI). Here as well, the result suggest benefits of optimizing the intensity transformation generator using the proposed approach. Fig. 5 illustrates that the images generated from the learned intensity transformation generator are not necessarily realistic.
Since both the generators and are modeled to encapsulate different characteristics of the entire population, using both the augmentations is expected to produce higher performance gains over using only one of the augmentation separately. In our experiments, we indeed observed a substantial improvement in DSC when both are used together.
To our surprise, we observed that the Mixup augmentation yielded substantial performance gain over the affine transformations and random elastic deformations. This improvement was despite the augmented images being unrealistic. We attribute the performance improvement to the fact that Mixup creates soft probability maps for augmented images, which has been hypothesized to assist the optimization by providing additional information for training samples [hinton2015distilling]. Applying Mixup over the augmented data obtained from the trained generators and yielded further marginal improvements, which suggests both approaches have complementary benefits.
With selftraining, we observed an improvement in DSC over the affine augmentations as illustrated in Fig. 2. This has been welldocumented in SSL literature [yarowsky1995unsupervised, li2005setred] that retraining the neural network with the estimated predictions of the unlabeled data can assist in improving the segmentation performance [bai2017semi] in limited annotation setting. Although this yields some improvement over the affine augmentations, it did not outperform the proposed augmentations (GD+GI) except for the structure central gland of the prostate dataset.
The semisupervised adversarial training [souly2017semi, zhang2017deep] provided marginal performance gains over the baseline with affine augmentation (Results of [souly2017semi] that uses imagelevel adversarial training are not reported as the GAN training did not converge to reasonable performance for the case of one labeled volume). This observation is not surprising as it has also been shown for other tasks in [oliver2018realistic], SSL training may yield minimal performance gains when affine augmentations are included in the training.
In the Appendix, we provide additional analysis plots such as: (a) the performance improvement seen per test subject averaged over all the runs for each dataset in Fig. 9. We observed that for the majority of test subjects, the proposed method performs better or equal to random augmentations. (b) A sample of generated deformation and additive intensity fields obtained from the trained generators on the cardiac dataset are provided in Fig. 10 and Fig. 11, respectively.
Despite getting performance improvements using the proposed method for the three datasets evaluated in this work, it is important to note that it is a computationally expensive method to deploy on any new dataset. This is because one would need to search for the optimal hyperparameters (, ) for the loss terms.
Ablation studies:
A. Effects of adversarial () and large deviation () loss terms of regularization loss on segmentation performance: In this experiment, we analyze the effect of each term of the regularization loss on the performance by varying the values of and in the training of the generators and .
Fig. 2(a) presents the quantitative results of the analysis on the cardiac dataset for GD+GI augmentations. We observed the least performance gain over baseline when we disabled the whole regularization with , the setup similar to the work in [wang2018low]. This setup yielded performance similar to the case when both random deformations and intensity fluctuations were leveraged for augmentation.
The performance gain we observe when we enable only the adversarial loss in the regularization, i.e. (), can be attributed to enforcing the model to match the distribution of generated images to that of unlabeled images. This matching propels the generator to synthesize examples showing the diverse set of shape and intensity variations present in the unlabeled data.
We observed a deterioration in performance when only largedeviation loss is enabled on deformation and intensity fields (). This setup encourages the generators to explicitly produce larger deformation and intensity fields without any control from the adversarial term. Transformations output from these generators yields very unrealistic samples, in the most extreme case moving all the foreground pixels out of the frame. Such synthetic examples, naturally, are not useful for the segmentation task.
We observed the highest performance boost when both terms were enabled(). This suggests the value of the combination of the terms and their complementary behavior. The largedeviation loss, when used in addition to the adversarial loss, prevents the network from simply replicating the training data with generating very small transformations. Instead, it compels the generator to produce larger fields as long as they satisfy discriminator’s objective. Effectively, producing augmented imagelabel pairs that are very different from the labeled imagelabel pairs. Adversarial loss on the other hand, contains the effects of the largedeviation loss by not allowing models to generate extreme transformations.
B. Independent optimization of the generator and the segmentation networks: One of the biggest claims of the proposed method is the benefit of using the segmentation cost function for learning the generators. This leads to a joint optimization of the segmentation network’s parameters along with the generator’s. Here, we examine the impact on segmentation performance when the generators are optimized without the segmentation loss, only using the regularization term with adversarial and largedeviation terms. Results shown in Table I show that when the segmentation loss is included in the learning of the generators, i.e., joint optimization leads to much higher DSC values than independent optimization. Thus, empirically proving our point that taskdriven optimization is better than the traditional independent optimization, which was the approach taken in earlier works [shin2018medical, bowles2018gan] to generate the augmented data independent of the downstream task.
methods of  

optimization  RV  Myo  LV  RV  Myo  LV 
independent  0.527  0.553  0.719  0.793  0.797  0.909 
(0.266)  (0.218)  (0.23)  (0.187)  (0.097)  (0.09)  
joint  0.651  0.710  0.834  0.832  0.823  0.922 
(0.23)  (0.157)  (0.171)  (0.148)  (0.076)  (0.072) 
C. Varying number of unlabeled images: We investigate how varying the number of unlabeled 3D volumes for the training of the proposed method can influence the segmentation performance. This experiment is illustrated in Fig. 2(b) for the cardiac dataset. We observe that the improvements in segmentation accuracy do not change significantly while varying the number of unlabeled 3D volumes used.
D. Varying number of labeled images: Here, we investigated how the performance gap between affine, random, and proposed augmentations varies as we increase the number of training volumes involved in the training. We observe that the performance gap between the augmentation approaches reduces as we increase the number of labeled training volumes as shown in Fig. 2(c). This is expected since as the labeled examples increase, the network sees larger number cases and gains robustness to variations present in these images. For the case of 40 3D training volumes, we see that the performance of affine augmentations is almost similar to the proposed augmentations. One striking observation is that with the proposed model, segmentation accuracy using 10 labeled examples is similar to using 40 examples using random augmentations.
E. Different set of train, validation, test, and unlabeled 3D volumes: Lastly, we show that the results hold for any randomly chosen dataset split, and does not overfit to a specific set of validation images as illustrated in Table II.
Methods  RV  Myo  LV 

RD+RI  0.506 (0.213)  0.647 (0.159)  0.819 (0.149) 
GD+GI  0.683 (0.212)  0.693 (0.139)  0.842 (0.136) 
F. No validation images: Additionally, we also present results for the case when no validation images were used in the training in Fig. 8 in the appendix. Here, the chosen model parameters are obtained after training the generators for predefined number of iterations. Surprisingly, we observe that the performance obtained without validation images does not vary much w.r.t the performance obtained using validation images.
Vi Conclusion
In the clinical setting, deployment of successful deep learning algorithms for medical image analysis is limited due to the difficulty of assembling largescale annotated datasets. In this work, we proposed a semisupervised taskdriven data augmentation method to tackle the issue of obtaining robust segmentation in limited data setting for training. To achieve this we proposed two novel contributions: (i) taskdriven based optimization where the generation of the augmentation data is optimal for the segmentation performance, and (ii) semisupervised nature is induced by using the unlabeled data in the generative modeling setup, where we design two conditional generative models to output transformations that capture two factors of variations: shape and intensity characteristics present in the population. Using three publicly available datasets, we demonstrated the proposed method for segmenting the cardiac, prostate and pancreas using limited annotated examples, reporting substantial performance gains over existing methods. Surprisingly, the augmented images generated via the proposed taskdriven approach were not necessarily realistic yet yielded improved segmentation performance, questioning the validity of the assumption that generating realistic examples is the optimal way.
Vii Acknowledgements
The presented work is partially funded by Swiss Data Science Center (DeepMicroIA) and the Clinical Research Priority Program Grant on Artificial Intelligence in Oncological Imaging Network from University of Zurich. We thank Nvidia for their GPU donation.
Comments
There are no comments yet.