Semi-Supervised and Task-Driven Data Augmentation

02/11/2019 ∙ by Krishna Chaitanya, et al. ∙ 18

Supervised deep learning methods for segmentation require large amounts of labelled training data, without which they are prone to overfitting, not generalizing well to unseen images. In practice, obtaining a large number of annotations from clinical experts is expensive and time-consuming. One way to address scarcity of annotated examples is data augmentation using random spatial and intensity transformations. Recently, it has been proposed to use generative models to synthesize realistic training examples, complementing the random augmentation. So far, these methods have yielded limited gains over the random augmentation. However, there is potential to improve the approach by (i) explicitly modeling deformation fields (non-affine spatial transformation) and intensity transformations and (ii) leveraging unlabelled data during the generative process. With this motivation, we propose a novel task-driven data augmentation method where to synthesize new training examples, a generative network explicitly models and applies deformation fields and additive intensity masks on existing labelled data, modeling shape and intensity variations, respectively. Crucially, the generative model is optimized to be conducive to the task, in this case segmentation, and constrained to match the distribution of images observed from labelled and unlabelled samples. Furthermore, explicit modeling of deformation fields allow synthesizing segmentation masks and images in exact correspondence by simply applying the generated transformation to an input image and the corresponding annotation. Our experiments on cardiac magnetic resonance images (MRI) showed that, for the task of segmentation in small training data scenarios, the proposed method substantially outperforms conventional augmentation techniques.



There are no comments yet.


page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This article has been accepted at the 26th international conference on Information Processing in Medical Imaging (IPMI) 2019.

Precise segmentation of anatomical structures is crucial for several clinical applications. Recent advances in deep neural networks yielded automatic segmentation algorithms with unprecedented accuracy. However, such methods heavily rely on large annotated training datasets. In this work, we consider the problem of medical image segmentation in the setting of small training datasets.

Let us first consider the question: why is a large training dataset necessary for the success of deep learning methods? One hypothesis is that a large training dataset exposes a neural network to sufficient variations in factors, such as shape, intensity and texture, thereby allowing it to learn a robust image to segmentation mask mapping. In medical images, such variations may arise from subject specific shape differences in anatomy or lesions. Image intensity and contrast characteristics may differ substantially according to the image acquisition protocol or even between scanners for the same acquisition protocol. When the training dataset is small, deep learning methods are susceptible to faring poorly on unseen test images either due to not identifying such variations or because the test images appear to have been drawn from a distribution different to the training images.

We conjecture that one way to train a segmentation network on a small training dataset more robustly could be to incorporate into the training dataset, intensity and anatomical shape variations observed from a large pool of unlabelled images. Specifically, we propose to generate synthetic image-label pairs by learning generative models of deformation fields and intensity transformations that map the available labelled training images to the distribution of the entire pool of available images, including labelled as well as unlabelled. Additionally, we explicitly encourage the synthesized image-label pairs to be conducive to the task at hand. We carried out extensive evaluation of the proposed method, in which the method showed substantial improvements over existing data augmentation as well as semi-supervised learning techniques for segmentation of cardiac MRIs.

Related work

: Due to the high cost of obtaining large amount of expert annotations, robust training of machine learning methods in the small training dataset setting has been widely studied in the literature. Focusing on the methods that are most relevant to the proposed method, we broadly classify the related works into two categories:

Data augmentation is a technique wherein the training dataset is enlarged with artificially synthesized image-label pairs. The main idea is to transform training images in such a way that the corresponding labels are either unchanged or get transformed in the same way. Some commonly used data augmentation methods are affine transformations [6] (such as translation, rotation, scaling, flipping, cropping, etc.) and random elastic deformations [17, 15]. Leveraging recent advances in generative image modelling [9]

, several works proposed to map randomly sampled vectors from a simple distribution to realistic image-label pairs as augmented data for medical image segmentation problems 

[4, 7, 16]

. Such methods are typically trained on already labelled data, with the objective of interpolating within the training dataset. In an alternative direction,

[19] proposed to synthesize data for augmentation by simply linearly interpolating the available images and the corresponding labels. Surprisingly, despite employing clearly unrealistic images, this method led to substantial improvements in medical image segmentation [8] when the available training dataset is very small. None of these data augmentation methods use unlabelled images that may be more readily available and all of them, except for those based on generative models, are hand-crafted rather than optimized based on data.

Semi-supervised learning (SSL) methods are another class of techniques that are suitable in the setting of learning with small labelled training datasets. The main idea of these methods is to regularize the learning process by employing unlabelled images. Approaches based on self-training [2]

alternately train a network with labeled images, estimate labels for the unlabelled images using the network and update the network with both the available true image-label pairs and the estimated labels for the unlabelled images.


propose a SSL method based on adversarial learning, where the joint distribution of unlabelled image-estimated labels pairs is matched to that of the true labelled images-label pairs. Interestingly, 

[13] show that many SSL methods fail to provide substantial gains over the supervised baseline that is trained with data augmentation and regularization.

Weakly-supervised learning tackles the issue of expensive pixel-wise annotations by training on weaker labels, such as scribbles [5] and image-wide labels [1]. Finally, other regularization methods that do not necessarily leverage unlabelled images may also aid in preventing over-fitting to small training datasets.

2 Methods

In a supervised learning setup, an objective function that measures discrepancy between ground truth labels, , and predictions of a network on training images, , is minimized with respect to a set of learnable parameters of the network, i.e.


When data augmentation is employed in the supervised learning setup, Eq. 2 is minimized with respect to .


Here, and refer to generated images and labels obtained by affine or elastic transformations of , or by using methods such as Mixup [19]. The set is referred to as the augmented training set. In augmentation methods based on generative models [4], the parameters of are still optimized according to Eq. 2, but the generative process for and involves two other networks: a generator and a discriminator . The corresponding parameters and are estimated according to the generative adversarial learning (GAN) [9] framework by optimizing:


takes as input a vector sampled from a known distribution and maps that to a {, } pair. is optimized to distinguish between outputs of and real {, } pairs, while is optimized to generate {, } pairs such that responds to them similarly as to {, }. Thus, are encouraged to be “realistic” in the sense that they cannot be distinguished by .

2.1 Semi-Supervised and Task-Driven Data Augmentation

Instead of solving the optimization given in Eq. 3 for generating the augmentation image-label pairs {, }, we propose solving Eq. 4:


This incorporates two ideas. The first term dictates that {, } be such that they are beneficial for minimizing the segmentation loss . Secondly, note that depends not only on the labelled images {} (as in Eq. 3), but also on the unlabelled images {}. It is a regularization term based on an adversarial loss, which incorporates information about the image distribution that can be extracted from both {} and {}. This is achieved by synthesizing as , where denotes a conditional generative model. is modelled in two different ways: one for deformation field (non-affine spatial transformations) and one for intensity transformations. In both cases, the formulation is such that as a certain labelled image is mapped to an augmentation image, the mapping to obtain the corresponding augmentation label readily follows.

2.1.1 Deformation Field Generator

: The deformation field generator, = , is trained to create samples from the distribution of deformation fields that can potentially map elements from to those in the combined set . takes as input an image from and a vector

, sampled from a unit Gaussian distribution, and outputs a dense per-pixel deformation field,

v. The input image and its corresponding label (in 1-hot encoding) are warped using bilinear interpolation according to v to produce and respectively.

2.1.2 Additive Intensity Field Generator

: The intensity field generator, = , is trained to draw random samples from the distribution of additive intensity fields that can potentially map elements from to those in . , takes as input an element of and a noise vector and outputs an intensity mask, . is added to the input image to give the transformed image , while its segmentation mask remains the same as that of the input image.

(a) Deformation field cGAN
(b) Additive Intensity field cGAN
Figure 3: Modules for task-driven and semi-supervised data augmentation.

2.1.3 Regularization Loss

: For both the conditional generators, the regularization term, , in Eq. 4 is formulated as in Eq. 5. The corresponding discriminator networks are trained to minimize the usual adversarial objective (Eq. 6).


The generated images are obtained as and , where denotes a bilinear warping operation. In our experiments, we observe that with only the adversarial loss term in , the generators tend to create only the identity mapping. So, we introduce the term to incentivize non-trivial transformations. We formulate and as and respectively.

2.1.4 Optimization Sequence

: The method starts by learning the optimal data augmentation for the segmentation task. Thus, all networks , and are optimized according to Eq. 4. The generative models for the deformation fields and the intensity fields are trained separately. Once this is complete, both and are fixed and the parameters of are re-initialized. Now, is trained again according to Eq. 2, using the original labelled training data and augmentation data generated using the trained or or both.

3 Dataset and Network details

3.1 Dataset Details

We used a publicly available dataset hosted as part of MICCAI’17 ACDC challenge [3] 111 It comprises of short-axis cardiac cine-MRIs of 100 subjects from 5 groups - 20 normal controls and 20 each with 4 different cardiac abnormalities. The in-plane and through-plane resolutions of the images range from 0.70x0.70mm to 1.92x1.92mm and 5mm to 10mm respectively. Expert annotations are provided for left ventricle (LV), myocardiam (Myo) and right ventricle (RV) for both end-systole (ES) and end-diastole (ED) phases of each subject. For our experiments, we only used the ES images.

3.2 Pre-processing

We apply the following pre-processing steps to all images of the dataset: (i) bias correction using N4 [18] algorithm, (ii) normalization of each 3d image by linearly re-scaling the intensities as: , where and are the and

percentile in the bias corrected 3d image, (iii) re-sample each slice of each 3d image and the corresponding labels to an in-plane resolution of 1.367x1.367mm using bi-linear and nearest neighbour interpolation respectively and crop or pad them to a fixed size of 224x224.

3.3 Network Architectures

There are three types of networks in the proposed method (see Fig. 3): a segmentation network , a generator network and a discriminator network . In this sub-section, we describe their architectures. Expect for the last layer of , the same architecture is used for the and networks used for modelling both the deformation fields and the intensity transformations.

Generator: takes as input an image from and a noise vector of dimension 100, which are both first passed through separate sub-networks, and . , consists of 2 convolutional layers, while , consists of a fully-connected layer, followed by reshaping of the output, followed by 5 convolutional layers, interleaved with bilinear upsampling layers. The outputs of the two sub-networks are of the same dimensions. They are concatenated and passed through a common sub-network, , consisting of 4 convolutional layers, the last of which is different for and . The final convolutional layer for outputs two feature maps corresponding to the 2-dimensional deformation field v, while that for outputs a single feature map corresponding to the intensity mask . The final layer of

employs the tanh activation to cap the range of the intensity mask. All other layers use the ReLU activation. All convolutional layers have 3x3 kernels except for the final ones in both


and are followed by batch-normalization layers before the activation.


consists of 5 convolutional layers with kernel size of 5x5 and stride 2. The convolutions are followed by batch normalization layers and leaky ReLU activations with the negative slope of the leak set to 0.2. After the convolutional layers, the output is reshaped and passed through 3 fully-connected layers, with the final layer having an output size of 2.

Segmentation Network: We use a U-net [15] like architecture for

. It has an encoding and a decoding path. In the encoder, there are 4 convolutional blocks, each consisting of 2 3x3 convolutions, followed by a max-pooling layer. The decoder consists of 4 convolutional blocks, each made of a concatenation with the corresponding features of the encoder, followed by 2 3x3 convolutions, followed by bi-linear upsampling with factor 2. Batch normalization and ReLU activation are employed in all layers, except the last one.

3.4 Training Details

Weighted cross-entropy is used as the segmentation loss, . We empirically set the weights of the 4 output labels to 0.1 (background) and 0.3 (each of the 3 foreground labels). The background loss is considered while learning the augmentations, but not while learning the segmentation task alone. We empirically set and to 1 and respectively. The batch size is set to 20 and each training is run for 10000 iterations. The model parameters that provide the best dice score on the validation set are chosen for evaluation. Adam optimizer is used for all networks with an initial learning rate of , and .

4 Experiments

We divide the dataset into test (), validation (), labelled training () and unlabelled training () sets which consist of 20, 2, NL and 25 3d images respectively. As we are interested in the few labelled training images scenario, we run all our experiments in two settings: with NL set to 1 and 3. , and are selected randomly a-priori and fixed for all experiments. and are chosen such that they consist of equal number of images from each group (see Sec. 3.1) of the dataset. A separate set of 10 images (2 from each group), , is selected randomly. Each experiment is run 5 times with as NL images randomly selected from . When NL is 3, it is ensured that the images in come from different groups. Further, each of the 5 runs with different is run thrice in order to account for variations in convergence of the networks. Thus, overall, we have 15 runs for each experiment.

The following experiments were done thrice for each choice of :

  • [leftmargin=*]

  • No data augmentation (Augnone): is trained without data augmentation.

  • Affine data augmentation(AugA): is trained with data augmentation comprising of affine transformations. These consist of rotation (randomly chosen between -15deg and +15deg), scaling (with a factor randomly chosen uniformly between 0.9 and 1.1), another possible rotation that is multiple of 45deg (angle=45deg*N where N is randomly chosen between 0 to 8), and flipping along x-axis. For each slice in a batch, a random number between 0 and 5 is uniformly sampled and accordingly, either the slice is left as it is or is transformed by one of the 4 stated transformations.

    All the following data augmentation methods, each training batch (batch sizebs) is first applied affine transformations as explained above. The batch used for training consists of half of these images along with bs/2 augmentation images obtained according to the particular augmentation method.

  • Random elastic deformations (AugA,RD): Elastic augmentations are modelled as in [15]

    , where a deformation field is created by sampling each element of a 3x3x2 matrix from a Gaussian distribution with mean 0 and standard deviation 10 and upscaling it to the image dimensions using bi-cubic interpolation.

  • Random contrast and brightness fluctuations [11, 14] (AugA,RI): This comprises of an image contrast adjustment step: , followed by a brightness adjustment step: . We sample c and b uniformly in [0.8,1.2] and [-0.1,0.1] respectively.

  • Deformation field transformations (AugA,GD): Augmentation data is generated from the trained deformation field generator .

  • Intensity field transformations (AugA,GI): Augmentation data is generated from the trained intensity field generator .

  • Both deformation and intensity field transformations (AugA,GD,GI): In this experiment, we sample data from and to obtain transformed images and respectively. We also get an additional set of images which contain both deformation and intensity transformations . These are obtained by conditioning on spatially transformed images . The augmentation data comprises of all these images .

  • MixUp [19] (AugA,Mixup): Augmentation data () is generated using the original annotated images and their linear combinations using the Mixup formulation as stated in Eq. 7.



    is sampled from beta distribution Beta

    with and which controls the ratio to mix the image-label pairs , selected randomly from the set of labelled training images.

  • Mixup over deformation and intensity field transformations
    : Mixup is applied over different pairs of available images: original data (), their affine transformations and the images generated using deformation and intensity field generators .

  • Adversarial Training (Adv Tr): Here, we investigate the benefit of the method proposed in [20] on our dataset (explained in Sec. 1), in both supervised (SL) [12] and semi-supervised (SSL) [20] settings.

Evaluation : The segmentation performance of each method is evaluated using the Dice similarity coefficient (DSC) over 20 test subjects for three foreground structures: left ventricle (LV), myocardiam (Myo) and right ventricle (RV).

5 Results and Discussion

Table 1 presents quantitative results of our experiments. The reported numbers are the mean dice scores over the 15 runs for each experiments as described in Sec. 4. It can be observed that the proposed method provides substantial improvements over other data augmentation methods as well as the semi-supervised adversarial learning method, especially in the case where only 1 3D volume is used for training. The improvements can also be visually observed in Fig. 4. In the rest of this section, we discuss the results of specific experiments.

Perhaps unsurprisingly, the lowest performance occurs when neither data augmentation nor semi-supervised training is used. Data augmentation with affine transformations already provides remarkable gains in performance. Both random elastic deformations and random intensity fluctuations further improve accuracy.

The proposed augmentations based on learned deformation fields improve performance as compared to random elastic augmentations. These results show the benefit of encouraging the deformations to span the geometric variations present in entire population (labelled as well as unlabelled images), while still generating images that are conducive to the training of the segmentation network. Some examples of the generated deformed images are shown in Fig. 5. Interestingly, the anatomical shapes in these images are not always realistic. While this may appear to be counter-intuitive, perhaps preserving realistic shapes of anatomical structures in not essential to obtaining the best segmentation neural network.

Similar observations can be made about the proposed augmentations based on learned additive intensity masks as compared to random intensity fluctuations. Again, the improvements may be attributed to encouraging the intensity transformations to span the intensity statistics present in the population, while being beneficial for the segmentation task. Qualitatively, also as before, the generated intensity masks (Fig. 5) do not necessarily lead to realistic images.

As both and are designed to capture different characteristics of the entire dataset, using both the augmentations together may be expected to provide a higher benefit than employing either one in isolation. Indeed, we observe a substantial improvement in dice scores with our experiments.

As an additional experiment, we investigated the effect of excluding the regularization term from the training of the generators, and (). While the resulting augmentations still resulted in better performance than random deformations or intensity fluctuations, their benefits were lesser than that from the ones that were trained with the regularization. This shows that although the adversarial loss does not ensure the generation of realistic images, it is still advantageous to include unlabelled images in the learning of the augmentations.

Augmentations obtained from the Mixup [19]

method also lead to a substantial improvement in performance as compared to using affine transformations, random elastic transformations or random intensity fluctuations. Interestingly, this benefit also occurs despite the augmented images being not realistic looking at all. One reason for this behaviour might be that the Mixup augmentation method provides soft probability labels for the augmented images - such soft targets have been hypothesized to aid optimization by providing more task information per training sample 

[10]. Even so, Mixup can only generate augmented images that are linear combinations of the available labelled images and it is not immediately clear how to extend this method to use unlabelled images. Finally, we see that Mixup provides a marginal improvement when applied over the original images together with the augmentations obtained from the trained generators and . This demonstrates the complementary benefits of the two approaches.

While semi-supervised adversarial learning provides improvement in performance as compared to training with no data augmentation, these benefits are only as much as those obtained with simple affine augmentation. This observation seems to be in line with works such as [13].

Method Number of 3D training volumes used
Augnone 0.259 0.291 0.446 0.589 0.631 0.805
AugA 0.373 0.484 0.644 0.733 0.744 0.885
AugA,RD 0.397 0.503 0.663 0.756 0.763 0.897
AugA,GD() 0.394 0.756
AugA,RI 0.429 0.554 0.742 0.744 0.759 0.896
AugA,GI() 0.912
AugA,GI() 0.893
AugA,Mixup [19] 0.581 0.599 0.774 0.818 0.791 0.915
AugA,GD,GI,Mixup 0.679 0.713 0.849 0.844 0.825 0.924
Adv Tr SL [12] 0.417 0.507 0.698 0.731 0.753 0.891
Adv Tr SSL [20] 0.409 0.506 0.692 0.692 0.719 0.874

Table 1: Average Dice score (DSC) results over 15 runs of 20 test subjects for the proposed method and relevant works. denotes statistical significance over AugA,RD, AugA,RI abd AugA,Mixup respectively. (Wilcoxon signed rank test with threshold p value of 0.05).

(a)           (b)           (c)           (d)           (e)           (f)           (g)

Figure 4: Qualitative comparision of the proposed method with other approaches: (a) input image, (b) ground truth, (c) AugA, (d) AugA,RD, (e) Adv Tr SL [12], (f) AugA,Mixup [19], (g) AugA,GD,GI

Input Image   Images generated by

Input Image   Images generated by

Figure 5: Generated augmentation images from the deformation field generator (top) and the intensity field generator (bottom).

6 Conclusion

One of the challenging requirements for the success of deep learning methods in medical image analysis problems is that of assembling large-scale annotated datasets. In this work, we propose a semi-supervised and task-driven data augmentation approach to tackle the problem of robust image segmentation in the setting of training datasets consisting of as few as 1 to 3 labelled 3d images. This is achieved via two novel contributions: (i) learning conditional generative models of mappings between labelled and unlabelled images, in a formulation that also readily enables the corresponding segmentation annotations to be appropriately mapped and (ii) guiding these generative models with task-dependent losses. In the small labelled data setting, for the task of segmenting cardiac MRIs, we show that the proposed augmentation method substantially outperforms the conventional data augmentation techniques. Interestingly, we observe that in order to obtain improved segmentation performance, the generated augmentation images do not necessarily need to be visually hyper-realistic.