1 Description of purpose
The availability of training data in medical image segmentation problems is often very limited. In such cases, convolutional neural networks (CNNs) tend to overfit due to a lack of feature generalization to variations in shapes and appearance, and over parameterization. In order to address generalization, one has to find models that generate features equivariant or invariant under different transformations of the input. Equivariance of feature maps generated by CNNs to certain transformations can be obtained by using group convolutions [cohen16] where different orientations of the features maps are learnt by kernels with shared weights. However, the resulting equivariance is restricted to only linear and symmetric transformations. In order to reach generalization across a large group of transformations one has to rely on data augmentation.
Data augmentation is commonly achieved by applying transformations that generate warped versions of the available training data. The choice of the transformation in literature so far has been fairly arbitrary – often restricted to rotations, translations, reflections, and very small nonlinear deformations [roth2015, NIPS2015_5854, hauberg2016]. Some degree of learning the right kind of transformations needed to improve the network performance was introduced in [NIPS2015_5854]. Hauberg et al. [hauberg2016] proposed a diffeomorphic registration approach where the distribution of transformations are learnt by explicitly constructing it by performing pairwise registration of
more similar images. Once the distribution is constructed, new transformations are sampled and applied to the training data to obtain data for augmentation. However, variations captured by the probability density function (pdf) of transformations is strongly dependent on the choice of
. One may circumvent this dependency by registering all the possible pairwise images in the dataset, however, it is computationally expensive and images that cannot be plausibly registered may induce transformations that are not meaningful.In order to obtain a model that produces transformations that capture shape variations in training data automatically, we propose Probabilistic Augmentation of Data using Diffeomorphic Image Transformation (PADDIT). PADDIT involves an unsupervised approach to learn shape variations that naturally appear in the training dataset. This is done by first constructing an unbiased template image that represents the central tendency of shapes in the training dataset. We sample – using a Hamiltonian Monte Carlo (HMC) scheme [duane1987] – transformations that warp the training images to the generated mean template. The sampled transformations are used to perturb training data which is then used for augmentation. We use convolutional neural networks (CNNs) to segment T1/FLAIR brain magnetic resonance images (MRI) for white matter hyperintensities. We show that PADDIT outperforms CNN methods that use either no data augmentation or limited augmentation (using random Bspline transformations).
2 Methods
Probabilistic Bayesian models for template estimation in registration was introduced by
[zhang2013], albeit using a different class of transformations. In short, the method views image registration as a maximum a posteriori (MAP) problem where the similarity between two images (, ) is the likelihood. The transformations are (lie group exponential of a timeconstant velocity field ) regularized by a prior which is in the form of a norm attached to velocity field. Formally, it is a minimization of the energy(1) 
The norm on the vector field is generally induced by a differential operator. However, we directly choose a kernel inducing a reproducing kernel Hilbert space to parameterize the velocity field
[Pai:2016dz]. Given a finite set of kernels, the regularization takes the form, where are the vectors attached to each spatial kernel, and is the spatial position of each kernel .Using the distance metric between two images (minimization of (1)), one can formulate template estimation as a Fréchet mean estimation problem. In other words, given a set of images (or observations) , the atlas is the minimization of the sumofsquared distances function
Since (1) is viewed as a MAP problem, the velocity fields are considered as latent variables, i.e.,
, a normal distribution with zero mean and covariance
derived from a kernel function. In the presence of latent variables, the template estimation is posed as an expectation maximization (EM) problem. Further, for simplicity, we assume an i.i.d. noise at each voxel, with a likelihood term (for each
observation) given by(2) 
where are the parameters to be estimated via MAP;
is the noise variance,
is the mean template, and is the number of voxels. Each observation can be viewed as a random variation around a mean (). The prior on the velocity field may be defined in terms of the norm as:Estimating the posterior distribution involves the marginalization of it over the latent variables. This is computationally intractable due to the dimensionality of . To solve this, Hamiltonian Monte Carlo (HMC) [neal2011] is employed to sample velocity field for marginalization. The posterior distribution to draw number of samples from is
(3) 
The sampled velocity fields ( of the image) are used in an EM algorithm to estimate an optimal . A singlescale Wendland kernel [Pai:2016dz]
is used to parameterize the velocity field and construct the covariance matrix for regularization. Once a template is estimated, the posterior distribution is sampled for a set of velocity fields for each training data. To induce more variations, the velocity fields are randomly integrated between 0 and 1. The training samples are deformed with cubic interpolation for the image, and nearest neighbor interpolation for the atlas to create the new set of synthetic data. The input (for one image as an example) to the deeplearning network will be of the form
where is the number of augmentations and is the label of input image . Note that the label is a segmentation assigning a class to each voxel and is transformed using the same transformation accordingly.3 Experiments and Results
We considered CNNs based on a Unet architecture in our experiments. To evaluate the proposed method, the performance of CNNs trained with data augmentation using PADDIT was compared to training without augmentation and training with augmentation using deformations based on random Bsplines – we call this method the baseline. The abovementioned strategies were applied to White Matter Hyperintensities (WMH) segmentation from FLAIR and T1 MRI scans. To this end, we use the training dataset from the 2017 WMH segmentation MICCAI challenge ^{1}^{1}1http://wmh.isi.uu.nl. The set is composed of T1/FLAIR MRI scans and manual annotations for WMH from 60 subjects. The dataset was split into a training(30), validation(5) and testing(10) set. For each method two different deformed versions of each training case were created, i.e the training set size was tripled.
The Random deformations for the baseline were obtained by using a deformation field defined on a grid with Cp number of control points and Bspline interpolation. The size of deformation was controlled by adding Gaussian noise with
mean and standard deviation
Sd. We evaluate the impact of Cp and Sdhyperparameters, specifically we tried: and .Figure 1 shows examples of the obtained deformed versions of a FLAIR scan from one subject from the training dataset. As can be observed, both methods generated new shapes for WMHs regions. It is worth noting, however, that images provided by PADDIT look more realistic and without drastic alterations to the Brain. In contrast, those obtained using random Bspline deformations exhibit some aberrations in cortical and ventricular structures depending on the size of the deformation used.
Sd:2  Sd:4  Sd:6  

FLAIR 
Cp: 4  
PADDIT 1 
Cp: 8  
PADDIT 2 
Cp: 16 
Figure 2
, shows the dice performance at each epoch on the validation and testing set. It is worth noting that PADDIT achieved higher accuracy than training with random Bspline deformations as well as training without augmentation. Also, it can be noted that random Bspline deformations did not provide a consistent improvement compared to the training without data augmentation.
For the final assessment of PADDIT, the validation data was used for early stopping. The final evaluation of each method is carried out on the testing set using the network configuration at the epoch where it showed the highest accuracy on the validation set. The best configuration for random deformations was achieved using and For PADDIT, the control points were placed every 8 voxels. Results for evaluation on the testing set are summarized in Table 3. Our proposed method PADDIT achieved higher dice accuracy compared to the network performance without data augmentation and compared to the baseline data augmentation approach (best configuration). (both differences where statistically significant ())
Non Data Aug 
Rd Cp: 4 Sd:2 
Rd Cp: 4 Sd:4 
Rd Cp: 4 Sd:6 
Rd Cp: 8 Sd:2 
Rd Cp: 8 Sd:4 
Rd Cp: 8 Sd:6 
Rd Cp: 16 Sd:2 
Rd Cp: 16 Sd:4 
Rd Cp: 16 Sd:6 
PADDIT 


Dice (mean)  0.66321  0.6628  0.6347  0.6661  0.6452  0.6438  0.6566  0.6358  0.6587  0.6535  0.6813 
Dice (std)  0.24829  0.2260  0.2466  0.2274  0.2347  0.2403  0.2327  0.2457  0.2238  0.2341  0.2185 
4 New or breakthrough work to be presented
Even though several configurations of random transformations generated realistic looking images, they were not necessarily useful in CNN training. On the other hand, the best configuration of random transformations generated images that were not necessarily biologically plausible. We hypothesize that such noisy data may help the optimization to find better minimums. However, one has to be careful in choosing the configuration of transformations since other configurations with a higher magnitude of deformations had a negative effect on the training. In the case of PADDIT, one need not worry about the transformation configuration too much since the method learns the right transformation needed to capture the shape variations in the data set. Hence, the resulting synthetic images were both realistic and useful for CNN training.
5 Conclusion
The proposed probabilistic augmentation approach PADDIT, proved to be an effective way to increase the training set by generating new training images which improve the segmentation performance of CNN’s based approaches. From the results it is evident that the network trained with PADDIT performed statistically significantly better than the networks with either no data augmentation or random Bsplines based augmentation.
This work has not been submitted for publication or presentation elsewhere.
Acknowledgments
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SkłodowskaCurie grant agreement No 721820. We would like to thank both Microsoft and NVIDIA for providing computational resources on the Azure platform for this project.
Comments
There are no comments yet.