Deep learning techniques are getting increasingly popular for image analysis but are often dependent on a large quantity of labeled data. In case of medical images, this problem is even stronger as data acquisition is administratively and technically more complex, as data sharing is more restricted, and as the annotator expertise is scarce.
To address biomarker (e.g. number or volume of lesions) quantification, many methods propose to optimize first a segmentation problem and then derive the target quantity with simpler methods. These approaches require expensive voxel-wise annotations. In this work, we circumvent the segmentation problem by optimizing our method to directly regress the target quantity [2, 3, 4]. Therefore we need only a single label per image instead of voxel-wise annotations. Our main contribution is that we push this limit even further by proposing a data augmentation method to reduce the number of training images required to optimize the regressors. The proposed method takes advantage of the fact that the original global labels represent a countable quantity. Its principle is to combine real training samples to construct many more virtual training samples. During training, our model takes as input random sets of images and is optimized to predict a single label for each of these sets that denotes the sum of the labels of all images of the set. This is motivated by the idea that adding a large quantity of virtual samples with weaker labels may reduce the over-fitting to training samples and improve the generalization to unseen data. During training, in order for the network to process several images simultaneously, we replicate the full network as many time as there are images in the set. The parameters of each replicated network are then shared and their predictions are summed. Once the model is trained, the replicated networks can be removed, and the original architecture can be used for prediction.
We evaluate our approach on two different tasks: quantification of enlarged perivascular spaces (PVS) in the basal ganglia, and of white matter hyperintensities (WMH) in the brain. For the PVS, we evaluate our method on 1977 visually scored MRI scans from a single scanner. For the WMH, we used the training dataset from the WMH challenge . This dataset contains 60 MRI scans from three different centers.
1.1 Related Work
Data augmentation and other regularizers.
. The authors define a manifold representation for training samples, and propose to compute transformations corresponding to vectors in the tangent plane. Simple data-augmentations such as rotation, translation and flipping are a special case of such transformations, and are often used to reduce over-fitting of deep learning algorithms, as they are simple to implement and can be very effective. The authors of Unet 
, probably the most used segmentation network in medical imaging, stress that random elastic deformations of training samples significantly improved the performance of their model. In case of medical images, these transformations can mimic the possible movements of biological tissues. Generative adversarial networks (GAN) can also be used to generate samples for the network during training, and hence reduce the over-fitting . Perez et al.  compare different data augmentation methods and propose to learn augmentations that best improve the performance of the model. To learn augmentations, the authors include in their network an augmentation branch taking as input other images of the same class as the input image. This branch is optimized to generate images similar to the input image. The authors explore different similarity metrics describing content or style.
Recently, data augmentation methods using combinations of training samples have been published. Zhang et al. 
proposed to construct virtual training samples by computing a linear combination of pairs of real training samples. The corresponding one-hot labels are summed with the same coefficients. The authors evaluated their method on classification datasets from computer vision and on a speech dataset, and demonstrate that their method improves the generalization of state-of-the-art neural networks, such as DenseNet-BC-190  or ResNet-101 . In some experiments, the authors achieve better results using both their method and Dropout  instead of Dropout alone. Simultaneously, Inoue et al.  reached similar conclusions by averaging pairs of training samples, after standard data augmentations including random extractions and flipping. Tokozume et al.  also used similar methods to improve performance for sound classification, and validated their method on three datasets. However, in case of grayscale volumetric inputs, summing image intensity values could overlay the target structures, confuse discriminative shapes, and thus harm the performance of the network. For instance one of the most discriminative feature of PVS is their tubular shape, averaging several close PVS could create a new shape, which would probably not and should not be detected as PVS by the network. With our method, training samples can be combined without overlaying the intensity values. The other difference with the above-mentioned approaches is that our method is also not designed for classification, but for regression of global labels, such as volume or count in an image. With the proposed combination of samples, our method computes plausible augmentation.
Note that for these methods, the amount of regularization introduced can be modulated by at least one parameter, for instance the degree of rotation applied to the input image, or the percentage of neurons dropped in Dropout. With most methods, the more regularization is applied, the longer the optimization takes until convergence. Introducing too much regularization can make the optimization unreasonably long, and force stopping before convergence of the validation loss, which leads to a sub-optimal performance. Choosing the right amount of regularization is therefore a key factor. In the proposed method, the regularization effect can be controlled by varying the average number of samples used to create combinations.
Biomarker extraction with regression networks.
Recently, methods employing neural network regressors trained with global labels have been proposed for biomarker extraction [2, 3, 4]. Cole et al.  predict brain age from 3D images of grey matter density computed from MRI scans. Dubost et al.  predict the number of enlarged perivascular spaces in a 3D region of interest in the brain extracted from a MRI scan. And Espinosa et al.  predict Agatston scores to quantify coronary artery calcifications from 3D non-contrast non-ECG gated chest CT scans. These networks needed hundreds training samples to be trained accurately. With our method, we can achieve the level of interrater agreement only using 25 training scans, which could for instance only require one to two hours of labeling from an expert rater.
The principle of our data augmentation method is to create many new (and weaker) training samples by combining existing ones (see Figure 1). In the remainder, the original samples are called real samples, and the newly created samples are called virtual samples.
2.1 Proposed Data Augmentation.
During training, the model is not optimized on single real samples with label , but on sets of random samples with label , with the label of sample . These sets with labels
are the virtual samples. Consequently, the loss functionis computed directly on these virtual samples and not anymore the individual real samples . This approach is designed for labels describing a quantitative element in the samples, such as volume or count in an image.
To create the sets , the samples
are drawn without replacement from the training set at each epoch. To create more combinations of samples, and to allow the model to use the real samples for its optimization, the size of the setscan randomly vary in during training.
If the training set contains samples, with our method, we can create possible different combinations, because (1) the order of the samples in has no effect on the optimization, (2) the samples are drawn without replacement within an epoch, (3) the size of the set can vary.
Difference between the proposed method and mini-batch stochastic gradient descent (SGD).
In mini-batch SGD, the model is also optimized on sets of random samples, but the loss function is computed individually for each sample of the batch, and then summed (averaged). For the proposed method, the predictions are first summed, and the loss function is then computed a single time. For non-linear loss functions, this is not equivalent: , with the model’s prediction for sample .
There are at least two possible implementations of the proposed method. The first implementation could consist of modifying the computation of the loss function across samples in a mini-batch, and provide mini-batches of random size. Alternatively the model’s architecture could be adapted to receive the set of images. Depending on one’s deep learning library, one of these implementation might be easier to implement. We opted for the second approach, and present it below in more details. We optimize a regression neural network with an 3D image for input, and global label representing a volume or count for output.
, where skip connections concatenate features maps instead of summing them. It is both simple (196 418 parameters) and flexible to allow fast prototyping. There is no activation function after the last layer. The outputcan therefore span and the network is optimized with the mean squared error (MSE). We call this regression network , such that , with the input image.
To process several images simultaneously, we replicate times the regressor during training (Figure 2 right). This results in different branches (or ”heads”, hence the name ”Hydranet”) which can receive the corresponding images . The weights of each head are shared such that . A new network is constructed by summing the predictions of each individual branch :
In Section 2.1, we explained that it is desirable that the size of the sets can randomly vary in during training. To allow this feature, each element of has a chance to be a black image of zero intensities only (Figure 1 right column). With , the following situation becomes possible:
Finally, for this implementation, the batch size has to be a multiple of the number of heads . In our experiments, we chose .
In this section, we provide details about the datasets and experimental settings, and present results on both tasks: estimation of number of PVS in the basal ganglia, and estimation of WMH volume. We compare the optimization and performance of our method to that of more standard methods (base regressor and dropout) and for different sizes of training set.
The PVS dataset contains 2017 PD-weighted scans, from 2017 subjects, acquired from a 1.5T GE scanner. The original voxel size is and the images are voxels wide. The scans were visually scored by an expert rater who counted the PVS in the basal ganglia in a single slice: the one showing the anterior commissure. The WMH dataset is the training set of the MICCAI2017’s WMH challenge. It contains 2D multi-slice FLAIR-w and 3D T1-w scans from 60 participants from 3 centers and 3 vendors: 20 scans from Amsterdam (GE), 20 from Utrecht (Philips) and 20 from Singapore (Siemens). For simplicity, we only used the FLAIR-w scans. Although the ground truths of the challenge are pixel-wise, we only used the number of WMH voxels as ground truth during our training.
3.2 Experimental Settings
For the regression of PVS in the basal ganglia, we start by creating a mask of the basal ganglia from the 3D PD scans by using the subcortical segmentation algorithm from FreeSurfer . Both masks and image are then registered to MNI space, using Elastix 
with the mutual information as similarity measure and default settings. Before applying the mask, we smooth its border with a gaussian kernel (standard deviation of 2 voxels) and crop the images around the center of mass of the segmented basal ganglia. After cropping, the images arevoxels-wide. For the WMH dataset, we only crop each image around its center of mass, weighted by the voxel intensities. The size of the cropped image is the same for all 3 centers: voxels-wide. For both tasks the intensities are then rescaled between 0 and 1. See Figure 3 for samples of preprocessed scans.
Training of the networks.
For all methods, images are padded with zeros for the convolutions. The weights of the networks are initialized from a Gaussian distribution of 0.0 mean and 0.05 standard deviation. During training, the images are randomly augmented on-the-fly with standard methods. The possible augmentations are flipping inor , 3D rotation from -0.2 to 0.2 radians and random translations in or from -2 to 2 voxels. This standard data augmentation is exactly the same for the proposed method and the base regressor . Adadelta 
is used as optimizer, with Keras’ default parameters. For a fair comparison, both the proposed method and the base regressorare trained with the same batch-size . The architecture of the network for the proposed method has then four branches (). During an epoch, the proposed method gets as input different combinations of training samples, were is the total number of training images. During the same epoch, the base regressor simply gets the images separately (in batches of size ). For the proposed method, the probability to draw black image (Sec. 2) was set to 0.1. In some experiments we also included a dropout layer 
after each convolution and after the global pooling layer. If not mentioned, no dropout was included. The code is written in Keras with Tensorflow as backend, and the experiments were run on a Nvidia GeForce GTX 1070 GPU.
For all the experiments, the datasets are split into training and testing sets. In the training set, a small subset is used to evaluate the overfitting: the validation set. For the PVS dataset, we experiment with varying size of training set, between 12 and 25 scans. The validation set always contains the same 5 scans. All methods were evaluated on the same separated test set of 1977 scans. For the WMH dataset, we split the set into 30 training scans and 30 testing scans. Six scan from the training set are used as validation scans. In both cases, the dataset is randomly (uniform distribution) split into training and testing sets. For the PVS dataset, once the dataset has been split into 30 training scans and 1977 testing scan, we manually sample scans to keep a pseudo-uniform distribution of the lesion count when decreasing the number of training scans.
To compare the automated predictions to visual scoring (for PVS) or volumes (for WMH), we use two evaluation metrics: the mean squared error (MSE), and the intraclass correlation coefficient (ICC).
Enlarged Perivascular Spaces (PVS).
Figure 4 compares the proposed method to the base regressor on the PVS datasets, and for an increasing number of training samples. Their performance is also compared to the average interrater agreement computed for the same problem and reported in . With less than 30 training samples, only the proposed method reaches a performance similar to the interrater agreement, and significantly outperforms the base regressor in ICC (Williams’ test p-value 0.001) when averaging the predictions of the methods across the four points of their learning curve. We also notice that for all training set sizes, the proposed method always reaches a better MSE than the conventional methods. Note that all methods were optimized for MSE (not ICC).
White Matter Hyperintensities (WMH).
We conducted three series of experiments, and trained in total five neural networks (Table 1). When using small training sets, the proposed method outperforms the base network , when optimized either for MSE or for mean absolute error. With larger training sets, the difference of performance reduces, and the base regressor performs slightly better on the ICC.
|Method||Training scans||Testing scans||Loss||Performance (ICC)|
|Base Network||30||30||MSE||0.79 0.12|
|Proposed Method||30||30||MSE||0.84 0.02|
|[0.5pt/3pt] Base Network||30||30||MAE||0.78|
|[0.5pt/3pt] Base Network||40||20||MSE||0.89|
With the proposed data augmentation method, we could reach the inter-rater agreement performance on PVS quantification reported by Dubost et al.  with only 25 training scans, and without pretraining. The proposed method outperforms the base regressor on both datasets.
Dubost et al  also regressed the number of PVS in the basal ganglia with a neural network. We achieve a similar result (0.73 ICC) while training on 25 scans instead of 1000. Note that the authors reported interrater agreements with ICCs of 0.70 and 0.68, and intrarater ICC of 0.80 on their dataset. In our case, even the base regressor reaches a good performance with only 40 training samples (ICC of 0.68). For a similar performance, Dubost et al.  needed more than 200 training samples. We believe that this is due to the use of global pooling in our network architecture (instead of fully connected layers in ), which might have a regularization effect.
Training on such small datasets could drastically ease the use of deep learning regression methods in medical image analysis. For instance for PVS, creating global labels, for less than 30 scans, would generally only take between one and two hours for an expert human rater. This could make it affordable for different centers to create their in-house training sets.
With our data-augmentation strategy, the models are sometimes optimized with larger labels than those in real images. There can be controversy on whether only realistic transformations should be used during training . Dubost et al.  showed on PVS quantification that although unrealistic transformations could make the training longer, they did not harm the performance of the model.
We noticed that the more variability in the combinaisons of training samples, the longer the convergence. This variability can for instance be partly estimated using the number of possible combinations as described in section 2.1, and increases with the number of training samples , with the maximum size of the combination sets , and with the average size of combination sets (as it can randomly vary in ). In our implementation, the average size of the combination sets can be regulated by , the probability to input a black image in a given branch of the network’s architecture. This variability seems to act as a degree of the regularization effect, similarly to e.g. the percentage of dropped neurons in Dropout . The regularization effect can also be increased by drawing samples with replacement when creating combinations. In this work we opted for drawing samples without replacement for the simplicity of the implementation. Drawing with replacement increases the number of possible combinations, and could be beneficial for small training sets.
Zhang et al.  also proposed to combine training samples as a data augmentation method. Contrary to our approach, they evaluated their method on 2D images and use a large batch size. In their experiments, combining more than images does not bring any improvement. With the proposed method, training with combinations of four images brought improvement over only using pairs of images, probably due to the small size of our training set. The smaller the training set is, the more regularization is needed. We did not experiment with values of larger than 4.
On both PVS and WMH datasets, using dropout  actually made the results worse when trained on very little data, even with low dropout rates such as 0.3.
In section 2.2, we mentioned two possible implementations of the proposed method: (1) changing the computation of the loss over mini-batches, (2) replicating the architecture of network. In this work we used the second approach, as it was simpler to implement with our library (Keras). However with this approach, all samples in the combination have to be simultaneously processed by the network, which can cause GPU memory overload in case of large 3D images or large values of . The first approach does not suffer from this overload, as the samples can be successively loaded, while only saving the individual scalar predictions in the GPU memory. In case of large 3D images, we would consequently recommend implementing the first approach.
We proposed a data-augmentation approach for neural network regressors trained with global labels, such as volumes or counts. The objective of the method is to train such regressors with small datasets. The method consists of combining training samples to create a larger training set. During training, samples are randomly grouped into sets, and their labels are summed. Using this method, we could train a neural network and reach inter-rater agreement performance on brain lesion quantification with only 25 training scans, with a single global label per scan, and without pretraining.
-  Zhang, H., Cisse, M., Dauphin, Y.N. and Lopez-Paz, D., 2017. mixup: Beyond empirical risk minimization. ICLR 2018.
-  Cole, J.H., Poudel, R.P., Tsagkrasoulis, D., Caan, M.W., Steves, C., Spector, T.D. and Montana, G., 2017. Predicting brain age with deep learning from raw imaging data results in a reliable and heritable biomarker. NeuroImage, 163, pp.115-124.
-  Dubost, F., Adams, H., Bortsova, G., Ikram, M.A., Niessen, W., Vernooij, M. and de Bruijne, M., 2019. 3D Regression Neural Network for the Quantification of Enlarged Perivascular Spaces in Brain MRI. Medical Image Analysis.
-  Espinosa, C.C., González G., Washko, G.R., Cazorla, M., Estépar, R.S.J., 2018. Automated Agatston score computation in non-ECG gated CT scans using deep learning. SPIE.
-  http://wmh.isi.uu.nl/
-  Desikan, R.S., Ségonne, F., Fischl, B., Quinn, B.T., Dickerson, B.C., Blacker, D., Buckner, R.L., Dale, A.M., Maguire, R.P., Hyman, B.T. and Albert, M.S., 2006. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral base d regions of interest. Neuroimage, 31(3), pp.968-980.
-  Zeiler, M.D., 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
-  S. Klein, M. Staring, K. Murphy, M.A. Viergever, J.P.W. Pluim, ”elastix: a toolbox for intensity based medical image registration,” TMI, vol. 29, no. 1, pp. 196 - 205, January 2010.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), pp.1929-1958.
He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
-  Inoue, H., 2018. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929.
-  Simard, P.Y., LeCun, Y.A., Denker, J.S. and Victorri, B., 1998. Transformation invariance in pattern recognition—tangent distance and tangent propagation. In Neural networks: tricks of the trade (pp. 239-274). Springer, Berlin, Heidelberg.
Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P., 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, pp.321-357.
-  Ronneberger, O., Fischer, P. and Brox, T., 2015, October. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham.
-  Ioffe, S. and Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
-  LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), pp.2278-2324.
-  Perez, L. and Wang, J., 2017. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621.
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y., 2014. Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).
-  Sixt, L., Wild, B. and Landgraf, T., 2018. Rendergan: Generating realistic labeled data. Frontiers in Robotics and AI, 5, p.66.
-  Ratner, A.J., Ehrenberg, H., Hussain, Z., Dunnmon, J. and Ré, C., 2017. Learning to compose domain-specific transformations for data augmentation. In Advances in neural information processing systems (pp. 3236-3246).
-  Tokozume, Y., Ushiku, Y. and Harada, T., 2017. Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. and Berg, A.C., 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), pp.211-252.
-  Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K.Q., 2017, July. Densely connected convolutional networks. In CVPR (Vol. 1, No. 2, p. 3).
-  He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).