Improved inter-scanner MS lesion segmentation by adversarial training on longitudinal data

02/03/2020 ∙ by Mattias Billast, et al. ∙ 0

The evaluation of white matter lesion progression is an important biomarker in the follow-up of MS patients and plays a crucial role when deciding the course of treatment. Current automated lesion segmentation algorithms are susceptible to variability in image characteristics related to MRI scanner or protocol differences. We propose a model that improves the consistency of MS lesion segmentations in inter-scanner studies. First, we train a CNN base model to approximate the performance of icobrain, an FDA-approved clinically available lesion segmentation software. A discriminator model is then trained to predict if two lesion segmentations are based on scans acquired using the same scanner type or not, achieving a 78 base model and the discriminator are trained adversarially on multi-scanner longitudinal data to improve the inter-scanner consistency of the base model. The performance of the models is evaluated on an unseen dataset containing manual delineations. The inter-scanner variability is evaluated on test-retest data, where the adversarial network produces improved results over the base model and the FDA-approved solution.



There are no comments yet.


page 2

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multiple sclerosis (MS) is an autoimmune disorder characterized by a demyelination process which results in neuroaxonal degeneration and the appearance of lesions in the brain. The most prevalent type of lesions appear hyperintense on T2-weighted (T2w) magnetic resonance (MR) images and their quantification is an important biomarker for the diagnosis and follow-up of the disease [3].

Over the years methods for automated lesion segmentation have been developed. Several approaches model the distribution of intensities of healthy brain tissue and define outliers to these distributions as lesions

[15, 7]. Others are either atlas-based [11] or data-driven (supervised) [2, 13]classifiers. For a detailed overview of recent methods refer to [3].

Figure 1: MRI scans from one patient in three 3T scanners (left to right: Philips Achieva, Siemens Skyra and GE Discovery MR750w). Automated lesion segmentations in green.

Lesion segmentation is particularly interesting for patient follow-up, where data from two or more time-points is available for one patient. Some approaches try to improve segmentation consistency by analysing intensity differences over time [6]. Although these methods achieve good performance in controlled settings, they remain sensitive to changes in image characteristics related to scanner type and protocol. In a test-retest multi-scanner study [1], scanner type was observed to have an effect on MS lesion volume. These findings are supported by [12], where scanner-related biases were found even when using a harmonized protocol across scanners from the same vendor. Fig. 1 illustrates such an effect.

Few works have addressed the inter-scanner variability issue in the context of lesion segmentation. Recent approaches attempt to increase the generalization of CNN-based methods to unseen MR scanner types through domain adaptation [8]

or transfer learning

[4, 14] techniques. Nevertheless, these methods share the common downside that they require a training step to adapt to new unseen domains (scanners types and protocols). The consistency of the delineations in longitudinal settings is also not considered. A solution to incorporate consistency information into this type of data-driven solutions would be to train them on a dataset containing intra- and inter-scanner repetitions for the same patient, acquired within a short periods of time. However, in practice this type of test-retest dataset is almost impossible to acquire at a large scale, due to time and cost considerations.

In the present work we present a novel approach to improve the consistency of lesion segmentation in the case of multi-scanner studies, by capturing inter-scanner differences from lesion delineations. Given the shortage of test-retest data we propose instead to use longitudinal inter-scanner data to train a cross-sectional method. We start by training a base model on a multi-scanner dataset to achieve performance comparable to an existing lesion segmentation software [7]. We then design a discriminator to identify if two segmentations were generated from images that originate from the same scanner or not. The assumption is that the natural temporal variation in lesion shape can be distinguished from the variation caused by the different scanners. These networks are then combined and trained until the base model produces segmentations that are similar enough to fool the discriminator. We hypothesize that through this training scheme the model will become invariant to scanner differences, thus imposing consistency on the baseline CNN. Finally we evaluate the accuracy on a dataset with manual lesion segmentations and the reproducibility on a multi-scanner test-retest dataset.

2 Methods

We start by building a lesion segmentation base model

based on a deep convolutional neural network (CNN) architecture

[9] that approximates the performance of icobrain

, an FDA-approved segmentation software. This method is an Expectation-Maximization (EM) model that uses the distribution of healthy brain tissue to detect lesions as outliers while also using prior knowledge of the location and appearance of lesions

[7]. We refer to it as EM-model.

2.0.1 Base Model

The base model is based on the DeepMedic architecture [9]. Generally, it is composed of multiple pathways which process different scales of the original image simultaneously. This is achieved by downsampling the original image at different rates before dividing it into input patches, which allows the model to combine the high resolution of the original image and the broader context of a downsampled image to make a more accurate prediction. In our implementation we used three pathways, for which the input volumes were downsampled with factors , and and divided into patches of size , and

, respectively. Each pathway is comprised of ten convolutional layers, each followed by a PReLu activation, after which the feature maps from the second and third pathways are upsampled to the same dimensions as the first pathway and concatenated. This is followed by dropout, two fully connected layers and a sigmoid function, returning a

probability map. The first five layers have 32 filters and kernel size (3,3,1) and the last five layers 48 filters with kernel size (3,3,3). The values of the output probability map that are above a certain threshold are classified as lesions. The threshold used throughout this article is 0.4. The architecture is represented in Fig. 2

. The loss function of the base model is given by


where X is the concatenation of the T1- and FLAIR MR images, Y is the corresponding lesion segmentation label and B() the output of the base model.

Figure 2: Architecture of the base model that describes the patch sizes of the different pathways and the overall structure.

2.0.2 Discriminator

The discriminator is reduced to one pathway with six convolutional layers, since additional pathways with subsampling resulted in a marginal increase in performance. The two first layers have 32 filters of kernel size (3,3,1) and the following layers 48 filters with kernel size (3,3,3). As input it takes two label patches of size and generates a voxel-wise prediction that the two labels are derived from images acquired using the same scanner. The architecture is represented in Fig. 3.

Figure 3: Architecture of the discriminator that describes the in- and output sizes of the patches and the overall structure.

The loss function of the discriminator is given by :


where is the ground truth indicator variable (0 or 1) indicating whether two time points were acquired on the same scanner or not, and are images at different time points and B() and D() are respectively the output of the base model and the discriminator.

2.0.3 Adversarial Model

After training, the discriminator was combined adversarially with the base model, as introduced in [5]. The adversarial model consists of two base model blocks () and one discriminator () (Fig. 4). In our particular case the pre-trained weights of the discriminator are frozen and only the weights of the base model are fine-tuned. The concept of adversarial training uses the pre-trained weights of the discriminator to reduce the inter-scanner variability of the base model by maximizing the loss function of the discriminator. This is equivalent to minimizing the following loss function :


The loss function of the adversarial network then consists of two terms: one associated with the lesion segmentation labels, and one related to the output image of the discriminator:


The purpose of is to ensure that the base model is updated such that the discriminator can no longer distinguish between segmentations that are based on same- or different-scanner studies. We hypothesize that the base model learns to map scans from different scanners to a consistent lesion segmentation.

Figure 4: Adversarial network that combines the base model and the discriminator to reduce the inter-scanner variability.

2.0.4 Model training

Both the base model and the discriminator were trained using the binary cross entropy objective function and optimized using mini-batch gradient descent with Nesterov momentum

. Initial learning rates were for the base model and for the discriminator, and were decreased at regular intervals until convergence. For the adversarial network initial learning rate was

. All models were trained using an NVIDIA P100. The networks are implemented using the Keras and DeepVoxNet

[10] frameworks.

3 Data and preprocessing

Four different datasets were available: two for training and two for testing the performance of the models. Since for three of the datasets manual delineations were not available, automated segmentations were acquired using the EM-method described in the previous section. All automated delineations were validated by a human expert. Each study in the datasets contains T1w and FLAIR MR images from MS patients.
Cross-sectional dataset 208 independent studies from several centers. The base model is trained on this dataset.
Longitudinal dataset 576 multi-center, multi-scanner studies with approved quality MR scans, containing multiple studies from 215 unique patients at different timepoints. For training the adversarial model and the discriminator only studies with less than 2 years interval were used to minimize the effect of the natural evolution of lesions over time and capture the differences between scanners. This resulted in approximately being used since most patients have a follow-up scan every 6 months to one year. The discriminator and adversarial model are trained on this dataset.
Manual segmentations 20 studies with manual lesion delineations by experts.
Test-retest dataset 10 MS patients. Each patient was scanned twice in three 3T scanners: Philips Achieva, Siemens Skyra and GE Discovery MR450w [7].

All the data was registered to Montreal Neurological Institute (MNI) space and intensities were normalized to zero mean and unit standard deviation. Ten studies from each training dataset were randomly selected to use as validation during the training process. The data was additionally augmented by randomly flipping individual samples around the x-axis.

4 Results

The models were evaluated on the manual segmentations and the test-retest datasets described in Section 3 and compared to the EM-model. The main results are summarized in Table 1

. For the manual segmentations dataset results are described in terms of Dice score, Precision and Recall. For the test-retest dataset we are mainly interested in evaluating the reproducibility in the inter-scanner cases. Since there is no ground truth, we report the metrics between different time points for the same patient. Aside from the total lesion volume (LV) in

we additionally quantify the absolute differences in lesion volume () in . The results in this table were calculated with a lesion threshold value of . Fig. 5 depicts the distribution of () for both inter-scanner and intra-scanner cases of the test-retest dataset.

Base model

For the manual segmentation dataset, results are comparable to the EM-model. In the test-retest validation, the inter-scanner is larger for the base model, which indicates that the model is sensitive to inter-scanner variability.


The discriminator is validated on a balanced sample of the test-retest dataset, so that there is the same number of inter- and intra-scanner examples. It achieves an accuracy of by looking at the average probability value on the lesion voxels only.

Adversarial model

On the manual segmentations dataset, again referring to Table 1, the adversarial model achieves a slightly lower but still competitive performance when compared to the EM-model.

Regarding the test-retest dataset, the adversarial model produces lower inter-scanner when compared to the base model (Wilcoxon Signed-Rank Test, and to the EM-model, (Wilcoxon Signed-Rank Test, ). This indicates that the adversarial model produces segmentations that are less sensitive to inter-scanner variation than both the base model and the EM-model.

The mean values and standard deviation for the EM-model are almost twice as large as the adversarial model. Taking into account the boxplots in Fig. 5

, this is partly explained by the fact that the distribution has a positive skew and additionally by three significant outliers, which artificially increase the mean values.

This is evidence that the EM-model has larger variability and lower reproducibility than the adversarial model, while the average predicted lesion volume is similar for the EM- and adversarial models. Fig. 6 shows an example of the different lesion segmentations on the different scanners with the three models.

Manual Test/Retest
Model Dice Precision Recall LV
Table 1: Mean performance metrics for the different models on two test sets: manual segmentations and test-retest. For the latter only inter-scanner studies are considered. represents absolute differences between individual lesion volumes and is given in .
Figure 5: Absolute intra- and inter-scanner difference in lesion volume, calculated on the test-retest dataset with three different models.
Figure 6: Lesion segmentation results for one patient in three 3T scanners. Top: EM-model; Middle: base model; Bottom: adversarial model. Adversarial model results appear more consistent, while maintaining physiological meaning.

5 Discussion and future work

We presented a novel approach to improve the consistency of inter-scanner MS lesion segmentations by using adversarial training on a longitudinal dataset. The proposed solution shows improvements in terms of reproducibility when compared to a base CNN model and to an FDA-approved segmentation method based on an EM approach. The key ingredient in the model is the discriminator, which predicts with accuracy on unseen data whether two lesion segmentations are based on MRI scans acquired using the same scanner. This is a very promising result, since this is not a standard problem.

When evaluated on an unseen dataset of cross-sectional data, the model’s performance approximates the EM-model, but decreases slightly after the adversarial training. This indicates a trade-off between performance and reproducibility. One concern was that this would be connected to an under-segmentation due to the consistency constraint learned during the adversarial training. However, evaluating the average predicted lesion volume on a separate test-retest dataset shows no indication of under-segmentation when compared to the EM-model.

Both the adversarial network and the discriminator were trained on longitudinal inter-scanner data. This is not ideal, since MS can have an unpredictable evolution over time, and as such it becomes difficult to distinguish between differences caused by hardware and the natural progression of the disease. We attempt to mitigate this effect by selecting studies within no more than two years interval, but better and more reliable performance could be achieved if the model would be trained on a large dataset with the same characteristics as the test-retest dataset described in section 3. However, large datasets of that type do not exist and would require a very big effort to collect, both from the point of view of patients and logistics. As such, using longitudinal inter-scanner data is a compromise that is cost-efficient and shows interesting results.

Another point that could improve the performance would be to use higher quality images and unbiased segmentations at training time. This would allow for a stronger comparison to other methods in literature and manual delineations. At this moment it is expectable that our model achieves results comparable to those of the method used to obtain the segmentations it was trained on.

Aside from these compromises, some improvements can still be made in future work. Namely, during the training and testing stages of the adversarial network images can be affinely registered to each other instead of using one common atlas space. We would expect this to increase the overlap metrics. On the other hand it was observed that the overlap metrics slightly decrease for the adversarial network with longer training, and as such the weight of the term in the loss function associated with the discriminator can be optimized/lowered to achieve more efficient training and better overlap of the images.

Finally, instead of only freezing the weights of the discriminator to improve the base model, the weights of the base model can also be frozen in a next step to improve the discriminator, so that the base model and discriminator are trained in an iterative process until there are no more performance gains.

Apart from the various optimizations to the model, it would be interesting to apply the same adversarial training to other lesion types, such as the ones resulting from vascular dementia or traumatic brain injuries.


  • [1] V. Biberacher and et al. (2016-11) Intra- and interscanner variability of magnetic resonance imaging based volumetry in multiple sclerosis. NeuroImage 142, pp. 188–197. External Links: Document, ISSN 10538119 Cited by: §1.
  • [2] T. Brosch and et al. (2016-05) Deep 3D Convolutional Encoder Networks With Shortcuts for Multiscale Feature Integration Applied to Multiple Sclerosis Lesion Segmentation. IEEE Transactions on Medical Imaging 35 (5), pp. 1229–1239. External Links: ISSN 0278-0062, Document Cited by: §1.
  • [3] A. Carass and et al. (2017) Longitudinal multiple sclerosis lesion segmentation: resource and challenge. NeuroImage 148 (C), pp. 77–102. External Links: ISSN 1053-8119, Document Cited by: §1, §1.
  • [4] M. Ghafoorian and et al. (2017) Transfer learning for domain adaptation in mri: application in brain lesion segmentation. In Medical Image Computing and Computer Assisted Intervention − MICCAI 2017, pp. 516–524. External Links: ISBN 978-3-319-66179-7 Cited by: §1.
  • [5] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. Cited by: §2.0.3.
  • [6] S. Jain and et al. (2017) Unsupervised framework for consistent longitudinal ms lesion segmentation. Vol. 10081, pp. 208–219. External Links: ISSN 03029743, ISBN 9783319611877 Cited by: §1.
  • [7] S. Jain and et al. (2015) Automatic segmentation and volumetry of multiple sclerosis brain lesions from MR images. NeuroImage: Clinical 8, pp. 367–375. External Links: Document, ISBN 2213-1582 (Electronic)$\$r2213-1582 (Linking), ISSN 22131582 Cited by: §1, §1, §2, §3.
  • [8] K. Kamnitsas and et al. (2016-12) Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. External Links: 1612.08894 Cited by: §1.
  • [9] K. Kamnitsas and et al. (2017) Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical Image Analysis 36 (C), pp. 61–78. External Links: ISSN 1361-8415, Document Cited by: §2.0.1, §2.
  • [10] D. Robben, J. Bertels, S. Willems, D. Vandermeulen, F. Maes, and P. Suetens (2018) DeepVoxNet: voxel-wise prediction for 3d images. External Links: Link Cited by: §2.0.4.
  • [11] N. Shiee and et al. (2010-01) A topology-preserving approach to the segmentation of brain images with multiple sclerosis lesions. NeuroImage 49 (2), pp. 1524–1535. External Links: ISSN 1053-8119, Document Cited by: §1.
  • [12] R. T. Shinohara and et al. (2017) Volumetric Analysis from a Harmonized Multisite Brain MRI Study of a Single Subject with Multiple Sclerosis.. AJNR. American journal of neuroradiology 38 (8), pp. 1501–1509. External Links: Document, ISSN 1936-959X Cited by: §1.
  • [13] S. Valverde and et al. (2017-07) Improving automated multiple sclerosis lesion segmentation with a cascaded 3D convolutional neural network approach. NeuroImage 155, pp. 159–168. External Links: Document, ISSN 10538119 Cited by: §1.
  • [14] S. Valverde and et al. (2019-01) One-shot domain adaptation in multiple sclerosis lesion segmentation using convolutional neural networks. NeuroImage: Clinical 21, pp. 101638. External Links: Document, ISSN 2213-1582 Cited by: §1.
  • [15] K. Van Leemput and et al. (2001)

    Automated segmentation of multiple sclerosis lesions by model outlier detection

    IEEE Transactions on Medical Imaging 20 (8), pp. 677–688. External Links: Document, ISBN 3-540-66503-X, ISSN 02780062 Cited by: §1.