Spatio-temporal Learning from Longitudinal Data for Multiple Sclerosis Lesion Segmentation

04/07/2020 ∙ by Stefan Denner, et al. ∙ Technische Universität München 9

Segmentation of Multiple Sclerosis (MS) lesions in longitudinal brain MR scans is performed for monitoring the progression of MS lesions. In order to improve segmentation, we use spatio-temporal cues in longitudinal data. To that end, we propose two approaches: Our longitudinal segmentation architecture which is grounded upon early-fusion of longitudinal data. And complementary to the longitudinal architecture, we propose a novel multi-task learning approach by defining an auxiliary self-supervised task of deformable registration between two time-points to guide the neural network toward learning from spatio-temporal changes. We show the effectiveness of our methods on two datasets: An in-house dataset comprised of 70 patients with one follow-up study for each patient and the ISBI longitudinal MS lesion segmentation challenge dataset which has 19 patients with three to five follow-up studies. Our results show that spatio-temporal information in longitudinal data is a beneficial cue for improving segmentation. Code is publicly available.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multiple Sclerosis (MS) is a neurological disease characterized by damage to myelinated nerve sheaths (demyelination) and is a potentially disabling disease of the central nervous system. The affected regions appear as focal lesions in the white matter [18] and Magnetic Resonance Imaging (MRI) is used to visualize and detect the lesions [8]. MS is a chronic disease, therefore longitudinal MRI patient studies are conducted to monitor the progression of the disease. Accurate lesion segmentation in the MRI scans is important to quantitatively assess response to treatment [17] and future disease-related disability progression [19]. However, manual segmentation of MS lesions in MRI volumes is time-consuming, prone to errors and intra/inter-observer variability [6].

Several works have proposed automatic methods for MS lesion segmentation in MRI scans [10, 1, 20, 11, 21, 2]. Valverde et al. [20] proposed a cascade of two 3D patch-wise convolutional networks, where the first network provides candidate voxels to the second network for final lesion prediction. Hashemi et al. [11]

introduced an asymmetric loss, similar to the Tversky index, which is supposed to tackle the problem of high class imbalance in MS lesion segmentation by achieving a better trade-off between precision and recall. Whereas the previous two approaches worked with 3D input, Aslani et al. 

[2] proposed a 2.5D slice-based multimodality approach, where they use a single branch for each modality. They trained their network with slices from all plane orientations (axial, coronal, sagittal). During inference, they merge those 2D binary predictions to a single lesion segmentation volume by applying a majority vote. Zhang et al. [21] also proposed a 2.5D slice-based approach, but they concatenated all modalities instead of processing them in multiple branches. In contrast, they utilize a separate model for each plane orientation.

However, none of these works use the data from multiple time-points. The work of Birenbaum et al. [5] is the only method that processes longitudinal data. Birenbaum et al. [5] proposed a siamese architecture, where input patches from two time-points are given to separate encoders that share weights and subsequently the encoders’ outputs are concatenated and fed into subsequent CNN to predict the class of pixel of interest. Birenbaum et al. [5] sets the direction for using longitudinal data and opens up a line of opportunities for future work. However, their work does not extensively investigate the potential of using information from longitudinal data. Specifically, their proposed late-fusion of features does not properly take advantage of learning from structural changes.

In this paper, we propose two approaches complementary to each other, that use spatio-temporal information available in longitudinal MR scans to improve the segmentation of MS lesions. Our intuition is that the structural changes of MS lesions through time are valuable cues for the model to detect these lesions. The first approach, which serves as our baseline longitudinal method concatenates the multimodal inputs from two time-points and the result is given to the network as input. The early fusion of inputs, as opposed to late fusion proposed by Birenbaum et al. [5], allows for proper capturing of the differences between inputs from two time-points in the encoder. The second approach is complementary to our baseline longitudinal architecture and further enforces the use of structural changes through time. We propose a multitask learning framework by adding an auxiliary deformable registration task to the segmentation model. The two tasks share the same encoder, but two separate decoders are assigned for each task. We evaluate our approaches on two longitudinal multimodal MS lesion datasets: An in-house dataset including 70 patients with one follow-up study for each patient, and the publicly available ISBI longitudinal MS lesion segmentation challenge dataset [6] which includes 19 patients (5 train and 14 test), each with three to five follow-up studies.

2 Methodology

Figure 1: Our proposed methods: (a) Longitudinal Network: longitudinal scans are concatenated and given to the segmentation model to implicitly use the structural differences (b) Multitask Longitudinal Network: The network is trained with an auxiliary task of deformable registration between two longitudinal scans, to explicitly guide the network toward using spatio-temporal changes.

This section describes our approaches for incorporating spatio-temporal features into the learning pipeline of a neural networks. Our intuition is to use the structural changes of MS lesions between the longitudinal scans for detecting these lesions. Note that the aim is not to model how the lesions deform or change, but to find what has changed and to use that information to improve segmentation. To this aim, we propose two complementary approaches that allow the use of structural change information to improve segmentation.

2.1 Longitudinal Architecture

We adopt a 2.5D approach [16, 2] for segmentation of 3D MR volumes, where for each voxel, segmentation is done on the three orthogonal slices crossing the voxel of interest. The prediction from the corresponding pixel in each view is combined via majority voting to determine the final prediction for the voxel. To segment a given slice, we use a fully convolutional and densely connected neural network (Tiramisu) [12]. The network receives a slice from any of the three orthogonal views and outputs a segmentation mask. To account for different modalities (T1-w, T2-w, FLAIR, PD), we stack the corresponding slices from all modalities and feed them to the network.

In order to use the structural changes between the two time-points, we give the concatenated scans of the two-time points as input to the segmentation network (Fig. 1.a). This early-fusion of inputs allows the network filters to capture the minute structural changes at all layers leading to the bottleneck, as opposed to the late fusion of Birenbaum et al. [5], where high-level representations from each time point are concatenated. The early fusion’s effectiveness for learning structural differences can be further supported by the similar architectural approaches in the design of deformable registration networks [3, 4]. Note that the two inputs do not have to be consecutive time-points, as we are not trying to model a temporal change, but only use structural changes as cues.

2.2 Multitask Learning with Deformable Registration

In this section we describe our approach involving the augmentation of the segmentation task with an auxiliary deformable registration task. The intuition is to explicitly use the structural change information between the two longitudinal scans. In longitudinal scans, only specific structures such as MS lesions change substantially. Deformable registration is defined to learn a deformation field between two instances. We therefore propose augmenting our baseline longitudinal segmentation model with a deformable registration loss. We hypothesize that this would further guide the network toward using structural differences between the inputs of two different time points. Note that the longitudinal scans are already rigidly registered in the pre-processing step, therefore the deformation field reflects the structural differences of lesions.

The resulting network (Fig. 1.b) consists of a shared encoder followed by two decoders used to generate the specific outputs for the two tasks. One head of the network is associated with generating the segmentation mask and the other one with deformation field map. The encoder-decoder architecture here is that of Tiramisu [12] architecture, and two decoders for registration and segmentation are architecturally equivalent. The deformable registration task is trained without supervision or ground truth registration data. This is rather trained self-supervised and by reconstructing one scan from the other which helps adding additional generic information to the network. The multi task loss is defined as:


A common pitfall in multi task learning is the imbalance of different tasks which leads to under-performance of multitask learning compared to single tasks. To solve this one needs to normalize loss functions or gradients flow 

[7]. Here we use the same type of loss function for both tasks. Specifically we use MSE loss, which is used for both registration and segmentation problems. We use a CNN based deformable registration methodology similar to VoxelMorph [4], but adapted to 2D inputs and using Tiramisu architecture (Fig. 1.b). The registration loss () is defined as:


where is the deformation field between inputs and ( and denote the time-points). is the warping of by , and is the loss imposing the similarity between and warped version of . is regularization term to encourage to be smooth. We use MSE loss for , and for the smoothness term , similar to [4] we use a diffusion regularizer on the spatial gradients of the displacement field.

3 Experiment Setup

3.1 Datasets

The first dataset is the publicly available ISBI Longitudinal MS Lesion Segmentation Challenge dataset [6]. This dataset contains 3T MR images over time. The dataset consists of 5 subjects in training set and 14 subjects in test set with 3 to 5 follow-up images per subject. For each time point, T1-w MPRAGE, FLAIR, T2-weighted, and PD images were acquired. MS lesions were manually annotated by two expert raters.

The second dataset is the in-house clinical [9, 13]. The datasets consists of 1.5T and 3T MR images. Follow-up images of 70 MS patients were acquired. Images are 3D T1-w MRI scans and 3D FLAIR scans with approximately 1 mm isotropic resolution. MS lesions were manually annotated by an expert rater. Data of 40 patients were used as a training set (30 train, 10 validation). The remaining data of 30 patients were used as an independent test set.

3.2 Implementation Details

The encoders and decoders of our architectures are based on FC-DenseNet57 [12]. We used Adam optimizer with AMSGrad [15]

and a learning rate of 1e-4. We used a single model for all plane orientations. Since our approaches are 2.5D, we applied a majority vote on the probability output predictions over all orientations. PyTorch 1.4 

[14] is used for neural network implementation.

3.3 Evaluation Metrics

For the ISBI challenge dataset, the segmentation volumes were uploaded to the ISBI challenge official website, where they calculate an overall performance score based on Dice Similarity Coefficient (DSC), Positive Predictive Value (PPV), Pearson’s correlation coefficient (VC), Lesion-wise True Positive Rate (LTPR), Lesion-wise False Positive Rate (LFPR) as in Eq. 3.


For the in-house dataset, DSC, PPV, LTPF, LFPR, VD, and volume difference (VD) are used for evaluation. Moreover, to consider the overall effect of metrics, we define an Overall Score in a similar fashion to ISBI score as follows,


3.4 Method Comparisons

In the following, we clarify the compared methods:

Static Network: Similar to Zhang et al. [21] our model is based on FC-DenseNet-57 and uses only a single time-point.

Longitudinal Siamese Network: Longitudinal siamese architecture of Birenbaum et al. [5]. The reported score in [5] is used.

Longitudinal Siamese Network (Tiramisu): We implement the longitudinal siamese model [5] with a FC-DenseNet-57 [12]. As in [5], late fusion is used.

Longitudinal Network (ours): Our proposed longitudinal model (section 2.1).

Multitask Longitudinal Network (ours): Our multitask network (section 2.2).

Longitudinal Network with Pretraining (ours): As an ablation study regarding the use of registration information, we implement Longitudinal Network with pretraining. The network architecture is exactly the same as Longitudinal Network. The segmentation model is pre-trained using registration loss.

4 Results and Discussion

Figure 2: Visualisation of MS lesion’s structural change in two longitudinal MR FLAIR scans. Each row presents data from one patient. (a) is the scan from the first time-point, (b) is the scan from the follow up study, (c) visualizes ground truth of MS lesions on the follow up image, (d) shows the predicted segmentation mask of Multitask Longitudinal Network on the follow up image, (e) represents the predicted displacement field between the two scans using the registration module of our multitask method.
Method ISBI score
Multitask Longitudinal Network 91.97
Longitudinal Network with Pretraining 91.96
Longitudinal Network 92.12
Static Network 91.54
Longitudinal Siamese Network (Tiramisu) [5] 91.52
Longitudinal Siamese Network [5] 90.07
Table 1: Comparison of different approaches on ISBI MS lesion segmentation challenge dataset. Our methods are shown in bold letters.
Multitask Longitudinal Network 0.695 0.771 0.680 0.212 0.221 0.745
Longitudinal Network with Pretraining 0.692 0.777 0.660 0.205 0.232 0.739
Longitudinal Network 0.694 0.752 0.654 0.227 0.227 0.731
Static Network 0.684 0.762 0.647 0.250 0.247 0.718
Longitudinal Siamese Network (Tiramisu) [5] 0.684 0.777 0.614 0.194 0.245 0.726
Table 2: Comparison of different approaches on the clinical dataset. Our methods are shown in bold letters. For LFPR and VD, lower is better.

To illustrate the behavior of our Multitask Longitudinal Network, we visualize the segmentation mask and the displacement field in Fig.2. The displacement field shows what has changed. In Fig.2, the colors in the displacement encode the direction of the field at any point, and the brightness signifies the magnitude of displacement. As can be seen, the areas corresponding to MS lesions have high brightness indicating that the deformable registration model has captured the change of MS lesions.

To verify the effectiveness of the proposed methods, comparative experiments were conducted on both ISBI challenge dataset and on our in-house clinical dataset. We compare our method with Static Network and Longitudinal Siamese Network [5]. Table 1 shows the results of the different approaches on the ISBI challenge dataset. As shown in the table, the proposed Longitudinal Network and Multitask Longitudinal Network achieve higher ISBI score compared with both Static Network and Longitudinal Siamese Network [5]. Because the original Longitudinal Siamese Network [5] used a shallow network architecture, we also compare with Longitudinal Siamese Network (DenseNet). Our implementation of Longitudinal Siamese Network (DenseNet) performs better than original Longitudinal Siamese Network [5]. All our longitudinal methods perform better than the Longitudinal Siamese networks.

Our longitudinal methods are performing better on ISBI test set, however, it is noteworthy that our models outperformed others on the validation dataset with a higher margin. Multitask Longitudinal Network achieved the best performance with a DSC of 0.8163, followed by Longitudinal Network with Pretraining, Longitudinal Network and Static Model with DSCs of 0.8147, 0.8046 and 0.78430, respectively. The reason why the performance improvements on the validation set do not transfer to the test set lays in the very limited size of the train set (train and validation), which consists of only 5 patients, whereas the test set consists of 14 patients. Hence, results on a bigger dataset, such as our in-house dataset are more reliable.

To further evaluate our method on a larger clinical dataset, we compare the methods on the in-house dataset. Results for the different methods are shown Table 2, which indicates that Multitask Longitudinal Network outperforms other models, owing it to the explicit modeling of spatio-temporal features in the network’s architecture. In section 2.1 we stated that the early-fusion of inputs, allows the network to capture the structural differences between the inputs better than late-fusion. The comparison between our Longitudinal Network and Longitudinal Siamese Network (Tiramisu) which only differ in how they fuse the inputs, serves as an ablation experiment for our claim that early-fusion performs better.

As an ablation study on the longitudinal methods, we also explore different approaches to use deformable registration information on our in-house dataset. We pre-train the Longitudinal Network with the deformable registration task and fine tune it with segmentation.

As shown in the table, by exploiting the spatio-temporal information from deformable registration, the segmentation performances were improved compared to Longitudinal Network. Moreover, by effectively utilizing spatio-temporal features with a multitask learning framework, our Multitask Longitudinal Network achieved the best performance.

Comparison with Mean difference 95% CI p-value
standard error
Static Network 0.01170.0038 [0.0040,0.0193] 0.0035
Longitudinal Siamese Network (Tiramisu) [5] 0.01150.0040 [0.0033,0.0196] 0.0065
Table 3:

Statistical significance analysis of performance improvements by paired t-test. Our Multitask Longitudinal Network is compared with Static Network and Longitudinal Siamese Network, respectively.

To verify that the performance improvements are statistically significant, we conducted further analysis of the models with paired t-test. The paired t-test provides a statistical evaluation of the performance differences between models. Table 3 shows the results of the statistical significance analysis for differences of DSC. We first compared the performance improvement between Multitask Longitudinal Network and the Static Network. The performance improvement was statistically significant with . The performance difference between Multitask Longitudinal Network and Longitudinal Siamese Network [5] was also statistically significant ().

5 Conclusion

In this work, we proposed using spatio-temporal information in longitudinal brain MR data to improve the segmentation of MS lesions. To that end, we proposed two approaches based on early fusion of longitudinal input data and a novel multitask formulation, where an auxiliary unsupervised deformable registration task is adopted. We evaluated our approaches on two longitudinal MS lesion datasets and showed that incorporating spatio-temporal information into segmentation models improves the segmentation performance. In future work, our proposed methodology can be extended to other longitudinal medical studies to improve segmentation.


  • [1] S. Andermatt, S. Pezold, and P. C. Cattin (2017)

    Automated segmentation of multiple sclerosis lesions using multi-dimensional gated recurrent units

    In International MICCAI Brainlesion Workshop, pp. 31–42. Cited by: §1.
  • [2] S. Aslani, M. Dayan, L. Storelli, M. Filippi, V. Murino, M. A. Rocca, and D. Sona (2019)

    Multi-branch convolutional neural network for multiple sclerosis lesion segmentation

    NeuroImage 196, pp. 1–15. Cited by: §1, §2.1.
  • [3] G. Balakrishnan, A. Zhao, M. R. Sabuncu, A. V. Dalca, and J. Guttag (2018)

    An Unsupervised Learning Model for Deformable Medical Image Registration


    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    External Links: Document, 1802.02604, ISBN 9781538664209, ISSN 10636919 Cited by: §2.1.
  • [4] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca (2019) VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Transactions on Medical Imaging. External Links: Document, 1809.05231, ISSN 1558254X Cited by: §2.1, §2.2.
  • [5] A. Birenbaum and H. Greenspan (2016) Longitudinal multiple sclerosis lesion segmentation using multi-view convolutional neural networks. In Deep Learning and Data Labeling for Medical Applications, pp. 58–67. Cited by: §1, §1, §2.1, §3.4, §3.4, Table 1, Table 2, Table 3, §4, §4.
  • [6] A. Carass, S. Roy, A. Jog, J. L. Cuzzocreo, E. Magrath, A. Gherman, J. Button, J. Nguyen, F. Prados, C. H. Sudre, et al. (2017) Longitudinal multiple sclerosis lesion segmentation: resource and challenge. NeuroImage 148, pp. 77–102. Cited by: §1, §1, §3.1.
  • [7] Z. Chen, V. Badrinarayanan, C. Y. Lee, and A. Rabinovich (2018) GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In

    35th International Conference on Machine Learning, ICML 2018

    External Links: 1711.02257, ISBN 9781510867963 Cited by: §2.2.
  • [8] A. Compston and A. Coles (2008) Multiple sclerosis. External Links: Document, ISSN 01406736 Cited by: §1.
  • [9] A. Galimzianova, F. Pernuš, B. Likar, and Ž. Špiclin (2016) Stratified mixture modeling for segmentation of white-matter lesions in brain mr images. NeuroImage 124, pp. 1031–1043. Cited by: §3.1.
  • [10] M. Ghafoorian and B. Platel (2015) Convolutional neural networks for ms lesion segmentation, method description of diag team. Proceedings of the 2015 Longitudinal Multiple Sclerosis Lesion Segmentation Challenge, pp. 1–2. Cited by: §1.
  • [11] S. R. Hashemi, S. S. M. Salehi, D. Erdogmus, S. P. Prabhu, S. K. Warfield, and A. Gholipour (2018) Asymmetric loss functions and deep densely-connected networks for highly-imbalanced medical image segmentation: application to multiple sclerosis lesion detection. IEEE Access 7, pp. 1721–1735. Cited by: §1, Table 4.
  • [12] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 11–19. Cited by: §2.1, §2.2, §3.2, §3.4.
  • [13] Ž. Lesjak, A. Galimzianova, A. Koren, M. Lukin, F. Pernuš, B. Likar, and Ž. Špiclin (2018) A novel public mr image dataset of multiple sclerosis patients with lesion segmentations based on multi-rater consensus. Neuroinformatics 16 (1), pp. 51–63. Cited by: §3.1.
  • [14] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §3.2.
  • [15] S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. International Conference on Learning Representations (ICLR). Cited by: §3.2.
  • [16] H. R. Roth, L. Lu, A. Seff, K. M. Cherry, J. Hoffman, S. Wang, J. Liu, E. Turkbey, and R. M. Summers (2014) A new 2.5D representation for lymph node detection using random sets of deep convolutional neural network observations. In

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    External Links: Document, 1406.2639, ISBN 9783319104034, ISSN 16113349 Cited by: §2.1.
  • [17] M. Stangel, I. K. Penner, B. A. Kallmann, C. Lukas, and B. C. Kieseier (2015) Towards the implementation of ‘no evidence of disease activity’ in multiple sclerosis treatment: the multiple sclerosis decision model. Therapeutic Advances in Neurological Disorders 8 (1), pp. 3–13. Cited by: §1.
  • [18] L. Steinman (1996) Multiple sclerosis: A coordinated immunological attack against myelin in the central nervous system. External Links: Document, ISSN 00928674 Cited by: §1.
  • [19] T. Uher, M. Vaneckova, L. Sobisek, M. Tyblova, Z. Seidl, J. Krasensky, D. Ramasamy, R. Zivadinov, E. Havrdova, T. Kalincik, and D. Horakova (2017) Combining clinical and magnetic resonance imaging markers enhances prediction of 12-year disability in multiple sclerosis. Multiple Sclerosis 23 (1), pp. 51–61. Cited by: §1.
  • [20] S. Valverde, M. Cabezas, E. Roura, S. González-Villà, D. Pareto, J. C. Vilanova, L. Ramió-Torrentà, À. Rovira, A. Oliver, and X. Lladó (2017) Improving automated multiple sclerosis lesion segmentation with a cascaded 3d convolutional neural network approach. NeuroImage 155, pp. 159–168. Cited by: §1.
  • [21] H. Zhang, A. M. Valcarcel, R. Bakshi, R. Chu, F. Bagnato, R. T. Shinohara, K. Hett, and I. Oguz (2019) Multiple sclerosis lesion segmentation with tiramisu and 2.5 d stacked slices. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 338–346. Cited by: §1, §3.4, Table 4.

6 Supplementary Material

Multitask Longitudinal Network 0.695 0.771 0.680 0.212 0.221 0.745
Static Network 0.684 0.762 0.647 0.250 0.247 0.718
Static Network (Zhang et al. [21]) 0.684 0.761 0.604 0.223 0.263 0.710
Static Network (Asymmetric Dice Loss as in [11]) 0.690 0.648 0.752 0.346 0.336 0.685
Table 4: Comparison with State-of-the-art Static Networks [11, 21] on the in-house clinical dataset. [21] uses a separate model for each plane orientation and three consecutive 2D slices are stacked as input, whereas Static Network (used in our paper) has one model for all plane orientations and only have a single 2D slice as input. Note that Static Network used in our paper is slightly better than [21]. Moreover, the proposed Multitask Longitudinal Network outperforms [11, 21] and our Static Network.
Static Network (All plane orientations) 0.684 0.762 0.647 0.250 0.247 0.718
Static Network (Single plane orientation) 0.686 0.734 0.634 0.265 0.259 0.705
Table 5: Comparison of our Static Network trained with all three plane orientations (axial, coronal, sagittal) and Static Network (Single plane orientation) where three Static Network models are trained with one plane orientation. For the final output in both cases a majority vote is used. Experiments are conducted on the clinical dataset.