MS Lesion Segmentation: Revisiting Weighting Mechanisms for Federated Learning

by   Dongnan Liu, et al.

Federated learning (FL) has been widely employed for medical image analysis to facilitate multi-client collaborative learning without sharing raw data. Despite great success, FL's performance is limited for multiple sclerosis (MS) lesion segmentation tasks, due to variance in lesion characteristics imparted by different scanners and acquisition parameters. In this work, we propose the first FL MS lesion segmentation framework via two effective re-weighting mechanisms. Specifically, a learnable weight is assigned to each local node during the aggregation process, based on its segmentation performance. In addition, the segmentation loss function in each client is also re-weighted according to the lesion volume for the data during training. Comparison experiments on two FL MS segmentation scenarios using public and clinical datasets have demonstrated the effectiveness of the proposed method by outperforming other FL methods significantly. Furthermore, the segmentation performance of FL incorporating our proposed aggregation mechanism can exceed centralised training with all the raw data. The extensive evaluation also indicated the superiority of our method when estimating brain volume differences estimation after lesion inpainting.


page 1

page 6


Auto-FedAvg: Learnable Federated Averaging for Multi-Institutional Medical Image Segmentation

Federated learning (FL) enables collaborative model training while prese...

Auto-FedRL: Federated Hyperparameter Optimization for Multi-institutional Medical Image Segmentation

Federated learning (FL) is a distributed machine learning technique that...

FedDropoutAvg: Generalizable federated learning for histopathology image classification

Federated learning (FL) enables collaborative learning of a deep learnin...

Federated Cross Learning for Medical Image Segmentation

Federated learning (FL) can collaboratively train deep learning models u...

Closing the Generalization Gap of Cross-silo Federated Medical Image Segmentation

Cross-silo federated learning (FL) has attracted much attention in medic...

Modelling brain lesion volume in patches with CNN-based Poisson Regression

Monitoring the progression of lesions is important for clinical response...

ImageTBAD: A 3D Computed Tomography Angiography Image Dataset for Automatic Segmentation of Type-B Aortic Dissection

Type-B Aortic Dissection (TBAD) is one of the most serious cardiovascula...

I Introduction

Multiple sclerosis (MS) is a chronic inflammatory and degenerative disease of the central nervous system, characterized by the appearance of focal lesions in the white and gray matter that topographically correlate with an individual patient’s neurological symptoms and signs. Globally there are an estimated million people with MS and, after trauma, the disease constitutes the most common cause of neurological disability in young adults [8]. Lesion characteristics, such as number and volume, are principal imaging metrics for both MS clinical trials and monitoring of the disease in clinical practice [39]. To this end, automatic and accurate MS lesion segmentation in Magnetic Resonance (MR) imaging can critically enhance both MS research and patient management [53, 13, 3, 29].

With recent advances on AI-enhanced computer vision techniques, deep learning-based methods have been widely used for lesion segmentation and can achieve results close to human experts. Despite this, there remain significant challenges in the current methods 

[30, 10]. In particular, MR images from different scanners present different data distributions, which incur a performance drop when validating off-the-shelf models trained at a single client with images from another. On the other hand, data sharing between multiple clients is not always possible due to privacy, legal and ethical concerns.

Fig. 1: Evidence of the variance on appearance and lesion volume in multi-client studies. The top images are examples of 2D slices from each client in the study. The bottom graphs are the violin and box plots for the lesion volume to brain volume ratio distributions per client for all the subjects in this FL study.

To address this dilemma, federated learning (FL) techniques where training is decentralized were proposed [32, 25]. At the beginning of the FL process, each participating client is firstly assigned an initialized model. Next, these models are trained using the local data in each client. After several training iterations, each client is required to share their private model weights with a central server, which aggregates these local weights and distributes them back to each client. Initialized by the updated weights from the server, the model in each client continues their local training for another round of FL process. By enriching the knowledge learned in each local model without sharing the raw data, FL methods have also been widely employed for multi-client medical image analysis [24, 16, 27, 40]. Note that throughout the paper, we use the notion ‘client’ to represent the data in each distinct scanner or clinical center.

However, current FL methods are suboptimal for multi-client MS lesion segmentation. First, during aggregation, the central server averages the model parameters from all the local clients, assuming each local model has the same importance and performance. For MS lesion segmentation, the datasets from multiple clients, their data distribution and the lesion morphology and signal characteristics can vary greatly [20, 1], which can lead to divergence of the private local models, thereby conferring distinct segmentation characteristics when they are aggregated in the central server. By fusing a model with inferior segmentation performance to others with superior ability, the segmentation performance for the entire updated model may be compromised. Second, differences in the clinical distribution of patients can impact lesion burden, size and morphology at a client level, generating significant inter-site variance in multi-client studies, as shown in Fig. 1. As explored in [36, 43], a model trained on a dataset with smaller lesions will usually present a lower performance due to the lack of lesion samples for training. However, the task loss functions in each client are optimized with the same importance in previous FL methods [32, 24, 25], which would induce the inferior performance of the center model on the clients with smaller lesion sizes, and further influence the overall FL segmentation accuracy.

To solve the aforementioned issues, we propose a Federated MS lesion segmentation framework based on two dynamic Re-Weighting mechanisms (FedMSRW). During the model aggregation process, the model parameters from each client are assigned a weight based on their segmentation abilities during local training, including the segmentation performance and confidence. Models with higher ability are assigned a higher weight and vice versa. To solve the lesion volume imbalance across different clients, we propose to re-weight the task loss function in each client based on the average case-wise lesion volume ratio, i.e., the ratio of lesion volume to the brain volume, of the training data for that client. Motivated by [43], where more attention should be paid to smaller lesion objects during model training, the weights for the overall loss functions in clients with a smaller lesion volume are enlarged, and vice versa.

The major contributions of this work are summarized as follows:

  • To the best of our knowledge, this work is the first application of privacy-preserving FL methods to the task of MS lesion segmentation and, in particular, to multi-client MS datasets.

  • We propose uncertainty-aware re-weighting mechanisms during the central model aggregation process to prevent the negative influence of the inferior local models.

  • We further propose to re-weight the segmentation loss functions in each local client/center based on its local lesion volume ratio, addressing the impact of client-specific lesion variance in the multi-client MS datasets.

  • We have conducted extensive experiments in two FL MS lesion segmentation scenarios using both public and real-world clinical MS datasets. Our FedMSRW method outperforms typical FL methods significantly.

Fig. 2: Detailed framework of our FedMSRW method. The for calculating the weighting factors during model aggregation can be referred to Equation 4. The details of for the segmentation task re-weighting are in Equation 6.

Ii Related Work

Ii-a MS Lesion Segmentation

In classical MS lesion segmentation methods, the brain tissues (e.g., WM, GM, CSF) are firstly segmented from the raw MR images via statistical methods, e.g., the Expectation-Maximization (EM) algorithm 

[4, 2]

or Gaussian Mixture Modeling 

[12, 21]

. Then, lesions are detected as outliers based on the tissue masks 

[48, 4, 2, 12, 21]. With the advent of deep learning-based medical image computing [42, 41], deep learning models that learn representative features via convolutional modules have been widely employed for automatic MS lesion segmentation, achieving competitive performance [14, 51, 35, 31, 18, 30, 3, 47, 33].

In clinical practice, the data distribution of brain MRI varies across MRI scanners due to variance in image geometry, resolution, tissue intensity and contrast conferred by differences in hardware (scanner and coil) and acquisition protocols [20, 1, 46, 11]

. These domain differences limit the performance of supervised learning methods when applied to images from new scanners 

[30, 20, 1]. Such phenomenon is referred to as the domain shift issue, which exists in various medical image analyses applications for multiple datasets from different resources (e.g, modalities, sites) [6, 26, 49]. Recently, cross-domain MS lesion segmentation methods have been further explored to enhance the models’ generalization ability. In particular, the domain differences are alleviated by inducing the model to generate scanner-invariant features [20, 1], learning from synthetic images that follow the distribution of the target scanners [37], and cross-scanner data harmonization [11]. A crucial prerequisite of these methods is that all the data from multiple scanners should be fed into the framework simultaneously. However, sharing clinical data across sites invokes privacy issues, which limit the practical applications of these methods in large collaborative studies [24, 16]. Recently, cross-domain MS lesion segmentation methods have been further explored to enhance the models’ generalization ability. Particularly, the domain differences are alleviated by inducing the model to generate scanner-invariant features [20, 1], learning from synthetic images that follow the distribution of the target scanners [37], and cross-scanner data harmonization [11]. A crucial prerequisite of these methods is that all the data from multiple scanners should be fed into the framework simultaneously. However, sharing clinical data across sites involves privacy issues, which limits the practical applications of these methods on large collaborative studies [24, 16].

Ii-B Federated Learning

Federated learning (FL) provides a decentralized solution for multi-client collaborative learning without raw data sharing. To ensure privacy preservation, a central server is established to collect, from each local client, their model weights, gradients, and features. Next, an aggregation process is conducted to update the central model, which is then broadcasted back to each local client. Originally, FedAvg [32] performed the aggregation through averaging the weights from all local clients. This approach was later extended [32] to propose FedProx [23], with regularization to stabilize the models’ performance.

FL methods have also been utilized for multi-client medical image analysis. In [24] and [16], each local model is incorporated with an adversarial domain discriminator to alleviate the inter-client distribution bias. However, the intermediate features in each local client are required to be shared across clients. Despite these privacy-preserving strategies, distributing features still incur the risk of data leakage. To solve this problem, FedBN [25]

has been proposed for domain adaptive FL by only processing the parameters outside the batch normalization layers of each local model. In addition, FedDG 


enhances the generalization ability of the FL framework on the unseen datasets via Fourier transform-based image synthesis and episodic learning strategies. Furthermore, 

[15] proposed a distillation-based FL method without sharing the model parameters, which further enhances data safety. Although these methods are effective in many medical imaging scenarios, they have not considered the weighting strategies for the global aggregation and local training, which is crucial for FL MS segmentation. The method that is most related to ours is [40]

, which re-weights each local model’s training based on the loss value changes. However, its dynamic weighting strategy is sensitive to hyperparameter selections, which lacks robustness. Rather, our proposed re-weighting mechanisms at the global and local levels are effective and simple, without auxiliary hyperparameters.

Iii Methods

In this section, we first present the overview of our proposed FedMSRW method. Next, the re-weighting mechanisms during the central aggregation and local training are respectively illustrated in detail. Finally, we introduce the training and inference details of the FedMSRW method.

Iii-a Overview

We denote as the set of MS lesion segmentation datasets from different clients, where and represent the MR images and the corresponding lesion annotations. In the client, the local model with the parameters is optimized via:


where is the soft Dice loss function for probabilistic binary segmentations [34]:


Due to the data distribution differences in multi-client MR images, we establish our proposed FedMSRW on FedBN [25], which tackles the domain bias issues in FL processes that only require sharing of the model parameters. Based on the assumption that the parameters of the normalization layers in deep learning models represent the domain-specific information [5, 17], FedBN prevents the central model from domain shift by aggregating the parameters in the convolutional layers, while ignoring those in the batch normalization layers. Specifically, each can be represented as: , where are the parameters for all the batch normalization layers, and are those for the rest layers. After collecting the local weights, the central server aggregates model through:


By receiving the updated weights from the server, each is then initialized as: , for the next round of local segmentation training. The detailed framework is shown in Fig. 2.

Iii-B Central Aggregation Re-weighting based on the Models’ Segmentation

Due to distinct, client-specific characteristics of both the MRI data and the MS lesions, the difficulty of lesion segmentation tasks differs across clients. To this end, the segmentation ability for the various is different after each round of local training. According to Equation 3, both the low-performance and high-performance models are assigned equal importance during the aggregation process at the central server. This is suboptimal since the local models with inferior segmentation ability influence the updated model from the server and further limit collaborative knowledge learning in FL. A trivial solution to this problem is to adjust the number of training samples for each client, as indicated in [32]. However, there is no simple, non-biased sample selection mechanism to alleviate the negative effects of the models with inferior performance. Additionally, selecting auxiliary hyperparameters manually in FL would limit the model’s robustness.

To this end, we propose an aggregation re-weighting mechanism based on the segmentation performance of each during the training process in the local clients. For each training iteration in client , we define the input data and corresponding labels as and , respectively. The segmentation ability for probabilistic lesion segmentation is measured as:


As indicated in Equation 4, the first item represents the models’ confidence in the predicted lesion segmentation. Since the MS lesion region of interest occupies only a tiny fraction (average around ) of the whole brain volume, the confidence value within the true positive lesion regions better reflects the models’ lesion prediction certainty relative to traditional methods that measure the models’ confidence based on the entropy of the whole prediction map. Additionally, the in Equation 4 on the model’s segmentation performance is further included. Finally, the average for all the local training iterations is able to indicate the segmentation ability for the . Considering , the central aggregation process in Equation 3 is re-formulated as:

FedAvg 58.64 7.01
FedBN 82.01 59.56
DWA 58.13 54.12 16.43
Ours 62.76 71.15 63.56 72.60 78.13 67.39 64.97 13.66 17.98
TABLE I: The comparison experiments between our proposed method and others on the first FL MS lesion segmentation scenario, using MICCAI MSSEG16 dataset.
FedAvg 52.70 60.37 26.63 5.94
FedProx 68.83 66.96 63.98
DWA 57.70
Ours 50.90 52.41 53.66 64.22 69.48 56.90 58.61 62.31 45.14 35.73 40.25 29.78
TABLE II: The comparison experiments between our proposed method and others on the second FL MS lesion segmentation scenario.
Client scanner cases
Scenario 1
C1 Siemens Verio 3T 5
C2 Siemens Aera 1.5T 5
C3 Philips Ingenia 3T 5
Scenario 2
C1 GE Discovery 3T 54
C2 Philips Ingenia 3T 21
C3 Siemens Skyra 3T 30
C4 Siemens Magnetom 3T 30
TABLE III: Details on the scanners for the datasets used in our experiments.

Iii-C Local Optimization Re-weighting based on the Lesion Volume

Another challenge in FL MS lesion segmentation tasks is the heterogeneity of lesion size across different clients. As indicated in [36, 43], lesions with smaller sizes should be assigned a larger weight during model training. To this end, we further propose to re-weight the segmentation loss functions in each client defined in Equation 1 based on the lesion volume.

For the round of local training in client , we first calculate the average lesion volume ratio of all the data samples for training. Specifically, the lesion ratio in each training patch is the ratio of the lesion volume to the brain volume. Compared with only counting the voxel number of lesions, the lesion volume ratio can avoid inaccurate estimations when the proportions of the brain volume in some specific training patches are small. Next, the is accumulated with the average lesion volume ratio from the previous round, denoted as . With the increase of , the accumulated can represent the true lesion volume ratio for the data used during the model training process in each client. In the round of local training, the segmentation loss in Equation 1 is then reformulated as:

0:    : MS lesion segmentation from clients. In each , is the CNN model with the parameters . P: the number of FL rounds. Q: the number of local training iterations in each round.
1:  for  do
2:     for  do
3:        Initialize the with the updated global model.
4:        Obtain the accumulated lesion volume ratio for .
5:        Optimize the via Eq. 6 in Q iterations.
6:        Obtain the which measures the segmentation ability for by Eq. 4.
7:     end forAggregate local models in the central servers via Eq. 5. Calculate the re-weighting factors in Eq. 6.
8:  end for
9:  return  
Algorithm 1 Algorithm for the proposed FedMSRW method

Iii-D Training and Inference Details

The overall training algorithm of our proposed FedMSRW method is indicated in Algorithm 1. In each local client, the lesion segmentation task is trained with a 3D U-Net [7]. During training, we employ the SGD optimizer with a momentum of , a weight decay of , and a learning rate of . After every

training iterations, the local models are sent to the central server for aggregation. During inference, the model in each client is constructed by the central aggregated convolutional weights and the client-private batch normalization weights. Our experiment is implemented with PyTorch 

[38] on 4 RTX 6000 GPU devices.

Iv Experiments

Fig. 3: Qualitative results on the comparison FL methods. Lesion masks are overlapped on the original image. The top four rows are the visualization for the Scenario 1, and the bottom four rows are for the Scenario 2.

Iv-a Dataset Description

Iv-A1 Scenario 1

First, we conducted experiments on the MSSEG16 MS lesion segmentation challenge from MICCAI [9]. Since the testing images are not publicly available, our experiments were conducted on the training set with 5 cases from 3 different scanners each, as indicated in Table III. Due to the small number of cases, all experiments were performed in 2-fold cross-validation manners. At each iteration, 3D patches of size were randomly cropped from the original FLAIR images, with random flipping and rotation augmentations.

Iv-A2 Scenario 2

To further indicate the effectiveness of our proposed framework on the FL MS lesion segmentation tasks in a practical clinical scenario, we conducted experiments using in-house and public multi-scanner MS datasets from 4 different scanners. Specifically, the data from C1, C2, and C3 is obtained from de-identified images derived from patients with relapsing and remitting MS, recruited at the Brain and Mind Centre, University of Sydney (Sydney, Australia). All lesion masks were annotated semi-automatically by at least two expert neuroimaging analysts at the Sydney Neuroimaging Analysis Centre (Sydney, Australia). To further increase the diversity of the multi-client MS data, we included a public dataset from a new site [22], in addition to the private data from different scanners. All cases were defaced to protect patient privacy. All experiments under these settings were conducted in a 3-fold cross-validation manner. During training, the patches were randomly cropped from the original MRI data, with the augmentations of flipping and rotations.

Iv-B Evaluation Metrics

To evaluate the segmentation performance of our proposed method, we first employed the case-level and voxel-level Dice coefficient, defined as:


where TP, FP, and FN indicate the number of true positive, false positive, and false negative voxel predictions, respectively. The case-wise Dice score (C-Dice) was obtained by the average Dice score for all cases, and the voxel-wise Dice score (V-Dice) calculated via the accumulated predictions of all the cases. Additionally, we also evaluated the performance based on the true positive rate (TPR) and false positive rate (FPR) at the voxel level:


Iv-C Comparison Experiments

In this section, we present the experimental results in comparison with typical FL methods, including 1) FedAvg [32], a fundamental FL method by central aggregation via averaging of model weights; 2) FedProx [23], an FL framework introducing an auxiliary regularization mechanism in each client to stabilize learning, 3) FedBN [25], an FL framework which can alleviate the cross-site data distribution bias by ignoring parameters in the normalization layers during aggregation, and 4) DWA [40], a dynamic re-weighting mechanism for the central model aggregation process based on the changes of the loss functions in each client. For a fair comparison, we re-implement the DWA on the same FL baseline as our proposed FedMSRW method, i.e., FedBN. We also report the results by training within each local client (Single), and joint training with the raw data from all clients (Central). We maintained the same data split on the N-fold cross-validation for all methods. The experimental results under two FL MS segmentation scenarios are shown in Table I, and Table II, respectively.

Scenario 1 Scenario 2
C-Dice V-Dice V-TPR V-FPR C-Dice V-Dice V-TPR V-FPR
FedBN 59.56
+ RW-LT 58.74
Ours 63.56 67.39 16.46 53.66 62.31 29.78
TABLE IV: Details of the ablation studies in our experiments. ‘+ RW-CA’ and ‘+ RW-LT’ indicates the FedBN baseline constructed with the proposed central aggregation and local training mechanism, repsectively.

In the first scenario, the MS lesion segmentation experiments were conducted on images from different scanners (from different clinical sites). As shown in Table I, the performance of the typical FedAvg and FedProx methods is worse than the models solely trained with the data in each specific client. For the multi-client MS lesion segmentation dataset, the data distributions for each client are distinct, reflecting variance in hardware and image acquisition protocols. This results in domain bias issues when optimizing the aggregated model on each local client. For MS lesion segmentation task, the foreground objects (i.e. lesions) are almost always small and numerous, with a heterogenous spatial distribution. Subsequently, the domain shifts incur inaccurate segmentation performance for the FedAvg and FedProx methods. By preserving the domain-specific batch normalization in each client, FedBN can alleviate the issue and improve the locally trained models. By further prioritizing inter-client label bias and the distinct model performance, our proposed method outperformed the FedBN and DWA, based on the Dice score at both the case and voxel levels. Our proposed FL method also outperformed centralized training (improved average Dice score and competitive voxel-wise Dice score), which requires each client shares their raw data.

In the second scenario, FL methods were conducted on the in-house and public datasets. First, real-world clinical MS datasets from three different scanners were employed. To further increase the diversity of the FL setting, we include an auxiliary public dataset from another new scanner [22]. The experimental results are presented in Table II. We observed a similar phenomenon as the first scenario, namely that cross-client distribution bias in multi-client MS datasets degrades the collaborative performance of the FedAvg and FedProx, while FedBN achieves much better performance by alleviating the domain bias. However, incorporating the DWA with the FedBN baseline has incurred a severe performance drop. The relatively larger dataset used from each client in the second scenario, which exaggerates client-specific differences in data distribution, may explain this observation. Conversely, FedMSRW, which further considers task-specific factors such as cross-client lesion ratios, and distinct local model MS lesion segmentation ability, outperformed the FedBN under all metrics. Fig. 3 illustrates a visual comparison of FedMSRW with other methods, which further indicates the outstanding segmentation performance of our method.

Scenario 1 Scenario 2
C-Dice V-Dice V-TPR V-FPR C-Dice V-Dice V-TPR V-FPR
FedBN 59.56
Ours-ent 63.70 15.83
Ours-vol 62.26
Ours 67.39 53.66 62.31 29.78
TABLE V: Results on the effectiveness of our proposed FedMSRW under different model designs.

Iv-D Ablation Studies

In contrast to typical FL benchmark tasks, which assume the annotations for each client are in the same distribution space [25], the MS lesion segmentation task is confounded by substantial inter-client lesion heterogeneity / distinctions. For specific clients whose MR images generally contain smaller lesions with more noise, it is more challenging for a 3D U-Net to segment lesions accurately. To this end, the effectiveness of the FedBN is still limited by ignoring the bias of labelling space on MS lesion segmentation tasks. To solve this problem, we propose a re-evaluation of the weighting mechanism for the central aggregation (RW-CA) process and local training (RW-LT) process. As shown in Table IV, solely employing the RW-CA or RW-LT mechanism incurs an unstable performance gain. In Scenario 1, the RW-LT module marginally improves the Dice score but incurs a large performance drop in the second scenario. A similar phenomenon has been observed in [40], namely that re-weighting the training loss functions in each client generates unstable FL performance. For the RW-CA module, this introduces a slight performance drop based on the voxel-wise Dice score in the first scenario, while improving the segmentation accuracy under other metrics. Conversely, in the proposed FedMSRW framework, jointly incorporating the two re-weighting mechanisms consistently improves the FedBN method by a large margin, indicating the effectiveness and robustness of our method on the FL MS segmentation tasks.

Grey Matter Difference (%) White Matter Difference (%)
FedAvg 0.0419 0.0212
DWA 0.1594
Ours 0.3670 0.2054 0.4819 0.2785 0.2715
TABLE VI: The experimental results on the brain tissue differences comparison (MSSEG daataset for the Scenario 1).
Grey Matter Difference (%) White Matter Difference (%)
FedBN 0.2211
Ours 0.1069 0.0325 0.1274 0.3188 0.1470 0.0761 0.1696 0.3771 0.2370
TABLE VII: The experimental results on the brain tissue differences comparison on Scenario 2.

Iv-E Different Model Design Strategies

In this section, we present further experiments on the different design selections of the proposed re-weighting mechanism at the local and global levels. These experiments were conducted on both scenarios and the results are shown in Table V.

First, we replace the model’s segmentation confidence in Equation 4 with the entropy map of the whole segmentation predictions (‘Ours-ent’ in Table V), following typical uncertainty learning methods in medical image segmentation [28, 50]. The Equation 4 is then re-formulated as:


Finally, each local model in the central aggregation process in Equation 5 is assigned a weight of . Due to the severe imbalance of MS lesions in the brain MRI from the clinical practice, utilizing entropy maps incurs inaccurate representations of the model’s segmentation confidence, and further degrades the FL segmentation performance in Scenario 2. Although the ‘Ours-ent

’ method achieves a slight performance gain in Scenario 1, we still select the global-level re-weighting mechanism based on the mask probability as defined in Equation 

5, due to the consistent performance gain.

In addition, we conducted experiments in which lesion volume was employed for local-level re-weighting on the task learning, referred to as the ‘Ours-vol’ method in Table V. Specifically, the volume ratio in Equation 6 is replaced by the total number of lesion voxels . Due to the inaccurate estimation of the true MS lesion distributions in brain MRI patches for model training, the ‘Ours-vol’ method degrades the segmentation accuracies under voxel-level Dice in Scenario 1 and all metrics in Scenario 2. For both the ‘Ours-ent’ and ‘Ours-vol’ selections, we notice although they can improve the FedBN baseline in the Scenario 1, a severe performance drop has been incurred in the Scenario 2. We think the reasons for this phenomenon are two folds: 1) each client of the Scenario 2 has more data than those in Scenario 1; 2) the multi-client MS dataset in Scenario 2 is constructed by various datasets from in-house scanners and the public resources, which brings more distinctions for the cross-client data distributions.

Iv-F Evaluation on Brain Volumetric Analysis

In addition to the lesion segmentation accuracy, we evaluated the impact of FL methods for lesion segmentation and inpainting on brain volumetric analysis, an important application in MS clinical trials and clinical practice. Essentialy, the presence of white matter MS lesions, which have an intensity approximating grey matter, leads to tissue misclassifications in brain volumetric analyses on T1 MR images. Lesion inpainting is therefore a routine pre-processing approach to remove the impact of lesions [44] to brain tissue segmentation. However, the lesion inpainting methods are affected by the quality of the lesion segmentation masks. Thus, we compared brain volumetric analyses on T1 images inpainted using different segmentation approaches as an auxiliary analysis to further support the evaluation of the proposed method.

The pipeline in this section is based on our previous work [44]. First, the intensity inhomogeneity in T1 weighted brain MR images of each subject was corrected using the N3 bias correction method from FreeSurfer [45]. Then FLAIR and T1 images were co-registered using FSL-FLIRT [19]. We applied the lesion inpainting method on T1 images separately using the corresponding MS lesion masks from ground truth and other compared methods, then FSL-FAST [52] was applied to segment the grey matter (GM) and white matter (WM) from the inpainted brain images.

The brain volumetric analysis results in both two scenarios are shown in Table VI and VII, where the grey matter and white matter percentage differences with respect to the whole brain tissue volume are included. Although other comparison methods achieve the best performance on WM or GM percentage difference on some specific clients, the proposed method outperformed all of them on the average performance on all clients in both scenarios. These auxiliary experiment results further support that the proposed method is capable of providing robust MS lesion masks that improve the performance of downstream image analysis tasks.

V Conclusion

In this work, we propose a novel framework for FL MS lesion segmentation incorporating task-specific re-weighting mechanisms. Due to substantial inter-client variance in MS lesion data compared with typical FL settings, we observed limitations of previously described weighting mechanisms for central aggregation and local training. To this end, we first propose to re-weight the model aggregation process based on the segmentation ability for each local model, which alleviates the negative influence of local models with inferior segmentation ability on the central model. Considering the variance in lesion size distribution amongst clients, we further propose to re-weight the loss function in each local client based on the lesion volume ratio, avoiding model bias due to cross-client label distinctions. Extensive experiments in two FL MS lesion segmentation scenarios indicated the superiority of our proposed re-weighting mechanism compared with typical FL methods. In addition, brain volumetric analysis demonstrated the effectiveness of our proposed FL framework in practical research and clinical applications. The demand for privacy-preserving FL in clinical scenarios heightens the imperative to refine existing approaches. FedMSRW is an important methodological advance for analysing heterogenous multi-client imaging datasets with FL.


  • [1] A. Ackaouy, N. Courty, E. Vallee, O. Commowick, C. Barillot, and F. Galassi (2020) Unsupervised domain adaptation with optimal transport in multi-site segmentation of multiple sclerosis lesions from MRI data. Frontiers in Computational Neuroscience 14. Cited by: §I, §II-A.
  • [2] J. Beaumont, O. Commowick, and C. Barillot (2016) Automatic multiple sclerosis lesion segmentation from intensity-normalized multi-channel MRI. In Proceedings of the 1st MICCAI Challenge on Multiple Sclerosis Lesions Segmentation Challenge Using a Data Management and Processing Infrastructure - MICCAI-MSSEG, Cited by: §II-A.
  • [3] T. Brosch, L. Y. Tang, Y. Yoo, D. K. Li, A. Traboulsee, and R. Tam (2016) Deep 3d convolutional encoder networks with shortcuts for multiscale feature integration applied to multiple sclerosis lesion segmentation. IEEE transactions on medical imaging 35 (5), pp. 1229–1239. Cited by: §I, §II-A.
  • [4] L. Catanese, O. Commowick, and C. Barillot (2015) Automatic graph cut segmentation of multiple sclerosis lesions. In ISBI Longitudinal Multiple Sclerosis Lesion Segmentation Challenge, Cited by: §II-A.
  • [5] W. Chang, T. You, S. Seo, S. Kwak, and B. Han (2019) Domain-specific batch normalization for unsupervised domain adaptation. In

    Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition

    pp. 7354–7362. Cited by: §III-A.
  • [6] C. Chen, Q. Dou, H. Chen, J. Qin, and P. A. Heng (2020) Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation. IEEE transactions on medical imaging 39 (7), pp. 2494–2505. Cited by: §II-A.
  • [7] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pp. 424–432. Cited by: §III-D.
  • [8] A. J. Coles, D. Compston, K. W. Selmaj, S. L. Lake, S. Moran, D. H. Margolin, K. Norris, and P. Tandon (2008) Alemtuzumab vs. interferon beta-1a in early multiple sclerosis. N Engl J Med 359 (17), pp. 1786–1801. Cited by: §I.
  • [9] O. Commowick, A. Istace, M. Kain, B. Laurent, F. Leray, M. Simon, S. C. Pop, P. Girard, R. Ameli, J. Ferré, et al. (2018) Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure. Scientific reports 8 (1), pp. 1–17. Cited by: §IV-A1.
  • [10] A. Danelakis, T. Theoharis, and D. A. Verganelakis (2018) Survey of automated multiple sclerosis lesion segmentation techniques on magnetic resonance imaging. Computerized Medical Imaging and Graphics 70, pp. 83–100. Cited by: §I.
  • [11] B. E. Dewey, C. Zhao, J. C. Reinhold, A. Carass, K. C. Fitzgerald, E. S. Sotirchos, S. Saidha, J. Oh, D. L. Pham, P. A. Calabresi, et al. (2019) DeepHarmony: a deep learning approach to contrast harmonization across scanner changes. Magnetic resonance imaging 64, pp. 160–170. Cited by: §II-A.
  • [12] S. Doyle, F. Forbes, and M. Dojat (2016) Automatic multiple sclerosis lesion segmentation with p-locus. In Proceedings of the 1st MICCAI Challenge on Multiple Sclerosis Lesions Segmentation Challenge Using a Data Managementand Processing Infrastructure - MICCAI-MSSEG, pp. 17–21. Cited by: §II-A.
  • [13] C. Elliott, D. L. Arnold, D. L. Collins, and T. Arbel (2013) Temporally consistent probabilistic detection of new multiple sclerosis lesions in brain mri. IEEE transactions on medical imaging 32 (8), pp. 1490–1503. Cited by: §I.
  • [14] M. Ghafoorian, N. Karssemeijer, T. Heskes, M. Bergkamp, J. Wissink, J. Obels, K. Keizer, F. de Leeuw, B. van Ginneken, E. Marchiori, et al. (2017)

    Deep multi-scale location-aware 3D convolutional neural networks for automated detection of lacunes of presumed vascular origin

    NeuroImage: Clinical 14, pp. 391–399. Cited by: §II-A.
  • [15] X. Gong, A. Sharma, S. Karanam, Z. Wu, T. Chen, D. Doermann, and A. Innanje (2021) Ensemble attention distillation for privacy-preserving federated learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15076–15086. Cited by: §II-B.
  • [16] P. Guo, P. Wang, J. Zhou, S. Jiang, and V. M. Patel (2021) Multi-institutional collaborations for improving deep learning-based magnetic resonance image reconstruction using federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. Cited by: §I, §II-A, §II-B.
  • [17] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018)

    Multimodal unsupervised image-to-image translation

    In Proceedings of the European conference on computer vision (ECCV), pp. 172–189. Cited by: §III-A.
  • [18] F. Isensee, P. F. Jaeger, S. A. A. Kohl, and K. H. Maier-Hein (2021) NnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature: Methods. Cited by: §II-A.
  • [19] M. Jenkinson, P. Bannister, M. Brady, and S. Smith (2002) Improved optimization for the robust and accurate linear registration and motion correction of brain images. Neuroimage 17 (2), pp. 825–841. Cited by: §IV-F.
  • [20] K. Kamnitsas, C. Baumgartner, C. Ledig, V. Newcombe, J. Simpson, A. Kane, D. Menon, A. Nori, A. Criminisi, D. Rueckert, et al. (2017) Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In International conference on information processing in medical imaging, pp. 597–609. Cited by: §I, §II-A.
  • [21] J. Knight and A. Khademi (2016) MS lesion segmentation using FLAIR MRI only. Proceedings of the 1st MICCAI Challenge on Multiple Sclerosis Lesions Segmentation Challenge Using a Data Management and Processing Infrastructure-MICCAI-MSSEG, pp. 21–28. Cited by: §II-A.
  • [22] Ž. Lesjak, A. Galimzianova, A. Koren, M. Lukin, F. Pernuš, B. Likar, and Ž. Špiclin (2018) A novel public mr image dataset of multiple sclerosis patients with lesion segmentations based on multi-rater consensus. Neuroinformatics 16 (1), pp. 51–63. Cited by: §IV-A2, §IV-C.
  • [23] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2020) Federated optimization in heterogeneous networks.

    Proceedings of Machine Learning and Systems

    2, pp. 429–450.
    Cited by: §II-B, §IV-C.
  • [24] X. Li, Y. Gu, N. Dvornek, L. H. Staib, P. Ventola, and J. S. Duncan (2020) Multi-site fmri analysis using privacy-preserving federated learning and domain adaptation: abide results. Medical Image Analysis 65, pp. 101765. Cited by: §I, §I, §II-A, §II-B.
  • [25] X. Li, M. Jiang, X. Zhang, M. Kamp, and Q. Dou (2021) FedBN: federated learning on non-IID features via local batch normalization. In International Conference on Learning Representations, Cited by: §I, §I, §II-B, §III-A, §IV-C, §IV-D.
  • [26] D. Liu, D. Zhang, Y. Song, F. Zhang, L. O’Donnell, H. Huang, M. Chen, and W. Cai (2020) Pdam: a panoptic-level feature alignment framework for unsupervised domain adaptive instance segmentation in microscopy images. IEEE Transactions on Medical Imaging 40 (1), pp. 154–165. Cited by: §II-A.
  • [27] Q. Liu, C. Chen, J. Qin, Q. Dou, and P. Heng (2021) FedDG: federated domain generalization on medical image segmentation via episodic learning in continuous frequency space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1013–1023. Cited by: §I, §II-B.
  • [28] X. Liu, F. Xing, C. Yang, G. El Fakhri, and J. Woo (2021) Adapting off-the-shelf source segmenter for target medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 549–559. Cited by: §IV-E.
  • [29] X. Lladó, A. Oliver, M. Cabezas, J. Freixenet, J. C. Vilanova, A. Quiles, L. Valls, L. Ramió-Torrentà, and À. Rovira (2012) Segmentation of multiple sclerosis lesions in brain mri: a review of automated approaches. Information Sciences 186 (1), pp. 164–185. Cited by: §I.
  • [30] Y. Ma, C. Zhang, M. Cabezas, Y. Song, Z. Tang, D. Liu, W. Cai, M. Barnett, and C. Wang (2022) Multiple sclerosis lesion analysis in brain magnetic resonance images: techniques and clinical applications. IEEE Journal of Biomedical and Health Informatics. Cited by: §I, §II-A, §II-A.
  • [31] R. McKinley, R. Wepfer, L. Grunder, F. Aschwanden, T. Fischer, C. Friedli, R. Muri, C. Rummel, R. Verma, C. Weisstanner, et al. (2020)

    Automatic detection of lesion load change in multiple sclerosis using convolutional neural networks with segmentation confidence

    NeuroImage: Clinical 25, pp. 102104. Cited by: §II-A.
  • [32] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. Cited by: §I, §I, §II-B, §III-B, §IV-C.
  • [33] R. Mehta, T. Christinck, T. Nair, A. Bussy, S. Premasiri, M. Costantino, M. Chakravarty, D. L. Arnold, Y. Gal, and T. Arbel (2021) Propagating uncertainty across cascaded medical imaging tasks for improved deep learning inference. IEEE Transactions on Medical Imaging. Cited by: §II-A.
  • [34] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pp. 565–571. Cited by: §III-A.
  • [35] T. Nair, D. Precup, D. L. Arnold, and T. Arbel (2020) Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Medical image analysis 59, pp. 101557. Cited by: §II-A.
  • [36] B. Nichyporuk, J. Szeto, D. Arnold, and T. Arbel (2021) Optimizing operating points for high performance lesion detection and segmentation using lesion size reweighting. In Medical Imaging with Deep Learning, Cited by: §I, §III-C.
  • [37] J. A. Palladino, D. F. Slezak, and E. Ferrante (2020) Unsupervised domain adaptation via cyclegan for white matter hyperintensity segmentation in multicenter mr images. In 16th International Symposium on Medical Information Processing and Analysis, Vol. 11583, pp. 1158302. Cited by: §II-A.
  • [38] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. NeurIPS 2017 Autodiff Workshop. Cited by: §III-D.
  • [39] G. Pontillo, S. Tommasin, R. Cuocolo, M. Petracca, N. Petsas, L. Ugga, A. Carotenuto, C. Pozzilli, R. Iodice, R. Lanzillo, M. Quarantelli, V. Brescia Morra, E. Tedeschi, P. Pantano, and S. Cocozza (2021) A combined radiomics and machine learning approach to overcome the clinicoradiologic paradox in multiple sclerosis. American Journal of Neuroradiology In press. Cited by: §I.
  • [40] C. Shen, P. Wang, H. R. Roth, D. Yang, D. Xu, M. Oda, W. Wang, C. Fuh, P. Chen, K. Liu, et al. (2021) Multi-task federated learning for heterogeneous pancreas segmentation. In Clinical Image-Based Procedures, Distributed and Collaborative Learning, Artificial Intelligence for Combating COVID-19 and Secure and Privacy-Preserving Machine Learning, pp. 101–110. Cited by: §I, §II-B, §IV-C, §IV-D.
  • [41] D. Shen, G. Wu, and H. Suk (2017) Deep learning in medical image analysis. Annual review of biomedical engineering 19, pp. 221–248. Cited by: §II-A.
  • [42] H. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers (2016)

    Deep convolutional neural networks for computer-aided detection: cnn architectures, dataset characteristics and transfer learning

    IEEE transactions on medical imaging 35 (5), pp. 1285–1298. Cited by: §II-A.
  • [43] B. Shirokikh, A. Shevtsov, A. Kurmukov, A. Dalechina, E. Krivov, V. Kostjuchenko, A. Golanov, and M. Belyaev (2020) Universal loss reweighting to balance lesion size inequality in 3d medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 523–532. Cited by: §I, §I, §III-C.
  • [44] Z. Tang, M. Cabezas, D. Liu, M. Barnett, W. Cai, and C. Wang (2021) LG-Net: lesion gate network for multiple sclerosis lesion inpainting. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 660–669. Cited by: §IV-F, §IV-F.
  • [45] N. Tustison and J. Gee (2009) N4ITK: nick’s N3 ITK implementation for mri bias field correction. Insight Journal 9. Cited by: §IV-F.
  • [46] S. Valverde, M. Salem, M. Cabezas, D. Pareto, J. C. Vilanova, L. Ramió-Torrentà, À. Rovira, J. Salvi, A. Oliver, and X. Lladó (2019) One-shot domain adaptation in multiple sclerosis lesion segmentation using convolutional neural networks. NeuroImage: Clinical 21, pp. 101638. Cited by: §II-A.
  • [47] S. Valverde, M. Cabezas, E. Roura, S. González-Villà, D. Pareto, J. C. Vilanova, L. Ramió-Torrentà, À. Rovira, A. Oliver, and X. Lladó (2017) Improving automated multiple sclerosis lesion segmentation with a cascaded 3d convolutional neural network approach. NeuroImage 155, pp. 159–168. Cited by: §II-A.
  • [48] K. Van Leemput, F. Maes, D. Vandermeulen, A. Colchester, and P. Suetens (2001)

    Automated segmentation of multiple sclerosis lesions by model outlier detection

    IEEE transactions on medical imaging 20 (8), pp. 677–688. Cited by: §II-A.
  • [49] G. Xu, C. Liu, J. Liu, Z. Ding, F. Shi, M. Guo, W. Zhao, X. Li, Y. Wei, Y. Gao, et al. (2021) Cross-site severity assessment of covid-19 from ct images via domain adaptation. IEEE Transactions on Medical Imaging 41 (1), pp. 88–102. Cited by: §II-A.
  • [50] L. Yu, S. Wang, X. Li, C. Fu, and P. Heng (2019) Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 605–613. Cited by: §IV-E.
  • [51] C. Zhang, Y. Song, S. Liu, S. Lill, C. Wang, Z. Tang, Y. You, Y. Gao, A. Klistorner, M. Barnett, et al. (2018) MS-GAN: gan-based semantic segmentation of multiple sclerosis lesions in brain magnetic resonance imaging. In 2018 Digital Image Computing: Techniques and Applications (DICTA), pp. 39–46. Cited by: §II-A.
  • [52] Y. Zhang, J. M. Brady, and S. Smith (2000) Hidden markov random field model for segmentation of brain MR image. Medical Imaging 2000: Image Processing 3979, pp. 1126–1137. Cited by: §IV-F.
  • [53] A. P. Zijdenbos, R. Forghani, and A. C. Evans (2002) Automatic” pipeline” analysis of 3-d mri data for clinical trials: application to multiple sclerosis. IEEE transactions on medical imaging 21 (10), pp. 1280–1291. Cited by: §I.