Multi-site fMRI Analysis Using Privacy-preserving Federated Learning and Domain Adaptation: ABIDE Results

01/16/2020 ∙ by Xiaoxiao Li, et al. ∙ Yale University 0

Deep learning models have shown their advantage in many different tasks, including neuroimage analysis. However, to effectively train a high-quality deep learning model, the aggregation of a significant amount of patient information is required. The time and cost for acquisition and annotation in assembling, for example, large fMRI datasets make it difficult to acquire large numbers at a single site. However, due to the need to protect the privacy of patient data, it is hard to assemble a central database from multiple institutions. Federated learning allows for population-level models to be trained without centralizing entities' data by transmitting the global model to local entities, training the model locally, and then averaging the gradients or weights in the global model. However, some studies suggest that private information can be recovered from the model gradients or weights. In this work, we address the problem of multi-site fMRI classification with a privacy-preserving strategy. To solve the problem, we propose a federated learning approach, where a decentralized iterative optimization algorithm is implemented and shared local model weights are altered by a randomization mechanism. Considering the systemic differences of fMRI distributions from different sites, we further propose two domain adaptation methods in this federated learning formulation. We investigate various practical aspects of federated model optimization and compare federated learning with alternative training strategies. Overall, our results demonstrate that it is promising to utilize multi-site data without data sharing to boost neuroimage analysis performance and find reliable disease-related biomarkers. Our proposed pipeline can be generalized to other privacy-sensitive medical data analysis problems.



There are no comments yet.


page 1

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Data has “non-rivalrous” value, a term from the economics literature [weimer2017policy]

, meaning that it can be utilized by multiple parties at a time to create additional data products or services. Pooling data together will have synergistic effects. For example, for developing a deep neural network for image recognition tasks, it is essential to have a training set of images with

tags in ImageNet

[deng2009imagenet] consisting of publicly available, manually annotated information. However, similar data at scale tend to not be available in healthcare, resulting in a lack of generalizability and accuracy for models and concerns regarding the reproducibility of results. Sharing large amounts of medical data is essential for precision medicine, with one important example being functional MRI (fMRI) data related to certain neurological diseases or disorders. The time and cost for acquisition and annotation in gathering large fMRI datasets make it difficult to recruit large numbers at a single site. Deep learning models have shown their advantage in fMRI analysis [suk2016state, shen2017deep]. Without assembling data from a number of different locations, the typically limited amount of data available from a single site becomes an obstacle to building an accurate deep learning model for neuroimage analysis.

However, there are many concerns regarding medical data sharing. For example, patients might be concerned about sharing their medical data, due to the risk that it will be shared with employers or used for future health insurance decision-making if their data are stored and accessed by multiple users, even when deidentified [roski2014creating]. There are questions about whether deidentified data are truly anonymous. From a legal point of view, data sharing is regulated by different federal and state laws. The power of regulation might vary due to the content of the data, its identifiability, and the context of its use [rosenbaum2005assessing]. Many governmental agencies have their own privacy and data-sharing policies [policy2003cdc]. In addition, health systems are concerned that competitors will be able to use their data when they compete for customers. Providers worry that if their health statistics are publicly available, they will lose patients or be sanctioned if they cannot assess their performance [heitmueller2014developing].

To tackle the data-sharing problem, Federated learning [li2019federated]

was introduced to protect privacy by using training data distributed among multiple parties. Instead of transferring data directly to a centralized data warehouse for building machine learning models, in a federated learning setup, each party retains its data and performs decentralized computing. Hence, federated learning addresses privacy concerns and encourages multi-institution collaboration.


Fig. 1: fMRI distribution of different sites

Another problem existing in utilizing data from different parties is domain shift. Diverse domains of data are common because institutions can have very different methods of data generation and collection. The scanners used in different institutions may be from different manufacturers, may be calibrated differently and may have different acquisition protocols specified. For example, in data from the Autism Brain Imaging Data Exchange (ABIDE I) [di2014autism], the University of Utah School of Medicine (USM) site used a 3T Siemens TrioTim MR scanner, the New York University (NYU) site used a 3T Siemens Allegra MR scanner, while the University of Michigan (UM) site used a 3T GE Signa MR scanner. Also, the instructions given to each subject were different at different sites. The USM site told participants to ”Keep your eyes open and remain awake, letting thoughts pass through your mind without focusing on any particular mental activity” while participants at the UM site looked at a fixation cross in the middle of the screen and participants at the NYU site were asked to look at a white cross-hair against a black background that was projected on a screen but some participants’ eyes were closed during scanning. Figure 1 shows the heterogeneous fMRI data distribution of NYU and USM sites, although both of the sites used the scanners from the same manufacturer. One of the challenges of imaging studies of brain disorders is to replicate findings across sites. Federated learning, together with domain adaptation methods, has the potential to extract reliable, robust neural patterns from brain imaging data of patients having different psychiatric disorders.

Our contributions are summarized as follows:

  1. We formulate a new privacy-preserving pipeline for multi-site fMRI analysis and investigate various practical aspects of the federated model’s communication frequency and privacy-preserving mechanisms.

  2. To the best of our knowledge, we investigate domain adaptation in federated learning for medical image analysis for the first time. Domain shift due to heterogeneous data distribution is a challenging issue when utilizing medical images from different institutions.

  3. We propose to evaluate performance based on the biomarkers detected by the model, in addition to direct assessment of accuracy metrics.

Paper structure: In Section II, we summarize related work about federated learning and unsupervised domain adaptation, the two techniques we focus on in this paper. In Section III, we introduce the methods used for our study. Specifically, in Section III-A, we propose the privacy-preserving federated learning setup for multi-site fMRI analysis; in Section III-B, we propose two domain adaptation methods to boost federated learning performance; and in Section III-C, we proposed the biomarker detection and evaluation methods. The experiments, results, and evaluation methods are presented in section IV. We conclude the paper in Section V.

Ii Related Work

Ii-a Federated Learning

Generally, federated learning can be achieved by two approaches: 1) training individual models and a meta-model is constructed from the sub-models, and 2) using encryption techniques to allow safe communications between different parties [yang2019federated]. In this way, the details of the data are not disclosed in between each party. In this paper, we focus on the first approach, which has been studied in [dean2012large, shokri2015privacy, mcmahan2016].

Obtaining sufficient data is a major challenge in the field of medical imaging. Apart from data collection, labeling medical image data that require expert knowledge can be addressed by the collaboration between institutions. However, there are lots of potential legal and technical issues when sharing medical data to a centralized location, especially among international institutions. In the medical imaging field, multi-institutional deep learning without sharing patient data was firstly investigated in [sheller2018multi]. Later, another work [li2019privacy]

empirically studied privacy-preserving issues using a sparse vector technique and investigated model weights sharing schemes for imbalanced data. We note that the randomization mechanism for privacy protection and domain adaptation issues have not been studied in federated learning for medical images. We address these two issues in our study.

Ii-B Domain Adaptation

Domain Adaptation aims to transfer the knowledge learned from a source domain to a target domain. Then, a model trained over a data set from a source domain is further refined to adapt to a data set from a different target domain. Unsupervised domain adaptation methods have been extensively studied [gholami2018unsupervised, zhao2019multi, hoffman2018algorithms, long2015learning, ganin2014unsupervised, tzeng2017adversarial, zhu2017unpaired, long2018conditional]. However, these efforts cannot meet the requirements of federated settings: data are stored locally and cannot be shared, which hinders adaptive approaches in mainstream domains because they require access to source and target data [tzeng2014deep, long2017deep, ghifary2016deep, sun2016deep, ganin2014unsupervised, tzeng2017adversarial]. Federated domain adaptation has been recently proposed [peng2019federated, peterson2019private]. In our study, we investigate adopting those two federated domain adaptation methods in our multi-source and multi-target federated learning domain adaptation problem.

Iii Methods

Iii-a Basic privacy-preserving federated learning setup

In this section, we formulate multi-site fMRI analysis without data sharing in a federated learning framework. Then we introduce the randomized mechanism for privacy protection. Finally, we show the details of training such a privacy-preserving federated learning network step by step.

Iii-A1 Problem definition


Fig. 2: The simplified example of privacy-preserving federated learning strategy for fMRI analysis.

Let matrix denote the data held by the data owner site . Define sites , all of whom wish to train a deep learning model by consolidating their respective data . For medical imaging problems, usually, the data size at each site is limited to train a good deep learning model. A conventional method is to put all data together and use to train a model . At the same time, some data sets may also contain label data. We denote the feature space as , the label space as and we use to denote the sample ID space. The feature , label and sample IDs constitute the complete training dataset . In our multi-site fMRI classification scenario: is fMRI data, is the institution owning private fMRI data; is the extracted fMRI feature and label can be the diagnosis or phenotype we want to predict. In this setting, data sets share the same feature space but are different in samples. For example, different sites have different subjects. However, the features are all fMRI signals extracted from the same preprocessing pipeline. Therefore, we can summarize the data distribution as:


which belongs to the horizontal federated learning category where different data sets have large overlap on features while they have small overlap on samples [yang2019federated].

In this scenario, due to regulation and other issues, each medical institution will not share data with the other parties. A federated learning system is a learning process where the data owners collaboratively train a model , in which any data owner does not expose its data to others. In our problem setting, assume there is a central server for computing (not for data storage). All the different medical institutions (sites) use the same deep learning architecture for the same task. Each institution trains the deep learning model in-house and updates the model weight information to a central server at a particular frequency during training. The shared weights are covered by additive random noise to protect data from inverse interpretation leakage. Once the central server receives all the weights, it summarizes them and updates the new weights to each institution. The simplified pipeline is depicted in Figure 2.

Iii-A2 Privacy-preserving decentralized training

The simplified federated learning framework is depicted in Figure 2, which contains two key steps in decentralized optimization: 1) local update, and 2) communicating to a global server. The detailed training procedure is presented in Algorithm 1, where the objective function is cross-entropy loss:


where is the label and

is the model output, which estimates the probability of that label, given an input.

Input: 1. , fMRI data from institutions/sites; 2. , local models within sites, where is local model weights; 3. , fMRI labels; 4. , noise generator; 5. , number of optimization iterations; 6. , global model updating pace; 7. , privacy-preserving mechanism (explained in the following section); 8. , optimizer returning updated model weights w.r.t. objective function .

1: initialize local model
2:for  to  do
3:      initialize pace counter
4:     for  to  do
6:     end for
7:      models communicate
8:     if  then
9:          update global model per steps
10:         for  to  do
11:               deploy weights to local model
12:         end for
13:     end if
14:end for

Return: global model

Algorithm 1 Privacy-preserving federated learning for multi-site fMRI analysis

Iii-A3 Randomized mechanism for privacy protection

Differential privacy [dwork2014algorithmic, dwork2006calibrating] is a popular approach to privacy-preserving machine learning [shokri2015privacy] and establishes a strong standard for privacy guarantees for aggregated database-based algorithms. Informally, differential privacy aims to provide a bound, , that the attacker could learn virtually nothing more about an individual than they would learn if it were absent from the dataset as the individual’s sensitive information is almost irrelevant in the outputs of the model. The bound represents the degree of privacy preference that can be controlled by each party. A lot of research has tried to protect differential privacy at the data level when a model is learned in a centralized manner [shokri2015privacy, abadi2016deep]. To protect the data from inversion attack, such as inferring data from model weights, a differential privacy-preserving randomized mechanism can be incorporated into the learning process. Given a deterministic real-valued function , ’s sensitivity is defined as the maximum of the absolute distance , where , meaning that there is only one data point difference between and [dwork2014algorithmic] (Definition 3.1). In our case computes the weight parameters in the deep learning model. Introducing “noise” in the training process (inputs, parameters, or outputs) can limit the granularity of information shared and ensure -differential privacy [dwork2006calibrating] (Definition 1) for the data point of any set , and then [dwork2006our]:




where the additional additive term is the probability of -differential privacy being broken. Here, we introduce two approaches: 1) Gaussian mechanism, and 2) Laplace mechanism, which can enjoy good privacy guarantees [chaudhuri2019capacity] by adding noise to the shared weights.

Gaussian Mechanism

The Gaussian mechanism adds

noise with mean 0 and standard deviation

to a function with global sensitivity . will satisfy -differential privacy if and [dwork2014algorithmic] (Theorem 3.22). Hereby, we linked the Gaussian noise parameter to the privacy parameters and .

Laplace Mechanism

The Laplace Distribution centered at 0 with scale

is the distribution with probability density function:


and the variance of the Laplace distribution is

. The Laplace mechanism adds noise to a function with global sensitivity and preserves -difference privacy. Hereby, we linked the Laplace noise parameter to the privacy parameters .

In our case, mapping function is a deep learning model and it is not tractable to compute the sensitivity . For simplicity of discussion, sensitivity is assumed to be 1. From the mechanisms described above, we can control noise parameters to meet certain privacy requirement, as the noise parameters are linked to privacy parameters as shown above.

Iii-B Boosting multi-site learning with domain adaptation


(a) MoE strategy.


(b) Adversarial alignment strategy
Fig. 3: Domain adaptation strategies for our proposed federated learning setup.

Although federated learning is promising for better privacy and efficiency, there is the additional issue that the data at each site likely have different distributions, leading to domain shift between the sites [quionero2009dataset]. The main hypothesis here is that domain adaptation techniques can improve accuracy in a federated learning setting, and that holds even when noise is added for privacy-preserving. In this subsection, we investigate two domain adaptation methods: 1) Mixture of Experts (MoE), adaptation near the output layer, and 2) Adversarial domain alignment, adaptation on the data knowledge representation level.

Iii-B1 Mixture of Experts (MoE) domain adaptation

Mixture of Experts (MoE) [masoudnia2014mixture, shazeer2017outrageously, wang2018deep]

is an approach to conditionally combine experts to process each input. In deep learning, experts mean deep learning models. An MoE layer for feed-forward neural networks is a trainable gating network that dynamically assigns gated weights to combine multiple networks. Then, all parts of the big model that contains all expert models and the MoE layer are trained jointly by back-propagation.

Mixing the outputs of a collaboratively-learned general model and a domain expert was proposed for domain adaptation [peterson2019private]. Each participating party has an independent set of labeled training examples that they wish to keep private, drawn from a party-specific domain distribution. These users collaborate to build a general model for the task but maintain private, domain-adapted expert models. The final predictor is a weighted average of the outputs from the general and private models. These weights are learned using a MoE architecture [masoudnia2014mixture], so the entire model can be trained with gradient descent. More specifically, given an input data , the output of the global model is . In the binary classification setting, the output is the model’s predicted probability for the positive class. As shown in Figure 2(a), we train another local model in the meantime, which is defined as a private model. The private model can have different architecture from and it does not communicate with the global model. The output of the private model is . is trained using the regular deep learning setting, without including privacy-related noise.

The final output that entity uses to label data is


The weight is called a gating function in the MoE, and we use a non-linear layer to compute , where

is the sigmoid function, and

and are learned weights by end-to-end training together with the federated learning architecture.

Iii-B2 Adversarial domain alignment

Input: 1. , fMRI data from institutions/sites; 2. , local feature generators within sites, where is the generator’s parameters of site ; 3.

, local classifiers within

sites, where is the classifier’s parameters of site ; 4. , discriminators from embedded features, where is the discriminator parameters that identify the data from site ; 5. , fMRI labels (HC or ASD); 6. , noise generator; 7. , number of optimization iterations; 8. , global model updating pace; 9. , global model.

1:Initialize parameters
2:for  to  do
3:      initialize pace counter
4:     for  to  do
5:         Sample mini-batch from source site and target site
6:         Compute gradient with cross-entropy classification loss (Eq. 2) to update and
7:         Domain Alignment:
8:         Update with Eq. 7 and Eq. 8 respectively to align the domain distribution
9:     end for
10:      models communicate
11:     if  then
13:          update global model per steps
14:         for  to  do
16:               deploy weights to local model
17:         end for
18:     end if
19:end for

Return: global model

Algorithm 2 Federated Adversarial Domain Alignment

In the federated setting, the data are locally stored in a privacy-preserving manner. For the domain adaptation problem, we have multiple source domains and want to generalize the domains into a common space of target data. Due to the data sharing limitation of federated learning, we cannot train a single model that has access to the source domain and target domain simultaneously. To address this issue, we employed federated adversarial alignment [peng2019federated] that introduces two modules (a domain-specific local feature extractor, and a global discriminator) in the classification networks and divides optimization into two independent steps. Using this method (Figure 2(b)), for source site , we train a local feature extractor, . For the target site , we train a local feature generator . For each source-target domain pair, we train an adversarial domain discriminator to align the distributions. First, domain discriminator is trained to identify which domain the features come from, then the feature generators are trained to confuse the discriminator . In this setting, the discriminator only gets access to the output features with noise coverage of and , without plagiarizing the original data. Given the source domain data and target data , the objective for discriminating the source domain from the others is defined as:


In the second step, remains unchanged, but is updated with the following objective:


By end-to-end training of the federated learning model with the alignment module, we can minimize the discrepancy between the source and target domains. The implementation details are described in Algorithm 2.

Iii-C Evaluate model by interpreting biomarkers

The primary goal of psychiatric neuroimaging research is to identify objective and repeatable biomarkers that may inform the disease [heinsfeld2018identification]. Finding the biomarkers associated with ASD is extremely helpful in understanding the underlying roots of the disorder and can lead to earlier diagnosis and more targeted treatment. Alteration in brain functional connectivity is expected to provide potential biomarkers for classifying or predicting brain disorders [du2018classification]. Deep learning methods are promising tools for investigating the reliability of patterns of brain function across large and heterogeneous data sets [varoquaux2014machine].

We held the hypothesis that reliable biomarkers could be detected from a reliable model. The guided gradient-based explanation method [simonyan2013deep, springenberg2014striving] is perhaps the most straightforward and easiest approach for data feature importance interpretation. The advantage of gradient-based explanation method is easy to compute. By calculating the difference of the output w.r.t the model input then applying norm, a score can be obtained. The gradient-based score can be used to indicate the relative importance of the input feature since it represents the change in input space, which corresponds to the positive maximizing rate of change in the model output.


where is the correct class of input, and is the score for class

before softmax layer,

is the th feature of the input. can indicate the importance of feature for classifying an input as class . We use this method to interpret the important features (ROIs) as biomarkers.

Given the important biomarkers, first, we propose to examine their consistency, i.e., whether the biomarkers are replicable across different datasets. Second, we should examine whether the biomarkers are meaningful. For the relatively important features selected, such as the features with the top

important scores, we can ”decode” them to associated functional keywords based on prior knowledge and compute the correlation score for the keyword with the biomarkers in class . The informative biomarkers of the inputs in the different classes should have different functional representations, which means we expect large for the informative biomarkers, where . The larger the difference, the more representative and informative the biomarkers.

Iv Experiments and Results

Iv-a Data

Iv-A1 Participants

The study was carried out using resting-state fMRI (rs-fMRI) data from the Autism Brain Imaging Data Exchange dataset (ABIDE I preprocessed, [di2014autism]

). ABIDE is a consortium that provides preciously collected rs-fMRI ASD and matched controls data for the purpose of data sharing in the scientific community. However, in reality, collecting data in a consortium like ABIDE is not easy as strict agreement need to be reached by different parties. Therefore, although the data were shared in ABIDE, we studied the multi-site data from the federated learning perspective. To ensure the deep learning model could be performed on a single site, we downloaded Regions of Interests (ROIs) fMRI series of the top four largest sites (UM1, NYU, USM, UCLA1) from the preprocessed ABIDE dataset with Configurable Pipeline for the Analysis of Connectomes (CPAC), band-pass filtering (0.01 - 0.1 Hz), no global signal regression, parcellated by Harvard-Oxford (HO) atlas. Skipping subjects lacking filename, we downloaded 106, 175, 72, 71 subjects from UM1, NYU, USM, UCLA1 separately. HO parcellated each brain into 111 ROIs. Since some subjects did not contain complete ROIs, we removed the incomplete data, resulting in 88, 167, 52, 63 subjects for UM1, NYU, USM, UCLA1 separately. Due to a lack of sufficient data, we used sliding windows (with window size 32 and stride 1) to truncate raw time sequences of fMRI. After removing incomplete subjects, the compositions of four sites are shown in Table

I. We denoted UM for UM1 and UCLA for UCLA1. We summarized the phenotype information of the subjects under our study in Table II.

Total Subject 167 88 52 63
ASD Subject 73 43 33 37
HC Subject 94 45 19 26
ASD Percentage 44% 49% 63% 59%
fMRI Frames 176 296 236 116
Overlapping Trunc 145 265 205 85
TABLE I: Data summary of the dataset used in our study

0.45! SITE AGE ADOS IQ SEX ASD UM 12.4(2.2) - 102.8(18.8) M 36 F 7 USM 22.9(7.3) 12.6(3.0) 99.8(16.4) M 33 F 0 NYU 14.7(7.1) 11.5(4.1) 107.4(16.5) M 65 F 8 UCLA 13.0(2.7) 10.4(3.6) 103.5(13.5) M 31 F 6 HC UM 14.1(3.4) - 106.7(9.6) M 32 F 13 USM 20.8(8.2) - 117.1(14.4) M 19 F 0 NYU 15.2(5.9) - 112.6(13.5) M 69 F 25 UCLA 13.4(2.3) - 104.9(10.4) M 22 F 4 Values reported with mean (std) format. M: Male, F: Female, ADOS score: - means information not available

TABLE II: Data phenotype summary.

Iv-A2 Data preprocessing

The task we performed on the ABIDE datasets was to identify autism spectrum disorders (ASD) or healthy control (HC). We used the mean time sequences of ROIs to compute the correlation matrix as functional connectivity. The functional connectivity provided an index of the level of co-activation of brain regions based on the time series of rs-fMRI brain imaging data. Each element of the correlation matrix was calculated using Pearson correlation coefficient, which ranged from -1 to 1: values close to 1 indicated that the time series were highly correlated and values close to -1 indicate that the time series are anti-correlated. Then, we applied the Fisher transformation on the correlation matrices to emphasize the strong correlations. As the correlation matrices were symmetric, we only kept the upper-triangle of the matrices and flattened the triangle values to vectors, with the purpose of using them for the inputs of multilayer perceptron (MLP) classifiers. The number of resultant features was defined by

, where was the number of ROIs. Under the HO atlas (111 ROIs), the procedure resulted in 6105 features.

Iv-B Federated training setup and hyper-parameters discussion

A multi-layer perceptron (MLP) 6105-16-2 (corresponding to 6105 nodes for the input (first) layer, 16 nodes for the hidden layer, and 2 nodes for the output layer) was used for classification. The outputs of the MLPs were the probability of the given input being classified as each class. We used cross-entropy as the objective function. We performed 5-fold cross-validation (subject-wise splitting), and each entry of the input vectors was normalized by training set mean and standard deviation (std) within each site. As we performed overlapping truncation for data augmentation in data processing, we used the majority voting method to evaluate the final classification performance. For example, we augmented input instances for a single subject, and if more than

instances were classified as ASD, then we assigned ’ASD’ label to the subject. Adam optimization was applied with initial learning rate 1e-5 and reduced by 1/2 for every 20 epochs and stopped at the 50th epoch. In each epoch, we performed local updates multiple times instead of once based on communication pace

. We set the total steps of each epoch as 60, and the batch size of each site was the number of training data over 60.

First, we investigated the effects of changing communication pace on classification accuracy, as communication between models would be costly. To select the best communication pace , we did not apply any noise on the shared weights in the experiment. As the results in Figure 4 show, there was no significant difference between the accuracies when varied from 5 to 30.


Fig. 4: Investigate communication pace vs accuracy

Then, we investigated adding the randomization mechanism on shared weights to protect the data from inversion attack, such as inferring data from model weights, given local model weights. Here we tested the Gaussian and Laplace mechanism, which corresponded to L2 and L1 sensitivity. Institutions may want to specify the level of privacy they want to preserve, which would be reflected in the noise levels. For the Gaussian mechanism experiment, we generated Gaussian noise adding to local model weights, where is the standard deviation of the local model weights and is the noise level. We varied from 0.001 to 1. For the Laplace mechanism experiment, we generated Laplace noise adding to local model weights, where was the scale parameter, and was the standard deviation of the local model weights. We varied from 0.001 to 1. As the results in Figure 5 and Figure 6 show, there was a trade-off between model performance and noise level (privacy-preserving level). When the noise level was too high ( in our setup), corresponding to high privacy-preserving levels, the models failed in the classification task.


Fig. 5: Investigate Gaussian mechanism vs accuracy


Fig. 6: Investigate Laplace mechanism vs accuracy

Iv-C Comparisons with different strategies


Fig. 7: Different classification strategies

To demonstrate the proposed federated learning framework in Algorithm 1 (Fed) could improve multi-site fMRI classification, we compared the proposed methods ( and ) with four alternative, non-federated strategies: 1) training and testing within the single site (Single); 2) training using one site and testing on another site (Cross); 3) collecting multi-site data together for training (Mix) and 4) creating an ensemble model using the models from different sites (Ensemble). Ensemble method combines the Single model trained within the site and Cross model trained by other sites by averaging the outputs. Single and Cross preserve data privacy, while cannot incorporate the data. Mix can take use of all the data from different sites, while cannot preserve data privacy. The classification performance of Mix was expected to perform better than Fed as it used more data information. Due to the data size limitation in training deep learning model in Single, Cross and Ensemble, we changed the MLP architecture to 6105-8-2, while all the other training settings and data splitting settings were the same as described in Section IV-B.

Considering the fact that data distribution was heterogeneous, we also tried to use the domain adaptation methods introduced in Section III-B to boost the classification performance of Fed. For the combination of federated training and MoE (Fed-MoE) strategy, we trained a private classifier simultaneously with the federated architecture. The same as Single, we used MLPs 6105-8-2 as the private models. The gate function was implemented by an MLP with two fully-connected (FC) layers 6105-8-1 and a sigmoid non-linearity layer. For the combination of federated training and adversarial alignment (Fed-Align) strategy, we used four discriminators to discriminate whether the data came from the source domain. We treated the first two layers of the federated MLPs 6105-16 as a feature generator , and each site had a different . The input of the classifier was a 16-dim vector. The global model was the concatenation of . Only the and weights of local models were shared with the global model. For the whole network training, the setup was the same as training a Fed model, except that we started to propagate adversarial loss on (Eq. 7) after training the G-C part for 5 epochs.

How to utilize data for training and testing in different classification strategies was explained in Figure 7. All the implemented model architectures are shown in the Appendix. The comparison results were shown in Table III. In Cross, we denoted the site used for training as ’trsite’. As the testing data were all the other whole sites, there was no standard deviation (std) to report. Also, we ignored the performance of the site used for training. The other results were reported using the ’mean (std)’ format. By comparing the mean accuracy only, we highlighted the best accuracy in Table III. The reason why Cross results were better than Single was probably because more data were included in training (no data splitting). For example, the total number of training instances at the UCLA site with Single strategy was , while using the Cross strategy training on the USM site then testing on the UCLA site included training instances. Ensemble results were not good, probably because the ensemble methods could not make use of the decisions made by different models and counter-productively weakened the prediction power. The mean accuracy of Fed was higher than the best Cross learning case for each single site. In addition, Fed was significantly better than Single

by two sample t test with

for each site. We also observed that Fed-MoE () and Fed-Align () significantly improved accuracy on NYU site when comparing with Fed. The accuracy on UM site using Fed-Align was significantly better than the accuracy using Fed ().The accuracy on UCLA site using Fed-MoE showed potential to improve the classification results compared with using Fed (). Using domain adaptation methods did not improve the performance on the USM site, which was probably caused by the data distribution of the USM site. We validate the hypothesis in the following discussion.

0.45! NYU UM USM UCLA trNYU - 0.716 0.673 0.682 trUM 0.611 - 0.712 0.682 trUSM 0.641 0.625 - 0.730 trUCLA 0.575 0.648 0.750 - Single 0.601(0.064) 0.648(0.065) 0.695(0.108) 0.571(0.100) Ensemble 0.611(0.012) 0.638(0.054) 0.654(0.088) 0.634(0.064) Fed 0.647(0.049) 0.728(0.073) 0.849(0.124) 0.712(0.075) Fed-MoE 0.671(0.082) 0.728(0.083) 0.809(0.098) 0.744(0.130) Fed-Align 0.676(0.071) 0.751(0.053) 0.829(0.091) 0.712(0.089) Mix 0.671(0.035) 0.740(0.063) 0.829(0.137) 0.710(0.128)

TABLE III: Results of using different training strategies

Iv-D Evaluate model from interpretation perspective

We tried to understand the model mechanism by interpreting how each model made a particular decision and how the adaptation methods affected the decision-making process.

Iv-D1 Aligned feature embedding

We used t-SNE [maaten2008visualizing] to visualize the latent space embedded by the first fully connected layer in Figure 7(a) and Figure 7(b) for our federated learning model without and with adversarial domain alignment. We found the alignment method overall improved domain adaptation. In Figure 7(a), we also noticed that the features of the USM site (blue crosses) mixed with other domains. We assumed that could be the reason why the adversarial domain alignment methods did not improve federated learning accuracy for the USM site.


(a) Embedded latent features from 4 sites without alignment.


(b) Embedded latent features from 4 sites with alignment.
Fig. 8: t-SNE visualization of latent space.

Iv-D2 MoE gating value

The core of MoE was to mix the outputs of a collaboratively-learned global model and a private model in each site. Over time, a site’s gate function learned whether to trust the global model or the private model more for a given input. The private model needed to perform well on only the subset of the data points for which the global model failed. While the global model still benefited from the data product (model weights) sharing but received weaker updates on these hard ”private” data points. This meant that users with unusual domains had a smaller effect on the global model, which might increase their ability to generalize [ji2019learning]. We show the gating value associated with a federated global model for each testing data point in Figure 9

. Again, we noticed that the gating values were almost uniformly distributed in the range

, which meant the MoE layer functioned as an inter-medium to coordinate the decisions of the private and global model, except that the gating values of USM site were skewed to 0s and 1s. This showed evidence for why

Fed-MoE did not perform better than Fed on the USM site.


Fig. 9: The histogram of MoE gated values assigned to federated global model.

Iv-D3 Neural patterns: connectivity in the autistic brain














(a) Biomarkers using Fed strategy - view 1.


(b) Biomarkers using Single strategy - view 1.


(c) Biomarkers using Fed strategy - view 2.


(d) Biomarkers using Single strategy - view 2.
Fig. 10: Interpreting brain biomarkers associated with identifying HC from federated learning model (Fed) and using single site data for training (Single). The colors stand for the relative important scores of the ROIs and the values are denoted on the color bar. The names of the strategies and sites are denoted on the left-upper corners of each subfigure. Each row shows the results of NYU, UM, USM, UCLA site from top to bottom.














(a) Biomarkers using Fed strategy - view 1.


(b) Biomarkers using Single strategy - view 1.


(c) Biomarkers using Fed strategy - view 2.


(d) Biomarkers using Single strategy - view 2.
Fig. 11: Interpreting brain biomarkers associated with identifying ASD from federated learning model (Fed) and using single site data for training (Single). The colors stand for the relative important scores of the ROIs and the values are denoted on the color bar. The names of the strategies and sites are denoted on the left-upper corners of each subfigure. Each row shows the results of NYU, UM, USM, UCLA site from top to bottom.

0.5! Sementic Comprehension Social Attention Memory Reward Fed HC 0.054 0.096 0.099 0.088 0.009 -0.078 ASD -0.048 -0.035 -0.081 0.007 0.031 0.017 0.102 0.131 0.180 0.081 0.022 0.095 Single HC 0.050 0.043 0.069 0.053 0.022 -0.062 ASD -0.029 -0.005 -0.094 0.005 0.041 0.010 0.079 0.048 0.163 0.048 0.019 0.072 is the absolute difference between the scores of HC and ASC groups.

TABLE IV: Correlations between the detected biomarkers and functional keywords maps decoded by Neurosynth.

Whether informative and replicable biomarkers can be interpreted is another dimension to evaluate a deep learning model apart from using accuracy-related metrics. Here, we used the guided back-propagation method (Eq. 9) to interpret feature importance on Fed and Single model separately. The features of inputs were the functional connectivity between brain ROIs. First, we calculated for each testing point. To get the ROI level evaluation, we built a symmetric grad matrix where the entry is the of functional connectivity between ROI and . We summed over columns resulting in a 111-dim vector standing for the importance score of the 111 ROIs. We normalized by dividing to bound it to . We averaged the results for all the test data points in each site. The ROIs with the top 10 important scores for classification (2 classes) and normalized importance scores on the ROIs were plotted for HC (Figure 10) and ASD (Figure 11). Fed detected replicable and robust biomarkers across 4 sites, while the biomarkers detected by Single are different across different sites. Further, we listed the correlations between the biomarkers with functional keyword maps in Table IV by Neurosynth [yarkoni2011large]. The biomarkers detected by Fed were more distinguishable than those of Single, as the differences between correlation values for HC and ASD group were larger than those of Single (see the scores of Fed are larger than those of Single in Table IV). Therefore, we thought the biomarkers detected by Fed were more informative. We could infer from Table IV that the semantic, comprehension, social and attention-related functional connectivity was more salient in the HC group, while memory and reward-related functional connectivity was more salient in the ASD group. Hence, the biomarkers detected by Fed were more replicable and informative. The names of the biomarkers of each group detected by Fed and Single were listed in the Appendix.

Iv-E Limitation and discussion

Although, based on our empirical investigation that the communication pace, which controls how often the local and global model update the weight information, did not affect the classification performance, we could not draw the conclusion that the pace parameter was irrelevant. A more extensive range of pace values should be examined according to the application. Also, we used practical approaches to investigate privacy-preserving mechanisms. However, the sensitivity of the mapping function , the deep learning classifier in our case, was difficult to estimate. Hence, we did not explicitly give the bound . A recent study [zhu2019deep] also demonstrated Gaussian and Laplace noise higher than a certain scale can be a good defense to reconstruction attack. According to the specific application and dataset, we can empirically estimate a suitable noise level from attacking perspective as well. In our experiments, we used the averaging method to incorporate models’ outputs for Ensemble. To achieve better performance for Ensemble, more advanced ensemble methods could be exploited, such as gradient tree boosting, stacking, and forest of randomized trees [zhou2012ensemble]. We evaluated the biomarkers at the ROI-level. The functional connectivity also could be used as biomarkers. More advanced deep learning models can be explored as well. In order to show the strong direct associations between the biomarkers and disease diagnosis or treatment outcome prediction, down-stream tasks such as regression to ADOS scores using the biomarkers are worthy of exploring. We found that domain adaptation methods were not always a beneficial addition to the federated model. We could examine the distribution of the latent features of different data owners first, then decide whether to adopt our proposed domain adaptation methods.

V Conclusion

In this work, we have presented a privacy-preserving federated learning framework for multi-site fMRI analysis. We have investigated the communication pace and the privacy-preserving randomized mechanisms for the problem of using brain functional connectivity to classify ASD and HC. To overcome the domain shift issue, we have proposed two strategies: MoE and adversarial domain alignment to boost federated learning model performance. We have also evaluated the deep learning model for neuroimaging from the biomarker detection perspective.

Our results have demonstrated the advantage of using a federated framework to utilize multi-site data without data sharing compared to alternative methods. We have shown federated learning performance can potentially be boosted by adding domain adaptation and discussed the condition of benefits. In addition, the proposed federated learning model has revealed possible brain biomarkers for identifying ASD. Our work also has broader implications into other disease areas, particularly rare diseases with fewer patients. In these situations, utilizing data across multiple sites is critical and required for meaningful conclusions.

Our approach brings new hope for accelerating deep learning applications in the field of medical imaging, where data isolation and the emphasis on data privacy have become challenges. It can establish a unified model for multiple medical institutions while protecting local data, allowing medical institutions to work together with the required data security.

Vi Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Vii Acknowledgements

Data collection and sharing for this project was funded by the Autism Brain Imaging Data Exchange dataset (ABIDE) [di2014autism]. Parts of this research was supported by National Institutes of Health (NIH) [R01NS035193, R01MH100028].



Architecture of the models

We provide the detailed model architecture for each strategy we used in our study. For each fully connected (FC), we provide the input and output dimension. For drop-out (Dropout) layers, we provide the probability of an element to be zeroed. We denote batch normalization layers as (BN), relu layers as (ReLU) and softmax layers as Softmax.

Models for Single, Cross and Ensemble are shown in Table V.

0.4! Layer Configuration MLPs 1 Dropout (0.5), FC (6105, 8), ReLU, BN 2 Dropout (0.5), FC (8, 2), ReLU, BN, Softmax

TABLE V: Model architecture for ABIDE rs-fMRI classification task under Single, Cross and Ensemble strategies.

Models for Cross and Ensemble is shown in Table VI.

0.4! Layer Configuration MLPs 1 Dropout (0.5), FC (6105, 16), ReLU, BN 2 Dropout (0.5), FC (16, 2), ReLU, BN, Softmax

TABLE VI: Model architecture for ABIDE rs-fMRI classification task under Fed and Mix strategies.

Models for Fed-MoE strategy is shown in Table VII.

0.4! Layer Configuration Layer Configuration Private Model Global Model 1 Dropout (0.5), FC (6105,8), ReLU, BN 1 Dropout (0.5), FC (6105,16), ReLU, BN 2 Dropout (0.5), FC (8,2), ReLU, BN 2 Dropout (0.5), FC (16,2), ReLU, BN Layer Configuration MoE 1 FC (2,1), Sigmoid

TABLE VII: Model architecture for ABIDE rs-fMRI classification task under Fed-MoE strategy.

Models for Fed-Align strategy is shown in Table VIII

0.4! Layer Configuration Feature Generator 1 Dropout (0.5), FC (6105, 16), ReLU, BN Domain Discriminator 1 FC (6105, 8), ReLU 2 FC (8, 1), sigmoid Classifier 1 Dropout (0.5), FC (16, 2), ReLU, BN, Softmax

TABLE VIII: Model architecture for ABIDE rs-fMRI classification task under Fed-Align strategy.

Names of the biomarkers

We list the top 10 important ROIs (plotted in Figure 10 and Figure 11 in descending order.

1. HC biomarkers detected by Fed:

  • : ’Left Central Opercular Cortex’ ’Right Precuneous Cortex’ ’Right Inferior Frontal Gyrus’ ’Right Middle Temporal Gyrus’ ’Right Occipital Pole’ ’Left Middle Temporal Gyrus’ ’Right Inferior Temporal Gyrus’ ’Right Supramarginal Gyrus’ ’Right Angular Gyrus’ ’Left Frontal Operculum Cortex’

  • : ’Left Central Opercular Cortex’ ’Right Precuneous Cortex’ ’Right Inferior Frontal Gyrus’ ’Right Middle Temporal Gyrus’ ’Right Occipital Pole’ ’Left Middle Temporal Gyrus’ ’Right Supramarginal Gyrus’ ’Right Inferior Temporal Gyrus’ ’Right Angular Gyrus’ ’Left Frontal Operculum Cortex’

  • : ’Left Central Opercular Cortex’ ’Right Precuneous Cortex’ ’Right Inferior Frontal Gyrus’ ’Right Middle Temporal Gyrus’ ’Right Occipital Pole’ ’Left Middle Temporal Gyrus’ ’Right Angular Gyrus’ ’Left Frontal Operculum Cortex’ ’Right Supramarginal Gyrus’ ’Right Inferior Temporal Gyrus’

  • : ’Right Temporal Occipital Fusiform Cortex’ ’Right Angular Gyrus’ ’Left Occipital Pole’ ’Right Middle Temporal Gyrus’ ’Left Cingulate Gyrus’ ’Left Frontal Medial Cortex’ ’Right Paracingulate Gyrus’ ’Left Temporal Pole’ ’Left Middle Temporal Gyrus’ ’Right Superior Temporal Gyrus’

2. HC biomarkers detected by Single:

  • : ’Right Angular Gyrus’ ’Left Occipital Pole’ ’Right Temporal Occipital Fusiform Cortex’ ’Left Temporal Pole’ ’Right Middle Temporal Gyrus’ ’Left Postcentral Gyrus’ ’Right Inferior Temporal Gyrus’ ’Left Frontal Pole’ ’Left Supramarginal Gyrus’ ’Left Temporal Occipital Fusiform Cortex’

  • : ’Right Temporal Occipital Fusiform Cortex’ ’Right Angular Gyrus’ ’Right Paracingulate Gyrus’ ’Right Superior Temporal Gyrus’ ’Left Temporal Pole’ ’Left Central Opercular Cortex’ ’Left Frontal Medial Cortex’ ’Right Supramarginal Gyrus’ ’Left Superior Parietal Lobule’ ’Right Superior Parietal Lobule’

  • : ’Right Angular Gyrus’ ’Right Temporal Occipital Fusiform Cortex’ ’Right Superior Parietal Lobule’ ’Left Occipital Pole’ ’Left Temporal Pole’ ’Right Middle Temporal Gyrus’ ’Right Paracingulate Gyrus’ ’Right Lateral Occipital Cortex’ ’Left Angular Gyrus’ ’Left Hippocampus’

  • : ’Right Temporal Occipital Fusiform Cortex’ ’Right Angular Gyrus’ ’Left Occipital Pole’ ’Right Middle Temporal Gyrus’ ’Left Cingulate Gyrus’ ’Left Frontal Medial Cortex’ ’Right Paracingulate Gyrus’ ’Left Temporal Pole’ ’Left Middle Temporal Gyrus’ ’Right Superior Temporal Gyrus’

3. ASD biomarkers detected by Fed:

  • : ’Left Accumbens’ ’Left Parahippocampal Gyrus’ ’Right Thalamus’ ”Right Heschl’s Gyrus (includes H1 and H2)” ’Right Pallidum’ ’Left Middle Frontal Gyrus’ ’Right Precentral Gyrus’ ’Right Parahippocampal Gyrus’ ’Left Cuneal Cortex’ ’Left Temporal Fusiform Cortex’

  • : ’Left Accumbens’ ’Left Parahippocampal Gyrus’ ’Right Thalamus’ ”Right Heschl’s Gyrus (includes H1 and H2)” ’Right Pallidum’ ’Right Parahippocampal Gyrus’ ’Left Middle Frontal Gyrus’ ’Right Precentral Gyrus’ ’Left Cuneal Cortex’ ’Left Temporal Fusiform Cortex’

  • : ’Left Accumbens’ ’Left Parahippocampal Gyrus’ ’Right Thalamus’ ’Right Pallidum’ ”Right Heschl’s Gyrus (includes H1 and H2)” ’Left Middle Frontal Gyrus’ ’Right Parahippocampal Gyrus’ ’Right Precentral Gyrus’ ’Left Temporal Fusiform Cortex’ ’Left Cuneal Cortex’

  • : ’Left Accumbens’ ’Left Parahippocampal Gyrus’ ’Right Thalamus’ ”Right Heschl’s Gyrus (includes H1 and H2)” ’Right Pallidum’ ’Left Middle Frontal Gyrus’ ’Right Parahippocampal Gyrus’ ’Right Precentral Gyrus’ ’Left Cuneal Cortex’ ’Left Temporal Fusiform Cortex’

4. ASD biomarkers detected by Single:

  • : ’Left Cuneal Cortex’ ’Right Central Opercular Cortex’ ’Left Putamen’ ’Left Thalamus’ ’Left Supracalcarine Cortex’ ’Left Superior Temporal Gyrus’ ’Left Parahippocampal Gyrus’ ’Left Middle Frontal Gyrus’ ’Right Thalamus’ ’Left Accumbens’

  • : ’Left Supracalcarine Cortex’ ’Left Accumbens’ ’Right Precentral Gyrus’ ’Left Subcallosal Cortex’ ’Left Cuneal Cortex’ ’Left Lateral Occipital Cortex’ ’Left Inferior Frontal Gyrus’ ’Left Hippocampus’ ’Right Temporal Pole’ ’Right Pallidum’

  • : ’Left Cuneal Cortex’ ’Left Putamen’ ’Left Superior Temporal Gyrus’ ’Right Precentral Gyrus’ ’Right Temporal Pole’ ’Right Inferior Temporal Gyrus’ ’Left Caudate’ ’Right Pallidum’ ’Left Lingual Gyrus’ ’Left Precentral Gyrus’

  • : ’Left Putamen’ ’Left Cuneal Cortex’ ’Left Superior Temporal Gyrus’ ’Right Inferior Temporal Gyrus’ ’Left Lingual Gyrus’ ’Left Caudate’ ’Left Precentral Gyrus’ ’Left Lateral Occipital Cortex’ ’Right Precentral Gyrus’ ’Right Central Opercular Cortex’