Deep learning techniques have demonstrated state-of-the-art performances in a wide range of computer vision and automatic clinical tasks, such as classification of natural images, detection and diagnosis of cancer, and clinical prediction [1, 2, 3, 4, 5]. However, the advancement of deep learning techniques is heavily dependent on the amount and diversity of the data in the training dataset. Many deep learning models are currently trained using data from few centers and generally do not perform well in new data or in clinical practice. Cohort sizes at single institution or even in public data repositories such as The Cancer Imaging Archive (TCIA) are often small, especially for rarer diseases or for certain patient populations (e.g., molecular variants in gliomas) [6, 7, 8]. Aggregating data from multiple institutions is not always feasible due to regulatory, technical and patient privacy concerns. Federated learning, where computations are performed locally at each institution without sharing data, is promising for accessing large, representative data to train robust deep learning models that have greater generalizability and lower risk of model bias.
, such as Federated stochastic gradient descent (FedSGD)[9, 10], Federated averaging algorithm (FedAVG) , and Cyclical weight transfer (CWT) . Despite the promising progress, existing methods generally do not account for the presence of data heterogeneity among institutions. Most federated learning methods usually assume independent and identically distributed (IID) data among institutions, which is unlikely to hold in real federated learning settings in healthcare. Few recent studies have explored the performance of federated learning methods on non-IID data partitions that have label distribution skew [14, 15, 16]. For example, Hsieh et al.  conducted a series of experimental studies to show the impact of label distribution skew on federated learning methods. However, in addition to label distribution skew, the data at different institutions are usually heterogeneous in other ways, such as data quantity skew and imaging acquisition skew. For example, large academic university hospitals generally have substantially larger datasets than small community hospitals, and they differ in equipment vendors and imaging equipment parameters.
In this paper, we present an experimental study on two medical image classification tasks, to investigate the impact of a taxonomy of data heterogeneity regimes on federated learning methods, including quantity skew, label distribution skew, and imaging acquisition skew (e.g., different hospitals may use different imaging equipment vendors and acquisition protocols). Contributions of this paper are summarized as follows.
1) We present the first thorough research to study the impact of a taxonomy of data heterogeneity regimes on several widely used federated learning methods with medical image data. Our study provides valuable guidance for the deployment of federated learning in real clinical applications.
2) We show that the performance of the federated learning methods in our study degrades with the increasing degrees of data heterogeneity, and the rate of decrease in performance is determined by the degree of deviation from homogenous distributions.
3) We propose a variety of optimization strategies to mitigate the performance loss for quantity skew and label distribution skew, including weighted average strategy for data quantity skew, and weighted loss strategy for label distribution skew.
4) We study the influence of the Batch Normalization (BN) for FedAVG, we show that averaging the mean and variance of BN across institutions during FedAVG training is a simple and flexible alternative to mitigate skew-induced performance loss of BN.
2 Study Materials and Experimental Setup
2.1 Study Methods
We employed three popular federated learning methods in our study: two parallel federated learning methods (FedSGD [9, 10] and FedAVG ), and CWT . We used the model trained with the centrally hosted data as a baseline method, termed as centrally hosted.
FedSGD [9, 10], involves frequent transferring of gradients from individual institutions to a central server, computing weight updates using institutional gradients at the central server, and transferring updated weights back to individual institutions.
FedAVG , involves frequent transferring of weights from individual institutions to a central parameter server, averaging weight computation at the central server, and transferring averaged weights back to individual institutions.
CWT , involves training for a fixed number of iterations at one institution, and cyclically transferring weights to the next training institution until model convergence.
ADNI Dataset provides a longitudinal multi-institutional observation study on Alzheimer’s disease patients, mild cognitive impairment subjects, and healthy elders controls. For simplicity, we only use PET scans from (18F)-fluorode-oxyglucose PET to predict the summary standard uptake value ratios (SUVRs) status. Specifically, we discretized the continuous SUVRs into 4 classes, 2 classes of equal size for SUVR values below the 1.11 cutoff , and 2 classes for the SUVR values above. We utilized a total of 2603 florbetapir PET scans (from 1235 subjects) in our study. We randomly divided the scans into 3 subsets: 1896 scans for training, 314 scans for validation, and the remaining for testing. The scans for the same subject were divided into same subset.
Retina Dataset  consists of 17,563 pairs of right and left color digital retinal fundus images. Each image provided by this dataset includes a rating on a scale of 0 to 4 according to the presence of diabetic retinopathy. Following the setting in 
, the labels were binarized to Healthy (scale 0) and Diseased (scale 2, 3 or 4) in our study, while the mild diabetic retinopathy images (scale 1) were excluded. Additionally, only left eye images were used to remove the confusion from using multiple images for the same patient. We utilized a total of 6000 subjects for training, 3000 for validation, and 3000 for testing.
2.3 Experimental Setup
We used a full communication setting for FedSGD, i.e., update the gradients between individual institutions and central server every iteration. For FedAVG, we set the number of training passes on each local institution to , where is the quantity of training samples in local institution and is the local minibatch size. CWT involves training the same fixed number of iterations on each institutions. Here we set the number of iterations to , where is the quantity of training samples of centrally hosted data, is the number of institutions involving in federated learning.
We applied ResNet34 
as our baseline DNN architecture. All the methods were implemented in Pytorch and optimized with SGD. The training parameters (such as learning rate and minibatch size) were tuned to make sure the baseline centrally hosted achieved best performance. For a fair comparison, we then used these training parameters in all the federated learning methods we compared. All methods were trained with enough epochs until model convergence.
3 Experiments and Results
In this section, we study the impact of a taxonomy of data heterogeneity regimes on federated learning methods, including quantity skew, label distribution skew and imaging acquisition skew. We then present several mitigation strategies to overcome performance drops from data heterogeneity.
3.1 Quantity Skew
Mean and standard deviation test accuracies were obtained with 4 runs. We use the same setting for the following experiments.. The performance drop rate of data partitions with quantity skew from homogenous data split 1 is shown when it is larger than 1%.
Quantity skew is one common way that causes data to deviate from a homogenous distribution. Large academic university hospitals generally have substantially larger datasets than small community hospitals. We created 4 sets of data partitions with variable sample sizes across simulated institutions to study the impact of quantity skew on federated learning methods (see supplementary file for detailed data partitions). Each data partition consists of 4 stimulated institutions and each institution shares the same feature distribution and label distribution. We use sample standard deviation (STD) of the sample size across institutions to measure the degree of quantity skew.
Fig. 2 shows that all the federated learning methods achieve comparable performance to the centrally hosted baseline in homogenous data split 1. However, the performance of FedSGD and CWT degrade with the increasing degree of skew, e.g, 7.0% and 8.9% drop rates of FedSGD and CWT on ADNI dataset with split 4. All the institutions at FedSGD contributed equally on the gradient update at each training iteration, ignoring the impact of quantity skew. Different from FedSGD, FedAVG averages the model weights after one epoch train on the whole local institutional dataset, without reusing the small dataset for weights update, thus works well on quantity skew. We then introduced a weighted average strategy for FedSGD, termed as FedSGD+WP, where gradients from each institutions were not treated equally but proportional to the institutional training sample size. Similarly, we applied proportional training sample sizes  to CWT for addressing quantity skew, termed as CWT+WP. Shown in Fig. 2, with the proposed weighted average strategy and proportional training sample sizes strategy, both FedSGD and CWT achieve promising performance on data partitions with quantity skew.
3.2 Label Distribution Skew
Our study on quantity skew assumes homogenous label distribution across institutions, which is not always true in real applications. In fact, label distribution may vary across institutions even when they share the same label annotations. For example, lupus is much more common in Black, Asian people than White people. We created 4 sets of data partitions with label distribution skew (same data quantity) by controlling the fraction of non-IID data. We used the mean Kolmogorov-Smirnov (KS) statistic between every two institutions to measure the degree of label distribution skew. Specifically, KS=0 indicates homogenous label distributions, and 1 indicates totally different label distributions across institutions. See supplementary file for detailed data partitions and its corresponding KS value.
shows that all the compared federated learning methods are vulnerable to label distribution skew. The performance drop rates on split 4 (with KS = 0.90) of ADNI dataset reaches to 26.0%, 23.0%, and 32.0% of FedSGD, FedAVG, and CWT, respectively. One common approach for addressing class imbalance in standard DNNs is to introduce weighted factors into loss function. Similarly, we also applied a class weighted cross-entropy loss (WL) to tackle the label distribution skew in federated learning methods:
where is the number of categories, is the ground-truth binary indicator with 1 when class label is the correct category for sample and 0 else, and
is the model prediction probability that samplehas class label , is the weighted factor for class label and is defined as:
where indicates the number of samples with class label . WL is applied on both the training stage and the cross-validation model selection stage for better convergence.
WL mitigates the performance drops on data partitions with certain degrees of label distribution skew (see Fig. 2). For example, the performance drop rate of CWT+WL is increased from 32.0%, and 12.0% to 17.0% and 3.9% on split 4 of ADNI and Retina dataset, respectively. However, the improvement on parallel federated learning (FedSGD and FedAVG) with large degree of skew is limited. FedAVG+WL even works worse than FedAVG on split 4 of ADNI dataset, e.g., a 28.0% drop rate compared to original 23%. Note that WL has also been described in , but they only applied them to CWT for 2-label classification tasks. Here, we extend it to general federated learning approaches and multi-label classification tasks.
To investigate the worse performance of FedAVG+WL, we show the prediction accuracy of each institutional model on each institutional test dataset333Same label distribution for local institutional testing and training dataset. (see Fig. 3(a)). It is worth noting that the institutional model performs even worse on its own testing dataset than the models from other institutions, e.g., worse diagonal performance of row 1 and row 3 in Fig. 3(a).
This then raises the question: why the institutional models after model average differ a lot? Actually, the models across institutions only differ in the BN layer after model averaging on central server. Thus, BN is the main factor that causes the discrepancy between models across institutions. BN is a widely used technique aiming to stabilize and accelerate the DNN training. It normalizes each layer’s inputs with minibatch mean and variance during training, and an estimated global mean and variance is used during testing. In standard implementation, FedAVG and FedSGD do not average the estimated mean and variance of each layer, which then causes the discrepancy between models across institutions. To this end, we modified the averaging setting of BN, also averaged the estimated mean and variance across institutions during federal training. Fig.3 shows results with averaged BN settings. FedAVG+WL+BN generates superior performance than the standard FedAVG+WL setting, e.g., a 16.0% percentage increase is shown on split 4 of ADNI dataset.  also indicates that DNNs with BN is vulnerable to label distribution skew, they show that group normalization  avoids the skew-induced accuracy loss of BN. Our experiments are complementary to theirs and provide a simple and flexible alternative to help mitigate the skew-induced performance loss of BN in federated learning settings.
|(a) Split 4 of ADNI||(b) ADNI||(c)Retina|
3.3 Imaging Acquisition Skew
In medical domain, there has been a longstanding debate about applying medical imaging standardization throughout the imaging industry [21, 22]. Modern X-Ray, MR imaging, and PET units allow for wide variations in imaging acquisition settings, which may result in significant differences in the generated images across institutions, even for the same underlying disease.
In this section, we show the impact of imaging acquisition skew on federated learning methods with two sets of experiments. We first simulated three sets of data partitions (split 1 - split 3), and then generated a real data partition on ADNI dataset according to scanner vendors (split 4).
Split 2 (Simulated partition with resolution skew): Decreasing image resolution in Institution 1 to 4 of split 1 with a factor of 4, 3, 2, 1, respectively.
(Simulated partition with signal-to-noise-ratio (SNR) skew): Degrading images in split 1 with various types of noises and blurring, such as Gaussian, Speckle, Poisson noise and motion blur. Institution 1: Gaussian noise, Institution 2: motion blur, Institution 3: mixture of noise/blurring but dominated by gaussian noise, Institution 4: mixture of noise/blurring but dominated by motion blur. Examples of synthetic gaussian noise and motion blur are shown in Fig.4.
Split 4 (Real partition according to scanner vendors): ADNI dataset were acquired from 7 manufacturers, such as GE Healthcare, Philips Healthcare, Siemens Healthcare and etc. We resplit the dataset into 4 institutions according to the scanner vendors. Least quantity skew and label distribution skew across institutions are ensured in this new split.
|Orig img.||Orig img. + Gaussian||Orig img.||Orig. img + motion blur|
Experimental results in Fig. 5
indicate that, in addition to the label distribution skew, the imaging acquisition skew is also a critical hurdle that prevents the deployment of federated learning in real applications. Strategies such as super-resolution, image denoising and histogram matching may be applied to deal with the imaging acquisition skew, which will remains to be undertaken in our future work.
In this paper, we conduct extensive experiments to study the impact of data heterogeneity on federated learning methods. Our study covers three widely used federated learning methods, a taxonomy of data heterogeneity regimes, data heterogeneity with different degrees of skewness. We show that the federated learning methods in our study are vulnerable to data partitions with a high degree of skew. We then present several optimization strategies to overcome the performance loss from data heterogeneity.
Extensive experiments demonstrate that: 1) the proposed weighted average for FedSGD can recover performance loss from introducing quantity skew; 2) weighted loss helps mitigate the performance loss from introducing label distribution skew; 3) averaging the mean and variance of BN across institutions in FedAVG training is an attractive alternative to mitigate skew-induced performance loss of BN. We anticipate that our detailed analysis provided herein will provide guidance for the deployment of federated learning in real clinical applications, and that our findings will provide useful hints towards the construction of better federated learning methods.
-  Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, 1097–1105 (2012).
-  Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nature medicine 24, 1559–1567 (2018).
-  Qu, L., Zhang, Y., Wang, S., Yap, P.-T. & Shen, D. Synthesized 7t mri from 3t mri via deep learning in spatial and wavelet domains. Medical Image Analysis 101663 (2020).
-  Mobadersany, P. et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proceedings of the National Academy of Sciences 115, E2970–E2979 (2018).
-  Wang, G., Ye, J. C. & De Man, B. Deep learning for tomographic image reconstruction. Nature Machine Intelligence 2, 737–748 (2020).
-  Clark, K. et al. The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging 26, 1045–1057 (2013).
-  Qu, L., Wang, S., Yap, P.-T. & Shen, D. Wavelet-based semi-supervised adversarial learning for synthesizing realistic 7t from 3t mri. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 786–794 (Springer, 2019).
-  Kaushal, A., Altman, R. & Langlotz, C. Geographic distribution of us cohorts used to train deep learning algorithms. Jama 324, 1212–1213 (2020).
-  Su, H. & Chen, H. Experiments on parallel training of deep neural network using model averaging. arXiv preprint arXiv:1507.01239 (2015).
-  McMahan, H. B., Moore, E., Ramage, D., Hampson, S. et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 (2016).
-  Lin, Y., Han, S., Mao, H., Wang, Y. & Dally, W. J. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017).
-  Chang, K. et al. Distributed deep learning networks among institutions for medical imaging. Journal of the American Medical Informatics Association 25, 945–954 (2018).
-  Vepakomma, P., Gupta, O., Swedish, T. & Raskar, R. Split learning for health: Distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564 (2018).
-  Hsu, T.-M. H., Qi, H. & Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335 (2019).
-  Hsieh, K., Phanishayee, A., Mutlu, O. & Gibbons, P. B. The non-iid data quagmire of decentralized machine learning. arXiv preprint arXiv:1910.00189 (2019).
-  Balachandar, N., Chang, K., Kalpathy-Cramer, J. & Rubin, D. L. Accounting for data variability in multi-institutional distributed deep learning for medical imaging. Journal of the American Medical Informatics Association (2020).
-  Kaggle. Diabetic retinopathy detection. https://www.kaggle.com/c/diabetic-retinopathy-detection (2017).
-  Landau, S. M. et al. Amyloid deposition, hypometabolism, and longitudinal cognitive decline. Annals of neurology 72, 578–586 (2012).
He, K., Zhang, X., Ren,
S. & Sun, J.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
-  Wu, Y. & He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), 3–19 (2018).
-  Gillies, R. J., Kinahan, P. E. & Hricak, H. Radiomics: images are more than pictures, they are data. Radiology 278, 563–577 (2016).
-  Lambin, P. et al. Radiomics: the bridge between medical imaging and personalized medicine. Nature reviews Clinical oncology 14, 749 (2017).
We detail our simulated data partitions in this supplementary file. Table 1 shows the simulated data partitions with quantity skew. Fig. 6 and Fig. 7 show the simulated data partitions with label distribution skew on ADNI and Retina dataset, respectively. Fig. 8 shows an example of PET cases imaged with different scanner vendors in ADNI dataset. These PET images differ in either resolution, contrast, or intensity distributions.