An Experimental Study of Data Heterogeneity in Federated Learning Methods for Medical Imaging

07/18/2021 ∙ by Liangqiong Qu, et al. ∙ 14

Federated learning enables multiple institutions to collaboratively train machine learning models on their local data in a privacy-preserving way. However, its distributed nature often leads to significant heterogeneity in data distributions across institutions. In this paper, we investigate the deleterious impact of a taxonomy of data heterogeneity regimes on federated learning methods, including quantity skew, label distribution skew, and imaging acquisition skew. We show that the performance degrades with the increasing degrees of data heterogeneity. We present several mitigation strategies to overcome performance drops from data heterogeneity, including weighted average for data quantity skew, weighted loss and batch normalization averaging for label distribution skew. The proposed optimizations to federated learning methods improve their capability of handling heterogeneity across institutions, which provides valuable guidance for the deployment of federated learning in real clinical applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning techniques have demonstrated state-of-the-art performances in a wide range of computer vision and automatic clinical tasks, such as classification of natural images, detection and diagnosis of cancer, and clinical prediction [1, 2, 3, 4, 5]. However, the advancement of deep learning techniques is heavily dependent on the amount and diversity of the data in the training dataset. Many deep learning models are currently trained using data from few centers and generally do not perform well in new data or in clinical practice. Cohort sizes at single institution or even in public data repositories such as The Cancer Imaging Archive (TCIA) are often small, especially for rarer diseases or for certain patient populations (e.g., molecular variants in gliomas) [6, 7, 8]. Aggregating data from multiple institutions is not always feasible due to regulatory, technical and patient privacy concerns. Federated learning, where computations are performed locally at each institution without sharing data, is promising for accessing large, representative data to train robust deep learning models that have greater generalizability and lower risk of model bias.

Numerous federated learning methods have emerged in past decades [9, 10, 11, 12, 13]

, such as Federated stochastic gradient descent (FedSGD)

[9, 10], Federated averaging algorithm (FedAVG) [10], and Cyclical weight transfer (CWT) [12]. Despite the promising progress, existing methods generally do not account for the presence of data heterogeneity among institutions. Most federated learning methods usually assume independent and identically distributed (IID) data among institutions, which is unlikely to hold in real federated learning settings in healthcare. Few recent studies have explored the performance of federated learning methods on non-IID data partitions that have label distribution skew [14, 15, 16]. For example, Hsieh et al. [15] conducted a series of experimental studies to show the impact of label distribution skew on federated learning methods. However, in addition to label distribution skew, the data at different institutions are usually heterogeneous in other ways, such as data quantity skew and imaging acquisition skew. For example, large academic university hospitals generally have substantially larger datasets than small community hospitals, and they differ in equipment vendors and imaging equipment parameters.

In this paper, we present an experimental study on two medical image classification tasks, to investigate the impact of a taxonomy of data heterogeneity regimes on federated learning methods, including quantity skew, label distribution skew, and imaging acquisition skew (e.g., different hospitals may use different imaging equipment vendors and acquisition protocols). Contributions of this paper are summarized as follows.

1) We present the first thorough research to study the impact of a taxonomy of data heterogeneity regimes on several widely used federated learning methods with medical image data. Our study provides valuable guidance for the deployment of federated learning in real clinical applications.

2) We show that the performance of the federated learning methods in our study degrades with the increasing degrees of data heterogeneity, and the rate of decrease in performance is determined by the degree of deviation from homogenous distributions.

3) We propose a variety of optimization strategies to mitigate the performance loss for quantity skew and label distribution skew, including weighted average strategy for data quantity skew, and weighted loss strategy for label distribution skew.

4) We study the influence of the Batch Normalization (BN) for FedAVG, we show that averaging the mean and variance of BN across institutions during FedAVG training is a simple and flexible alternative to mitigate skew-induced performance loss of BN.

2 Study Materials and Experimental Setup

2.1 Study Methods

We employed three popular federated learning methods in our study: two parallel federated learning methods (FedSGD [9, 10] and FedAVG [10]), and CWT [12]. We used the model trained with the centrally hosted data as a baseline method, termed as centrally hosted.

FedSGD [9, 10], involves frequent transferring of gradients from individual institutions to a central server, computing weight updates using institutional gradients at the central server, and transferring updated weights back to individual institutions.

FedAVG [10], involves frequent transferring of weights from individual institutions to a central parameter server, averaging weight computation at the central server, and transferring averaged weights back to individual institutions.

CWT [12], involves training for a fixed number of iterations at one institution, and cyclically transferring weights to the next training institution until model convergence.

2.2 Dataset

We evaluated on Alzheimer’s Disease Neuroimaging Initiative (ADNI) Dataset 111http://adni.loni.usc.edu/ and Diabetic Retinopathy (Retina) Dataset [17]:

ADNI Dataset provides a longitudinal multi-institutional observation study on Alzheimer’s disease patients, mild cognitive impairment subjects, and healthy elders controls. For simplicity, we only use PET scans from (18F)-fluorode-oxyglucose PET to predict the summary standard uptake value ratios (SUVRs) status. Specifically, we discretized the continuous SUVRs into 4 classes, 2 classes of equal size for SUVR values below the 1.11 cutoff [18], and 2 classes for the SUVR values above. We utilized a total of 2603 florbetapir PET scans (from 1235 subjects) in our study. We randomly divided the scans into 3 subsets: 1896 scans for training, 314 scans for validation, and the remaining for testing. The scans for the same subject were divided into same subset.

Retina Dataset [17] consists of 17,563 pairs of right and left color digital retinal fundus images. Each image provided by this dataset includes a rating on a scale of 0 to 4 according to the presence of diabetic retinopathy. Following the setting in [12]

, the labels were binarized to Healthy (scale 0) and Diseased (scale 2, 3 or 4) in our study, while the mild diabetic retinopathy images (scale 1) were excluded. Additionally, only left eye images were used to remove the confusion from using multiple images for the same patient. We utilized a total of 6000 subjects for training, 3000 for validation, and 3000 for testing.

2.3 Experimental Setup

We used a full communication setting for FedSGD, i.e., update the gradients between individual institutions and central server every iteration. For FedAVG, we set the number of training passes on each local institution to , where is the quantity of training samples in local institution and is the local minibatch size. CWT involves training the same fixed number of iterations on each institutions. Here we set the number of iterations to , where is the quantity of training samples of centrally hosted data, is the number of institutions involving in federated learning.

We applied ResNet34 [19]

as our baseline DNN architecture. All the methods were implemented in Pytorch and optimized with SGD. The training parameters (such as learning rate and minibatch size) were tuned to make sure the baseline centrally hosted achieved best performance. For a fair comparison, we then used these training parameters in all the federated learning methods we compared. All methods were trained with enough epochs until model convergence.

3 Experiments and Results

In this section, we study the impact of a taxonomy of data heterogeneity regimes on federated learning methods, including quantity skew, label distribution skew and imaging acquisition skew. We then present several mitigation strategies to overcome performance drops from data heterogeneity.

3.1 Quantity Skew

ADNI Retina
Figure 1: Test accuracy on data partitions with quantity skew222

Mean and standard deviation test accuracies were obtained with 4 runs. We use the same setting for the following experiments.

. The performance drop rate of data partitions with quantity skew from homogenous data split 1 is shown when it is larger than 1%.

Quantity skew is one common way that causes data to deviate from a homogenous distribution. Large academic university hospitals generally have substantially larger datasets than small community hospitals. We created 4 sets of data partitions with variable sample sizes across simulated institutions to study the impact of quantity skew on federated learning methods (see supplementary file for detailed data partitions). Each data partition consists of 4 stimulated institutions and each institution shares the same feature distribution and label distribution. We use sample standard deviation (STD) of the sample size across institutions to measure the degree of quantity skew.

Fig. 2 shows that all the federated learning methods achieve comparable performance to the centrally hosted baseline in homogenous data split 1. However, the performance of FedSGD and CWT degrade with the increasing degree of skew, e.g, 7.0% and 8.9% drop rates of FedSGD and CWT on ADNI dataset with split 4. All the institutions at FedSGD contributed equally on the gradient update at each training iteration, ignoring the impact of quantity skew. Different from FedSGD, FedAVG averages the model weights after one epoch train on the whole local institutional dataset, without reusing the small dataset for weights update, thus works well on quantity skew. We then introduced a weighted average strategy for FedSGD, termed as FedSGD+WP, where gradients from each institutions were not treated equally but proportional to the institutional training sample size. Similarly, we applied proportional training sample sizes [16] to CWT for addressing quantity skew, termed as CWT+WP. Shown in Fig. 2, with the proposed weighted average strategy and proportional training sample sizes strategy, both FedSGD and CWT achieve promising performance on data partitions with quantity skew.

3.2 Label Distribution Skew

Our study on quantity skew assumes homogenous label distribution across institutions, which is not always true in real applications. In fact, label distribution may vary across institutions even when they share the same label annotations. For example, lupus is much more common in Black, Asian people than White people. We created 4 sets of data partitions with label distribution skew (same data quantity) by controlling the fraction of non-IID data. We used the mean Kolmogorov-Smirnov (KS) statistic between every two institutions to measure the degree of label distribution skew. Specifically, KS=0 indicates homogenous label distributions, and 1 indicates totally different label distributions across institutions. See supplementary file for detailed data partitions and its corresponding KS value.

ADNI Retina
Figure 2: Test accuracy on data partitions with label distribution skew. The performance drop rate of data partitions with label distribution skew from homogenous data split 1 is shown when it is larger than 1%.

Fig. 2

shows that all the compared federated learning methods are vulnerable to label distribution skew. The performance drop rates on split 4 (with KS = 0.90) of ADNI dataset reaches to 26.0%, 23.0%, and 32.0% of FedSGD, FedAVG, and CWT, respectively. One common approach for addressing class imbalance in standard DNNs is to introduce weighted factors into loss function. Similarly, we also applied a class weighted cross-entropy loss (WL) to tackle the label distribution skew in federated learning methods:

(1)

where is the number of categories, is the ground-truth binary indicator with 1 when class label is the correct category for sample and 0 else, and

is the model prediction probability that sample

has class label , is the weighted factor for class label and is defined as:

(2)

where indicates the number of samples with class label . WL is applied on both the training stage and the cross-validation model selection stage for better convergence.

WL mitigates the performance drops on data partitions with certain degrees of label distribution skew (see Fig. 2). For example, the performance drop rate of CWT+WL is increased from 32.0%, and 12.0% to 17.0% and 3.9% on split 4 of ADNI and Retina dataset, respectively. However, the improvement on parallel federated learning (FedSGD and FedAVG) with large degree of skew is limited. FedAVG+WL even works worse than FedAVG on split 4 of ADNI dataset, e.g., a 28.0% drop rate compared to original 23%. Note that WL has also been described in [16], but they only applied them to CWT for 2-label classification tasks. Here, we extend it to general federated learning approaches and multi-label classification tasks.

To investigate the worse performance of FedAVG+WL, we show the prediction accuracy of each institutional model on each institutional test dataset333Same label distribution for local institutional testing and training dataset. (see Fig. 3(a)). It is worth noting that the institutional model performs even worse on its own testing dataset than the models from other institutions, e.g., worse diagonal performance of row 1 and row 3 in Fig. 3(a).

This then raises the question: why the institutional models after model average differ a lot? Actually, the models across institutions only differ in the BN layer after model averaging on central server. Thus, BN is the main factor that causes the discrepancy between models across institutions. BN is a widely used technique aiming to stabilize and accelerate the DNN training. It normalizes each layer’s inputs with minibatch mean and variance during training, and an estimated global mean and variance is used during testing. In standard implementation, FedAVG and FedSGD do not average the estimated mean and variance of each layer, which then causes the discrepancy between models across institutions. To this end, we modified the averaging setting of BN, also averaged the estimated mean and variance across institutions during federal training. Fig. 

3 shows results with averaged BN settings. FedAVG+WL+BN generates superior performance than the standard FedAVG+WL setting, e.g., a 16.0% percentage increase is shown on split 4 of ADNI dataset. [15] also indicates that DNNs with BN is vulnerable to label distribution skew, they show that group normalization [20] avoids the skew-induced accuracy loss of BN. Our experiments are complementary to theirs and provide a simple and flexible alternative to help mitigate the skew-induced performance loss of BN in federated learning settings.

(a) Split 4 of ADNI (b) ADNI (c)Retina
Figure 3: (a): Prediction accuracy of each institutional model on each institutional test dataset of split 4 on ADNI datasest. (b) and (c): Performance analysis with different averaging settings of BN in FedAVG on ADNI and Retina dataset, respectively. FedAVG+WL+BN helps mitigate the skew-induced accuracy loss of BN by averaging the estimated mean and variance of BN.

3.3 Imaging Acquisition Skew

In medical domain, there has been a longstanding debate about applying medical imaging standardization throughout the imaging industry [21, 22]. Modern X-Ray, MR imaging, and PET units allow for wide variations in imaging acquisition settings, which may result in significant differences in the generated images across institutions, even for the same underlying disease.

In this section, we show the impact of imaging acquisition skew on federated learning methods with two sets of experiments. We first simulated three sets of data partitions (split 1 - split 3), and then generated a real data partition on ADNI dataset according to scanner vendors (split 4).

Split 1 (Simulated IID partition): Same as split 1 in Sec. 3.1 and Sec. 3.2.

Split 2 (Simulated partition with resolution skew): Decreasing image resolution in Institution 1 to 4 of split 1 with a factor of 4, 3, 2, 1, respectively.

Split 3

(Simulated partition with signal-to-noise-ratio (SNR) skew): Degrading images in split 1 with various types of noises and blurring, such as Gaussian, Speckle, Poisson noise and motion blur. Institution 1: Gaussian noise, Institution 2: motion blur, Institution 3: mixture of noise/blurring but dominated by gaussian noise, Institution 4: mixture of noise/blurring but dominated by motion blur. Examples of synthetic gaussian noise and motion blur are shown in Fig. 

4.

Split 4 (Real partition according to scanner vendors): ADNI dataset were acquired from 7 manufacturers, such as GE Healthcare, Philips Healthcare, Siemens Healthcare and etc. We resplit the dataset into 4 institutions according to the scanner vendors. Least quantity skew and label distribution skew across institutions are ensured in this new split.

Orig img. Orig img. + Gaussian Orig img. Orig. img + motion blur
Figure 4: Examples of images with synthetic gaussian noise and motion blur.
ADNI Retina
Figure 5: Test accuracy on data partitions with imaging acquisition skew. The performance decrease rate compared to the baseline centrally hosted is also shown.

Experimental results in Fig. 5

indicate that, in addition to the label distribution skew, the imaging acquisition skew is also a critical hurdle that prevents the deployment of federated learning in real applications. Strategies such as super-resolution, image denoising and histogram matching may be applied to deal with the imaging acquisition skew, which will remains to be undertaken in our future work.

4 Conclusion

In this paper, we conduct extensive experiments to study the impact of data heterogeneity on federated learning methods. Our study covers three widely used federated learning methods, a taxonomy of data heterogeneity regimes, data heterogeneity with different degrees of skewness. We show that the federated learning methods in our study are vulnerable to data partitions with a high degree of skew. We then present several optimization strategies to overcome the performance loss from data heterogeneity.

Extensive experiments demonstrate that: 1) the proposed weighted average for FedSGD can recover performance loss from introducing quantity skew; 2) weighted loss helps mitigate the performance loss from introducing label distribution skew; 3) averaging the mean and variance of BN across institutions in FedAVG training is an attractive alternative to mitigate skew-induced performance loss of BN. We anticipate that our detailed analysis provided herein will provide guidance for the deployment of federated learning in real clinical applications, and that our findings will provide useful hints towards the construction of better federated learning methods.

References

Supplementary Material

We detail our simulated data partitions in this supplementary file. Table  1 shows the simulated data partitions with quantity skew. Fig. 6 and Fig. 7 show the simulated data partitions with label distribution skew on ADNI and Retina dataset, respectively. Fig. 8 shows an example of PET cases imaged with different scanner vendors in ADNI dataset. These PET images differ in either resolution, contrast, or intensity distributions.

Splits Inst1 Inst2 Inst3 Inst4 STD
Split 1 474 474 474 474 0
Split 2 299 317 385 895 283.1
Split 3 113 211 579 993 399.9
Split 4 66 111 282 1437 648.7
(a) Partitions on ADNI
Splits Inst1 Inst2 Inst2 Inst2 STD
Split 1 1500 1500 1500 1500 0
Split 2 750 960 1800 2490 800.8
Split 3 315 850 1750 3085 1211
Split 4 208 350 889 4553 2056
(b) Partitions on Retina
Table 1: Data partitions with quantity skew. The number of training samples in each institution is shown. STD is sample standard deviation of the training sample size across institutions.
Figure 6: Data partitions on ADNI dataset with label distribution skew. Large Kolmogorov-Smirnov (KS) indicates higher degree of label distribution skew.
Figure 7: Data partitions on Retina dataset with label distribution skew. Large KS indicates higher degree of label distribution skew.
Philips Siemens CPS GEMS MiE
Figure 8: PET cases imaged with different scanner vendors in ADNI dataset.