Training effective machine learning (ML) models in medical imaging is a data-driven problem, where model utility is typically directly dependent on the quantity and quality of data available during training. In recent works Sheller2019; Sheller2020, federated learning (FL) has been proposed to allow the utilisation of multi-site clinical datasets to enable and encourage collaboration between data owners, obtaining larger pools of high quality, diverse and representative data while avoiding direct data sharing. While FL circumvents centralised data pooling, it is not a privacy-enhancing technology (PT), as it does not provide the federation with any formal notion of privacy in regards to the patient data they are holding. This can leave the model vulnerable to catastrophic privacy breaches through attacks such as model inversion (dlg; idlg; geiping2020) or membership inference (mia) during model training by malicious parties inside or outside the federation. It is therefore imperative that collaborative learning does not just benefit the utility of the model through a richer data pool, but also provides formal privacy guarantees to the participating parties, for example through the utilisation of PTs such as differential privacy (DP) or encrypted algorithm training.
However, the application of PTs often comes at the cost of decreased model utility (privacy-accuracy-trade-off), e.g. due to the addition of noise to the training process (dp_impact_accuracy). Minimising this trade-off is a complex, yet fundamental process, and has so far not been conclusively investigated in the area of medical imaging.
In the present work, we perform federated medical image segmentation under image-level differential privacy guarantees. Through detailed experiments, we demonstrate that the appropriate choice of architecture and training technique can enable excellent model performance despite rigorous privacy guarantees while minimising network overhead and computational demands.
2 Background and Related Work
We use the term Federated Learning (FL, (konevcny2016federated)) to denote a collaborative learning protocol in which a number of parties (nodes/workers
) jointly train a neural network. Training occurs inrounds during which the central server sends the model to the nodes, local training occurs for a number of iterations, whereupon the updated models are aggregated by Federated (Gradient) Averaging (mcmahan2017communication). We assume a cross-silo topology, based on a small number of centres with relatively large datasets. Moreover, we assume homogeneous node compute resources and constant node availability.
We use the following definition of Differential Privacy (DP) by (dwork2014algorithmic): For some randomised algorithm (=mechanism) M, all subsets of its image , sensitive dataset and its neighbouring dataset , whereby and differ by at most one record, we say that M is -differentially private (DP) if, for a (typically small) constant and :
The probabilityis taken over the randomness of the algorithm . DP is, therefore, an attribute of an algorithm that makes it approximately equivariant to exclusion or inclusion of a data point. It quantifies an individuals contribution to the final outcome of the computation. When the difference between the contributions of multiple participants is minimal, it is not possible to reliably determine the presence or the absence of an individual.
Sheller and colleagues have demonstrated the utilisation of FL in the context of brain tumour segmentation (Sheller2019; Sheller2020). However, neither work utilises PTs to provide privacy guarantees to the included patients. Li et al. (li2019privacy) also showcase brain tumour segmentation using FL. The Sparse Vector Technique
Sparse Vector Techniqueutilised in the study however only provides privacy guarantees to the model’s parameters and not to the dataset records, which is not a meaningful notion of privacy. In comparison, DP-SGD, utilised in our study, provides the guarantees to each individual patient, providing the federation with an a pragmatic, information-theoretic privacy-preserving solution instead. Fay et al. (fay2020decentralized) utilise the Private Aggregation of Teacher Ensembles for brain tumour segmentation. This technique was originally developed for classification tasks and imposes strong assumptions on the learning process. Consequently authors could not demonstrate reasonable privacy guarantees in their study while still experiencing a steep utility penalty. Yang et al. (Yang2021) utilised FL for COVID-19 lesion segmentation in computed tomography (CT) but did not employ PTs. Lastly, Sarma et al. (Sarma2021) demonstrate FL segmentation of prostate volumes in magnetic resonance imaging (MRI), but also did not employ any PTs.
As witnessed from the survey of previous works above, although several studies have dealt with the topic of FL for medical image segmentation, our work is, to our best knowledge, the first to utilise differentially private stochastic gradient descent (DP-SGD) in addition to FL in order to provide both stringent image-level privacy guarantees and maintain high model utility in the setting of medical image segmentation. We summarise our main contributions below:
We present an in-depth study on the application of DP to medical image segmentation by successfully training several segmentation model architectures in the federated setting. Contrary to the previous works, our implementation of differentially private training provides strict, provable guarantees with respect to each individual image, which allows the federation to obtain a meaningful measure of privacy.
Our models trained with FL achieve comparable segmentation performance to centrally trained models while suffering only mild privacy-utility trade-offs.
We demonstrate the first successful model inversion attack on semantic segmentation architectures, leading to the full reconstruction of input images for certain models. We thus provide evidence that, despite prevailing literature opinion, FL in itself is an insufficient technique for protecting patient privacy. We then empirically show that -consistent with its theoretical privacy guarantees- the addition of DP to model training completely thwarts such privacy-centred attacks.
3.1 Federated Training
For FL experimentation, we simulated a scenario in which three hospitals (=workers/nodes) are collaboratively training the neural network coordinated by a central server. For this, we split the dataset randomly by patient onto the three servers maintaining an equal number of patients per server. We conducted all experimentation using the PriMIA framework (primia)
, a generic open-source software package for privacy-preserving and federated deep learning on medical imaging, which we adapted to semantic medical image segmentation.
3.2 Differentially Private Training
For deep neural network training, we utilised DP-stochastic gradient descent (DP-SGD) (abadi2016deep) which extends DP guarantees to gradient-based optimisation by clipping the -norm of the per-sample gradients of each minibatch to a specific value and adding Gaussian noise of predetermined magnitude to the averaged minibatch gradients before performing an optimisation step. We utilise the Rényi Differential Privacy Accountant (mironov2019renyi), an extension of the moments accountant technique by Abadi et al. for the privacy analysis of this algorithm, i.e. the calculation of privacy loss at each individual site in terms of . We note that the reported privacy guarantees are record-level
, and not patient-level guarantees. Moreover, we regarded all datasets used as public for the purposes of experimentation such as hyperparameter searches. DP training was performed under threeprivacy regimes, shown in Table 1. In the following, we will refer to these as low, medium and high privacy regimes. Finally, we note that the utilisation of Batch Normalisation layers is incompatible with DP training, as the running statistics of the layers are maintained non-privately. Hence, we deactivated the running statistics collection for Batch Normalisation layers, effectively converting them to Instance/Channel Normalisation Layers (dai2019channel) which are DP-compatible.
3.3 Gradient-based Model Inversion
In order to empirically verify whether reconstruction of training data by an Honest-but-Curious (HbC) adversary with white-box model access (evans2017pragmatic) that participates in the training protocol is possible, we performed a model inversion attack utilising the gradients shared during Federated Averaging. We employ an approach similar to the Deep Leakage from Gradients (DLG) (dlg; idlg; geiping2020) attack to infer the data that generated the update. The attack relies on approximating the input image through a gradient descent-based perturbation of a randomly initialised noise matrix, based on a differentiable similarity metric to the gradient captured during model training. Unlike the original attack, designed for an image reconstruction in a classification context given a model update and the corresponding label, we provided the gradient update and a segmentation mask, which are used to reconstruct the input. We observed that the use of a segmentation mask of the victim by the adversary greatly improves the results of the reconstruction. In cases of more complex models, the attack was not successful without access to the segmentation mask, which we consider a limitation of this method.
3.4 Model Architectures
All experiments were carried out using U-Net-like architectures (ronneberger2015u) with modifications detailed in (Yakubovskiy2019), introducing newer architectures as backbones to the encoder portion of the U-Net. Since network input/output overhead is a critical bottleneck for federated learning, we focused on architectures which provide a good balance between network size and performance on established benchmarks. Hence, we included MobileNet V2 (sandler2018mobilenetv2) and ResNet-18 (he2016deep) as backbones. Moreover, we considered MoNet (knolle2020efficient), a novel lightweight U-Net-like architecture with extremely few parameters specifically optimised for FL. Lastly, for comparability to the original U-Net, we utilised an eleven-layer VGG architecture with Batch Normalisation (simonyan2014very) (VGG11 BN). An overview of the number of parameters (i.e. model size) and Multiply/Accumulate operations (MACs) can be found in Table 2. We note that MoNet relies on large receptive field dilated (atrous) convolutions which introduce a substantial number of operations despite the small network size. Moreover, the MobileNet-V2 architecture is optimised for CPU performance (orsic2019defense). We therefore carried out timing experiments both on CPU and GPU (compare Table 4).
|(MobileNet V2)||(ResNet-18)||(VGG11 BN)|
3.5 Hyperparameter Optimisation
We performed hyperparameter optimisation on non-private baseline models in a decentralised manner over the entire federation to obtain suitable values for the learning rate, beta parameters for the Adam optimiser, as well as translation, rotation and scale values for image augmentation. This corresponds to local hyperparameter optimisation prior to DP training to avoid repeated dataset interaction which would consume the privacy budget. For FL, the synchronisation rate, number of different image augmentations as well as whether or not the model parameters were weighted by the number of samples on the worker during aggregation were additionally optimised. The settings found through optimisation were used to train the corresponding models with DP.
All models were trained and evaluated on the MSD Liver segmentation task (simpson2019large). The dataset was split into a 63% training set, 7% validation set and 30% held-out testing set. All architectures were pre-trained on a pancreatic CT segmentation dataset from our institution to improve convergence speed and offer equal starting conditions to all models.
4.1 Segmentation results
Model evaluation results on the test set are listed in Table 3. We observed that models trained with FL are able to achieve performance on-par with locally trained models. In line with previous results (abadi2016deep), all models witnessed performance deterioration due to the utilisation of DP. Surprisingly, we found the performance penalty to be especially high on the ResNet-18 architecture, which did not converge at all in the FL setting with DP. Moreover, we consistently found the VGG-11 BN architecture to perform best. This finding suggests that a high parameter count alone is not the sole determinant of performance deterioration due to DP as suggested in previous work (papernot2019making)
. Instead, both the collaborative training setting and the architecture itself seem to contribute to this phenomenon. However, these findings await detailed future analysis. Moreover, the high variance of the per-image Dice score distributions for ResNet-18 and MobileNet V2 backbones suggest a disparate effect of DP on different images (see Figure1), with the MoNet and VGG-11 BN architectures maintaining a more consistent performance in the privately trained local setting. Exemplary segmentation results are shown in Figure 2.
|(MobileNet V2)||(ResNet-18)||(VGG11 BN)|
In Table 4
, we compare the time required per epoch of training for the various models. We found the optimised, widely-used architectures to train more efficiently. In comparison, MoNet trades off computational efficiency for a much smaller size to alleviate network input/output constraints.
|(MobileNet V2)||(ResNet-18)||(VGG11 BN)|
Mean and standard deviation [s] for one epoch with equal settings for all models. CPU: Central Processing Unit. GPU: Graphics Processing Unit
4.2 Model Inversion Attack
In this study we employ a novel gradient-based model inversion attack based on the works of Geiping et al. geiping2020. In our setting the adversary starts with a randomly initialised
pair and a captured model update. The optimisation task that is executed by the adversary involves perturbing the image to such extent that it (along with a corresponding segmentation mask) produces a gradient update similar to the one that the adversary captured. We utilise a cosine similarity as the cost function in this optimisation process. The procedure is repeated until either the loss starts diverging or the final iteration is reached.
We note that for our method the attacker is assumed to have access to a segmentation mask that corresponds to the sensitive image. We consider this to be a limitation of our approach and note that for smaller networks, utilisation of randomly initialised segmentation mask can provide the adversary with a suitable reconstruction result. However, for the purposes of medical imaging, larger models, such as VGG-11 or MoNet require a segmentation mask in order to allow the attacker model to converge and to produce an acceptable reconstruction result.
Gradient-based model inversion attacks were performed in two settings: initially, gradients captured from FL training without DP were attacked. We then evaluated the attacks on the same architectures with the addition of DP. As presented in Figure 3, non-private models sharing unprotected updates during training risk their data being reconstructed in full. In comparison, models trained with the addition of DP yield no usable information to the inversion attack regardless of architecture. In general, architectural complexity seems to inhibit attack success to a greater degree than shown in previous work on classification (geiping2020).
5 Discussion and Conclusion
To the best of our knowledge, this is the first work to demonstrate DP-SGD-based collaborative model training in the context of semantic medical image segmentation. Our main conclusion is that the provision of rigorous privacy guarantees is possible in the FL setting while maintaining high model utility. One area that we highlight as a promising research direction is an investigation into the relationship between privacy-oriented modes of training and the selection of an optimal model architecture. We found that larger model architectures can be more robust to noise addition, whereas lightweight models can be advantageous in non-private FL settings. This highlights a promising research area that considers the application of privacy-preserving mechanisms in task-specific deployments in order to better tailor defense mechanisms to learning tasks with optimal utility preservation. Additionally, we outline a requirement for investigations into the field of pragmatic applications of DP, in order to allow privately trained models to be interpretable by machine learning researchers and facilitate the widespread utilisation of private collaborative model training. Our novel model inversion attack in the unprotected FL setting resulted in a catastrophic privacy breach, while only utilising a segmentation mask and a shared model update. This highlights that FL alone is an insufficient privacy preservation mechanism in collaborative learning and should be regarded as a method for preservation of data ownership/governance which facilitates controlled data access. As our method requires possession of a segmentation mask by the adversary, future work will include a natural relaxation of this requirement and the study of robust, scalable model inversion attacks. We note that similarly to other model inversion implementations, supporting large batches of images is a non-trivial task, therefore we also outline this area as a potential future work direction. We expect that our work will stimulate further research on privacy-preserving machine learning, essential to large scale, multi-site medical imaging analysis, in order to allow collaborative model training while mitigating associated privacy risks.
Georgios Kaissis received funding from the Technical University of Munich, School of Medicine Clinician Scientist Programme (KKF), project reference H14. Dmitrii Usynin received funding from the Technical University of Munich/ Imperial College London Joint Academy for Doctoral Studies. This research was supported by the UK Research and Innovation London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare. The funders played no role in the design of the study, the preparation of the manuscript or the decision to publish. The liver segmentation dataset is described and provided available at https://arxiv.org/pdf/1902.09063.
The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.
Authors declare no conflicts of interest.