Collaborative Learning From Distributed Data With Differentially Private Synthetic Twin Data

08/09/2023
by   Lukas Prediger, et al.
0

Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible. We propose a framework in which each party shares a differentially private synthetic twin of their data. We study the feasibility of combining such synthetic twin data sets for collaborative learning on real-world health data from the UK Biobank. We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of target statistics compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. Based on our results we conclude that sharing of synthetic twins is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. The setting of distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/09/2021

CaPC Learning: Confidential and Private Collaborative Learning

Machine learning benefits from large training datasets, which may not al...
research
08/26/2019

Privacy-Preserving Tensor Factorization for Collaborative Health Data Analysis

Tensor factorization has been demonstrated as an efficient approach for ...
research
05/24/2023

Private and Collaborative Kaplan-Meier Estimators

Kaplan-Meier estimators capture the survival behavior of a cohort. They ...
research
01/18/2022

An Efficient Hashing-based Ensemble Method for Collaborative Outlier Detection

In collaborative outlier detection, multiple participants exchange their...
research
11/13/2019

Asynchronous Distributed Learning from Constraints

In this paper, the extension of the framework of Learning from Constrain...
research
06/05/2020

Generation of Differentially Private Heterogeneous Electronic Health Records

Electronic Health Records (EHRs) are commonly used by the machine learni...
research
11/26/2021

A Differentially Private Bayesian Approach to Replication Analysis

Replication analysis is widely used in many fields of study. Once a rese...

Please sign up or login with your details

Forgot password? Click here to reset