Surrogate-Assisted Federated Learning of high dimensional Electronic Health Record Data

02/09/2023
by   Yue Liu, et al.
0

Surrogate variables in electronic health records (EHR) play an important role in biomedical studies due to the scarcity or absence of chart-reviewed gold standard labels, under which supervised methods only using labeled data poorly perform poorly. Meanwhile, synthesizing multi-site EHR data is crucial for powerful and generalizable statistical learning but encounters the privacy constraint that individual-level data is not allowed to be transferred from the local sites, known as DataSHIELD. In this paper, we develop a novel approach named SASH for Surrogate-Assisted and data-Shielding High-dimensional integrative regression. SASH leverages sizable unlabeled data with EHR surrogates predictive of the response from multiple local sites to assist the training with labeled data and largely improve statistical efficiency. It first extracts a preliminary supervised estimator to realize convex training of a regularized single index model for the surrogate at each local site and then aggregates the fitted local models for accurate learning of the target outcome model. It protects individual-level information from the local sites through summary-statistics-based data aggregation. We show that under mild conditions, our method attains substantially lower estimation error rates than the supervised or local semi-supervised methods, as well as the asymptotic equivalence to the ideal individual patient data pooled estimator (IPD) only available in the absence of privacy constraints. Through simulation studies, we demonstrate that SASH outperforms all existing supervised or SS federated approaches and performs closely to IPD. Finally, we apply our method to develop a high dimensional genetic risk model for type II diabetes using large-scale biobank data sets from UK Biobank and Mass General Brigham, where only a small fraction of subjects from the latter has been labeled via chart reviewing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/26/2020

Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping

Electronic Health Records (EHR) data, a rich source for biomedical resea...
research
05/04/2021

Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

Risk modeling with EHR data is challenging due to a lack of direct obser...
research
09/12/2021

FedTriNet: A Pseudo Labeling Method with Three Players for Federated Semi-supervised Learning

Federated Learning has shown great potentials for the distributed data u...
research
10/24/2021

Efficient and Robust Semi-supervised Estimation of ATE with Partially Annotated Treatment and Response

A notable challenge of leveraging Electronic Health Records (EHR) for tr...
research
09/12/2022

Semi-supervised Triply Robust Inductive Transfer Learning

In this work, we propose a semi-supervised triply robust inductive trans...
research
02/24/2019

High Dimensional Restrictive Federated Model Selection with multi-objective Bayesian Optimization over shifted distributions

A novel machine learning optimization process coined Restrictive Federat...
research
09/02/2017

When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, ℓ_2-consistency and Neuroscience Applications

Many studies in biomedical and health sciences involve small sample size...

Please sign up or login with your details

Forgot password? Click here to reset