When can Multi-Site Datasets be Pooled for Regression? Hypothesis Tests, ℓ_2-consistency and Neuroscience Applications

09/02/2017
by   Hao Henry Zhou, et al.
0

Many studies in biomedical and health sciences involve small sample sizes due to logistic or financial constraints. Often, identifying weak (but scientifically interesting) associations between a set of predictors and a response necessitates pooling datasets from multiple diverse labs or groups. While there is a rich literature in statistical machine learning to address distributional shifts and inference in multi-site datasets, it is less clear when such pooling is guaranteed to help (and when it does not) -- independent of the inference algorithms we use. In this paper, we present a hypothesis test to answer this question, both for classical and high dimensional linear regression. We precisely identify regimes where pooling datasets across multiple sites is sensible, and how such policy decisions can be made via simple checks executable on each site before any data transfer ever happens. With a focus on Alzheimer's disease studies, we present empirical results showing that in regimes suggested by our analysis, pooling a local dataset with data from an international study improves power.

READ FULL TEXT
research
12/17/2018

Likelihood Ratio Test in Multivariate Linear Regression: from Low to High Dimension

Multivariate linear regressions are widely used statistical tools in man...
research
04/03/2020

A Note on Double Pooling Tests

We present double pooling, a simple, easy-to-implement variation on test...
research
05/04/2022

Validating Approximate Slope Homogeneity in Large Panels

Statistical inference for large data panels is omnipresent in modern eco...
research
10/24/2020

Shared Space Transfer Learning for analyzing multi-site fMRI data

Multi-voxel pattern analysis (MVPA) learns predictive models from task-b...
research
02/09/2023

Surrogate-Assisted Federated Learning of high dimensional Electronic Health Record Data

Surrogate variables in electronic health records (EHR) play an important...
research
03/29/2022

Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets

Pooling multiple neuroimaging datasets across institutions often enables...
research
08/13/2020

An estimator for predictive regression: reliable inference for financial economics

Estimating linear regression using least squares and reporting robust st...

Please sign up or login with your details

Forgot password? Click here to reset