Selective Inference with Distributed Data
Nowadays, big datasets are spread over many machines which compute in parallel and communicate with a central machine through short messages. We consider a sparse regression setting in our paper and develop a new procedure for selective inference with distributed data. While there are many distributed procedures for point estimation in the sparse setting, not many options exist for estimating uncertainties or conducting hypothesis tests in models based on the estimated sparsity. We solve a generalized linear regression on each machine which communicates a selected set of predictors to the central machine. The central machine forms a generalized linear model with the selected predictors. How do we conduct selective inference for the selected regression coefficients? Is it possible to reuse distributed data, in an aggregated form, for selective inference? Our proposed procedure bases approximately-valid selective inference on an asymptotic likelihood. The proposal seeks only aggregated information, in relatively few dimensions, from each machine which is merged at the central machine to construct selective inference. Our procedure is also broadly applicable as a solution to the p-value lottery problem that arises with model selection on random splits of data.
READ FULL TEXT