Optimal Sampling for Generalized Linear Models under Measurement Constraints

07/17/2019
by   Tao Zhang, et al.
0

Suppose we are using a generalized linear model to predict a scalar outcome Y given a covariate vector X. We consider two related problems and propose a methodology for both. In the first problem, every data point in a large dataset has both Y and X known, but we wish to use a subset of the data to limit computational costs. In the second problem, sometimes call "measurement constraints," Y is expensive to measure and initially is available only for a small portion of the data. The goal is to select another subset of data where Y will also be measured. We focus on the more challenging but less well-studied measurement constraint problem. A popular approach for the first problem is sampling. However, most existing sampling algorithms require Y is measured at all data points, so they cannot be used under measurement constraints. We propose an optimal sampling procedure for massive datasets under measurement constraints (OSUMC). We show consistency and asymptotic normality of estimators from a general class of sampling procedures. An optimal oracle sampling procedure is derived and a two-step algorithm is proposed to approximate the oracle procedure. Numerical results demonstrate the advantages of OSUMC over existing sampling methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2022

Unweighted estimation based on optimal sample under measurement constraints

To tackle massive data, subsampling is a practical approach to select th...
research
08/12/2022

A sub-sampling algorithm preventing outliers

Nowadays, in many different fields, massive data are available and for s...
research
10/18/2019

Sampling strategy and statistical analysis for radioactive waste characterization

This paper describes the methodology we have developed to define a sampl...
research
01/09/2016

On Computationally Tractable Selection of Experiments in Measurement-Constrained Regression Models

We derive computationally tractable methods to select a small subset of ...
research
03/18/2021

Optimal soil sampling design based on the maxvol algorithm

Spatial soil sampling is an integral part of a soil survey aimed at crea...
research
05/10/2021

Budget-limited distribution learning in multifidelity problems

Multifidelity methods are widely used for statistical estimation of quan...

Please sign up or login with your details

Forgot password? Click here to reset