Markov subsampling based Huber Criterion

12/12/2021
by   Tieliang Gong, et al.
3

Subsampling is an important technique to tackle the computational challenges brought by big data. Many subsampling procedures fall within the framework of importance sampling, which assigns high sampling probabilities to the samples appearing to have big impacts. When the noise level is high, those sampling procedures tend to pick many outliers and thus often do not perform satisfactorily in practice. To tackle this issue, we design a new Markov subsampling strategy based on Huber criterion (HMS) to construct an informative subset from the noisy full data; the constructed subset then serves as a refined working data for efficient processing. HMS is built upon a Metropolis-Hasting procedure, where the inclusion probability of each sampling unit is determined using the Huber criterion to prevent over scoring the outliers. Under mild conditions, we show that the estimator based on the subsamples selected by HMS is statistically consistent with a sub-Gaussian deviation bound. The promising performance of HMS is demonstrated by extensive studies on large scale simulations and real data examples.

READ FULL TEXT

page 1

page 4

page 6

page 8

research
07/29/2022

A model robust sub-sampling approach for Generalised Linear Models in Big data settings

In today's modern era of Big data, computationally efficient and scalabl...
research
03/10/2021

A cautionary note on the Hanurav-Vijayan sampling algorithm

We consider the Hanurav-Vijayan sampling design, which is the default me...
research
03/02/2018

Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares

In modern data analysis, random sampling is an efficient and widely-used...
research
08/12/2022

A sub-sampling algorithm preventing outliers

Nowadays, in many different fields, massive data are available and for s...
research
11/18/2016

Robust and Scalable Column/Row Sampling from Corrupted Big Data

Conventional sampling techniques fall short of drawing descriptive sketc...
research
05/05/2015

On the Feasibility of Distributed Kernel Regression for Big Data

In modern scientific research, massive datasets with huge numbers of obs...
research
01/22/2018

Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization

Data collection is a fundamental problem in the scenario of big data, wh...

Please sign up or login with your details

Forgot password? Click here to reset