Support Estimation with Sampling Artifacts and Errors

06/14/2020
by   Eli Chien, et al.
0

The problem of estimating the support of a distribution is of great importance in many areas of machine learning, computer science, physics and biology. Most of the existing work in this domain has focused on settings that assume perfectly accurate sampling approaches, which is seldom true in practical data science. Here we introduce the first known approach to support estimation in the presence of sampling artifacts and errors where each sample is assumed to arise from a Poisson repeat channel which simultaneously captures repetitions and deletions of samples. The proposed estimator is based on regularized weighted Chebyshev approximations, with weights governed by evaluations of so-called Touchard (Bell) polynomials. The supports in the presence of sampling artifacts are calculated using discretized semi-infite programming methods. The estimation approach is tested on synthetic and textual data, as well as on GISAID data collected to address a new problem in computational biology: mutational support estimation in genes of the SARS-Cov-2 virus. In the later setting, the Poisson channel captures the fact that many individuals are tested multiple times for the presence of viral RNA, thereby leading to repeated samples, while other individual's results are not recorded due to test errors. For all experiments performed, we observed significant improvements of our integrated methods compared to those obtained through adequate modifications of state-of-the-art noiseless support estimation methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/22/2019

Support Estimation via Regularized and Weighted Chebyshev Approximations

We introduce a new framework for estimating the support size of an unkno...
research
07/26/2018

Tackling 3D ToF Artifacts Through Learning and the FLAT Dataset

Scene motion, multiple reflections, and sensor noise introduce artifacts...
research
05/30/2022

T-Wise Presence Condition Coverage and Sampling for Configurable Systems

Sampling techniques, such as t-wise interaction sampling are used to ena...
research
09/24/2021

Sample Efficient Model Evaluation

Labelling data is a major practical bottleneck in training and testing c...
research
09/09/2023

Correcting sampling biases via importancereweighting for spatial modeling

In machine learning models, the estimation of errors is often complex du...
research
10/25/2021

Algorithms for the Communication of Samples

We consider the problem of reverse channel coding, that is, how to simul...

Please sign up or login with your details

Forgot password? Click here to reset