Support Estimation via Regularized and Weighted Chebyshev Approximations

by   Chien, et al.

We introduce a new framework for estimating the support size of an unknown distribution which improves upon known approximation-based techniques. Our main contributions include describing a rigorous new weighted Chebyshev polynomial approximation method and introducing regularization terms into the problem formulation that provably improve the performance of state-of-the-art approximation-based approaches. In particular, we present both theoretical and computer simulation results that illustrate the utility and performance improvements of our method. The theoretical analysis relies on jointly optimizing the bias and variance components of the risk, and combining new weighted minmax polynomial approximation techniques with discretized semi-infinite programming solvers. Such a setting allows for casting the estimation problem as a linear program (LP) with a small number of variables and constraints that may be solved as efficiently as the original Chebyshev approximation-based problem. The described approach also applies to the support coverage and entropy estimation problems. Our newly developed technique is tested on synthetic data and used to estimate the number of bacterial species in the human gut. On synthetic datasets, we observed up to five-fold improvements in the value of the worst-case risk. For the bioinformatics application, metagenomic data from the NIH Human Gut and the American Gut Microbiome was combined and processed to obtain lists of bacterial taxonomies. These were subsequently used to compute the bacterial species histograms and estimate the number of bacterial species in the human gut to roughly 2350, with the species being represented by trillions of cells.


page 1

page 2

page 3

page 4


Convergence of Chao Unseen Species Estimator

Support size estimation and the related problem of unseen species estima...

Support Estimation with Sampling Artifacts and Errors

The problem of estimating the support of a distribution is of great impo...

Worst case tractability of L_2-approximation for weighted Korobov spaces

We study L_2-approximation problems APP_d in the worst case setting in t...

Spatial Implicit Neural Representations for Global-Scale Species Mapping

Estimating the geographical range of a species from sparse observations ...

Multivariate Rational Approximation

We present two approaches for computing rational approximations to multi...

Estimating the number of unseen species: A bird in the hand is worth n in the bush

Estimating the number of unseen species is an important problem in many ...

Sparse Estimation From Noisy Observations of an Overdetermined Linear System

This note studies a method for the efficient estimation of a finite numb...

Please sign up or login with your details

Forgot password? Click here to reset