Support Estimation via Regularized and Weighted Chebyshev Approximations
We introduce a new framework for estimating the support size of an unknown distribution which improves upon known approximation-based techniques. Our main contributions include describing a rigorous new weighted Chebyshev polynomial approximation method and introducing regularization terms into the problem formulation that provably improve the performance of state-of-the-art approximation-based approaches. In particular, we present both theoretical and computer simulation results that illustrate the utility and performance improvements of our method. The theoretical analysis relies on jointly optimizing the bias and variance components of the risk, and combining new weighted minmax polynomial approximation techniques with discretized semi-infinite programming solvers. Such a setting allows for casting the estimation problem as a linear program (LP) with a small number of variables and constraints that may be solved as efficiently as the original Chebyshev approximation-based problem. The described approach also applies to the support coverage and entropy estimation problems. Our newly developed technique is tested on synthetic data and used to estimate the number of bacterial species in the human gut. On synthetic datasets, we observed up to five-fold improvements in the value of the worst-case risk. For the bioinformatics application, metagenomic data from the NIH Human Gut and the American Gut Microbiome was combined and processed to obtain lists of bacterial taxonomies. These were subsequently used to compute the bacterial species histograms and estimate the number of bacterial species in the human gut to roughly 2350, with the species being represented by trillions of cells.
READ FULL TEXT