Log In Sign Up

Learning-based Support Estimation in Sublinear Time

by   Talya Eden, et al.

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±ε n from a sample of size O(log^2(1/ε) · n/log n), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to log (1/ε) · n^1-Θ(1/log(1/ε)). We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from Hsu et al, ICLR'19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.


page 1

page 2

page 3

page 4


An Improved Algorithm for Dynamic Set Cover

We consider the minimum set cover problem in a dynamic setting. Here, we...

Dynamic Set Cover: Improved Amortized and Worst-Case Update Time

In the dynamic minimum set cover problem, a challenge is to minimize the...

Fast Multivariate Log-Concave Density Estimation

We present a computational approach to log-concave density estimation. T...

Sublinear Algorithms and Lower Bounds for Metric TSP Cost Estimation

We consider the problem of designing sublinear time algorithms for estim...

Õptimal Differentially Private Learning of Thresholds and Quasi-Concave Optimization

The problem of learning threshold functions is a fundamental one in mach...

Revisiting the Set Cover Conjecture

In the Set Cover problem, the input is a ground set of n elements and a ...

Support Estimation with Sampling Artifacts and Errors

The problem of estimating the support of a distribution is of great impo...