Learning-based Support Estimation in Sublinear Time

by   Talya Eden, et al.

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±ε n from a sample of size O(log^2(1/ε) · n/log n), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to log (1/ε) · n^1-Θ(1/log(1/ε)). We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from Hsu et al, ICLR'19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4


An Improved Algorithm for Dynamic Set Cover

We consider the minimum set cover problem in a dynamic setting. Here, we...

Dynamic Set Cover: Improved Amortized and Worst-Case Update Time

In the dynamic minimum set cover problem, a challenge is to minimize the...

Privately Learning Thresholds: Closing the Exponential Gap

We study the sample complexity of learning threshold functions under the...

Fast Multivariate Log-Concave Density Estimation

We present a computational approach to log-concave density estimation. T...

Sublinear Algorithms and Lower Bounds for Metric TSP Cost Estimation

We consider the problem of designing sublinear time algorithms for estim...

Revisiting the Set Cover Conjecture

In the Set Cover problem, the input is a ground set of n elements and a ...

Evaluating representations by the complexity of learning low-loss predictors

We consider the problem of evaluating representations of data for use in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.