Learning-based Support Estimation in Sublinear Time

06/15/2021
by   Talya Eden, et al.
9

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±ε n from a sample of size O(log^2(1/ε) · n/log n), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to log (1/ε) · n^1-Θ(1/log(1/ε)). We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from Hsu et al, ICLR'19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/25/2020

An Improved Algorithm for Dynamic Set Cover

We consider the minimum set cover problem in a dynamic setting. Here, we...
02/25/2020

Dynamic Set Cover: Improved Amortized and Worst-Case Update Time

In the dynamic minimum set cover problem, a challenge is to minimize the...
11/22/2019

Privately Learning Thresholds: Closing the Exponential Gap

We study the sample complexity of learning threshold functions under the...
05/18/2018

Fast Multivariate Log-Concave Density Estimation

We present a computational approach to log-concave density estimation. T...
06/09/2020

Sublinear Algorithms and Lower Bounds for Metric TSP Cost Estimation

We consider the problem of designing sublinear time algorithms for estim...
11/21/2017

Revisiting the Set Cover Conjecture

In the Set Cover problem, the input is a ground set of n elements and a ...
09/15/2020

Evaluating representations by the complexity of learning low-loss predictors

We consider the problem of evaluating representations of data for use in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.