Learning-based Support Estimation in Sublinear Time

06/15/2021
by   Talya Eden, et al.
9

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±ε n from a sample of size O(log^2(1/ε) · n/log n), where n is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to log (1/ε) · n^1-Θ(1/log(1/ε)). We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from Hsu et al, ICLR'19 as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/01/2023

Nearly Optimal Dynamic Set Cover: Breaking the Quadratic-in-f Time Barrier

The dynamic set cover problem has been subject to extensive research sin...
research
02/25/2020

Dynamic Set Cover: Improved Amortized and Worst-Case Update Time

In the dynamic minimum set cover problem, a challenge is to minimize the...
research
04/04/2023

Set Covering with Our Eyes Wide Shut

In the stochastic set cover problem (Grandoni et al., FOCS '08), we are ...
research
05/18/2018

Fast Multivariate Log-Concave Density Estimation

We present a computational approach to log-concave density estimation. T...
research
07/10/2023

Improved Diversity Maximization Algorithms for Matching and Pseudoforest

In this work we consider the diversity maximization problem, where given...
research
06/08/2023

Analysis of Knuth's Sampling Algorithm D and D'

In this research paper, we address the Distinct Elements estimation prob...
research
10/31/2022

Improved Learning-augmented Algorithms for k-means and k-medians Clustering

We consider the problem of clustering in the learning-augmented setting,...

Please sign up or login with your details

Forgot password? Click here to reset