Streaming Algorithms for Support-Aware Histograms

07/18/2022
by   Justin Y. Chen, et al.
0

Histograms, i.e., piece-wise constant approximations, are a popular tool used to represent data distributions. Traditionally, the difference between the histogram and the underlying distribution (i.e., the approximation error) is measured using the L_p norm, which sums the differences between the two functions over all items in the domain. Although useful in many applications, the drawback of this error measure is that it treats approximation errors of all items in the same way, irrespective of whether the mass of an item is important for the downstream application that uses the approximation. As a result, even relatively simple distributions cannot be approximated by succinct histograms without incurring large error. In this paper, we address this issue by adapting the definition of approximation so that only the errors of the items that belong to the support of the distribution are considered. Under this definition, we develop efficient 1-pass and 2-pass streaming algorithms that compute near-optimal histograms in sub-linear space. We also present lower bounds on the space complexity of this problem. Surprisingly, under this notion of error, there is an exponential gap in the space complexity of 1-pass and 2-pass streaming algorithms. Finally, we demonstrate the utility of our algorithms on a collection of real and synthetic data sets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/15/2022

Streaming Algorithms for Ellipsoidal Approximation of Convex Polytopes

We give efficient deterministic one-pass streaming algorithms for findin...
research
09/25/2019

Streaming PTAS for Binary ℓ_0-Low Rank Approximation

We give a 3-pass, polylog-space streaming PTAS for the constrained binar...
research
04/11/2021

Graph Streaming Lower Bounds for Parameter Estimation and Property Testing via a Streaming XOR Lemma

We study space-pass tradeoffs in graph streaming algorithms for paramete...
research
11/20/2019

Streaming Frequent Items with Timestamps and Detecting Large Neighborhoods in Graph Streams

Detecting frequent items is a fundamental problem in data streaming rese...
research
07/15/2021

An Efficient Semi-Streaming PTAS for Tournament Feedback ArcSet with Few Passes

We present the first semi-streaming PTAS for the minimum feedback arc se...
research
02/20/2018

Sublinear Algorithms for MAXCUT and Correlation Clustering

We study sublinear algorithms for two fundamental graph problems, MAXCUT...
research
02/18/2020

How to Solve Fair k-Center in Massive Data Models

Fueled by massive data, important decision making is being automated wit...

Please sign up or login with your details

Forgot password? Click here to reset