SALSA: Self-Adjusting Lean Streaming Analytics

02/24/2021
by   Ran Ben Basat, et al.
0

Counters are the fundamental building block of many data sketching schemes, which hash items to a small number of counters and account for collisions to provide good approximations for frequencies and other measures. Most existing methods rely on fixed-size counters, which may be wasteful in terms of space, as counters must be large enough to eliminate any risk of overflow. Instead, some solutions use small, fixed-size counters that may overflow into secondary structures. This paper takes a different approach. We propose a simple and general method called SALSA for dynamic re-sizing of counters and show its effectiveness. SALSA starts with small counters, and overflowing counters simply merge with their neighbors. SALSA can thereby allow more counters for a given space, expanding them as necessary to represent large numbers. Our evaluation demonstrates that, at the cost of a small overhead for its merging logic, SALSA significantly improves the accuracy of popular schemes (such as Count-Min Sketch and Count Sketch) over a variety of tasks. Our code is released as open-source [1].

READ FULL TEXT

page 4

page 7

08/14/2019

(Learned) Frequency Estimation Algorithms under Zipfian Distribution

The frequencies of the elements in a data stream are an important statis...
11/06/2021

Frequency Estimation with One-Sided Error

Frequency estimation is one of the most fundamental problems in streamin...
04/27/2018

Buffered Count-Min Sketch on SSD: Theory and Experiments

Frequency estimation data structures such as the count-min sketch (CMS) ...
11/09/2018

Count-Min: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions

The Count-Min sketch is an important and well-studied data summarization...
03/29/2022

Analysis of Count-Min sketch under conservative update

Count-Min sketch is a hash-based data structure to represent a dynamical...
07/04/2022

Learning state machines via efficient hashing of future traces

State machines are popular models to model and visualize discrete system...
07/15/2020

FetchSGD: Communication-Efficient Federated Learning with Sketching

Existing approaches to federated learning suffer from a communication bo...