
(Learned) Frequency Estimation Algorithms under Zipfian Distribution
The frequencies of the elements in a data stream are an important statis...
read it

Buffered CountMin Sketch on SSD: Theory and Experiments
Frequency estimation data structures such as the countmin sketch (CMS) ...
read it

CountMin: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions
The CountMin sketch is an important and wellstudied data summarization...
read it

Composite Hashing for Data Stream Sketches
In rapid and massive data streams, it is often not possible to estimate ...
read it

Graphical Model Sketch
Structured highcardinality data arises in many domains, and poses a maj...
read it

Efficient Tensor Contraction via Fast Count Sketch
Sketching uses randomized Hash functions for dimensionality reduction an...
read it

An Econometric View of Algorithmic Subsampling
Datasets that are terabytes in size are increasingly common, but compute...
read it
SALSA: SelfAdjusting Lean Streaming Analytics
Counters are the fundamental building block of many data sketching schemes, which hash items to a small number of counters and account for collisions to provide good approximations for frequencies and other measures. Most existing methods rely on fixedsize counters, which may be wasteful in terms of space, as counters must be large enough to eliminate any risk of overflow. Instead, some solutions use small, fixedsize counters that may overflow into secondary structures. This paper takes a different approach. We propose a simple and general method called SALSA for dynamic resizing of counters and show its effectiveness. SALSA starts with small counters, and overflowing counters simply merge with their neighbors. SALSA can thereby allow more counters for a given space, expanding them as necessary to represent large numbers. Our evaluation demonstrates that, at the cost of a small overhead for its merging logic, SALSA significantly improves the accuracy of popular schemes (such as CountMin Sketch and Count Sketch) over a variety of tasks. Our code is released as opensource [1].
READ FULL TEXT
Comments
There are no comments yet.