CARAMEL: A Succinct Read-Only Lookup Table via Compressed Static Functions

05/26/2023
by   Benjamin Coleman, et al.
0

Lookup tables are a fundamental structure in many data processing and systems applications. Examples include tokenized text in NLP, quantized embedding collections in recommendation systems, integer sketches for streaming data, and hash-based string representations in genomics. With the increasing size of web-scale data, such applications often require compression techniques that support fast random O(1) lookup of individual parameters directly on the compressed data (i.e. without blockwise decompression in RAM). While the community has proposd a number of succinct data structures that support queries over compressed representations, these approaches do not fully leverage the low-entropy structure prevalent in real-world workloads to reduce space. Inspired by recent advances in static function construction techniques, we propose a space-efficient representation of immutable key-value data, called CARAMEL, specifically designed for the case where the values are multi-sets. By carefully combining multiple compressed static functions, CARAMEL occupies space proportional to the data entropy with low memory overheads and minimal lookup costs. We demonstrate 1.25-16x compression on practical lookup tasks drawn from real-world systems, improving upon established techniques, including a production-grade read-only database widely used for development within Amazon.com.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/19/2019

Improved Compressed String Dictionaries

We introduce a new family of compressed data structures to efficiently s...
research
06/04/2021

Parallel and External-Memory Construction of Minimal Perfect Hash Functions with PTHash

A minimal perfect hash function f for a set S of n keys is a bijective f...
research
08/07/2023

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

In the last decades, the necessity to process massive amounts of textual...
research
02/12/2019

Compressed Range Minimum Queries

Given a string S of n integers in [0,σ), a range minimum query RMQ(i, j)...
research
04/06/2020

Indexing Highly Repetitive String Collections

Two decades ago, a breakthrough in indexing string collections made it p...
research
06/14/2019

Dynamic Path-Decomposed Tries

A keyword dictionary is an associative array whose keys are strings. Rec...
research
08/07/2023

A General Framework for Progressive Data Compression and Retrieval

In scientific simulations, observations, and experiments, the cost of tr...

Please sign up or login with your details

Forgot password? Click here to reset