RStore: A Distributed Multi-version Document Store

02/21/2018
by   Souvik Bhattacherjee, et al.
0

We address the problem of compactly storing a large number of versions (snapshots) of a collection of keyed documents or records in a distributed environment, while efficiently answering a variety of retrieval queries over those, including retrieving full or partial versions, and evolution histories for specific keys. We motivate the increasing need for such a system in a variety of application domains, carefully explore the design space for building such a system and the various storage-computation-retrieval trade-offs, and discuss how different storage layouts influence those trade-offs. We propose a novel system architecture that satisfies the key desiderata for such a system, and offers simple tuning knobs that allow adapting to a specific data and query workload. Our system is intended to act as a layer on top of a distributed key-value store that houses the raw data as well as any indexes. We design novel off-line storage layout algorithms for efficiently partitioning the data to minimize the storage costs while keeping the retrieval costs low. We also present an online algorithm to handle new versions being added to system. Using extensive experiments on large datasets, we demonstrate that our system operates at the scale required in most practical scenarios and often outperforms standard baselines, including a delta-based storage engine, by orders-of-magnitude.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/21/2018

Managing and Querying Multi-versioned Documents using a Distributed Key-Value Store

We address the problem of compactly storing a large number of versions (...
research
02/11/2019

CPOI: A Compact Method to Archive Versioned RDF Triple-Sets

Large amounts of RDF/S data are produced and published lately, and sever...
research
11/22/2021

Columnar Formats for Schemaless LSM-based Document Stores

In the last decade, document store database systems have gained more tra...
research
01/23/2022

SToN: A New Fundamental Trade-off for Distributed Data Storage Systems

Locating data efficiently is a key process in every distributed data sto...
research
04/09/2019

Cold Storage Data Archives: More Than Just a Bunch of Tapes

The abundance of available sensor and derived data from large scientific...
research
02/16/2018

PRoST: Distributed Execution of SPARQL Queries Using Mixed Partitioning Strategies

The rapidly growing size of RDF graphs in recent years necessitates dist...
research
04/19/2016

Improving Raw Image Storage Efficiency by Exploiting Similarity

To improve the temporal and spatial storage efficiency, researchers have...

Please sign up or login with your details

Forgot password? Click here to reset