Robust and Scalable Content-and-Structure Indexing (Extended Version)

09/12/2022
by   Kevin Wellenzohn, et al.
0

Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publicly-available source code archive.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2020

Dynamic Interleaving of Content and Structure for Robust Indexing of Semi-Structured Hierarchical Data (Extended Version)

We propose a robust index for semi-structured hierarchical data that sup...
research
10/25/2019

Overlay Indexes: Efficiently Supporting Aggregate Range Queries and Authenticated Data Structures in Off-the-Shelf Databases

Commercial off-the-shelf DataBase Management Systems (DBMSes) are highly...
research
08/31/2021

Hierarchical Bitmap Indexing for Range and Membership Queries on Multidimensional Arrays

Traditional indexing techniques commonly employed in da­ta­ba­se systems...
research
10/01/2013

Hopping over Big Data: Accelerating Ad-hoc OLAP Queries with Grasshopper Algorithms

This paper presents a family of algorithms for fast subset filtering wit...
research
11/11/2022

Efficient Immediate-Access Dynamic Indexing

In a dynamic retrieval system, documents must be ingested as they arrive...
research
03/02/2023

RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database Indexing

Data management on GPUs has become increasingly relevant due to a tremen...
research
03/18/2020

PolyFit: Polynomial-based Indexing Approach for Fast Approximate Range Aggregate Queries

Range aggregate queries find frequent application in data analytics. In ...

Please sign up or login with your details

Forgot password? Click here to reset