Rumble: data independence when data is in a mess

10/25/2019
by   Stefan Irimescu, et al.
0

This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogenous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous datasets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent datasets, in the same way as SQL can be leveraged for structured datasets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested datasets as it does on highly structured tables.

READ FULL TEXT
research
01/11/2021

Query Lifting: Language-integrated query for heterogeneous nested collections

Language-integrated query based on comprehension syntax is a powerful te...
research
03/25/2020

A Formalization of SQL with Nulls

SQL is the world's most popular declarative language, forming the basis ...
research
08/31/2020

SparkGOR: A unified framework for genomic data analysis

Motivation: Our goal was to combine the capabilities of Spark and GOR in...
research
08/20/2017

Fast Access to Columnar, Hierarchically Nested Data via Code Transformation

Big Data query systems represent data in a columnar format for fast, sel...
research
09/02/2020

The Optics of Language-Integrated Query

Monadic comprehensions reign over the realm of language-integrated query...
research
11/12/2020

Scalable Querying of Nested Data

While large-scale distributed data processing platforms have become an a...
research
06/09/2011

A Knowledge Compilation Map

We propose a perspective on knowledge compilation which calls for analyz...

Please sign up or login with your details

Forgot password? Click here to reset