Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

Distributed data processing ecosystems are widespread and their components are highly specialized, such that efficient interoperability is urgent. Recently, Apache Arrow was chosen by the community to serve as a format mediator, providing efficient in-memory data representation. Arrow enables efficient data movement between data processing and storage engines, significantly improving interoperability and overall performance. In this work, we design a new zero-cost data interoperability layer between Apache Spark and Arrow-based data sources through the Arrow Dataset API. Our novel data interface helps separate the computation (Spark) and data (Arrow) layers. This enables practitioners to seamlessly use Spark to access data from all Arrow Dataset API-enabled data sources and frameworks. To benefit our community, we open-source our work and show that consuming data through Apache Arrow is zero-cost: our novel data interface is either on-par or more performant than native Spark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/12/2019

Moving Processing to Data: On the Influence of Processing in Memory on Data Management

Near-Data Processing refers to an architectural hardware and software pa...
research
12/02/2018

Koji: Automating pipelines with mixed-semantics data sources

We propose a new result-oriented semantic for defining data processing w...
research
07/14/2021

Querying the Most Granular Demographics Dataset

We have an API that allows you to query demographics data. Your data jus...
research
05/29/2018

Building your Cross-Platform Application with RHEEM

Today, organizations typically perform tedious and costly tasks to juggl...
research
02/02/2022

Data Processing Framework for Ship Performance Analysis

The hydrodynamic performance of a sea-going ship can be analysed using t...
research
05/18/2023

Towards the Automatic Generation of Conversational Interfaces to Facilitate the Exploration of Tabular Data

Tabular data is the most common format to publish and exchange structure...
research
07/20/2020

A Model-based Chatbot Generation Approach to Converse with Open Data Sources

The Open Data movement promotes the free distribution of data. More and ...

Please sign up or login with your details

Forgot password? Click here to reset