AFrame: Extending DataFrames for Large-Scale Modern Data Analysis (Extended Version)

08/19/2019
by   Phanwadee Sinthong, et al.
0

Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity issues for "normal" data scientists. This paper introduces AFrame, a new scalable data analysis package powered by a Big Data management system that extends the data scientists' familiar DataFrame operations to efficiently operate on managed data at scale. AFrame is implemented as a layer on top of Apache AsterixDB, transparently scaling out the execution of DataFrame operations and machine learning model invocation through a parallel, shared-nothing big data management system. AFrame incrementally constructs SQL++ queries and leverages AsterixDB's semistructured data management facilities, user-defined function support, and live data ingestion support. In order to evaluate the proposed approach, this paper also introduces an extensible micro-benchmark for use in evaluating DataFrame performance in both single-node and distributed settings via a collection of representative analytic operations. This paper presents the architecture of AFrame, describes the underlying capabilities of AsterixDB that efficiently support modern data analytic operations, and utilizes the proposed benchmark to evaluate and compare the performance and support for large-scale data analyses provided by alternative DataFrame libraries.

READ FULL TEXT

page 1

page 9

page 10

page 11

page 12

research
10/12/2020

PolyFrame: A Retargetable Query-based Approach to Scaling DataFrames (Extended Version)

In the last few years, the field of data science has been growing rapidl...
research
02/26/2019

Workflow-Driven Distributed Machine Learning in CHASE-CI: A Cognitive Hardware and Software Ecosystem Community Infrastructure

The advances in data, computing and networking over the last two decades...
research
04/15/2017

Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy

Astrophysics and cosmology are rich with data. The advent of wide-area d...
research
02/08/2018

Deep Learning with Apache SystemML

Enterprises operate large data lakes using Hadoop and Spark frameworks t...
research
10/20/2019

Micro-level Modularity of Computaion-intensive Programs in Big Data Platforms: A Case Study with Image Data

With the rapid advancement of Big Data platforms such as Hadoop, Spark, ...
research
04/20/2021

ds-array: A Distributed Data Structure for Large Scale Machine Learning

Machine learning has proved to be a useful tool for extracting knowledge...
research
02/21/2019

An IDEA: An Ingestion Framework for Data Enrichment in AsterixDB

Big Data today is being generated at an unprecedented rate from various ...

Please sign up or login with your details

Forgot password? Click here to reset