A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

08/23/2017
by   Benjamin S. Baumer, et al.
0

Many interesting data sets available on the Internet are of a medium size---too big to fit into a personal computer's memory, but not so large that they won't fit comfortably on its hard disk. In the coming years, data sets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable hub-and-spoke framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2018

SQL Query Completion for Data Exploration

Within the big data tsunami, relational databases and SQL are still ther...
research
12/11/2009

Design of Intelligent layer for flexible querying in databases

Computer-based information technologies have been extensively used to he...
research
03/22/2021

hep_tables: Heterogeneous Array Programming for HEP

Array operations are one of the most concise ways of expressing common f...
research
06/06/2023

DashQL – Complete Analysis Workflows with SQL

We present DashQL, a language that describes complete analysis workflows...
research
11/16/2017

An Encoder-Decoder Framework Translating Natural Language to Database Queries

Machine translation is going through a radical revolution, driven by the...
research
10/01/2019

Deep learning for Chemometric and non-translational data

We propose a novel method to train deep convolutional neural networks wh...

Please sign up or login with your details

Forgot password? Click here to reset