On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research

by   Pradeeban Kathiravelu, et al.

Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager ETL process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, Hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We incorporate a human-in-the-loop approach, to enhance the hybrid ETL, with selective data integration driven by the user queries and sharing of integrated data between users. We implement our hybrid ETL approach in a prototype platform, Obidos, and evaluate it in the context of data sharing for medical research. Obidos outperforms both the eager ETL and lazy ETL approaches, for scientific research data integration and sharing, through its selective loading of data and metadata, while storing the integrated data in a scalable integrated data repository.


page 9

page 14

page 16


An Approach to Handle Big Data Warehouse Evolution

One of the purposes of Big Data systems is to support analysis of data g...

Turning the information-sharing dial: efficient inference from different data sources

A fundamental aspect of statistics is the integration of data from diffe...

Harmonise and integrate heterogeneous areal data with the R package arealDB

Areal data is a common data type to store information such as biodiversi...

Toward a view-based data cleaning architecture

Big data analysis has become an active area of study with the growth of ...

Burgeoning Data Repository Systems, Characteristics and Development Strategies: Insights of Natural Resources and Environmental Scientists

Nowadays, we have the emergence and abundance of many different data rep...

A Hierarchical Approach to exploiting Multiple Datasets from TalkBank

TalkBank is an online database that facilitates the sharing of linguisti...

Popularity Driven Data Integration

More and more, with the growing focus on large scale analytics, we are c...

Please sign up or login with your details

Forgot password? Click here to reset