Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

10/04/2021
by   Pegdwendé Sawadogo, et al.
0

In the last few years, the concept of data lake has become trendy for data storage and analysis. Thus, several design alternatives have been proposed to build data lake systems. However, these proposals are difficult to evaluate as there are no commonly shared criteria for comparing data lake systems. Thus, we introduce DLBench, a benchmark to evaluate and compare data lake implementations that support textual and/or tabular contents. More concretely, we propose a data model made of both textual and raw tabular documents, a workload model composed of a set of various tasks, as well as a set of performance-based metrics, all relevant to the context of data lakes. As a proof of concept, we use DLBench to evaluate an open source data lake system we previously developed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/22/2019

CREATE: Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records using OMOP Common Data Model

Background: Widespread adoption of electronic health records (EHRs) has ...
research
04/21/2022

Why we should respect analysis results as data

The development and approval of new treatments generates large volumes o...
research
05/18/2021

rx-anon – A Novel Approach on the De-Identification of Heterogeneous Data based on a Modified Mondrian Algorithm

Traditional approaches for data anonymization consider relational data a...
research
08/24/2022

Next-Year Bankruptcy Prediction from Textual Data: Benchmark and Baselines

Models for bankruptcy prediction are useful in several real-world scenar...
research
07/09/2020

Enhancing spatial and textual analysis with EUPEG: an extensible and unified platform for evaluating geoparsers

A rich amount of geographic information exists in unstructured texts, su...
research
08/12/2021

TextBenDS: a generic Textual data Benchmark for Distributed Systems

Extracting top-k keywords and documents using weighting schemes are popu...
research
11/12/2021

Detecting Quality Problems in Data Models by Clustering Heterogeneous Data Values

Data is of high quality if it is fit for its intended use. The quality o...

Please sign up or login with your details

Forgot password? Click here to reset