Technical Report on Data Integration and Preparation

03/02/2021
by   El Kindi Rezig, et al.
0

AI application developers typically begin with a dataset of interest and a vision of the end analytic or insight they wish to gain from the data at hand. Although these are two very important components of an AI workflow, one often spends the first few weeks (sometimes months) in the phase we refer to as data conditioning. This step typically includes tasks such as figuring out how to prepare data for analytics, dealing with inconsistencies in the dataset, and determining which algorithm (or set of algorithms) will be best suited for the application. Larger, faster, and messier datasets such as those from Internet of Things sensors, medical devices or autonomous vehicles only amplify these issues. These challenges, often referred to as the three Vs (volume, velocity, variety) of Big Data, require low-level tools for data management, preparation and integration. In most applications, data can come from structured and/or unstructured sources and often includes inconsistencies, formatting differences, and a lack of ground-truth labels. In this report, we highlight a number of tools that can be used to simplify data integration and preparation steps. Specifically, we focus on data integration tools and techniques, a deep dive into an exemplar data integration tool, and a deep-dive in the evolving field of knowledge graphs. Finally, we provide readers with a list of practical steps and considerations that they can use to simplify the data integration challenge. The goal of this report is to provide readers with a view of state-of-the-art as well as practical tips that can be used by data creators that make data integration more seamless.

READ FULL TEXT

page 8

page 12

research
04/11/2019

The Role of Big Data Analytics in Industrial Internet of Things

Big data production in industrial Internet of Things (IIoT) is evident d...
research
09/16/2019

Preliminary Exploration on Digital Twin for Power Systems: Challenges, Framework, and Applications

Digital twin (DT) is one of the most promising enabling technologies for...
research
09/29/2017

Toward a System Building Agenda for Data Integration

In this paper we argue that the data management community should devote ...
research
05/03/2019

Dataspace architecture and manage its components class projection

Big Data technology is described. Big data is a popular term used to des...
research
07/23/2021

On data lake architectures and metadata management

Over the past two decades, we have witnessed an exponential increase of ...
research
01/15/2019

AI Pipeline - bringing AI to you. End-to-end integration of data, algorithms and deployment tools

Next generation of embedded Information and Communication Technology (IC...
research
01/27/2021

Alaska: A Flexible Benchmark for Data Integration Tasks

Data integration is a long-standing interest of the data management comm...

Please sign up or login with your details

Forgot password? Click here to reset