Data Curation with Deep Learning [Vision]: Towards Self Driving Data Curation

Past. Data curation - the process of discovering, integrating, and cleaning data - is one of the oldest data management problems. Unfortunately, it is still the most time consuming and least enjoyable work of data scientists. So far, successful data curation stories are mainly ad-hoc solutions that are either domain-specific (for example, ETL rules) or task-specific (for example, entity resolution). Present. The power of current data curation solutions are not keeping up with the ever changing data ecosystem in terms of volume, velocity, variety and veracity, mainly due to the high human cost, instead of machine cost, needed for providing the ad-hoc solutions mentioned above. Meanwhile, deep learning is making strides in achieving remarkable successes in areas such as image recognition, natural language processing, and speech recognition. This is largely due to its ability to understanding features that are neither domain-specific nor task-specific. Future. Data curation solutions need to keep the pace with the fast-changing data ecosystem, where the main hope is to devise domain-agnostic and task-agnostic solutions. To this end, we start a new research project, called AutoDC, to unleash the potential of deep learning towards self-driving data curation. We will discuss how different deep learning concepts can be adapted and extended to solve various data curation problems. We showcase some low-hanging fruits about the early encounters between deep learning and data curation happening in AutoDC. We believe that the directions pointed out by this work will not only drive AutoDC towards democratizing data curation, but also serve as a cornerstone for researchers and practitioners to move to a new realm of data curation solutions.

READ FULL TEXT
research
11/23/2017

A Deep Relevance Matching Model for Ad-hoc Retrieval

In recent years, deep neural networks have led to exciting breakthroughs...
research
09/22/2022

EventNet: Detecting Events in EEG

Neurologists are often looking for various "events of interest" when ana...
research
05/08/2017

Machine Learning with World Knowledge: The Position and Survey

Machine learning has become pervasive in multiple domains, impacting a w...
research
05/24/2023

A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem

Automatic Term Recognition is used to extract domain-specific terms that...
research
10/21/2020

A Level-wise Taxonomic Perspective on Automated Machine Learning to Date and Beyond: Challenges and Opportunities

Automated machine learning (AutoML) is essentially automating the proces...
research
03/12/2017

Why we have switched from building full-fledged taxonomies to simply detecting hypernymy relations

The study of taxonomies and hypernymy relations has been extensive on th...

Please sign up or login with your details

Forgot password? Click here to reset