Sherlock: A Deep Learning Approach to Semantic Data Type Detection

05/25/2019
by   Madelon Hulsebos, et al.
0

Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on 686,765 data columns retrieved from the VizNet corpus by matching 78 semantic types from DBpedia to column headers. We characterize each matched column with 1,588 features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F_1 score of 0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.

READ FULL TEXT

page 1

page 2

page 3

page 5

page 6

page 7

page 8

page 9

research
06/24/2021

DCoM: A Deep Column Mapper for Semantic Data Type Detection

Detection of semantic data types is a very crucial task in data science ...
research
07/24/2023

Comprehending Semantic Types in JSON Data with Graph Neural Networks

Semantic types are a more powerful and detailed way of describing data t...
research
11/14/2019

Sato: Contextual Semantic Type Detection in Tables

Detecting the semantic types of data columns in relational tables is imp...
research
10/31/2017

Extracting Syntactic Patterns from Databases

Many database columns contain string or numerical data that conforms to ...
research
12/15/2020

Semantic Annotation for Tabular Data

Detecting semantic concept of columns in tabular data is of particular i...
research
02/04/2021

RECol: Reconstruction Error Columns for Outlier Detection

Detecting outliers or anomalies is a common data analysis task. As a sub...
research
07/05/2020

DrugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings

Traditional relational databases contain a lot of latent semantic inform...

Please sign up or login with your details

Forgot password? Click here to reset