Deep Learning to Jointly Schema Match, Impute, and Transform Databases

06/22/2022
by   Sandhya Tripathi, et al.
0

An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, where unit changes and variable shift make simple matching of univariate summaries unsuccessful. We develop two novel procedures to address this problem. First, we demonstrate multiple methods of "fingerprinting" a feature based on its associations to other features. In the setting of even modest prior information, this allows most shared features to be accurately identified. Second, we demonstrate a deep learning algorithm for translation between databases. Unlike prior approaches, our algorithm takes advantage of discovered mappings while identifying surrogates for unshared features and learning transformations. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets, while jointly learning to impute unshared or transformed variables.

READ FULL TEXT
research
11/28/2018

FADL:Federated-Autonomous Deep Learning for Distributed Electronic Health Record

Electronic health record (EHR) data is collected by individual instituti...
research
04/28/2022

Cumulative Stay-time Representation for Electronic Health Records in Medical Event Time Prediction

We address the problem of predicting when a disease will develop, i.e., ...
research
02/03/2014

Principled Graph Matching Algorithms for Integrating Multiple Data Sources

This paper explores combinatorial optimization for problems of max-weigh...
research
04/30/2023

Sensitive Data Detection with High-Throughput Machine Learning Models in Electrical Health Records

In the era of big data, there is an increasing need for healthcare provi...
research
09/27/2022

Identifying and Extracting Football Features from Real-World Media Sources using Only Synthetic Training Data

Real-world images used for training machine learning algorithms are ofte...
research
05/08/2021

Distribution Matching for Heterogeneous Multi-Task Learning: a Large-scale Face Study

Multi-Task Learning has emerged as a methodology in which multiple tasks...
research
03/24/2019

Truly Batch Apprenticeship Learning with Deep Successor Features

We introduce a novel apprenticeship learning algorithm to learn an exper...

Please sign up or login with your details

Forgot password? Click here to reset