A Primer on the Data Cleaning Pipeline

07/25/2023
by   Rebecca C. Steorts, et al.
0

The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this expansion, the statistical and methodological questions around data integration, or rather merging multiple data sources, has also grown. Specifically, the science of the “data cleaning pipeline” contains four stages that allow an analyst to perform downstream tasks, predictive analyses, or statistical analyses on “cleaned data.” This article provides a review of this emerging field, introducing technical terminology and commonly used methods.

READ FULL TEXT

page 4

page 7

research
05/14/2018

Crowdbreaks: Tracking Health Trends using Public Social Media Data and Crowdsourcing

In the past decade, tracking health trends using social media data has s...
research
10/08/2021

A New Data Integration Framework for Covid-19 Social Media Information

The Covid-19 pandemic presents a serious threat to people's health, resu...
research
11/10/2020

On the State of Social Media Data for Mental Health Research

Data-driven methods for mental health treatment and surveillance have be...
research
03/14/2022

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

It's challenging to design reward functions for complex, real-world task...
research
01/18/2018

Citation Analysis of Innovative ICT and Advances of Governance (2008-2017)

This paper opens by introducing the Internet Plus Government (IPG), a ne...
research
06/25/2023

Machine Learning and Consumer Data

The digital revolution has led to the digitization of human behavior, cr...
research
12/20/2017

Linking Administrative Data: An Evolutionary Schema

Statistics New Zealand (Stats NZ) has committed unreservedly to an admin...

Please sign up or login with your details

Forgot password? Click here to reset