Detecting Errors in Numerical Data via any Regression Model

05/26/2023
by   Hang Zhou, et al.
0

Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. Here we consider estimating which data values are incorrect along a numerical column. We present a model-agnostic approach that can utilize any regressor (i.e. statistical or machine learning model) which was fit to predict values in this column based on the other variables in the dataset. By accounting for various uncertainties, our approach distinguishes between genuine anomalies and natural data fluctuations, conditioned on the available information in the dataset. We establish theoretical guarantees for our method and show that other approaches like conformal inference struggle to detect errors. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/12/2015

RCR: Robust Compound Regression for Robust Estimation of Errors-in-Variables Model

The errors-in-variables (EIV) regression model, being more realistic by ...
research
01/05/2023

Appropriate use of parametric and nonparametric methods in estimating regression models with various shapes of errors

In this paper, a practical estimation method for a regression model is p...
research
05/16/2023

A Comparative Study of Methods for Estimating Conditional Shapley Values and When to Use Them

Shapley values originated in cooperative game theory but are extensively...
research
10/09/2019

Estimating regression errors without ground truth values

Regression analysis is a standard supervised machine learning method use...
research
02/04/2021

RECol: Reconstruction Error Columns for Outlier Detection

Detecting outliers or anomalies is a common data analysis task. As a sub...
research
02/21/2023

Multi-Target Tobit Models for Completing Water Quality Data

Monitoring microbiological behaviors in water is crucial to manage publi...

Please sign up or login with your details

Forgot password? Click here to reset