RESTORE: Automated Regression Testing for Datasets

03/08/2019
by   Lei Zhang, et al.
0

In data mining, the data in various business cases (e.g., sales, marketing, and demography) gets refreshed periodically. During the refresh, the old dataset is replaced by a new one. Confirming the quality of the new dataset can be challenging because changes are inevitable. How do analysts distinguish reasonable real-world changes vs. errors related to data capture or data transformation? While some of the errors are easy to spot, the others may be more subtle. In order to detect such types of errors, an analyst will typically have to examine the data manually and assess if the data produced are "believable". Due to the scale of data, such examination is tedious and laborious. Thus, to save the analyst's time, it is important to detect these errors automatically. However, both the literature and the industry are still lacking methods to assess the difference between old and new versions of a dataset during the refresh process. In this paper, we present a comprehensive set of tests for the detection of abnormalities in a refreshed dataset, based on the information obtained from a previous vintage of the dataset. We implement these tests in automated test harness made available as an open-source package, called RESTORE, for R language. The harness accepts flat or hierarchical numeric datasets. We also present a validation case study, where we apply our test harness to hierarchical demographic datasets. The results of the study and feedback from data scientists using the package suggest that RESTORE enables fast and efficient detection of errors in the data as well as decreases the cost of testing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/19/2018

vsgoftest: An Package for Goodness-of-Fit Testing Based on Kullback-Leibler Divergence

The R-package vsgoftest performs goodness-of-fit (GOF) tests, based on S...
research
12/25/2021

FMViz: Visualizing Tests Generated by AFL at the Byte-level

Software fuzzing is a strong testing technique that has become the de fa...
research
06/07/2019

Nonparametric volatility change detection

We consider a nonparametric heteroscedastic time series regression model...
research
03/31/2021

NodeSRT: A Selective Regression Testing Tool for Node.js Application

Node.js is one of the most popular frameworks for building web applicati...
research
11/16/2022

The robusTest package: two-sample tests revisited

The R package robusTest offers corrected versions of several common test...
research
05/24/2022

Package Theft Detection from Smart Home Security Cameras

Package theft detection has been a challenging task mainly due to lack o...
research
08/16/2021

Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring

Data is expanding at an unimaginable rate, and with this development com...

Please sign up or login with your details

Forgot password? Click here to reset