Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation

11/24/2022
by   Sérgio Jesus, et al.
0

Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data – which is prevalent in many high-stakes domains – has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available 1 privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2021

An Empirical Comparison of Bias Reduction Methods on Real-World Problems in High-Stakes Policy Settings

Applications of machine learning (ML) to high-stakes policy settings – s...
research
10/27/2021

Towards Realistic Single-Task Continuous Learning Research for NER

There is an increasing interest in continuous learning (CL), as data pri...
research
03/23/2021

Promoting Fairness through Hyperparameter Optimization

Considerable research effort has been guided towards algorithmic fairnes...
research
11/20/2022

Are Out-of-Distribution Detection Methods Reliable?

This paper establishes a novel evaluation framework for assessing the pe...
research
12/05/2019

Generative Synthesis of Insurance Datasets

One of the impediments in advancing actuarial research and developing op...
research
04/29/2021

Privacy-Preserving Portrait Matting

Recently, there has been an increasing concern about the privacy issue r...
research
09/07/2023

TSGBench: Time Series Generation Benchmark

Synthetic Time Series Generation (TSG) is crucial in a range of applicat...

Please sign up or login with your details

Forgot password? Click here to reset