FDB: Fraud Dataset Benchmark

08/30/2022
by   Prince Grover, et al.
0

Standardized datasets and benchmarks have spurred innovations in computer vision, natural language processing, multi-modal and tabular settings. We note that, as compared to other well researched fields fraud detection has numerous differences. The differences include a high class imbalance, diverse feature types, frequently changing fraud patterns, and adversarial nature of the problem. Due to these differences, the modeling approaches that are designed for other classification tasks may not work well for the fraud detection. We introduce Fraud Dataset Benchmark (FDB), a compilation of publicly available datasets catered to fraud detection. FDB comprises variety of fraud related tasks, ranging from identifying fraudulent card-not-present transactions, detecting bot attacks, classifying malicious URLs, predicting risk of loan to content moderation. The Python based library from FDB provides consistent API for data loading with standardized training and testing splits. For reference, we also provide baseline evaluations of different modeling approaches on FDB. Considering the increasing popularity of Automated Machine Learning (AutoML) for various research and business problems, we used AutoML frameworks for our baseline evaluations. For fraud prevention, the organizations that operate with limited resources and lack ML expertise often hire a team of investigators, use blocklists and manual rules, all of which are inefficient and do not scale well. Such organizations can benefit from AutoML solutions that are easy to deploy in production and pass the bar of fraud prevention requirements. We hope that FDB helps in the development of customized fraud detection techniques catered to different fraud modus operandi (MOs) as well as in the improvement of AutoML systems that can work well for all datasets in the benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2022

BEANS: The Benchmark of Animal Sounds

The use of machine learning (ML) based techniques has become increasingl...
research
08/19/2019

Automated email Generation for Targeted Attacks using Natural Language

With an increasing number of malicious attacks, the number of people and...
research
09/11/2020

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

Although Indonesian is known to be the fourth most frequently used langu...
research
08/23/2020

Leveraging Organizational Resources to Adapt Models to New Data Modalities

As applications in large organizations evolve, the machine learning (ML)...
research
08/19/2020

A Survey on Text Simplification

Text Simplification (TS) aims to reduce the linguistic complexity of con...
research
06/15/2023

The 2023 Video Similarity Dataset and Challenge

This work introduces a dataset, benchmark, and challenge for the problem...
research
08/21/2023

A Modular and Adaptive System for Business Email Compromise Detection

The growing sophistication of Business Email Compromise (BEC) and spear ...

Please sign up or login with your details

Forgot password? Click here to reset