Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings

10/25/2022
by   Ajay Chawda, et al.
0

In this paper, we introduce the Vehicle Claims dataset, consisting of fraudulent insurance claims for automotive repairs. The data belongs to the more broad category of Auditing data, which includes also Journals and Network Intrusion data. Insurance claim data are distinctively different from other auditing data (such as network intrusion data) in their high number of categorical attributes. We tackle the common problem of missing benchmark datasets for anomaly detection: datasets are mostly confidential, and the public tabular datasets do not contain relevant and sufficient categorical attributes. Therefore, a large-sized dataset is created for this purpose and referred to as Vehicle Claims (VC) dataset. The dataset is evaluated on shallow and deep learning methods. Due to the introduction of categorical attributes, we encounter the challenge of encoding them for the large dataset. As One Hot encoding of high cardinal dataset invokes the "curse of dimensionality", we experiment with GEL encoding and embedding layer for representing categorical attributes. Our work compares competitive learning, reconstruction-error, density estimation and contrastive learning approaches for Label, One Hot, GEL encoding and embedding layer to handle categorical values.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2022

Representing missing values through polar encoding

We propose polar encoding, a representation of categorical and numerical...
research
01/27/2022

Fairness implications of encoding protected categorical attributes

Protected attributes are often presented as categorical features that ne...
research
05/29/2020

Quasi-orthonormal Encoding for Machine Learning Applications

Most machine learning models, especially artificial neural networks, req...
research
08/27/2020

The Impact of Discretization Method on the Detection of Six Types of Anomalies in Datasets

Anomaly detection is the process of identifying cases, or groups of case...
research
09/09/2022

Explanation Method for Anomaly Detection on Mixed Numerical and Categorical Spaces

Most proposals in the anomaly detection field focus exclusively on the d...
research
12/22/2021

Evaluating categorical encoding methods on a real credit card fraud detection database

Correctly dealing with categorical data in a supervised learning context...
research
09/08/2022

Stochastic gradient descent with gradient estimator for categorical features

Categorical data are present in key areas such as health or supply chain...

Please sign up or login with your details

Forgot password? Click here to reset