Classifying extremely imbalanced data sets

11/29/2010
by   Markward Britsch, et al.
0

Imbalanced data sets containing much more background than signal instances are very common in particle physics, and will also be characteristic for the upcoming analyses of LHC data. Following up the work presented at ACAT 2008, we use the multivariate technique presented there (a rule growing algorithm with the meta-methods bagging and instance weighting) on much more imbalanced data sets, especially a selection of D0 decays without the use of particle identification. It turns out that the quality of the result strongly depends on the number of background instances used for training. We discuss methods to exploit this in order to improve the results significantly, and how to handle and reduce the size of large training sets without loss of result quality in general. We will also comment on how to take into account statistical fluctuation in receiver operation characteristic curves (ROC) for comparing classifier methods.

READ FULL TEXT
research
04/19/2018

Instance Selection Improves Geometric Mean Accuracy: A Study on Imbalanced Data Classification

A natural way of handling imbalanced data is to attempt to equalise the ...
research
10/09/2022

An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH

Training of Machine Learning (ML) models in real contexts often deals wi...
research
09/18/2021

An Empirical Evaluation of the t-SNE Algorithm for Data Visualization in Structural Engineering

A fundamental task in machine learning involves visualizing high-dimensi...
research
04/21/2020

Improving Positive Unlabeled Learning: Practical AUL Estimation and New Training Method for Extremely Imbalanced Data Sets

Positive Unlabeled (PU) learning is widely used in many applications, wh...
research
12/12/2019

KLT Picker: Particle Picking Using Data-Driven Optimal Templates

Particle picking is currently a critical step in the cryo-EM single part...
research
08/25/2022

Credit card fraud detection - Classifier selection strategy

Machine learning has opened up new tools for financial fraud detection. ...
research
09/17/2018

From Same Photo: Cheating on Visual Kinship Challenges

With the propensity for deep learning models to learn unintended signals...

Please sign up or login with your details

Forgot password? Click here to reset