HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques

03/29/2022
by   Angelos Chatzimparmpas, et al.
0

Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.

READ FULL TEXT

page 1

page 6

page 7

page 8

research
05/04/2020

StackGenVis: Alignment of Data, Algorithms, and Models for Stacking Ensemble Learning Using Performance Metrics

In machine learning (ML), ensemble methods such as bagging, boosting, an...
research
03/26/2021

FeatureEnVi: Visual Analytics for Feature Engineering Using Stepwise Selection and Semi-Automatic Extraction Approaches

The machine learning (ML) life cycle involves a series of iterative step...
research
09/29/2021

PyHard: a novel tool for generating hardness embeddings to support data-centric analysis

For building successful Machine Learning (ML) systems, it is imperative ...
research
12/07/2022

MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels

Stacking (or stacked generalization) is an ensemble learning method with...
research
12/01/2021

VisRuler: Visual Analytics for Extracting Decision Rules from Bagged and Boosted Decision Trees

Bagging and boosting are two popular ensemble methods in machine learnin...
research
01/11/2021

Contrastive Learning Improves Critical Event Prediction in COVID-19 Patients

Machine Learning (ML) models typically require large-scale, balanced tra...
research
12/04/2022

Characterizing instance hardness in classification and regression problems

Some recent pieces of work in the Machine Learning (ML) literature have ...

Please sign up or login with your details

Forgot password? Click here to reset