PyHard: a novel tool for generating hardness embeddings to support data-centric analysis

09/29/2021
by   Pedro Yuri Arbs Paiva, et al.
0

For building successful Machine Learning (ML) systems, it is imperative to have high quality data and well tuned learning models. But how can one assess the quality of a given dataset? And how can the strengths and weaknesses of a model on a dataset be revealed? Our new tool PyHard employs a methodology known as Instance Space Analysis (ISA) to produce a hardness embedding of a dataset relating the predictive performance of multiple ML models to estimated instance hardness meta-features. This space is built so that observations are distributed linearly regarding how hard they are to classify. The user can visually interact with this embedding in multiple ways and obtain useful insights about data and algorithmic performance along the individual observations of the dataset. We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models and are therefore worth closer inspection, and the delineation of regions of strengths and weaknesses of ML models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/04/2022

Characterizing instance hardness in classification and regression problems

Some recent pieces of work in the Machine Learning (ML) literature have ...
research
07/28/2023

YOLOv8 for Defect Inspection of Hexagonal Directed Self-Assembly Patterns: A Data-Centric Approach

Shrinking pattern dimensions leads to an increased variety of defect typ...
research
03/29/2022

HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques

Despite the tremendous advances in machine learning (ML), training with ...
research
12/15/2021

Fix your Models by Fixing your Datasets

The quality of underlying training data is very crucial for building per...
research
02/24/2017

Mapping Patent Classifications: Portfolio and Statistical Analysis, and the Comparison of Strengths and Weaknesses

The Cooperative Patent Classifications (CPC) jointly developed by the Eu...
research
09/04/2023

Which algorithm to select in sports timetabling?

Any sports competition needs a timetable, specifying when and where team...
research
10/24/2022

Hardness in Markov Decision Processes: Theory and Practice

Meticulously analysing the empirical strengths and weaknesses of reinfor...

Please sign up or login with your details

Forgot password? Click here to reset