Assessing Dataset Quality Through Decision Tree Characteristics in Autoencoder-Processed Spaces

06/27/2023
by   Szymon Mazurek, et al.
0

In this paper, we delve into the critical aspect of dataset quality assessment in machine learning classification tasks. Leveraging a variety of nine distinct datasets, each crafted for classification tasks with varying complexity levels, we illustrate the profound impact of dataset quality on model training and performance. We further introduce two additional datasets designed to represent specific data conditions - one maximizing entropy and the other demonstrating high redundancy. Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality in achieving high-performing machine learning models. To aid researchers and practitioners, we propose a comprehensive framework for dataset quality assessment, which can help evaluate if the dataset at hand is sufficient and of the required quality for specific tasks. This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.

READ FULL TEXT

page 1

page 5

page 6

research
09/14/2020

Data Quality Evaluation using Probability Models

This paper discusses an approach with machine-learning probability model...
research
02/25/2020

A metric Suite for Systematic Quality Assessment of Linked Open Data

Abstract- The vision of the Linked Open Data (LOD) initiative is to prov...
research
03/08/2023

FUSQA: Fetal Ultrasound Segmentation Quality Assessment

Deep learning models have been effective for various fetal ultrasound se...
research
05/31/2023

Quality In / Quality Out: Assessing Data quality in an Anomaly Detection Benchmark

Autonomous or self-driving networks are expected to provide a solution t...
research
08/19/2021

More for Less: Non-Intrusive Speech Quality Assessment with Limited Annotations

Non-intrusive speech quality assessment is a crucial operation in multim...
research
12/11/2019

Callisto: Entropy based test generation and data quality assessment for Machine Learning Systems

Machine Learning (ML) has seen massive progress in the last decade and a...
research
09/10/2020

Critical analysis on the reproducibility of visual quality assessment using deep features

Data used to train supervised machine learning models are commonly split...

Please sign up or login with your details

Forgot password? Click here to reset