Class Density and Dataset Quality in High-Dimensional, Unstructured Data

02/08/2022
by   Adam Byerly, et al.
0

We provide a definition for class density that can be used to measure the aggregate similarity of the samples within each of the classes in a high-dimensional, unstructured dataset. We then put forth several candidate methods for calculating class density and analyze the correlation between the values each method produces with the corresponding individual class test accuracies achieved on a trained model. Additionally, we propose a definition for dataset quality for high-dimensional, unstructured data and show that those datasets that met a certain quality threshold (experimentally demonstrated to be > 10 for the datasets studied) were candidates for eliding redundant data based on the individual class densities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/20/2013

High-Dimensional Probability Estimation with Deep Density Models

One of the fundamental problems in machine learning is the estimation of...
research
12/04/2019

Sub-linear RACE Sketches for Approximate Kernel Density Estimation on Streaming Data

Kernel density estimation is a simple and effective method that lies at ...
research
01/05/2021

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Machine learning has been proven to be effective in various application ...
research
08/02/2022

On Good 2-Query Locally Testable Codes from Sheaves on High Dimensional Expanders

We expose a strong connection between good 2-query locally testable code...
research
09/04/2016

High Dimensional Human Guided Machine Learning

Have you ever looked at a machine learning classification model and thou...
research
05/31/2023

Representer Point Selection for Explaining Regularized High-dimensional Models

We introduce a novel class of sample-based explanations we term high-dim...
research
06/24/2016

Multipartite Ranking-Selection of Low-Dimensional Instances by Supervised Projection to High-Dimensional Space

Pruning of redundant or irrelevant instances of data is a key to every s...

Please sign up or login with your details

Forgot password? Click here to reset