Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

01/05/2021
by   Hyeongmin Cho, et al.
24

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

READ FULL TEXT

page 1

page 2

page 3

page 6

page 10

page 16

page 17

page 19

research
04/25/2023

VeML: An End-to-End Machine Learning Lifecycle for Large-scale and High-dimensional Data

An end-to-end machine learning (ML) lifecycle consists of many iterative...
research
08/10/2020

Measures of Complexity for Large Scale Image Datasets

Large scale image datasets are a growing trend in the field of machine l...
research
10/28/2016

SOL: A Library for Scalable Online Learning Algorithms

SOL is an open-source library for scalable online learning algorithms, a...
research
02/08/2022

Class Density and Dataset Quality in High-Dimensional, Unstructured Data

We provide a definition for class density that can be used to measure th...
research
07/05/2021

An Analytical Survey on Recent Trends in High Dimensional Data Visualization

Data visualization is the process by which data of any size or dimension...
research
04/23/2019

Block-distributed Gradient Boosted Trees

The Gradient Boosted Tree (GBT) algorithm is one of the most popular mac...
research
10/10/2017

Statistical Methods and Workflow for Analyzing Human Metabolomics Data

High-throughput metabolomics investigations, when conducted in large hum...

Please sign up or login with your details

Forgot password? Click here to reset