Making Early Predictions of the Accuracy of Machine Learning Applications

12/05/2012
by   J. E. Smith, et al.
0

The accuracy of machine learning systems is a widely studied research topic. Established techniques such as cross-validation predict the accuracy on unseen data of the classifier produced by applying a given learning method to a given training data set. However, they do not predict whether incurring the cost of obtaining more data and undergoing further training will lead to higher accuracy. In this paper we investigate techniques for making such early predictions. We note that when a machine learning algorithm is presented with a training set the classifier produced, and hence its error, will depend on the characteristics of the algorithm, on training set's size, and also on its specific composition. In particular we hypothesise that if a number of classifiers are produced, and their observed error is decomposed into bias and variance terms, then although these components may behave differently, their behaviour may be predictable. We test our hypothesis by building models that, given a measurement taken from the classifier created from a limited number of samples, predict the values that would be measured from the classifier produced when the full data set is presented. We create separate models for bias, variance and total error. Our models are built from the results of applying ten different machine learning algorithms to a range of data sets, and tested with "unseen" algorithms and datasets. We analyse the results for various numbers of initial training samples, and total dataset sizes. Results show that our predictions are very highly correlated with the values observed after undertaking the extra training. Finally we consider the more complex case where an ensemble of heterogeneous classifiers is trained, and show how we can accurately estimate an upper bound on the accuracy achievable after further training.

READ FULL TEXT

page 19

page 22

research
02/16/2021

Recommending Training Set Sizes for Classification

Based on a comprehensive study of 20 established data sets, we recommend...
research
05/27/2020

Data Separability for Neural Network Classifiers and the Development of a Separability Index

In machine learning, the performance of a classifier depends on both the...
research
06/28/2020

Modeling Generalization in Machine Learning: A Methodological and Computational Study

As machine learning becomes more and more available to the general publi...
research
07/07/2022

An Exploration of How Training Set Composition Bias in Machine Learning Affects Identifying Rare Objects

When training a machine learning classifier on data where one of the cla...
research
06/17/2022

Representational Multiplicity Should Be Exposed, Not Eliminated

It is prevalent and well-observed, but poorly understood, that two machi...
research
10/09/2015

Some Theory For Practical Classifier Validation

We compare and contrast two approaches to validating a trained classifie...
research
01/17/2022

Data-Centric Machine Learning in the Legal Domain

Machine learning research typically starts with a fixed data set created...

Please sign up or login with your details

Forgot password? Click here to reset