A critical look at the current train/test split in machine learning

06/08/2021
by   Jimin Tan, et al.
0

The randomized or cross-validated split of training and testing sets has been adopted as the gold standard of machine learning for decades. The establishment of these split protocols are based on two assumptions: (i)-fixing the dataset to be eternally static so we could evaluate different machine learning algorithms or models; (ii)-there is a complete set of annotated data available to researchers or industrial practitioners. However, in this article, we intend to take a closer and critical look at the split protocol itself and point out its weakness and limitation, especially for industrial applications. In many real-world problems, we must acknowledge that there are numerous situations where assumption (ii) does not hold. For instance, for interdisciplinary applications like drug discovery, it often requires real lab experiments to annotate data which poses huge costs in both time and financial considerations. In other words, it can be very difficult or even impossible to satisfy assumption (ii). In this article, we intend to access this problem and reiterate the paradigm of active learning, and investigate its potential on solving problems under unconventional train/test split protocols. We further propose a new adaptive active learning architecture (AAL) which involves an adaptation policy, in comparison with the traditional active learning that only unidirectionally adds data points to the training pool. We primarily justify our points by extensively investigating an interdisciplinary drug-protein binding problem. We additionally evaluate AAL on more conventional machine learning benchmarking datasets like CIFAR-10 to demonstrate the generalizability and efficacy of the new framework.

READ FULL TEXT

page 9

page 17

research
10/16/2020

ALdataset: a benchmark for pool-based active learning

Active learning (AL) is a subfield of machine learning (ML) in which a l...
research
07/27/2022

ALBench: A Framework for Evaluating Active Learning in Object Detection

Active learning is an important technology for automated machine learnin...
research
06/25/2022

Making Look-Ahead Active Learning Strategies Feasible with Neural Tangent Kernels

We propose a new method for approximating active learning acquisition st...
research
10/27/2019

Prediction stability as a criterion in active learning

Recent breakthroughs made by deep learning rely heavily on large number ...
research
11/09/2021

An Interactive Visualization Tool for Understanding Active Learning

Despite recent progress in artificial intelligence and machine learning,...
research
07/15/2020

Experimental Design for Bathymetry Editing

We describe an application of machine learning to a real-world computer ...
research
08/13/2019

Icebreaker: Element-wise Active Information Acquisition with Bayesian Deep Latent Gaussian Model

In this paper we introduce the ice-start problem, i.e., the challenge of...

Please sign up or login with your details

Forgot password? Click here to reset