DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data

08/20/2023
by   Peng Li, et al.
0

Data preprocessing is a crucial step in the machine learning process that transforms raw data into a more usable format for downstream ML models. However, it can be costly and time-consuming, often requiring the expertise of domain experts. Existing automated machine learning (AutoML) frameworks claim to automate data preprocessing. However, they often use a restricted search space of data preprocessing pipelines which limits the potential performance gains, and they are often too slow as they require training the ML model multiple times. In this paper, we propose DiffPrep, a method that can automatically and efficiently search for a data preprocessing pipeline for a given tabular dataset and a differentiable ML model such that the performance of the ML model is maximized. We formalize the problem of data preprocessing pipeline search as a bi-level optimization problem. To solve this problem efficiently, we transform and relax the discrete, non-differential search space into a continuous and differentiable one, which allows us to perform the pipeline search using gradient descent with training the ML model only once. Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated and improves the model's test accuracy by up to 6.6 percentage points.

READ FULL TEXT
research
01/26/2021

Incremental Search Space Construction for Machine Learning Pipeline Synthesis

Automated machine learning (AutoML) aims for constructing machine learni...
research
02/28/2023

Towards Personalized Preprocessing Pipeline Search

Feature preprocessing, which transforms raw input features into numerica...
research
08/21/2017

nuts-flow/ml: data pre-processing for deep learning

Data preprocessing is a fundamental part of any machine learning applica...
research
11/27/2021

AutoTSC: Optimization Algorithm to Automatically Solve the Time Series Classification Problem

Nowadays Automated Machine Learning, abbrevi- ated AutoML, is recognize...
research
11/27/2021

TPOT-SH: a Faster Optimization Algorithm to Solve the AutoML Problem on Large Datasets

Data are omnipresent nowadays and contain knowl- edge and patterns that...
research
06/15/2018

Automated Image Data Preprocessing with Deep Reinforcement Learning

Data preparation, i.e. the process of transforming raw data into a forma...
research
04/26/2023

AutoCure: Automated Tabular Data Curation Technique for ML Pipelines

Machine learning algorithms have become increasingly prevalent in multip...

Please sign up or login with your details

Forgot password? Click here to reset