SubStrat: A Subset-Based Strategy for Faster AutoML

06/07/2022
by   Teddy Lazebnik, et al.
0

Automated machine learning (AutoML) frameworks have become important tools in the data scientists' arsenal, as they dramatically reduce the manual work devoted to the construction of ML pipelines. Such frameworks intelligently search among millions of possible ML pipelines - typically containing feature engineering, model selection and hyper parameters tuning steps - and finally output an optimal pipeline in terms of predictive accuracy. However, when the dataset is large, each individual configuration takes longer to execute, therefore the overall AutoML running times become increasingly high. To this end, we present SubStrat, an AutoML optimization strategy that tackles the data size, rather than configuration space. It wraps existing AutoML tools, and instead of executing them directly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small yet representative data subset which preserves a particular characteristic of the full data. It then employs the AutoML tool on the small subset, and finally, it refines the resulted pipeline by executing a restricted, much shorter, AutoML process on the large dataset. Our experimental results, performed on two popular AutoML frameworks, Auto-Sklearn and TPOT, show that SubStrat reduces their running times by 79 (on average), with less than 2 pipeline.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

research
11/21/2020

AutoWeka4MCPS-AVATAR: Accelerating Automated Machine Learning Pipeline Composition and Optimisation

Automated machine learning pipeline (ML) composition and optimisation ai...
research
04/17/2023

eTOP: Early Termination of Pipelines for Faster Training of AutoML Systems

Recent advancements in software and hardware technologies have enabled t...
research
02/18/2022

SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

Automatic machine learning, or AutoML, holds the promise of truly democr...
research
07/04/2022

DiffML: End-to-end Differentiable ML Pipelines

In this paper, we present our vision of differentiable ML pipelines call...
research
06/29/2023

AutoML in Heavily Constrained Applications

Optimizing a machine learning pipeline for a task at hand requires caref...
research
04/26/2023

AutoCure: Automated Tabular Data Curation Technique for ML Pipelines

Machine learning algorithms have become increasingly prevalent in multip...
research
08/28/2023

Towards Evolution Capabilities in Data Pipelines

Evolutionary change over time in the context of data pipelines is certai...

Please sign up or login with your details

Forgot password? Click here to reset