OpenFE: Automated Feature Generation beyond Expert-level Performance

11/22/2022
by   Tianping Zhang, et al.
0

The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify useful features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves efficiency and accuracy with two components: 1) a novel feature boosting method for accurately estimating the incremental performance of candidate features. 2) a feature-scoring framework for retrieving effective features from a large number of candidates through successive featurewise halving and feature importance attribution. Extensive experiments on seven benchmark datasets show that OpenFE outperforms existing baseline methods. We further evaluate OpenFE in two famous Kaggle competitions with thousands of data science teams participating. In one of the competitions, features generated by OpenFE with a simple baseline model can beat 99.3% data science teams. In addition to the empirical results, we provide a theoretical perspective to show that feature generation is beneficial in a simple yet representative setting. The code is available at https://github.com/ZhangTP1996/OpenFE.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/30/2020

PMLB v1.0: An open source dataset collection for benchmarking machine learning methods

Motivation: Novel machine learning and statistical modeling studies rely...
research
05/05/2023

GPT for Semi-Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering

As the field of automated machine learning (AutoML) advances, it becomes...
research
07/29/2021

Temporal Dependencies in Feature Importance for Time Series Predictions

Explanation methods applied to sequential models for multivariate time s...
research
03/17/2021

You Only Look One-level Feature

This paper revisits feature pyramids networks (FPN) for one-stage detect...
research
10/14/2021

Analysis of the first Genetic Engineering Attribution Challenge

The ability to identify the designer of engineered biological sequences ...
research
03/26/2022

Implementation of an Automated Learning System for Non-experts

Automated machine learning systems for non-experts could be critical for...
research
12/07/2021

CapsProm: A Capsule Network For Promoter Prediction

Locating the promoter region in DNA sequences is of paramount importance...

Please sign up or login with your details

Forgot password? Click here to reset