GPT for Semi-Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering

05/05/2023
by   Noah Hollmann, et al.
0

As the field of automated machine learning (AutoML) advances, it becomes increasingly important to include domain knowledge within these systems. We present an approach for doing so by harnessing the power of large language models (LLMs). Specifically, we introduce Context-Aware Automated Feature Engineering (CAAFE), a feature engineering method for tabular datasets that utilizes an LLM to generate additional semantically meaningful features for tabular datasets based on their descriptions. The method produces both Python code for creating new features and explanations for the utility of the generated features. Despite being methodologically simple, CAAFE enhances performance on 11 out of 14 datasets, ties on 2 and looses on 1 - boosting mean ROC AUC performance from 0.798 to 0.822 across all datasets. On the evaluated datasets, this improvement is similar to the average improvement achieved by using a random forest (AUC 0.782) instead of logistic regression (AUC 0.754). Furthermore, our method offers valuable insights into the rationale behind the generated features by providing a textual explanation for each generated feature. CAAFE paves the way for more extensive (semi-)automation in data science tasks and emphasizes the significance of context-aware solutions that can extend the scope of AutoML systems. For reproducability, we release our code and a simple demo.

READ FULL TEXT
research
05/12/2021

Automating Data Science: Prospects and Challenges

Given the complexity of typical data science projects and the associated...
research
11/22/2022

OpenFE: Automated Feature Generation beyond Expert-level Performance

The goal of automated feature generation is to liberate machine learning...
research
05/16/2022

A Survey on Semantics in Automated Data Science

Data Scientists leverage common sense reasoning and domain knowledge to ...
research
09/12/2019

Augmented Data Science: Towards Industrialization and Democratization of Data Science

Conversion of raw data into insights and knowledge requires substantial ...
research
10/07/2020

CATBERT: Context-Aware Tiny BERT for Detecting Social Engineering Emails

Targeted phishing emails are on the rise and facilitate the theft of bil...
research
07/06/2022

An Overview on Designs and Applications of Context-Aware Automation Systems

Automation systems are increasingly being used in dynamic and various op...
research
09/14/2023

SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions

Before applying data analytics or machine learning to a data set, a vita...

Please sign up or login with your details

Forgot password? Click here to reset