Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs

09/18/2023
by   Jonas Golde, et al.
0

Most NLP tasks are modeled as supervised learning and thus require labeled training data to train effective models. However, manually producing such data at sufficient quality and quantity is known to be costly and time-intensive. Current research addresses this bottleneck by exploring a novel paradigm called zero-shot learning via dataset generation. Here, a powerful LLM is prompted with a task description to generate labeled data that can be used to train a downstream NLP model. For instance, an LLM might be prompted to "generate 500 movie reviews with positive overall sentiment, and another 500 with negative sentiment." The generated data could then be used to train a binary sentiment classifier, effectively leveraging an LLM as a teacher to a smaller student model. With this demo, we introduce Fabricator, an open-source Python toolkit for dataset generation. Fabricator implements common dataset generation workflows, supports a wide range of downstream NLP tasks (such as text classification, question answering, and entity recognition), and is integrated with well-known libraries to facilitate quick experimentation. With Fabricator, we aim to support researchers in conducting reproducible dataset generation experiments using LLMs and help practitioners apply this approach to train models for downstream tasks.

READ FULL TEXT
research
02/16/2022

ZeroGen: Efficient Zero-shot Learning via Dataset Generation

There is a growing interest in dataset generation recently due to the su...
research
04/19/2021

skweak: Weak Supervision Made Easy for NLP

We present skweak, a versatile, Python-based software toolkit enabling N...
research
05/02/2020

Exploring and Predicting Transferability across NLP Tasks

Recent advances in NLP demonstrate the effectiveness of training large-s...
research
10/05/2020

TextAttack: Lessons learned in designing Python frameworks for NLP

TextAttack is an open-source Python toolkit for adversarial attacks, adv...
research
06/17/2021

pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks

Extracting opinions from texts has gathered a lot of interest in the las...
research
03/02/2021

A Data-Centric Framework for Composable NLP Workflows

Empirical natural language processing (NLP) systems in application domai...
research
07/20/2022

Large Scale Radio Frequency Signal Classification

Existing datasets used to train deep learning models for narrowband radi...

Please sign up or login with your details

Forgot password? Click here to reset