LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction

04/17/2023
by   Abdullatif Köksal, et al.
0

Instruction tuning enables language models to generalize more effectively and better follow user intent. However, obtaining instruction data can be costly and challenging. Prior works employ methods such as expensive human annotation, crowd-sourced datasets with alignment issues, or generating noisy examples via LLMs. We introduce the LongForm dataset, which is created by leveraging English corpus examples with augmented instructions. We select a diverse set of human-written documents from existing corpora such as C4 and Wikipedia and generate instructions for the given documents via LLMs. This approach provides a cheaper and cleaner instruction-tuning dataset and one suitable for long text generation. We finetune T5, OPT, and LLaMA models on our dataset and show that even smaller LongForm models have good generalization capabilities for text generation. Our models outperform 10x larger language models without instruction tuning on various tasks such as story/recipe generation and long-form question answering. Moreover, LongForm models outperform prior instruction-tuned models such as FLAN-T5 and Alpaca by a large margin. Finally, our models can effectively follow and answer multilingual instructions; we demonstrate this for news generation. We publicly release our data and models: https://github.com/akoksal/LongForm.

READ FULL TEXT
research
06/07/2023

M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Instruction tuning has significantly advanced large language models (LLM...
research
08/27/2023

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

The ability of large language models (LLMs) to follow natural language i...
research
06/24/2023

Thinking Like an Annotator: Generation of Dataset Labeling Instructions

Large-scale datasets are essential to modern day deep learning. Advocate...
research
05/23/2023

Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation

Instruction tuning has emerged to enhance the capabilities of large lang...
research
05/19/2023

Self-QA: Unsupervised Knowledge Guided Language Model Alignment

Large-scale language models like ChatGPT and GPT-4 have gained attention...
research
08/24/2023

Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models

Instruction tuning is instrumental in enabling Large Language Models (LL...
research
08/11/2023

Self-Alignment with Instruction Backtranslation

We present a scalable method to build a high quality instruction followi...

Please sign up or login with your details

Forgot password? Click here to reset