Data-Juicer: A One-Stop Data Processing System for Large Language Models

09/05/2023
by   Daoyuan Chen, et al.
0

The immense evolution in Large Language Models (LLMs) has underscored the importance of massive, diverse, and high-quality data. Despite this, existing open-source tools for LLM data processing remain limited and mostly tailored to specific datasets, with an emphasis on the reproducibility of released data over adaptability and usability, inhibiting potential applications. In response, we propose a one-stop, powerful yet flexible and user-friendly LLM data processing system named Data-Juicer. Our system offers over 50 built-in versatile operators and pluggable tools, which synergize modularity, composability, and extensibility dedicated to diverse LLM data processing needs. By incorporating visualized and automatic evaluation capabilities, Data-Juicer enables a timely feedback loop to accelerate data processing and gain data insights. To enhance usability, Data-Juicer provides out-of-the-box components for users with various backgrounds, and fruitful data recipes for LLM pre-training and post-tuning usages. Further, we employ multi-facet system optimization and seamlessly integrate Data-Juicer with both LLM and distributed computing ecosystems, to enable efficient and scalable data processing. Empirical validation of the generated data recipes reveals considerable improvements in LLaMA performance for various pre-training and post-tuning cases, demonstrating up to 7.45 16 LLM benchmarks and 16.25 The system's efficiency and scalability are also validated, supported by up to 88.7 and CPU usage respectively, and 7.91x processing acceleration when utilizing distributed computing ecosystems. Our system, data recipes, and multiple tutorial demos are released, calling for broader research centered on LLM data.

READ FULL TEXT
research
06/20/2023

Lingua Manga: A Generic Large Language Model Centric System for Data Curation

Data curation is a wide-ranging area which contains many critical but ti...
research
07/19/2020

High Performance Data Engineering Everywhere

The amazing advances being made in the fields of machine and deep learni...
research
09/19/2023

OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

Large language models (LLMs) with billions of parameters have demonstrat...
research
05/29/2018

Building your Cross-Platform Application with RHEEM

Today, organizations typically perform tedious and costly tasks to juggl...
research
03/16/2022

Evolution of HEP Processing Frameworks

HEP data-processing software must support the disparate physics needs of...
research
05/18/2022

Hyperion: A Case for Unified, Self-Hosting, Zero-CPU Data-Processing Units (DPUs)

Since the inception of computing, we have been reliant on CPU-powered ar...
research
02/03/2017

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Web archives are a valuable resource for researchers of various discipli...

Please sign up or login with your details

Forgot password? Click here to reset