Chinese Open Instruction Generalist: A Preliminary Release

04/17/2023
by   Ge Zhang, et al.
3

Instruction tuning is widely recognized as a key technique for building generalist language models, which has attracted the attention of researchers and the public with the release of InstructGPT <cit.> and ChatGPT[<https://chat.openai.com/>]. Despite impressive progress in English-oriented large-scale language models (LLMs), it is still under-explored whether English-based foundation LLMs can perform similarly on multilingual tasks compared to English tasks with well-designed instruction tuning and how we can construct the corpora needed for the tuning. To remedy this gap, we propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We also summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora. The resulting Chinese Open Instruction Generalist (COIG) corpora are available in Huggingface[<https://huggingface.co/datasets/BAAI/COIG>] and Github[<https://github.com/BAAI-Zlab/COIG>], and will be continuously updated.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2023

Panda LLM: Training Data and Evaluation for Open-Sourced Chinese Instruction-Following Large Language Models

This project focuses on enhancing open-source large language models thro...
research
05/19/2023

InstructIE: A Chinese Instruction-based Information Extraction Dataset

We introduce a new Information Extraction (IE) task dubbed Instruction-b...
research
04/17/2023

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolution...
research
01/16/2023

PromptShots at the FinNLP-2022 ERAI Tasks: Pairwise Comparison and Unsupervised Ranking

This report describes our PromptShots submissions to a shared task on Ev...
research
05/24/2023

PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology

As advances in large language models (LLMs) and multimodal techniques co...
research
08/14/2023

#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models

Foundation language models obtain the instruction-following ability thro...
research
06/28/2023

On the Exploitability of Instruction Tuning

Instruction tuning is an effective technique to align large language mod...

Please sign up or login with your details

Forgot password? Click here to reset