GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

07/26/2023
by   Yaxin Fan, et al.
0

Grammatical error correction aims to correct ungrammatical sentences automatically. Recently, some work has demonstrated the excellent capabilities of closed-source Large Language Models (LLMs, e.g., ChatGPT) in grammatical error correction. However, the potential of open-source LLMs remains unexplored. In this paper, we introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors. We ultimately constructed about 1k parallel data and utilized these data to fine-tune open-source LLMs (e.g., Phoenix, released by The Chinese University of Hong Kong, Shenzhen) with instruction tuning. The experimental results show that GrammarGPT outperforms the existing SOTA system significantly. Although model parameters are 20x larger than the SOTA baseline, the required amount of data for instruction tuning is 1200x smaller, illustrating the potential of open-source LLMs on native CGEC. Our GrammarGPT ranks 3^rd on NLPCC2023 SharedTask1, demonstrating our approach's effectiveness. The code and data are available at <https://github.com/FreedomIntelligence/GrammarGPT>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2022

FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction

Grammatical Error Correction (GEC) has been broadly applied in automatic...
research
11/09/2017

Toward perfect reads

We propose a new method to correct short reads using de Bruijn graphs, a...
research
03/18/2022

Towards Lithuanian grammatical error correction

Everyone wants to write beautiful and correct text, yet the lack of lang...
research
06/23/2022

Mining Error Templates for Grammatical Error Correction

Some grammatical error correction (GEC) systems incorporate hand-crafted...
research
04/20/2022

ColorCode: A Bayesian Approach to Augmentative and Alternative Communication with Two Buttons

Many people with severely limited muscle control can only communicate th...
research
07/31/2022

Chinese grammatical error correction based on knowledge distillation

In view of the poor robustness of existing Chinese grammatical error cor...
research
08/14/2023

#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models

Foundation language models obtain the instruction-following ability thro...

Please sign up or login with your details

Forgot password? Click here to reset