A Large-Scale Chinese Short-Text Conversation Dataset

by   Yida Wang, et al.

The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.


page 1

page 2

page 3

page 4


EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training

Although pre-trained language models have remarkably enhanced the genera...

CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Conversational AI

Human language expression is based on the subjective construal of the si...

Fine-Grained Sentence Functions for Short-Text Conversation

Sentence function is an important linguistic feature referring to a user...

LIDA: Lightweight Interactive Dialogue Annotator

Dialogue systems have the potential to change how people interact with m...

A Manually Annotated Chinese Corpus for Non-task-oriented Dialogue Systems

This paper presents a large-scale corpus for non-task-oriented dialogue ...

Paint4Poem: A Dataset for Artistic Visualization of Classical Chinese Poems

In this work we propose a new task: artistic visualization of classical ...

PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation

To explore the limit of dialogue generation pre-training, we present the...

Code Repositories


A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models

view repo