A Large-Scale Chinese Short-Text Conversation Dataset

by   Yida Wang, et al.

The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.


page 1

page 2

page 3

page 4


EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training

Although pre-trained language models have remarkably enhanced the genera...

CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Conversational AI

Human language expression is based on the subjective construal of the si...

Fine-Grained Sentence Functions for Short-Text Conversation

Sentence function is an important linguistic feature referring to a user...

LIDA: Lightweight Interactive Dialogue Annotator

Dialogue systems have the potential to change how people interact with m...

Pchatbot: A Large-Scale Dataset for Personalized Chatbot

Natural language dialogue systems raise great attention recently. As man...

I run as fast as a rabbit, can you? A Multilingual Simile Dialogue Dataset

A simile is a figure of speech that compares two different things (calle...

DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels

We introduce a data set called DCH-2, which contains 4,390 real customer...

Code Repositories


A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models

view repo

Please sign up or login with your details

Forgot password? Click here to reset