A Large-Scale Chinese Short-Text Conversation Dataset

08/10/2020
by   Yida Wang, et al.
0

The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.

READ FULL TEXT

page 1

page 2

page 3

page 4

08/03/2021

EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training

Although pre-trained language models have remarkably enhanced the genera...
05/29/2022

CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Conversational AI

Human language expression is based on the subjective construal of the si...
07/24/2019

Fine-Grained Sentence Functions for Short-Text Conversation

Sentence function is an important linguistic feature referring to a user...
11/05/2019

LIDA: Lightweight Interactive Dialogue Annotator

Dialogue systems have the potential to change how people interact with m...
05/15/2018

A Manually Annotated Chinese Corpus for Non-task-oriented Dialogue Systems

This paper presents a large-scale corpus for non-task-oriented dialogue ...
09/23/2021

Paint4Poem: A Dataset for Artistic Visualization of Classical Chinese Poems

In this work we propose a new task: artistic visualization of classical ...
09/20/2021

PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation

To explore the limit of dialogue generation pre-training, we present the...

Code Repositories

CDial-GPT

A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models


view repo