A Large-Scale Chinese Short-Text Conversation Dataset

08/10/2020
by   Yida Wang, et al.
0

The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/03/2021

EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training

Although pre-trained language models have remarkably enhanced the genera...
research
05/29/2022

CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Conversational AI

Human language expression is based on the subjective construal of the si...
research
07/24/2019

Fine-Grained Sentence Functions for Short-Text Conversation

Sentence function is an important linguistic feature referring to a user...
research
11/05/2019

LIDA: Lightweight Interactive Dialogue Annotator

Dialogue systems have the potential to change how people interact with m...
research
09/28/2020

Pchatbot: A Large-Scale Dataset for Personalized Chatbot

Natural language dialogue systems raise great attention recently. As man...
research
06/09/2023

I run as fast as a rabbit, can you? A Multilingual Simile Dialogue Dataset

A simile is a figure of speech that compares two different things (calle...
research
04/18/2021

DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels

We introduce a data set called DCH-2, which contains 4,390 real customer...

Please sign up or login with your details

Forgot password? Click here to reset