The Gutenberg Dialogue Dataset

04/27/2020
by   Richard Csaky, et al.
0

Large datasets are essential for many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between size and quality (e.g. DailyDialog vs. Opensubtitles). We aim to close this gap by building a high-quality dataset consisting of 14.8M utterances in English. We extract and process dialogues from publicly available online books. We present a detailed description of our pipeline and heuristics and an error analysis of extracted dialogues. Better response quality can be achieved in zero-shot and finetuning settings by training on our data than on the larger but much noisier Opensubtitles dataset. Researchers can easily build their versions of the dataset by adjusting various trade-off parameters. The code can be extended to further languages with limited effort (https://github.com/ricsinaruto/gutenberg-dialog).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2022

BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset

In this work, we present BanglaParaphrase, a high-quality synthetic Bang...
research
10/25/2021

Zero-Shot Dialogue Disentanglement by Self-Supervised Entangled Response Selection

Dialogue disentanglement aims to group utterances in a long and multi-pa...
research
05/24/2023

RefGPT: Reference -> Truthful Customized Dialogues Generation by GPTs and for GPTs

General chat models, like ChatGPT, have attained impressive capability t...
research
11/14/2022

Towards Understanding Omission in Dialogue Summarization

Dialogue summarization aims to condense the lengthy dialogue into a conc...
research
11/05/2019

LIDA: Lightweight Interactive Dialogue Annotator

Dialogue systems have the potential to change how people interact with m...
research
06/24/2023

Active Data Acquisition in Autonomous Driving Simulation

Autonomous driving algorithms rely heavily on learning-based models, whi...
research
06/28/2022

Simplifying Dataflow Dialogue Design

In <cit.>, a dataflow (DF) based dialogue system was introduced, showing...

Please sign up or login with your details

Forgot password? Click here to reset