Re^3Dial: Retrieve, Reorganize and Rescale Dialogue Corpus for Long-Turn Open-Domain Dialogue Pre-training

05/04/2023
by   Jiaxin Wen, et al.
0

Large-scale open-domain dialogue data crawled from public social media has greatly improved the performance of dialogue models. However, long-turn dialogues are still highly scarce. Specifically, most dialogue sessions in existing corpora have less than three turns. To alleviate this issue, we propose the Retrieve, Reorganize and Rescale framework (Re^3Dial), which can automatically construct a billion-scale long-turn dialogue corpus from existing short-turn dialogue data. Re^3Dial first trains an Unsupervised Dense Session Retriever (UDSR) to capture semantic and discourse relationships within multi-turn dialogues for retrieving relevant and coherent sessions. It then reorganizes the short-turn dialogues into long-turn sessions via recursively retrieving and selecting the consecutive sessions with our proposed diversity sampling strategy. Extensive evaluations on multiple multi-turn dialogue benchmarks demonstrate that Re^3Dial consistently and significantly improves the dialogue model's ability to utilize long-term context for modeling multi-turn dialogues across different pre-training settings. Finally, we build a toolkit for efficiently rescaling dialogue corpus with Re^3Dial, which enables us to construct a corpus containing 1B Chinese dialogue sessions with 11.3 turns on average (5X longer than the original EVA corpus). We will release our UDSR model, toolkit, and data for public use.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/03/2021

EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training

Although pre-trained language models have remarkably enhanced the genera...
research
09/15/2017

Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue

The use of irony and sarcasm in social media allows us to study them at ...
research
08/09/2022

Positively transitioned sentiment dialogue corpus for developing emotion-affective open-domain chatbots

In this paper, we describe a data enhancement method for developing Emil...
research
03/03/2020

Transfer Learning for Context-Aware Spoken Language Understanding

Spoken language understanding (SLU) is a key component of task-oriented ...
research
03/19/2022

Learning-by-Narrating: Narrative Pre-Training for Zero-Shot Dialogue Comprehension

Comprehending a dialogue requires a model to capture diverse kinds of ke...
research
11/22/2019

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset forE-commerce Customer Service

Human conversations in real scenarios are complicated and building a hum...
research
11/30/2022

ConvLab-3: A Flexible Dialogue System Toolkit Based on a Unified Data Format

Diverse data formats and ontologies of task-oriented dialogue (TOD) data...

Please sign up or login with your details

Forgot password? Click here to reset