DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

07/19/2023
by   Jianguo Zhang, et al.
0

Despite advancements in conversational AI, language models encounter challenges to handle diverse conversational tasks, and existing dialogue dataset collections often lack diversity and comprehensiveness. To tackle these issues, we introduce DialogStudio: the largest and most diverse collection of dialogue datasets, unified under a consistent format while preserving their original information. Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues, making it an incredibly rich and diverse resource for dialogue research and model training. To further enhance the utility of DialogStudio, we identify the licenses for each dataset and design domain-aware prompts for selected dialogues to facilitate instruction-aware fine-tuning. Furthermore, we develop conversational AI models using the dataset collection, and our experiments in both zero-shot and few-shot learning scenarios demonstrate the superiority of DialogStudio. To improve transparency and support dataset and task-based research, as well as language model pre-training, all datasets, licenses, codes, and models associated with DialogStudio are made publicly accessible at https://github.com/salesforce/DialogStudio

READ FULL TEXT
research
05/25/2022

Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning

Instruction tuning is an emergent paradigm in NLP wherein natural langua...
research
08/19/2023

Large Language Models as Zero-Shot Conversational Recommenders

In this paper, we present empirical studies on conversational recommenda...
research
04/06/2022

The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems

Conversational agents have come increasingly closer to human competence ...
research
11/30/2022

ConvLab-3: A Flexible Dialogue System Toolkit Based on a Unified Data Format

Diverse data formats and ontologies of task-oriented dialogue (TOD) data...
research
05/23/2023

MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems

Although automatic dialogue tutors hold great potential in making educat...
research
03/07/2022

Building and curating conversational corpora for diversity-aware language science and technology

We present a pipeline and tools to build a maximally natural data set of...
research
04/13/2023

LasUIE: Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model

Universally modeling all typical information extraction tasks (UIE) with...

Please sign up or login with your details

Forgot password? Click here to reset