Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation

01/31/2022
by   Olga Majewska, et al.
0

Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers. Nevertheless, the potential of this technology is not fully realised, as current datasets for multilingual ToD - both for modular and end-to-end modelling - suffer from severe limitations. 1) When created from scratch, they are usually small in scale and fail to cover many possible dialogue flows. 2) Translation-based ToD datasets might lack naturalness and cultural specificity in the target language. In this work, to tackle these limitations we propose a novel outline-based annotation process for multilingual ToD datasets, where domain-specific abstract schemata of dialogue are mapped into natural language outlines. These in turn guide the target language annotators in writing a dialogue by providing instructions about each turn's intents and slots. Through this process we annotate a new large-scale dataset for training and evaluation of multilingual and cross-lingual ToD systems. Our Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding, dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages: Arabic, Indonesian, Russian, and Kiswahili. Qualitative and quantitative analyses of COD versus an equivalent translation-based dataset demonstrate improvements in data quality, unlocked by the outline-based approach. Finally, we benchmark a series of state-of-the-art systems for cross-lingual ToD, setting reference scores for future work and demonstrating that COD prevents over-inflated performance, typically met with prior translation-based ToD datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/27/2021

An Empirical Study of Cross-Lingual Transferability in Generative Dialogue State Tracker

There has been a rapid development in data-driven task-oriented dialogue...
research
07/26/2023

Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems

Creating high-quality annotated data for task-oriented dialog (ToD) is k...
research
06/05/2021

BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue Modeling

Task-oriented dialogue (ToD) benchmarks provide an important avenue to m...
research
04/29/2020

End-to-End Slot Alignment and Recognition for Cross-Lingual NLU

Natural language understanding in the context of goal oriented dialog sy...
research
02/18/2023

Zero and Few-Shot Localization of Task-Oriented Dialogue Agents with a Distilled Representation

Task-oriented Dialogue (ToD) agents are mostly limited to a few widely-s...
research
12/23/2021

Investigating Effect of Dialogue History in Multilingual Task Oriented Dialogue Systems

While the English virtual assistants have achieved exciting performance ...
research
08/02/2022

Multilingual Coreference Resolution in Multiparty Dialogue

Existing multiparty dialogue datasets for coreference resolution are nas...

Please sign up or login with your details

Forgot password? Click here to reset