Building and curating conversational corpora for diversity-aware language science and technology

03/07/2022
by   Andreas Liesenfeld, et al.
0

We present a pipeline and tools to build a maximally natural data set of conversational interaction that covers 66 languages and varieties from 32 phyla. We describe the curation and compilation process moving from diverse language documentation corpora to a unified format and describe an open-source tool "convo-parse" to help in quality control and assessment of conversational data. We conclude with two case studies of how diverse data sets can inform interactional linguistics and speech recognition technology and thus contribute to broadening the empirical foundations of language sciences and technologies of the future.

READ FULL TEXT

page 2

page 4

page 6

page 8

research
05/16/2022

Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

Building inclusive speech recognition systems is a crucial step towards ...
research
07/19/2023

Chit-Chat or Deep Talk: Prompt Engineering for Process Mining

This research investigates the application of Large Language Models (LLM...
research
07/19/2023

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

Despite advancements in conversational AI, language models encounter cha...
research
07/29/2022

Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation

The availability of data in expressive styles across languages is limite...
research
06/08/2023

Why Are Conversational Assistants Still Black Boxes? The Case For Transparency

Much has been written about privacy in the context of conversational and...

Please sign up or login with your details

Forgot password? Click here to reset