SQuId: Measuring Speech Naturalness in Many Languages

10/12/2022
by   Thibault Sellam, et al.
0

Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0 demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of our design decision, e.g., model size, pre-training diversity, and language rebalancing with several ablation experiments.

READ FULL TEXT
research
02/03/2022

mSLAM: Massively multilingual joint pre-training for speech and text

We present mSLAM, a multilingual Speech and LAnguage Model that learns c...
research
01/26/2021

First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT

Multilingual pretrained language models have demonstrated remarkable zer...
research
10/24/2022

Multilingual Auxiliary Tasks Training: Bridging the Gap between Languages for Zero-Shot Transfer of Hate Speech Detection Models

Zero-shot cross-lingual transfer learning has been shown to be highly ch...
research
05/30/2023

Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages

We train a MOS prediction model based on wav2vec 2.0 using the open-acce...
research
05/23/2023

mmT5: Modular Multilingual Pre-Training Solves Source Language Hallucinations

Multilingual sequence-to-sequence models perform poorly with increased l...
research
09/26/2021

On the Prunability of Attention Heads in Multilingual BERT

Large multilingual models, such as mBERT, have shown promise in crosslin...
research
12/06/2022

Robust Speech Recognition via Large-Scale Weak Supervision

We study the capabilities of speech processing systems trained simply to...

Please sign up or login with your details

Forgot password? Click here to reset