SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

07/27/2022
by   Artem Ploujnikov, et al.
12

End-to-end speech synthesis models directly convert the input characters into an audio representation (e.g., spectrograms). Despite their impressive performance, such models have difficulty disambiguating the pronunciations of identically spelled words. To mitigate this issue, a separate Grapheme-to-Phoneme (G2P) model can be employed to convert the characters into phonemes before synthesizing the audio. This paper proposes SoundChoice, a novel G2P architecture that processes entire sentences rather than operating at the word level. The proposed architecture takes advantage of a weighted homograph loss (that improves disambiguation), exploits curriculum learning (that gradually switches from word-level to sentence-level G2P), and integrates word embeddings from BERT (for further performance improvement). Moreover, the model inherits the best practices in speech recognition, including multi-task learning with Connectionist Temporal Classification (CTC) and beam search with an embedded language model. As a result, SoundChoice achieves a Phoneme Error Rate (PER) of 2.65 and Wikipedia. Index Terms grapheme-to-phoneme, speech synthesis, text-tospeech, phonetics, pronunciation, disambiguation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/28/2018

On the Inductive Bias of Word-Character-Level Multi-Task Learning for Speech Recognition

End-to-end automatic speech recognition (ASR) commonly transcribes audio...
research
03/22/2017

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Recent work on end-to-end automatic speech recognition (ASR) has shown t...
research
06/27/2019

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

We present a novel conversational-context aware end-to-end speech recogn...
research
02/17/2022

Curriculum optimization for low-resource speech recognition

Modern end-to-end speech recognition models show astonishing results in ...
research
08/05/2015

Listen, Attend and Spell

We present Listen, Attend and Spell (LAS), a neural network that learns ...
research
10/30/2017

Deep word embeddings for visual speech recognition

In this paper we present a deep learning architecture for extracting wor...
research
10/20/2020

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

Due to the compelling improvements brought by BERT, many recent represen...

Please sign up or login with your details

Forgot password? Click here to reset