Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis

04/26/2021
by   Kosuke Futamata, et al.
0

We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a.k.a BERT, and explicit features extracted from BiLSTM with linguistic features. In conventional BiLSTM based methods, word representations and/or sentence representations are used as independent components. The proposed method takes account of both representations to extract the latent semantics, which cannot be captured by previous methods. The objective evaluation results show that the proposed method obtains an absolute improvement of 3.2 points for the F1 score compared with BiLSTM-based conventional methods using linguistic features. Moreover, the perceptual listening test results verify that a TTS system that applied our proposed method achieved a mean opinion score of 4.39 in prosody naturalness, which is highly competitive with the score of 4.37 for synthesized speech with ground-truth phrase breaks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/24/2022

Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end

Although end-to-end text-to-speech (TTS) models can generate natural spe...
research
04/09/2023

An investigation of speaker independent phrase break models in End-to-End TTS systems

This paper presents our work on phrase break prediction in the context o...
research
11/11/2019

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

In Mandarin text-to-speech (TTS) system, the front-end text processing m...
research
02/27/2023

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Pause insertion, also known as phrase break prediction and phrasing, is ...
research
07/26/2021

Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

In this paper, we propose an effective method to synthesize speaker-spec...
research
08/27/2018

IIIDYT at IEST 2018: Implicit Emotion Classification With Deep Contextualized Word Representations

In this paper we describe our system designed for the WASSA 2018 Implici...
research
06/04/2022

Atypical lexical abbreviations identification in Russian medical texts

Abbreviation is a method of word formation that aims to construct the sh...

Please sign up or login with your details

Forgot password? Click here to reset