Pitchtron: Towards audiobook generation from ordinary people's voices

05/21/2020
by   Sunghee Jung, et al.
0

In this paper, we explore prosody transfer for audiobook generation under rather realistic condition where training DB is plain audio mostly from multiple ordinary people and reference audio given during inference is from professional and richer in prosody than training DB. To be specific, we explore transferring Korean dialects and emotive speech even though training set is mostly composed of standard and neutral Korean. We found that under this setting, original global style token method generates undesirable glitches in pitch, energy and pause length. To deal with this issue, we propose two models, hard and soft pitchtron and release the toolkit and corpus that we have developed. Hard pitchtron uses pitch as input to the decoder while soft pitchtron uses pitch as input to the prosody encoder. We verify the effectiveness of proposed models with objective and subjective tests. AXY score over GST is 2.01 and 1.14 for hard pitchtron and soft pitchtron respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/29/2018

Energy-Efficient Soft-Assisted Product Decoders

We implement a 1-Tb/s 0.63-pJ/bit soft-assisted product decoder in a 28-...
research
06/24/2019

A novel soft-aided bit-marking decoder for product codes

We introduce a novel soft-aided hard-decision decoder for product codes ...
research
11/21/2019

Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features

This paper presents a simple yet effective method to achieve prosody tra...
research
11/19/2022

VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

Video to sound generation aims to generate realistic and natural sound g...
research
11/04/2022

Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

We propose an end-to-end music mixing style transfer system that convert...
research
03/30/2021

A study of latent monotonic attention variants

End-to-end models reach state-of-the-art performance for speech recognit...

Please sign up or login with your details

Forgot password? Click here to reset