Zero resource speech synthesis using transcripts derived from perceptual acoustic units

06/08/2020
by   Karthik Pandia D S, et al.
0

Zerospeech synthesis is the task of building vocabulary independent speech synthesis systems, where transcriptions are not available for training data. It is, therefore, necessary to convert training data into a sequence of fundamental acoustic units that can be used for synthesis during the test. This paper attempts to discover, and model perceptual acoustic units consisting of steady-state, and transient regions in speech. The transients roughly correspond to CV, VC units, while the steady-state corresponds to sonorants and fricatives. The speech signal is first preprocessed by segmenting the same into CVC-like units using a short-term energy-like contour. These CVC segments are clustered using a connected components-based graph clustering technique. The clustered CVC segments are initialized such that the onset (CV) and decays (VC) correspond to transients, and the rhyme corresponds to steady-states. Following this initialization, the units are allowed to re-organise on the continuous speech into a final set of AUs in an HMM-GMM framework. AU sequences thus obtained are used to train synthesis models. The performance of the proposed approach is evaluated on the Zerospeech 2019 challenge database. Subjective and objective scores show that reasonably good quality synthesis with low bit rate encoding can be achieved using the proposed AUs.

READ FULL TEXT
research
09/10/2020

Exploration of End-to-end Synthesisers forZero Resource Speech Challenge 2020

A Spoken dialogue system for an unseen language is referred to as Zero r...
research
10/12/2020

The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units

We present the Zero Resource Speech Challenge 2020, which aims at learni...
research
08/26/2023

A small vocabulary database of ultrasound image sequences of vocal tract dynamics

This paper presents a new database consisting of concurrent articulatory...
research
04/28/2020

Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise

Attention-based sequence-to-sequence (seq2seq) speech synthesis has achi...
research
02/18/2022

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Though significant progress has been made for speaker-dependent Video-to...
research
08/15/2023

Preliminary investigation of the short-term in situ performance of an automatic masker selection system

Soundscape augmentation or "masking" introduces wanted sounds into the a...
research
03/22/2021

A Perceptual Model of Musical Mix Clarity using Decomposition and Masking Thresholds

Objective measurement of perceptually motivated music attributes has app...

Please sign up or login with your details

Forgot password? Click here to reset