Learning Speech Representation From Contrastive Token-Acoustic Pretraining

09/01/2023
by   Chunyu Qiang, et al.
0

For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic information such as speaker identity and acoustic details should be de-emphasized. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Contrastive learning is a good method for modeling intermediate representations from two modalities. However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. The CTAP model is trained on 210k speech and phoneme text pairs, achieving minimally-supervised TTS, VC, and ASR. The proposed CTAP method offers a promising solution for fine-grained generation and recognition downstream tasks in speech processing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2023

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

Neural text-to-speech (TTS) generally consists of cascaded architecture ...
research
10/30/2020

Joint Masked CPC and CTC Training for ASR

Self-supervised learning (SSL) has shown promise in learning representat...
research
07/23/2023

SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces

Numerous examples in the literature proved that deep learning models hav...
research
05/16/2023

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

Nowadays, recognition-synthesis-based methods have been quite popular wi...
research
09/04/2023

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Mapping two modalities, speech and text, into a shared representation sp...
research
02/22/2023

Contrastive Representation Learning for Acoustic Parameter Estimation

A study is presented in which a contrastive learning approach is used to...
research
03/24/2022

Disentangleing Content and Fine-grained Prosody Information via Hybrid ASR Bottleneck Features for Voice Conversion

Non-parallel data voice conversion (VC) have achieved considerable break...

Please sign up or login with your details

Forgot password? Click here to reset