Disentanglement in a GAN for Unconditional Speech Synthesis

07/04/2023
by   Matthew Baas, et al.
0

Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) – a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks. Code, models, samples: https://github.com/RF5/simple-asgan/

READ FULL TEXT

page 1

page 9

page 10

page 11

research
10/11/2022

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

We propose AudioStyleGAN (ASGAN), a new generative adversarial network (...
research
01/30/2023

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

Synthesizing high-fidelity complex images from text is challenging. Base...
research
05/12/2020

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

generative network for text-to-speech synthesis with control over speec...
research
11/08/2022

PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping

Previous generative adversarial network (GAN)-based neural vocoders are ...
research
06/16/2021

Disentangling Semantic-to-visual Confusion for Zero-shot Learning

Using generative models to synthesize visual features from semantic dist...
research
03/12/2021

Latent Space Explorations of Singing Voice Synthesis using DDSP

Machine learning based singing voice models require large datasets and l...
research
10/24/2019

Speaker diarization using latent space clustering in generative adversarial network

In this work, we propose deep latent space clustering for speaker diariz...

Please sign up or login with your details

Forgot password? Click here to reset