DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

04/06/2023
by   Amit Kumar Singh Yadav, et al.
0

Tools to generate high quality synthetic speech signal that is perceptually indistinguishable from speech recorded from human speakers are easily available. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods as a black box without providing reasoning for the decisions they make. This limits the interpretability of these approaches. In this paper, we propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two staged trained variational autoencoder that processes spectrograms of speech using disentangled representation learning to generate interpretable representations of a speech signal for detecting synthetic speech. DSVAE also creates an activation map to highlight the spectrogram regions that discriminate synthetic and bona fide human speech signals. We evaluated the representations obtained from DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (>98 unknown speech synthesizers. We also visualize the representation obtained from DSVAE for 17 different speech synthesizers and verify that they are indeed interpretable and discriminate bona fide and synthetic speech from each of the synthesizers.

READ FULL TEXT

page 4

page 6

page 8

research
06/06/2022

UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder

In this paper, we propose a novel unsupervised text-to-speech (UTTS) fra...
research
02/02/2021

Evaluating the Interpretability of Generative Models by Interactive Reconstruction

For machine learning models to be most useful in numerous sociotechnical...
research
12/17/2021

Disentangled representations: towards interpretation of sex determination from hip bone

By highlighting the regions of the input image that contribute the most ...
research
08/28/2022

Towards Disentangled Speech Representations

The careful construction of audio representations has become a dominant ...
research
10/13/2021

DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding

Conventional vocoders are commonly used as analysis tools to provide int...
research
04/17/2019

Learning Interpretable Disentangled Representations using Adversarial VAEs

Learning Interpretable representation in medical applications is becomin...
research
11/11/2021

Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT Based on the Quran Reciters Dataset

Current authentication and trusted systems depend on classical and biome...

Please sign up or login with your details

Forgot password? Click here to reset