Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

10/31/2022
by   Luigi Attorresi, et al.
0

The rapid spread of media content synthesis technology and the potentially damaging impact of audio and video deepfakes on people's lives have raised the need to implement systems able to detect these forgeries automatically. In this work we present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice. On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task. On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder. We show that the combination of these two embeddings fed to a supervised binary classifier allows the detection of deepfake speech generated with both Text-to-Speech and Voice Conversion techniques. Our results show improvements over the considered baselines, good generalization properties over multiple datasets and robustness to audio compression.

READ FULL TEXT

page 10

page 13

research
06/04/2018

Voice Imitating Text-to-Speech Neural Networks

We propose a neural text-to-speech (TTS) model that can imitate a new sp...
research
09/28/2022

Deepfake audio detection by speaker verification

Thanks to recent advances in deep learning, sophisticated generation too...
research
08/24/2023

WavMark: Watermarking for Audio Generation

Recent breakthroughs in zero-shot voice synthesis have enabled imitating...
research
08/08/2022

TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

Non-parallel many-to-many voice conversion remains an interesting but ch...
research
12/14/2022

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Human speech can be characterized by different components, including sem...
research
06/09/2023

Speaker Embeddings as Individuality Proxy for Voice Stress Detection

Since the mental states of the speaker modulate speech, stress introduce...
research
07/24/2021

Significance of Speaker Embeddings and Temporal Context for Depression Detection

Depression detection from speech has attracted a lot of attention in rec...

Please sign up or login with your details

Forgot password? Click here to reset