MeWEHV: Mel and Wave Embeddings for Human Voice Tasks

09/28/2022
by   Andrés Vasco-Carofilis, et al.
0

A recent trend in speech processing is the use of embeddings created through machine learning models trained on a specific task with large datasets. By leveraging the knowledge already acquired, these models can be reused in new tasks where the amount of available data is small. This paper proposes a pipeline to create a new model, called Mel and Wave Embeddings for Human Voice Tasks (MeWEHV), capable of generating robust embeddings for speech processing. MeWEHV combines the embeddings generated by a pre-trained raw audio waveform encoder model, and deep features extracted from Mel Frequency Cepstral Coefficients (MFCCs) using Convolutional Neural Networks (CNNs). We evaluate the performance of MeWEHV on three tasks: speaker, language, and accent identification. For the first one, we use the VoxCeleb1 dataset and present YouSpeakers204, a new and publicly available dataset for English speaker identification that contains 19607 audio clips from 204 persons speaking in six different accents, allowing other researchers to work with a very balanced dataset, and to create new models that are robust to multiple accents. For evaluating the language identification task, we use the VoxForge and Common Language datasets. Finally, for accent identification, we use the Latin American Spanish Corpora (LASC) and Common Voice datasets. Our approach allows a significant increase in the performance of state-of-the-art models on all the tested datasets, with a low additional computational cost.

READ FULL TEXT
research
05/24/2017

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

We introduce a technique for augmenting neural text-to-speech (TTS) with...
research
09/06/2021

Complementing Handcrafted Features with Raw Waveform Using a Light-weight Auxiliary Model

An emerging trend in audio processing is capturing low-level speech repr...
research
06/01/2023

Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings

The adoption of advanced deep learning architectures in stuttering detec...
research
02/25/2020

Speech2Phone: A Multilingual and Text Independent Speaker Identification Model

Voice recognition is an area with a wide application potential. Speaker ...
research
03/31/2022

DeepFry: Identifying Vocal Fry Using Deep Neural Networks

Vocal fry or creaky voice refers to a voice quality characterized by irr...
research
08/01/2020

Singer Identification Using Convolutional Acoustic Motif Embeddings

Flamenco singing is characterized by pitch instability, micro-tonal orna...
research
09/26/2022

Digital Audio Forensics: Blind Human Voice Mimicry Detection

Audio is one of the most used way of human communication, but at the sam...

Please sign up or login with your details

Forgot password? Click here to reset