Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals

08/06/2023
by   Sudarsana Reddy Kadiri, et al.
0

Prior studies in the automatic classification of voice quality have mainly studied the use of the acoustic speech signal as input. Recently, a few studies have been carried out by jointly using both speech and neck surface accelerometer (NSA) signals as inputs, and by extracting MFCCs and glottal source features. This study examines simultaneously-recorded speech and NSA signals in the classification of voice quality (breathy, modal, and pressed) using features derived from three self-supervised pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) and using a SVM as well as CNNs as classifiers. Furthermore, the effectiveness of the pre-trained models is compared in feature extraction between glottal source waveforms and raw signal waveforms for both speech and NSA inputs. Using two signal processing methods (quasi-closed phase (QCP) glottal inverse filtering and zero frequency filtering (ZFF)), glottal source waveforms are estimated from both speech and NSA signals. The study has three main goals: (1) to study whether features derived from pre-trained models improve classification accuracy compared to conventional features (spectrogram, mel-spectrogram, MFCCs, i-vector, and x-vector), (2) to investigate which of the two modalities (speech vs. NSA) is more effective in the classification task with pre-trained model-based features, and (3) to evaluate whether the deep learning-based CNN classifier can enhance the classification accuracy in comparison to the SVM classifier. The results revealed that the use of the NSA input showed better classification performance compared to the speech signal. Between the features, the pre-trained model-based features showed better classification accuracies, both for speech and NSA inputs compared to the conventional features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features.

READ FULL TEXT

page 9

page 17

research
11/15/2021

Investigating self-supervised front ends for speech spoofing countermeasures

Self-supervised speech model is a rapid progressing research topic, and ...
research
02/06/2023

Autodecompose: A generative self-supervised model for semantic decomposition

We introduce Autodecompose, a novel self-supervised generative model tha...
research
12/05/2022

Evince the artifacts of Spoof Speech by blending Vocal Tract and Voice Source Features

With the rapid advancement in synthetic speech generation technologies, ...
research
06/24/2022

Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models

In this work, we analyzed and compared speech representations extracted ...
research
12/28/2019

A Comparative Study of Glottal Source Estimation Techniques

Source-tract decomposition (or glottal flow estimation) is one of the ba...
research
04/22/2019

hf0: A hybrid pitch extraction method for multimodal voice

Pitch or fundamental frequency (f0) extraction is a fundamental problem ...
research
04/07/2021

Utilizing Self-supervised Representations for MOS Prediction

Speech quality assessment has been a critical issue in speech processing...

Please sign up or login with your details

Forgot password? Click here to reset