Comparison of Speech Representations for the MOS Prediction System

06/28/2022
by   Aki Kunikoshi, et al.
0

Automatic methods to predict Mean Opinion Score (MOS) of listeners have been researched to assure the quality of Text-to-Speech systems. Many previous studies focus on architectural advances (e.g. MBNet, LDNet, etc.) to capture relations between spectral features and MOS in a more effective way and achieved high accuracy. However, the optimal representation in terms of generalization capability still largely remains unknown. To this end, we compare the performance of Self-Supervised Learning (SSL) features obtained by the wav2vec framework to that of spectral features such as magnitude of spectrogram and melspectrogram. Moreover, we propose to combine the SSL features and features which we believe to retain essential information to the automatic MOS to compensate each other for their drawbacks. We conduct comprehensive experiments on a large-scale listening test corpus collected from past Blizzard and Voice Conversion Challenges. We found that the wav2vec feature set showed the best generalization even though the given ground-truth was not always reliable. Furthermore, we found that the combinations performed the best and analyzed how they bridged the gap between spectral and the wav2vec feature sets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/30/2022

FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition

Self-supervised learning representations (SSLR) have resulted in robust ...
research
06/11/2023

Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features

Patients who have had their entire larynx removed, including the vocal f...
research
10/18/2021

LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

An effective approach to automatically predict the subjective rating for...
research
11/03/2021

A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

The goal of voice conversion is to transform source speech into a target...
research
04/05/2022

UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

We present the UTokyo-SaruLab mean opinion score (MOS) prediction system...
research
04/07/2021

Utilizing Self-supervised Representations for MOS Prediction

Speech quality assessment has been a critical issue in speech processing...
research
06/24/2022

Speech Quality Assessment through MOS using Non-Matching References

Human judgments obtained through Mean Opinion Scores (MOS) are the most ...

Please sign up or login with your details

Forgot password? Click here to reset