MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network

02/27/2021
by   Yichong Leng, et al.
0

Mean opinion score (MOS) is a popular subjective metric to assess the quality of synthesized speech, and usually involves multiple human judges to evaluate each speech utterance. To reduce the labor cost in MOS test, multiple methods have been proposed to automatically predict MOS scores. To our knowledge, for a speech utterance, all previous works only used the average of multiple scores from different judges as the training target and discarded the score of each individual judge, which did not well exploit the precious MOS training data. In this paper, we propose MBNet, a MOS predictor with a mean subnet and a bias subnet to better utilize every judge score in MOS datasets, where the mean subnet is used to predict the mean score of each utterance similar to that in previous works, and the bias subnet to predict the bias score (the difference between the mean score and each individual judge score) and capture the personal preference of individual judges. Experiments show that compared with MOSNet baseline that only leverages mean score for training, MBNet improves the system-level spearmans rank correlation co-efficient (SRCC) by 2.9 dataset and 6.7

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2023

MOSPC: MOS Prediction Based on Pairwise Comparison

As a subjective metric to evaluate the quality of synthesized speech, Me...
research
10/18/2021

LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

An effective approach to automatically predict the subjective rating for...
research
04/20/2021

Bias-Aware Loss for Training Image and Speech Quality Prediction Models from Multiple Datasets

The ground truth used for training image, video, or speech quality predi...
research
08/18/2023

Generative Machine Listener

We show how a neural network can be trained on individual intrusive list...
research
11/28/2016

AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

Developers of text-to-speech synthesizers (TTS) often make use of human ...
research
01/17/2023

MooseNet: A trainable metric for synthesized speech with plda backend

We present MooseNet, a trainable speech metric that predicts listeners' ...
research
10/13/2021

Considering user agreement in learning to predict the aesthetic quality

How to robustly rank the aesthetic quality of given images has been a lo...

Please sign up or login with your details

Forgot password? Click here to reset