Log In Sign Up

MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network

by   Yichong Leng, et al.

Mean opinion score (MOS) is a popular subjective metric to assess the quality of synthesized speech, and usually involves multiple human judges to evaluate each speech utterance. To reduce the labor cost in MOS test, multiple methods have been proposed to automatically predict MOS scores. To our knowledge, for a speech utterance, all previous works only used the average of multiple scores from different judges as the training target and discarded the score of each individual judge, which did not well exploit the precious MOS training data. In this paper, we propose MBNet, a MOS predictor with a mean subnet and a bias subnet to better utilize every judge score in MOS datasets, where the mean subnet is used to predict the mean score of each utterance similar to that in previous works, and the bias subnet to predict the bias score (the difference between the mean score and each individual judge score) and capture the personal preference of individual judges. Experiments show that compared with MOSNet baseline that only leverages mean score for training, MBNet improves the system-level spearmans rank correlation co-efficient (SRCC) by 2.9 dataset and 6.7


page 1

page 2

page 3

page 4


LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

An effective approach to automatically predict the subjective rating for...

Bias-Aware Loss for Training Image and Speech Quality Prediction Models from Multiple Datasets

The ground truth used for training image, video, or speech quality predi...

AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

Developers of text-to-speech synthesizers (TTS) often make use of human ...

MooseNet: A trainable metric for synthesized speech with plda backend

We present MooseNet, a trainable speech metric that predicts listeners' ...

Considering user agreement in learning to predict the aesthetic quality

How to robustly rank the aesthetic quality of given images has been a lo...

DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores

Mean opinion score (MOS) is a typical subjective evaluation metric for s...

Improving pronunciation assessment via ordinal regression with anchored reference samples

Sentence level pronunciation assessment is important for Computer Assist...