Automatic synthetic speech quality assessment is attractive owing to its ability to replace the reliable but costly subjective evaluation process. Conventional objective measures designed for telephone speech not only require a clean reference speech but also fail to align with human ratings for more varieties of speech synthesis beyond speech codec. Therefore, non-intrusive statistical quality prediction models have received increasing attention in recent years. They are typically trained on a large-scale crowd-sourced listening test like the Blizzard challenge (BC)  or the voice conversion challenge (VCC) [13, 9, 17]
), which contains speech samples and their corresponding subjective scores. Early works tried to condition simple statistical models like linear regression with carefully designed hand-crafted features
, while recent works use deep neural networks (DNNs) to extract rich feature representations from raw inputs like magnitude spectrum, resulting in high correlations in both utterance-level and system-level.
Most previous works trained the mean opinion score (MOS) prediction model on the utterance-level scores. Specifically, given an input speech sample, the model was trained to predict the arithmetic mean of the several ratings from different listeners111In some literature, the term “listener” is also referred to as “judge”.. As pointed out in , one serious problem raised by using such a training strategy is the data scarcity. Although DNN-based models require a large amount of data, due to budget constraint, only a limited number of samples from each system are rated. As a result, the number of per-utterance scores can be too small for DNN models. Researchers have tried to address this problem by pretraining on an artificially distorted dataset  or utilizing self-supervised speech representations (S3Rs) trained on large-scale unlabeled datasets .
A more straight-forward approach is to leverage all ratings w.r.t. each sample in the dataset. This is called listener-dependent (LD) modeling, and has been studied in the context of speech emotion recognition . In addition to enlarging the data size, another advantage of LD modeling is more accurate modeling of the prediction by taking into account the preference of individual listeners. In the field of MOS prediction, a recent study proposed the so-called mean-bias network (MBNet) [7, 14], which consists of a mean subnet that predicts the utterance-level score of each utterance and a bias subnet that predicts the bias (defined as the difference between the mean score and listener score). During inference, given an input speech, the bias net is discarded and only the mean net is used to make the prediction.
In this work, we propose a unified framework, LDNet, that summarizes recent advances in LD modeling. LDNet directly learns to predict the LD score given the input speech and the listener ID. We also proposed two new inference methods. The all listeners inference averages simulated decisions from all listeners in the training set, and is shown to be more stable than the mean net inference. The mean listener inference mode relies on a learned virtual mean listener for fast prediction. We also suggest a more light-weight yet efficient model architecture design. We conducted systematic experiments on two datasets, including the VCC2018 benchmark and a newly collected dataset . Experimental results demonstrate the effectiveness of our system, meanwhile shedding light on a deeper understanding of LD modeling in MOS prediction.
We first introduce the problem formulation of MOS prediction modeling as described in . Assume we have access to a MOS dataset containing speech samples. Each sample has LD scores rated by a set of random listeners . We can further denote the mean score as . Note that the same listener may have rated several different samples. In total, there are listeners in , and usually due to the budget constraint when collecting .
A representative work in MOS prediction for synthetic speech is MOSNet , which is depicted in the left most subfigure in Figure 1. MOSNet aims at finding a model that predicts the subjective rating of a given speech sample. The MOSNet training involves minimizing a mean loss (using a criterion like MSE) w.r.t. the mean score of each sample:
During inference, given an input speech , the trained model is directly used to make the prediction.
MBNet consists of two subnets: a MeanNet and a BiasNet , as shown in the second left most subfigure in Figure 1. Similar to MOSnet, the MeanNet aims to predict the mean score, while the BiasNet takes not only the speech sample but also the listener ID as input to predict the bias score. By combining the mean score and the bias score, we obtain the LD score and the corresponding bias loss for MBNet as follows:
The final loss of MBNet is defined as a weighted sum of the mean loss and the bias loss:
During inference, since we aim to assess the average subjective rating of the query sample, the BiasNet is discarded and only the MeanNet is used to make the prediction, similar to Equation 2. We refer to this inference mode as the MeanNet inference.
2.3 Inefficient design of MBNet
Although MBNet was shown to be effective [7, 14], we argue that the MBNet design is somewhat inefficient. In the original MBNet, the MeanNet and the BiasNet both take the speech as input. which raises several problems. First, there are certain criterion and standards that are invariant to listeners when rating a speech sample. In other words, the two subnets should share some common behaviors. Second, from the inference point of view, if only the MeanNet is used, then the BiasNet and the subsequent bias loss act only like a regularization. However, since the BiasNet also takes the speech as input, it is necessary but inefficient to make the BiasNet big. In a nutshell, it is worthwhile to redesign the model in order to learn listenr-dependent and listenr-independent feature representations.
3 Listener-dependent network (LDNet)
We first present a more general formulation of LD modeling. Consider the model structure depicted in the second right most subfigure in Figure 1. From the input/output perspective, we only define one single model to produce the LD score given the speech and the listener ID as input. We name our formulation LDNet in contrast to MBNet since we do not explicitly define two submodules to predict the mean and bias scores. During training, the model is trained only to minimize a LD loss as follows:
Note that LDNet can be viewed as a generalization of MBNet, as the MBNet outputs the LD score by adding the outputs of the MeanNet and the BiasNet.
3.1 All listeners inference
We then show how to perform inference with only the LDNet. Inspired by , we propose the all listeners inference, which simulates the decisions of each training listener and average over them:
An obvious advantage of the all listeners inference is its flexibility. Unlike the MeanNet inference defined in Equation 2, which requires an explicit network to produce the mean score, the all listeners inference mode only requires the model to be able to produce LD scores w.r.t. training listeners. That is to say, all listeners inference also applies to MBNet.
3.2 Listener independent and dependent model decomposition
As we argued in Section 2.3, we propose to decompose the model into an encoder that learn listener-independent (LI) features and a decoder that fuses the listener information to generate LD features, as depicted in Figure 1. Formally, we can write
The division of the encoder and the decoder is simply where the listener ID is injected. That is to say, the BiasNet in MBNet can in fact be factorized in the same fashion. However, in MBNet, the encoder was a single convolutional 2D (conv2d) layer, while the decoder was much deeper. We argue that if the listener preference only adds a shift to the mean score (as the name “bias” suggests), then the decoder should be made simple and leave most of the representative power to the encoder, as done in . We present how we achieve this in our model design in later sections.
3.3 Utilizing the mean score with a MeanNet
We can utilize the mean scores with a MeanNet, as depicted in the rightmost subfigure in Figure 1
. We refer to this model as LDNet-MN. Instead of taking the input speech as input, the MeanNet here takes the LI features extracted with the encoder to predict the mean score. The motivation is to help the encoder extract LI features since the mean score is LI by our assumption. This multitask learning (MTL) loss can be derived by rewriting Equation1:
and the loss of LDNet-MN can be written as:
LDNet-MN and MBNet might appear to have a similar structure and objective, but a fundamental difference is that the MTL loss propagates back to not only the MeanNet but also the encoder, while the mean loss in Equation 4 only affects the MeanNet. In addition, similar to the principle described in Section 3.2, we designed the MTL head to be as simple as possible.
3.4 Utilizing the mean score with a mean listener
One shortcoming of the all listeners inference is that it requires to run multiple forward passes first and average the results. A work-around is using matrix representation to run only one forward pass, with the cost of extra memory consumption. Alternatively, we can extend the training set by adding a virtual ”mean listener” (ML). Formally, each sample now has LD scores, , and the listener ID that corresponds to the mean scores of each speech sample is the mean listener. We can then train a LDNet with the extended training set, and denote such a variant as LDNet-ML. Note that we did not assign a different weight (or use techniques like oversampling) when updating the model. During test time, in addition to the all listeners inference, LDNet-ML provides an efficient mean listener inference mode, which is to simply use the mean listener ID to run one forward pass.
4 Experimental settings
This dataset contains 20580 speech samples, where each sample was rated by 4 listeners. A total of 270 listeners were recruited. and each listener rated on average 226 samples. We followed an open-source MBNet implementation222https://github.com/sky1456723/Pytorch-MBNet/ and used a random split of 13580/3000/4000 for train/valid/test. Note that all listeners are seen listeners during validation and testing.
BVCC  This is a newly collected large-scale MOS dataset, containing samples from the past BCs and VCCs as well as state-of-the-art TTS systems implemented in ESPNet [15, 4]. There are 7106 samples, with each sample rated by 8 listeners. In total there are 304 listeners, with each listener rating 187 samples. A carefully curated rule was used to create a 4974/1066/1066 train/valid/test data split. There are 288 listeners in the training set, and there are 8 unseen listeners in the valid and test sets, with some overlap between the training listeners. For more details about how the split, please refer to .
We will open-source our codebase in the near future333https://github.com/unilight/LDNet.
Baselines. We used the pretrained model in the official MOSNet implementation444https://github.com/lochenchou/MOSNet. We used an unofficial, open-source implementation of MBNet as mentioned in the previous subsection. Note that we provided our self-implemented MBNet results because the data split of the MBNet paper was not specified by the authors, so their results are not directly comparable with our LDNet results.
Model details. The input of the models is the magnitude spectrum, which was also used in 
. All models output frame-level scores, and a simple average pooling was used to generate the utterance-level scores. For the LDNets, we tried three encoder variants. To align with MBNet, we first used the MeanNet structure in MBNet, which was composed of conv2d blocks with dropout and batch normalization layers. We then tried two efficient but powerful conv2d-based architectures, MobileNetV2 and MobileNetV3 . We used implementations provided by torchvision555https://github.com/pytorch/vision/tree/main/torchvision/models and we refer the readers to the original papers for more details. The base structure of the decoder and the MeanNet (in LDNet-MN) is a single-layered feed-forward network (FFN) followed by projection, and we additionally experimented with a BLSTM-based RNN decoder. Figure 2 illustrates the LDNet model architecture.
Training details. MBNet-style models were trained with the Adam optimizer with learning rate (LR)
, and MobileNet V2 and V3 models were trained using RMSprop. With V2 the LR was decayed by 0.9 every 5k steps, and with V3 the LR was decayed by 0.97 every 1k steps. Theand in Equations 4 and 8 were set to 1 and 4, respectively. We also used techniques from recent papers which we found helpful in our experiments. The clipped MSE 
was used to prevent the model from overfitting. Repetitive padding was found to be better than zero padding with a masked loss. Range clipping  was an effective inductive bias to force the range of the network output.
|Model||Config (Enc./Dec./MeanNet)||Model size||Mode||VCC2018||BVCC|
|Utterance level||System level||Utterance level||System level|
|MOSNet||Numbers from ||-||MN||0.538||0.642||0.589||0.084||0.957||0.888||-||-||-||-||-||-|
|MOSNet||Numbers from ||-||MN||0.465||0.638||0.611||0.047||0.964||0.922||-||-||-||-||-||-|
|MOSNet||Model from ||-||MN||-||-||-||-||-||-||0.816||0.294||0.263||0.563||0.261||0.266|
|MBNet||Numbers from ||-||MN||0.426||0.680||0.647||0.029||0.977||0.949||-||-||-||-||-||-|
|(a) MBNet||Self implementation||1.38M||MN||0.955||0.658||0.630||0.549||0.978||0.957||0.669||0.757||0.765||0.522||0.854||0.860|
|: The results on this row are not directly comparable with the rows below since it is unclear what data split the authors used. We suggest readers to compare results from (a) to (g).|
5 Experimental results
For all self implemented models, we used 3 different random seeds and report the averaged metrics including MSE, linear correlation coeffieient (LCC) and Spearsmans rank correlation coeffieient (SRCC) in both utterance level and system level. The MBNet and LDNets were trained for 50k and 100k steps, respectively, and following , model selection was based on system-level SRCC.
Our main experimental results are shown in Table 1. We summarize our observations into the following points.
5.1 Advantage of all listeners inference
As mentioned in Section 3.1
, all listeners inference can be applied to any model that produces LD scores w.r.t. training listeners, including MBNet. In row (a), compared to mean net inference, all listeners inference greatly reduces both utterance and system level MSE, and provided a slight system-level improvement. This shows that all listeners inference can reduce the variance of the prediction and better capture the relationship between different systems.
5.2 Impact of encoder design
We then investigate the impact of the encoder design by comparing rows (b), (c) and (d) which used the MBNet-style encoder, MobileNetV2 and MobileNetV3, respectively. On VCC2018, we observes an stable system-level improvement as the encoder advances. On BVCC, although the MobileNetV3 encoder gave the lowest system-level MSE, its LCC and SRCC were slightly lower than those of the MobileNetV2 encoder. Considering that MobileNetV3 used 0.25M less model parameters, and empirically its training time was faster than that of MobileNetV2, we fixed the encoder to MobileNetV3 in succeeding experiments.
5.3 Impact of decoder design
The influence of a simpler decoder design was inspected by removing the RNN layer in model (d) to form a simple FFN decoder. The resulting model (e) had a comparable system level performance, a reduction in model size and an empirical faster training time. This result is consistent with our argument in Section 3.2 that the decoder can be made as simple as possible as long as we have a encoder which is strong enough.
5.4 Effectiveness of LDNet-MN
LDNet-MN shared a similar structure with MBNet and was expected to bring improvement by utilizing the mean score. However, by comparing rows (d) and (f), we observed no improvements but only degradation in both utterance-level and system-level. We also tried an RNN-based MeanNet but still observed no improvements.
5.5 Effectiveness of LDNet-ML
We finally examined LDNet-ML, which utilized the mean score in the listener embedding space. By comparing rows (e) and (g), when both using all listeners inference, a slight degradation was observed on VCC2018, while a substantial improvement was observed on BVCC. Interestingly, when switched to mean listener inference, further improvements on the system level SRCC can be obtained on both datasets. There results suggest that LDNet-ML is a better way to utilize the mean score than LDNet-MN. Also, since VCC2018 has 4 ratings per sample and BVCC has 8, the mean score in BVCC is considered more reliable, resulting in more significant improvements brought by LDNet-ML.
6 Conclusions and future works
In this work, we integrated recent advances in LD modeling for MOS prediction. The resulting model, LDNet, was equipped with an advanced model structure and several inference options for efficient prediction. Evaluation results justified the design of the proposed components and showed that our system outperformed the MBNet baselines. Results also showed that LDNet-ML is the best way to utilize the mean scores, and its advantage is even more prevailing when we have more ratings per sample. In the future, as the proposed techniques are flexible, we plan to combine them with existing effective methods, such as time-domain modeling [6, 16] and S3R learning .
The authors would like to thank the organizers of the Blizzard Challenges and Voice Conversion Challenges for providing the data. This work was partly supported by JSPS KAKENHI Grant Number 21J20920 and JST CREST Grant Number JPMJCR19A3, Japan. This work was also supported by JST CREST grants JPMJCR18A6 and JPMJCR20D3, and by MEXT KAKENHI grants 21K11951 and 21K19808.
-  (2021) Speech Emotion Recognition based on Listener-dependent Emotion Perception Models. APSIPA Trans. on Signal and Information Processing 10, pp. e6. Cited by: §1, §3.1, §3.2.
-  Generalization ability of MOS prediction networks. In Submitted to ICASSP 2022, Cited by: §4.1.
-  (2021) How do voices from past speech synthesis challenges compare today?. In Proc. SSW, pp. 183–188. Cited by: §1, §4.1.
-  (2020) Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit. In Proc. ICASSP, pp. 7654–7658. Cited by: §4.1.
-  (2019) Searching for MobileNetV3. In Proc. ICCV, pp. 1314–1324. Cited by: §4.2.
A Deep Learning-Based Time-Domain Approach for Non-Intrusive Speech Quality Assessment. In Procs. APSIPA), pp. 477–481. Cited by: §6.
-  (2021) MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network. In Proc. ICASSP, pp. 391–395. Cited by: §1, §1, §2.1, §2.3, §4.2, Table 1.
-  (2019) MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion. In Proc. Interspeech, pp. 1541–1545. Cited by: §1, §2.1, §4.2, Table 1.
-  (2018) The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. In Proc. Odyssey, pp. 195–202. Cited by: §1, §4.1.
-  (2020) Deep Learning Based Assessment of Synthetic Speech Naturalness. In Proc. Interspeech, pp. 1748–1752. Cited by: §1.
-  (2012) Towards perceptual quality modeling of synthesized audiobooks-Blizzard Challenge 2012. In Blizzard Challenge Workshop, Cited by: §1.
-  (2018) MobileNetV2: Inverted residuals and linear bottlenecks. In Proc. CVPR, pp. 4510–4520. Cited by: §4.2.
-  (2016) The voice conversion challenge 2016. In Proc. Interspeech, pp. 1632–1636. Cited by: §1.
-  (2021) Utilizing Self-supervised Representations for MOS Prediction. In Proc. Interspeech, pp. 2781–2785. Cited by: §1, §1, §2.3, §4.2, §5, §6.
-  (2018) ESPnet: End-to-End Speech Processing Toolkit. In Proc. Interspeech, pp. 2207–2211. Cited by: §4.1.
-  (2021) MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment. In Proc. Interspeech, pp. 2142–2146. Cited by: §6.
-  (2020) Voice Conversion Challenge 2020 - Intra-lingual semi-parallel and cross-lingual voice conversion -. In Proc. Joint Workshop for the BC and VCC 2020, pp. 80–98. Cited by: §1.
-  (2020) The Blizzard Challenge 2020. In Proc. Joint Workshop for the BC and VCC 2020, pp. 1–18. Cited by: §1.