1 Introduction
Automatic synthetic speech quality assessment is attractive owing to its ability to replace the reliable but costly subjective evaluation process. Conventional objective measures designed for telephone speech not only require a clean reference speech but also fail to align with human ratings for more varieties of speech synthesis beyond speech codec. Therefore, nonintrusive statistical quality prediction models have received increasing attention in recent years. They are typically trained on a largescale crowdsourced listening test like the Blizzard challenge (BC) [18] or the voice conversion challenge (VCC) [13, 9, 17]
), which contains speech samples and their corresponding subjective scores. Early works tried to condition simple statistical models like linear regression with carefully designed handcrafted features
[11], while recent works use deep neural networks (DNNs) to extract rich feature representations from raw inputs like magnitude spectrum, resulting in high correlations in both utterancelevel and systemlevel
[8].Most previous works trained the mean opinion score (MOS) prediction model on the utterancelevel scores. Specifically, given an input speech sample, the model was trained to predict the arithmetic mean of the several ratings from different listeners^{1}^{1}1In some literature, the term “listener” is also referred to as “judge”.. As pointed out in [7], one serious problem raised by using such a training strategy is the data scarcity. Although DNNbased models require a large amount of data, due to budget constraint, only a limited number of samples from each system are rated. As a result, the number of perutterance scores can be too small for DNN models. Researchers have tried to address this problem by pretraining on an artificially distorted dataset [10] or utilizing selfsupervised speech representations (S3Rs) trained on largescale unlabeled datasets [14].
A more straightforward approach is to leverage all ratings w.r.t. each sample in the dataset. This is called listenerdependent (LD) modeling, and has been studied in the context of speech emotion recognition [1]. In addition to enlarging the data size, another advantage of LD modeling is more accurate modeling of the prediction by taking into account the preference of individual listeners. In the field of MOS prediction, a recent study proposed the socalled meanbias network (MBNet) [7, 14], which consists of a mean subnet that predicts the utterancelevel score of each utterance and a bias subnet that predicts the bias (defined as the difference between the mean score and listener score). During inference, given an input speech, the bias net is discarded and only the mean net is used to make the prediction.
In this work, we propose a unified framework, LDNet, that summarizes recent advances in LD modeling. LDNet directly learns to predict the LD score given the input speech and the listener ID. We also proposed two new inference methods. The all listeners inference averages simulated decisions from all listeners in the training set, and is shown to be more stable than the mean net inference. The mean listener inference mode relies on a learned virtual mean listener for fast prediction. We also suggest a more lightweight yet efficient model architecture design. We conducted systematic experiments on two datasets, including the VCC2018 benchmark and a newly collected dataset [3]. Experimental results demonstrate the effectiveness of our system, meanwhile shedding light on a deeper understanding of LD modeling in MOS prediction.
2 Preliminary
2.1 Formulation
We first introduce the problem formulation of MOS prediction modeling as described in [7]. Assume we have access to a MOS dataset containing speech samples. Each sample has LD scores rated by a set of random listeners . We can further denote the mean score as . Note that the same listener may have rated several different samples. In total, there are listeners in , and usually due to the budget constraint when collecting .
A representative work in MOS prediction for synthetic speech is MOSNet [8], which is depicted in the left most subfigure in Figure 1. MOSNet aims at finding a model that predicts the subjective rating of a given speech sample. The MOSNet training involves minimizing a mean loss (using a criterion like MSE) w.r.t. the mean score of each sample:
(1) 
During inference, given an input speech , the trained model is directly used to make the prediction.
(2) 
2.2 MBNet
MBNet consists of two subnets: a MeanNet and a BiasNet , as shown in the second left most subfigure in Figure 1. Similar to MOSnet, the MeanNet aims to predict the mean score, while the BiasNet takes not only the speech sample but also the listener ID as input to predict the bias score. By combining the mean score and the bias score, we obtain the LD score and the corresponding bias loss for MBNet as follows:
(3) 
The final loss of MBNet is defined as a weighted sum of the mean loss and the bias loss:
(4) 
During inference, since we aim to assess the average subjective rating of the query sample, the BiasNet is discarded and only the MeanNet is used to make the prediction, similar to Equation 2. We refer to this inference mode as the MeanNet inference.
2.3 Inefficient design of MBNet
Although MBNet was shown to be effective [7, 14], we argue that the MBNet design is somewhat inefficient. In the original MBNet, the MeanNet and the BiasNet both take the speech as input. which raises several problems. First, there are certain criterion and standards that are invariant to listeners when rating a speech sample. In other words, the two subnets should share some common behaviors. Second, from the inference point of view, if only the MeanNet is used, then the BiasNet and the subsequent bias loss act only like a regularization. However, since the BiasNet also takes the speech as input, it is necessary but inefficient to make the BiasNet big. In a nutshell, it is worthwhile to redesign the model in order to learn listenrdependent and listenrindependent feature representations.
3 Listenerdependent network (LDNet)
We first present a more general formulation of LD modeling. Consider the model structure depicted in the second right most subfigure in Figure 1. From the input/output perspective, we only define one single model to produce the LD score given the speech and the listener ID as input. We name our formulation LDNet in contrast to MBNet since we do not explicitly define two submodules to predict the mean and bias scores. During training, the model is trained only to minimize a LD loss as follows:
(5) 
Note that LDNet can be viewed as a generalization of MBNet, as the MBNet outputs the LD score by adding the outputs of the MeanNet and the BiasNet.
3.1 All listeners inference
We then show how to perform inference with only the LDNet. Inspired by [1], we propose the all listeners inference, which simulates the decisions of each training listener and average over them:
(6) 
An obvious advantage of the all listeners inference is its flexibility. Unlike the MeanNet inference defined in Equation 2, which requires an explicit network to produce the mean score, the all listeners inference mode only requires the model to be able to produce LD scores w.r.t. training listeners. That is to say, all listeners inference also applies to MBNet.
3.2 Listener independent and dependent model decomposition
As we argued in Section 2.3, we propose to decompose the model into an encoder that learn listenerindependent (LI) features and a decoder that fuses the listener information to generate LD features, as depicted in Figure 1. Formally, we can write
(7) 
The division of the encoder and the decoder is simply where the listener ID is injected. That is to say, the BiasNet in MBNet can in fact be factorized in the same fashion. However, in MBNet, the encoder was a single convolutional 2D (conv2d) layer, while the decoder was much deeper. We argue that if the listener preference only adds a shift to the mean score (as the name “bias” suggests), then the decoder should be made simple and leave most of the representative power to the encoder, as done in [1]. We present how we achieve this in our model design in later sections.
3.3 Utilizing the mean score with a MeanNet
We can utilize the mean scores with a MeanNet, as depicted in the rightmost subfigure in Figure 1
. We refer to this model as LDNetMN. Instead of taking the input speech as input, the MeanNet here takes the LI features extracted with the encoder to predict the mean score. The motivation is to help the encoder extract LI features since the mean score is LI by our assumption. This multitask learning (MTL) loss can be derived by rewriting Equation
1:(8) 
and the loss of LDNetMN can be written as:
(9) 
LDNetMN and MBNet might appear to have a similar structure and objective, but a fundamental difference is that the MTL loss propagates back to not only the MeanNet but also the encoder, while the mean loss in Equation 4 only affects the MeanNet. In addition, similar to the principle described in Section 3.2, we designed the MTL head to be as simple as possible.
3.4 Utilizing the mean score with a mean listener
One shortcoming of the all listeners inference is that it requires to run multiple forward passes first and average the results. A workaround is using matrix representation to run only one forward pass, with the cost of extra memory consumption. Alternatively, we can extend the training set by adding a virtual ”mean listener” (ML). Formally, each sample now has LD scores, , and the listener ID that corresponds to the mean scores of each speech sample is the mean listener. We can then train a LDNet with the extended training set, and denote such a variant as LDNetML. Note that we did not assign a different weight (or use techniques like oversampling) when updating the model. During test time, in addition to the all listeners inference, LDNetML provides an efficient mean listener inference mode, which is to simply use the mean listener ID to run one forward pass.
4 Experimental settings
4.1 Datasets
VCC2018 [9]
This dataset contains 20580 speech samples, where each sample was rated by 4 listeners. A total of 270 listeners were recruited. and each listener rated on average 226 samples. We followed an opensource MBNet implementation
^{2}^{2}2https://github.com/sky1456723/PytorchMBNet/ and used a random split of 13580/3000/4000 for train/valid/test. Note that all listeners are seen listeners during validation and testing.BVCC [3] This is a newly collected largescale MOS dataset, containing samples from the past BCs and VCCs as well as stateoftheart TTS systems implemented in ESPNet [15, 4]. There are 7106 samples, with each sample rated by 8 listeners. In total there are 304 listeners, with each listener rating 187 samples. A carefully curated rule was used to create a 4974/1066/1066 train/valid/test data split. There are 288 listeners in the training set, and there are 8 unseen listeners in the valid and test sets, with some overlap between the training listeners. For more details about how the split, please refer to [2].
4.2 Implementation
We will opensource our codebase in the near future^{3}^{3}3https://github.com/unilight/LDNet.
Baselines. We used the pretrained model in the official MOSNet implementation^{4}^{4}4https://github.com/lochenchou/MOSNet. We used an unofficial, opensource implementation of MBNet as mentioned in the previous subsection. Note that we provided our selfimplemented MBNet results because the data split of the MBNet paper was not specified by the authors, so their results are not directly comparable with our LDNet results.
Model details. The input of the models is the magnitude spectrum, which was also used in [8]
. All models output framelevel scores, and a simple average pooling was used to generate the utterancelevel scores. For the LDNets, we tried three encoder variants. To align with MBNet, we first used the MeanNet structure in MBNet, which was composed of conv2d blocks with dropout and batch normalization layers. We then tried two efficient but powerful conv2dbased architectures, MobileNetV2
[12] and MobileNetV3 [5]. We used implementations provided by torchvision^{5}^{5}5https://github.com/pytorch/vision/tree/main/torchvision/models and we refer the readers to the original papers for more details. The base structure of the decoder and the MeanNet (in LDNetMN) is a singlelayered feedforward network (FFN) followed by projection, and we additionally experimented with a BLSTMbased RNN decoder. Figure 2 illustrates the LDNet model architecture.Training details. MBNetstyle models were trained with the Adam optimizer with learning rate (LR)
, and MobileNet V2 and V3 models were trained using RMSprop. With V2 the LR was decayed by 0.9 every 5k steps, and with V3 the LR was decayed by 0.97 every 1k steps. The
and in Equations 4 and 8 were set to 1 and 4, respectively. We also used techniques from recent papers which we found helpful in our experiments. The clipped MSE [7]was used to prevent the model from overfitting. Repetitive padding
[7] was found to be better than zero padding with a masked loss. Range clipping [14] was an effective inductive bias to force the range of the network output.Model  Config (Enc./Dec./MeanNet)  Model size  Mode  VCC2018  BVCC  
Utterance level  System level  Utterance level  System level  
MSE  LCC  SRCC  MSE  LCC  SRCC  MSE  LCC  SRCC  MSE  LCC  SRCC  
MOSNet  Numbers from [8]    MN  0.538  0.642  0.589  0.084  0.957  0.888             
MOSNet  Numbers from [7]    MN  0.465  0.638  0.611  0.047  0.964  0.922             
MOSNet  Model from [8]    MN              0.816  0.294  0.263  0.563  0.261  0.266 
MBNet  Numbers from [7]    MN  0.426  0.680  0.647  0.029  0.977  0.949             
(a) MBNet  Self implementation  1.38M  MN  0.955  0.658  0.630  0.549  0.978  0.957  0.669  0.757  0.765  0.522  0.854  0.860 
All  0.615  0.656  0.627  0.154  0.980  0.966  0.492  0.758  0.765  0.271  0.856  0.860  
(b) LDNet  MBNetstyle/RNN/  1.18M  All  0.465  0.650  0.617  0.040  0.973  0.955  0.397  0.740  0.734  0.189  0.856  0.855 
(c) LDNet  MobileV2/RNN/  1.73M  All  0.461  0.646  0.603  0.037  0.984  0.958  0.328  0.793  0.791  0.179  0.878  0.876 
(d) LDNet  MobileV3/RNN/  1.48M  All  0.432  0.676  0.641  0.020  0.989  0.976  0.324  0.794  0.790  0.174  0.876  0.871 
(e) LDNet  MobileV3/FFN/  0.96M  All  0.457  0.661  0.621  0.013  0.988  0.976  0.333  0.788  0.784  0.173  0.876  0.870 
(f) LDNetMN  MobileV3/RNN/FFN  1.49M  All  0.437  0.671  0.635  0.023  0.987  0.971  0.324  0.794  0.791  0.187  0.869  0.868 
(g) LDNetML  MobileV3/FNN/  0.96M  All  0.463  0.653  0.617  0.024  0.983  0.975  0.316  0.795  0.794  0.157  0.881  0.881 
ML  0.479  0.648  0.613  0.021  0.983  0.979  0.333  0.795  0.794  0.169  0.885  0.886  
: The results on this row are not directly comparable with the rows below since it is unclear what data split the authors used. We suggest readers to compare results from (a) to (g). 
5 Experimental results
For all self implemented models, we used 3 different random seeds and report the averaged metrics including MSE, linear correlation coeffieient (LCC) and Spearsmans rank correlation coeffieient (SRCC) in both utterance level and system level. The MBNet and LDNets were trained for 50k and 100k steps, respectively, and following [14], model selection was based on systemlevel SRCC.
Our main experimental results are shown in Table 1. We summarize our observations into the following points.
5.1 Advantage of all listeners inference
As mentioned in Section 3.1
, all listeners inference can be applied to any model that produces LD scores w.r.t. training listeners, including MBNet. In row (a), compared to mean net inference, all listeners inference greatly reduces both utterance and system level MSE, and provided a slight systemlevel improvement. This shows that all listeners inference can reduce the variance of the prediction and better capture the relationship between different systems.
5.2 Impact of encoder design
We then investigate the impact of the encoder design by comparing rows (b), (c) and (d) which used the MBNetstyle encoder, MobileNetV2 and MobileNetV3, respectively. On VCC2018, we observes an stable systemlevel improvement as the encoder advances. On BVCC, although the MobileNetV3 encoder gave the lowest systemlevel MSE, its LCC and SRCC were slightly lower than those of the MobileNetV2 encoder. Considering that MobileNetV3 used 0.25M less model parameters, and empirically its training time was faster than that of MobileNetV2, we fixed the encoder to MobileNetV3 in succeeding experiments.
5.3 Impact of decoder design
The influence of a simpler decoder design was inspected by removing the RNN layer in model (d) to form a simple FFN decoder. The resulting model (e) had a comparable system level performance, a reduction in model size and an empirical faster training time. This result is consistent with our argument in Section 3.2 that the decoder can be made as simple as possible as long as we have a encoder which is strong enough.
5.4 Effectiveness of LDNetMN
LDNetMN shared a similar structure with MBNet and was expected to bring improvement by utilizing the mean score. However, by comparing rows (d) and (f), we observed no improvements but only degradation in both utterancelevel and systemlevel. We also tried an RNNbased MeanNet but still observed no improvements.
5.5 Effectiveness of LDNetML
We finally examined LDNetML, which utilized the mean score in the listener embedding space. By comparing rows (e) and (g), when both using all listeners inference, a slight degradation was observed on VCC2018, while a substantial improvement was observed on BVCC. Interestingly, when switched to mean listener inference, further improvements on the system level SRCC can be obtained on both datasets. There results suggest that LDNetML is a better way to utilize the mean score than LDNetMN. Also, since VCC2018 has 4 ratings per sample and BVCC has 8, the mean score in BVCC is considered more reliable, resulting in more significant improvements brought by LDNetML.
6 Conclusions and future works
In this work, we integrated recent advances in LD modeling for MOS prediction. The resulting model, LDNet, was equipped with an advanced model structure and several inference options for efficient prediction. Evaluation results justified the design of the proposed components and showed that our system outperformed the MBNet baselines. Results also showed that LDNetML is the best way to utilize the mean scores, and its advantage is even more prevailing when we have more ratings per sample. In the future, as the proposed techniques are flexible, we plan to combine them with existing effective methods, such as timedomain modeling [6, 16] and S3R learning [14].
7 Acknowledgements
The authors would like to thank the organizers of the Blizzard Challenges and Voice Conversion Challenges for providing the data. This work was partly supported by JSPS KAKENHI Grant Number 21J20920 and JST CREST Grant Number JPMJCR19A3, Japan. This work was also supported by JST CREST grants JPMJCR18A6 and JPMJCR20D3, and by MEXT KAKENHI grants 21K11951 and 21K19808.
References
 [1] (2021) Speech Emotion Recognition based on Listenerdependent Emotion Perception Models. APSIPA Trans. on Signal and Information Processing 10, pp. e6. Cited by: §1, §3.1, §3.2.
 [2] Generalization ability of MOS prediction networks. In Submitted to ICASSP 2022, Cited by: §4.1.
 [3] (2021) How do voices from past speech synthesis challenges compare today?. In Proc. SSW, pp. 183–188. Cited by: §1, §4.1.
 [4] (2020) EspnetTTS: Unified, Reproducible, and Integratable Open Source EndtoEnd TexttoSpeech Toolkit. In Proc. ICASSP, pp. 7654–7658. Cited by: §4.1.
 [5] (2019) Searching for MobileNetV3. In Proc. ICCV, pp. 1314–1324. Cited by: §4.2.

[6]
(2020)
A Deep LearningBased TimeDomain Approach for NonIntrusive Speech Quality Assessment
. In Procs. APSIPA), pp. 477–481. Cited by: §6.  [7] (2021) MBNET: MOS Prediction for Synthesized Speech with MeanBias Network. In Proc. ICASSP, pp. 391–395. Cited by: §1, §1, §2.1, §2.3, §4.2, Table 1.
 [8] (2019) MOSNet: Deep LearningBased Objective Assessment for Voice Conversion. In Proc. Interspeech, pp. 1541–1545. Cited by: §1, §2.1, §4.2, Table 1.
 [9] (2018) The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. In Proc. Odyssey, pp. 195–202. Cited by: §1, §4.1.
 [10] (2020) Deep Learning Based Assessment of Synthetic Speech Naturalness. In Proc. Interspeech, pp. 1748–1752. Cited by: §1.
 [11] (2012) Towards perceptual quality modeling of synthesized audiobooksBlizzard Challenge 2012. In Blizzard Challenge Workshop, Cited by: §1.
 [12] (2018) MobileNetV2: Inverted residuals and linear bottlenecks. In Proc. CVPR, pp. 4510–4520. Cited by: §4.2.
 [13] (2016) The voice conversion challenge 2016. In Proc. Interspeech, pp. 1632–1636. Cited by: §1.
 [14] (2021) Utilizing Selfsupervised Representations for MOS Prediction. In Proc. Interspeech, pp. 2781–2785. Cited by: §1, §1, §2.3, §4.2, §5, §6.
 [15] (2018) ESPnet: EndtoEnd Speech Processing Toolkit. In Proc. Interspeech, pp. 2207–2211. Cited by: §4.1.
 [16] (2021) MetricNet: Towards Improved Modeling For NonIntrusive Speech Quality Assessment. In Proc. Interspeech, pp. 2142–2146. Cited by: §6.
 [17] (2020) Voice Conversion Challenge 2020  Intralingual semiparallel and crosslingual voice conversion . In Proc. Joint Workshop for the BC and VCC 2020, pp. 80–98. Cited by: §1.
 [18] (2020) The Blizzard Challenge 2020. In Proc. Joint Workshop for the BC and VCC 2020, pp. 1–18. Cited by: §1.