The speech generated in voice conversion (VC) tasks remains challenging to effectively evaluate. In most studies, both objective and subjective evaluation results are reported to compare the performance of VC systems. For objective evaluation [Das2020], measurements borrowed from the speaker recognition task are usually used. For subjective evaluation, a listening test is usually conducted. Compared with objective evaluation, subjective evaluation incurs more time and cost. Moreover, to reach unbiased results, a large amount of subjective tests must be carried out [wester2015we]. However, since the target users of VC are humans, the subjective evaluation results are more important than the objective counterparts. In our previous study [lo2019mosnet]
, we proposed MOSNet, which can predict the mean opinion score (MOS) of human subjective ratings of speech quality and naturalness. MOSNet is formed by a Convolutional Neural Nnetwork-Bidirectional Long Short-Term Memory (CNN-BLSTM) architecture. The results of large-scale human evaluation of Voice Conversion Challenge 2018 (VCC2018) demonstrated that MOSNet achieves a high correlation with human MOS ratings at the system level and a fair correlation at the utterance level.
In our previous study [lo2019mosnet], we also slightly modified MOSNet to predict the similarity scores. The preliminary results showed that the predicted similarity scores were fairly correlated with human similarity ratings. In this work, to further improve the similarity prediction, we propose a novel assessment model called SVSNet, which has two features: (1) To more accurately characterize speech signals, SVSNet directly takes the speech waveform as input. (2) SVSNet adopts a co-attention mechanism to deal with length mismatch and switching of paired utterances. For (1), although hand-crafted features are widely used in many speech processing tasks, such as speaker verification (SV) [snyder2018x, variani2014deep], VC [hsu2017voice, hsu2016voice], speech synthesis [li2019neural, shen2018natural], and speech enhancement [pandey2019new], we believe that the raw waveform contains the most complete information for similarity prediction for two reasons. First, for most hand-crafted features, the phase information is ignored. However, many studies have shown that phase can provide useful information [loweimi2015source, loweimi2017robust, loweimi2018robust]
. Second, to compute hand-crafted features, prior knowledge is required about feature extraction specifications, such as window size, shift length, and feature dimension. Improper specifications can lead to ineffective features, which can result in poor prediction performance. For (2), since our goal is to predict the similarity score for a pair of utterances, we need to handle the asymmetry issue, which may be caused by two situations. First, the two utterances may be different lengths. Second, when the paired utterance input is switched to a different order, SVSNet should output the same prediction score. To solve the asymmetry problem, we design a special co-attention mechanism. Our experimental results on the VCC2018[lorenzo2018voice] and VCC2020 [vcc2020]
datasets show that SVSNet can predict the similarity score of a VC system quite accurately. As per our knowledge, this is the first deep learning-based model for similarity assessment for VC tasks.
Ii Related Works
Ii-a Neural Evaluation Metrics
Conventional evaluation metrics are generally derived on the basis of signal processing and human auditory theories. For example, perceptual evaluation of speech quality (PESQ) [rix2001perceptual] and short-time objective intelligibility (STOI) [taal2010short] are commonly used to evaluate the quality and intelligibility of processed speech. The normalized covariance measure (NCM) [ma2009objective] and its extensions [chen2010analysis, chen2012contributions] have been shown to be effective in measuring the intelligibility of normal speech and vocoded speech. In addition, some parametric distances are often used to measure the difference between paired voices, such as speech distortion index (SDI) [sdi], mel-cepstral distance (MCD) [mcd], cepstrum distance (Cep) [cep]
, segmental signal-to-noise ratio (SSNR) improvement[ssnr], and scale-invariant source-to-noise ratio (SI-SNR) [luo2019conv]. Several studies have indicated that these objective evaluation metrics may not truly reflect human perception [mcd]. Therefore, subjective listening evaluations are usually reported in speech generation studies. Unbiased subjective results, however, require a large number of tests, covering a wide range of listeners (gender, age, and hearing ability) and test samples, which makes listening tests challenging in terms of time and cost.
To address the above issues, several neural evaluation metrics have been proposed. For speech enhancement tasks, Quality-Net [qualitynet], DNSMOS [dnsmos],and STOI-Net [stoinet] were proposed as non-intrusive tools for measuring speech quality and intelligibility. For VC, MOSNet [lo2019mosnet] and MBNet [leng2021mbnet] were proposed to measure the naturalness of converted speech. Mittag and Möller [mittag2020deep] proposed a model for the text-to-speech synthesis task on the TU Berlin / Kiel University database. To the best of our knowledge, no one has previously established neural evaluation metrics for the similarity assessment of VC tasks, which is the main focus of this study.
Ii-B Similarity Prediction
The similarity prediction task resembles an SV task, which aims to determine whether the input speech is pronounced by a claimed speaker. For most SV systems, the test utterance and the enrollment utterance are first converted into embedding vectors through a neural network (NN) model, and then a similarity score between the two embedding vectors is calculated on the basis of a distance measurement function, such as cosine distance or other NN models[snyder2018x, zeng2021attention, zhang2019seq2seq]. The major difference between the similarity prediction task and the SV task is that the SV task uses two class labels (same speaker or different speakers), while multi-class labels based on human judgements are used in SVSNet.
Ii-C Waveform Modeling
Recently, several approaches have been proposed to incorporate waveform modeling into speech processing tasks, such as speech recognition [parcollet2020e2e], speech enhancement [pascual2017segan, shifas2020non], speech separation [zeghidour2020wavesplit, luo2019conv], speech vocoding [yamamoto2020parallel, wavenet, kumar2019melgan] , and SV [jung2019rawnet, jung2018complete]. The main idea of these waveform modeling methods is that traditional hand-crafted feature extraction techniques can be substituted by NN models in a data-driven manner. To effectively model speech waveforms, a dilated architecture has been proposed to increase the reception field with the same number of model parameters [wavenet]. Meanwhile, SincNet [ravanelli2018speaker] processes the raw waveform with a set of parameterizable band-pass filters, where only the low and high cutoff frequencies of the band-pass filters are the parameters to be learned. Learning data-dependent and task-dependent filters provides greater flexibility than fixed feature processing procedures. The effectiveness of SincNet has been demonstrated in several studies [ravanelli2018speaker, liu2020multichannel, ravanelli2019pytorch]
Iii Proposed SVSNet
Figure 1(a) shows the SVSNet architecture. The encoder (E) module (shared by two inputs) encodes the waveforms of the test and reference utterances into frame-wise representations ( and ). Unlike the attention module in [zhang2019seq2seq], which only aligns the test utterance with the enrollment utterance in one direction, to maintain the symmetry, the “Co-attention” module aligns the two representations in two directions. Then, two distances, namely (between and ) and (between and ), are computed by the “Distance” module and used to calculate the final similarity score by the “Prediction” module. We study two types of prediction modules: regression-based and classification-based. Their outputs are a continuous score and a score-level category, respectively.
(a) The architecture of SVSNet. E, CAT, Dis, and Pred blocks denote the Encoder, Co-attention, Distance, and Prediction modules, respectively. (b) The Encoder module. (c) The rSWC module. DConv denotes a dilated convolutional layer, and RConv denotes a normal convolutional layer followed by a ReLU layer.
Figure 1(b) shows the architecture of the encoder in SVSNet. First, the input waveform is processed by SincNet, which contains learnable band-pass filters, to decompose the input signal to subband signals. The subband signals are then processed by four stacked residual-skipped-WaveNet convolution (rSWC) layers and a BLSTM layer. Fig. 1(c) shows the rSWC layer. The core of the rSWC layers is the convolutional layers with dilation sizes of (1, 2, 4, 8, 16, 32, and 64), followed by a gated tanh unit (GTU) [gtu]
. In addition, the maxpooling layer with a stride size of 3 is to downsample the feature sequence. As shown in Fig.1(a), given the test utterance and the reference utterance , the encoder outputs and , respectively.
Iii-B Co-attention Module
A critical requirement of similarity prediction is symmetry. That is, when the input order is switched, the model should predict the same similarity score. To meet this requirement, we derive a novel co-attention model to align the representation of the other input with that of one input:
We used the scaled dot-product attention mechanism [vaswani2017attention]. With the co-attention module, two pairs of aligned representation sequences are obtained, namely and , which are then fed to the distance calculation module.
Iii-C Distance Calculation and Prediction Modules
We extend the attentive pooling used in SV [Okabe2018] to our work. We average the representations of an utterance over time to obtain the utterance embedding and compute the 1-norm distance of each dimension of two means:
Then, the two distances are fed to the prediction module to obtain the similarity score:
denotes an activation function,and denote two linear layers, and
denotes a rectified linear unit (ReLU) activation function. The number of nodes of the final linear output layer is 1 for the regression model and 2 or 4 for the classification model, i.e.,, for regression, and or for classification. The activation function is an identity function for the regression model and a softmax function for the classification model. Finally, the prediction module obtains the final score by = .
Iii-D Model Training
SVSNet is trained on a set of reference-test utterance pairs with corresponding human labeled similarity scores. We implemented two versions of SVSNet by using two types of prediction modules: regression and classification. The corresponding SVSNet models are termed SVSNet(R) and SVSNet(C), respectively. Given the ground-truth similarity score and the predicted similarity score , the mean squared error (MSE) loss is used to train SVSNet(R), and the cross entropy (CE) loss is used to train SVSNet(C).
Iv-a Experimental Setup
Since 2016, the VC challenge (VCC) has been held three times. The task is to modify an audio waveform so that it sounds as if it was from a specific target speaker other than the source speaker. In each challenge, a large-scale crowdsourced human perception evaluation was conducted to test the quality and similarity of the converted utterances. In VCC2018, there were 23 VC systems. A similarity evaluation was conducted on 20,996 converted-natural utterance pairs, and 30,864 speaker similarity assessments were obtained. Each pair was evaluated by 1 to 7 subjects, with a score ranging from 1 (same speaker) to 4 (different speakers). The detailed description of the corpus, listeners and evaluation methods can be found in [lorenzo2018voice]. In this study, the dataset was divided into 24,864 pairs for training and 6,000 pairs for testing.
We used MOSNet [lo2019mosnet] as the baseline. MOSNet was originally proposed for quality assessment, but a modified version was used for similarity assessment. Like SVSNet, the models via regression and classification are termed MOSNet(R) and MOSNet(C), respectively. Performance was evaluated in terms of accuracy (ACC), linear correlation coefficient (LCC) [pearson1920notes], Spearman’s rank correlation coefficient (SRCC) [spearman1961proof], and MSE at both utterance and system levels. The utterance-level evaluation was calculated from the predicted score and the corresponding ground-truth score for each pair of utterances. The system-level evaluation was calculated on the basis of the average predicted score and the corresponding average ground-truth score for each system. When treating similarity prediction as a classification problem, we considered two designs: 2-class classification and 4-class classification. For 4-class classification, the original labels were used as the ground-truth. For 2-class classification, the ground-truth scores 1 and 2 were merged into label 1 (same speaker), and the scores 3 and 4 were merged into label 2 (different speakers). When treating similarity prediction as a regression task and evaluating performance on the basis of ACC, the outputs of SVSNet(R) and MOSNet(R) were rounded and clipped to the nearest integer (i.e., 1 or 2).
Since two different sampling rates (22,050 and 16,000 Hz) were used in the VCC2018 dataset, we reduced the sampling rate of all utterances to 16,000 Hz. For the encoder, the number of output channels of SincNet, the output size of the WaveNet convolutional layers, and the hidden size of BLSTM were 64, 64, and 256, respectively. The hidden size of the linear layers in the distance module was 128, and the output size was 1, 2, or 4 for the scalar output, 2-class output, and 4-class output, respectively. We used the Adam optimizer to train the model. The learning rate, , and were 1e-4, 0.5, and 0.999, respectively. The batch size was set to 5. The model parameters were initialized by Xavier Uniform.
Iv-B Experimental Results
First, we compare SVSNet with the baseline MOSNet. Tables I and II report the results of 2-class similarity prediction via regression and classification, respectively. From the tables, we can see that SVSNet consistently outperforms MOSNet in all evaluation metrics, but the improvement is relatively small in the utterance-level prediction. We can also note that both SVSNet and MOSNet perform better in the regression mode than in the classification mode.
Next, we study the effect of waveform processing. For a fair comparison, we replaced the SincNet in SVSNet(R) and SVSNet(C) with an ordinary convolutional layer with a kernel size of 1, which has the same number of parameters as SincNet. The corresponding models are termed SVSNet(R) and SVSNet(C). We performed 4-class similarity prediction tests in both regression and classification modes. The results are shown in Tables III and IV, respectively. Obviously, in all metrics, SVSNet(R) and SVSNet(C) are always better than SVSNet(R) and SVSNet(C), respectively.
Then, we investigate which of regression and classification is more suitable for similarity prediction. By comparing Table I with Table II, both SVSNet and MOSNet perform better in the scalar regression mode than in the classification mode except for ACC. By comparing Table III with Table IV, we also note similar trends. The results indicate that regression is more suitable for the similarity prediction task than classification.
We also compare the results of 2-class and 4-class classifications. From Tables I and III, SVSNet(R) performs better in terms of ACC and MSE under the 2-class condition than under the 4-class condition. This is reasonable, because when the label type is increased from 2 to 4, the prediction difficulty increases accordingly. On the other hand, finer labels (4-class) enable the model to output smoother prediction scores. Thus, SVSNet(R) yields better LCC and SRCC scores under the 4-class condition than under the 2-class condition, as shown in Tables I and III. In VCC2018, each converted-natural utterance pair was manually labeled with a score ranging from 1 (same speaker) to 4 (different speakers). From Tables III and IV, the high LCC (0.966 via regression and 0.941 via classification) and SRCC (0.910 via regression and 0.871 via classification) scores indicate that the predicted ranking of the 23 submitted systems by SVSNet is very close to that of human evaluation.
Voice Conversion Challenge 2020 (VCC2020), the next edition of VCC2018, includes two tasks, namely intra-language VC and cross-language VC. The intra-language task consists of 16 source-target speaker pairs, and the cross-language task consists of 24 source-target speaker pairs. Each pair contained 5 converted utterances, and each converted utterance was evaluated by 12 subjects (for intra-language) and 8 subjects (for cross-language). It is worth noting that in the listening test, given a converted utterance, the reference utterance was always the same. Therefore, different evaluation scores might be given to the same test-reference listening pair. In our experiments, we ignored this and simply used all pairs and corresponding score labels to train the model to increase diversity of the training data, thereby enhancing the model robustness. In the evaluation phase, we used the average score of each pair to calculate the results. There are 31 submitted systems for the intra-language task and 28 submitted systems for the cross-language task. To investigate the effects of corpus mismatch, we adopted the VCC2018 training dataset used in [Das2020] as the training data and the full VCC2020 dataset as the test data. Please note that most systems used conventional vocoders in VCC2018, but used neural vocoders in VCC 2020. Thus, the corpus mismatch is quite significant. The prediction results are shown in Table V. From the table, the scores of both SVSNet and MOSNet are lower than those reported earlier due to corpus mismatch, while SVSNet still outperforms MOSNet. Following Das et al. [Das2020]
, we tested the performance with another prediction model formed by a cosine similarity measure based on 128-dimensional linear discriminant analysis (LDA) reduced x-vectors. The results show that with an extra and massive dataset for pretraining, the x-vector system outperforms both SVSNet and MOSNet. Finally, we constructed a fusion system that concatenates 1-norm distance between two x-vectors to Eq.2 on right hand side. From Table V, the fusion model yields further improvements over the SVSNet and x-vector systems.
Finally, Fig. 2 shows the system-level 4-class prediction results of SVSNet(R) and SVSNet(C) for VCC2018 and VCC2020. From the figure, we can see SVSNet achieves good prediction performance for both VCC2018 and VCC2020.
In this paper, we have proposed SVSNet, an end-to-end neural similarity assessment model. The results of experiments on the large-scale human perception evaluation results in VCC2018 and VCC2020 show that SVSNet, benefiting from the SincNet and the residual-skipped-WaveNet architecture, performs better than the previous model MOSNet in terms of linear correlation coefficient (LCC), Spearman’s rank correlation coefficient (SRCC), and mean squared error (MSE). It is also found that directly using the waveform as input without discarding the phase information will increase the prediction ability of our model. In the future, we plan to consider the theory of human perception to design a perception-based objective function to build a more robust mean opinion score (MOS) and similarity prediction model.