## 1 Introduction

We are observing an ever-increasing use of automatic speaker verification (ASV) systems in our everyday lives. An essential step for verification is to disentangle the speaker information from each spoken utterance and then decisions are made based on the speaker similarity. Through decades of years development, three most representative frameworks have been proposed in this research area. (i) Extending from joint factor analysis [kenny2007joint, kenny2007speaker]

that models speaker and channel subspaces separately, i-vector based speaker embedding is proposed to jointly model speaker and channel variations together with a speaker-discriminative back-end for decisions. Such systems include Gaussian mixture model (GMM) i-vector

[dehak2010front, kenny2012small, prince2007probabilistic, garcia2011analysis] and deep neural network (DNN) i-vector systems [lei2014novel]. (ii) Benefiting from the powerful discrimination ability of DNNs, DNN-based speaker embedding is proposed to extract speaker-discriminative representations for each utterance, which could perform as the state of the art on short-utterance evaluation conditions. These systems include d-vectors [variani2014deep] and x-vectors [snyder2017deep, snyder2018x]. (iii) With the development of end-to-end techniques, many researches focus on constructing ASV systems in an end-to-end manner [zhang2016end, heigold2016end, snyder2016deep], which directly learns a mapping from enrollment and testing utterance pairs to verification scores, resulting in a compact structure and comparably good performance.A challenging issue for ASV systems development is the mismatch between the training and evaluation data, such as the speaker population mismatch, and variations in channel and environmental background. The speaker population used for training and evaluation commonly have no overlap especially for practical applications. To overcome this mismatch usually requires the extracted speaker representations to generalize well on unseen speaker data. The channel and environment variations mostly exist in practical applications where the training and evaluation data are collected from different types of recorders and environments. These mismatches also have a high demand for the model’s generalizability on unseen data.

To address this issue, previous efforts [wang2018unsupervised, bhattacharya2019adapting, tu2019variational] have applied adversarial training to alleviate the channel and environment variations from utterance embedding. It is achieved by adding adversarial penalty on domain-related information in embedding vectors during the extractor training stage. This approach has been proven to be effective in alleviating the effects of channel and environmental mismatches. However, it does not consider the speaker population mismatch that could also lead to the system performance degradation. In this work, we try to improve the system’s generalizability across these kinds of mismatches in a unified approach. Inspired by previous work based on Bayesian learning [xie2019blhuc, hu2019bayesian, yu2019comparative], we focus on the DNN x-vector system and apply Bayesian neural networks (BNNs) to improve the x-vector system’s generalization ability.

The Bayesian learning approach has been shown to be effective to improve the generalization ability of discriminative training in DNN systems. In the machine learning community, similar work have been conducted to incorporate Bayesian learning into DNN systems. Barber et al.

[barber1998ensemble] proposed an efficient variational inference strategy for BNNs. Blundell et al. [blundell2015weight]proposed a novel backpropagation-compatible algorithm for learning the network parameters’ posterior distribution. In the speech area, some previous work involved BNNs in speech recognition

[graves2011practical, hu2019bayesian, xie2019blhuc, hu2019lf]. Especially, Xie et al. [xie2019blhuc] proposed the Bayesian learning of hidden unit contributions (BLHUC) for speaker adaptation. The BLHUC could model the uncertainty of speaker-dependent parameters and improve the speech recognition performance especially when given very limited speaker adapatation data. Other work also applied the Bayesian technique into language modelling [chien2015bayesian, yu2019comparative].In a DNN x-vector system, the parameters of traditional time delay neural network (TDNN) layers estimated via the maximum likelihood strategy are deterministic and tend to overfit when given limited training data or when there is a large mismatch between the training and evaluation data. In the case of mismatch in speaker population, the overfitted model parameters may result in speaker representations following a spike distribution towards possible training speaker identities. However this will tend not to generalize well on unseen speaker data. To address this issue, BNNs could help smooth the distributions of speaker representation for better generalization on unseen speaker data.

The cases of channel and environmental mismatch are similar. For instance, for channel mismatch, the overfitted model parameters may partially rely on channel information to classify speakers due to various recorders for different speakers in the training data. However, when generalizing to channel-mismatched evaluation data, the original channel-speaker relationship is broken and the trained reliance on channel information cloud lead to misclassification. To alleviate this issue, BNNs change deterministic parameters to be probabilistic via a posterior distribution. This parameter distribution modeling could reduce the risk of overfitting on channel information by smoothing parameters to consider extra possible values that do not rely on channel information for speaker classification.

The above issues motivate this work to incorporate BNNs into the x-vector system by replacing the TDNN layers. We adopt an efficient variational inference based approach to approximate the parameter posterior distribution. The effectiveness of Bayesian learning is investigated on both short- and long-utterance in-domain evaluation, and also an out-of-domain evaluation that includes larger channel and environment mismatches. To the best of our knowledge, this is the first work that applies Bayesian learning technique to speaker verification systems.

Our experiments are based on Voxceleb1 (for short-utterance condition) and NIST Speaker Recognition Evaluation (SRE) 10 (for long-utterance condition) datasets.

## 2 Baseline: DNN x-vector system

A DNN x-vector system [snyder2018x] consists of two parts: a front-end used for extracting utterance-level speaker embeddings and a verification scoring back-end. The front-end compresses speech utterances of different length into fixed-dimension speaker-related embeddings (x-vectors). Based on these embeddings, different scoring schemes can be used for judging whether two utterances belong to a same person or not. In this work, we focus on the reversion of the front-end, and choose different back-ends for the performance evaluation.

The x-vector extractor is a neural network trained via a speaker discrimination task, the architecture of which is shown in Fig. 1

. It consists of frame-level and utterance-level extractors. At the frame level, several layers of time delay neural network (TDNN) are used to model the time-invariant characteristics of acoustic features. Then the statistics pooling layer aggregates all the frame-level outputs from the last TDNN layer, and computes their mean and standard deviation. The mean and standard deviation are concatenated together and propagated through several fully connected utterance-level layers, i.e. embedding layers, and finally the softmax output layer. The cross-entropy between one-hot speaker labels and the softmax outputs is used as the loss function during the training stage. In the testing stage, given the acoustic features of an utterance, the embedding layer output is extracted as the x-vector. Since the network is trained in a speaker-discriminative manner, the extracted x-vectors are expected to only contain speaker-related information. But in practice, as investigated in

[raj2019probing], x-vectors still contain other speaker-unrelated information, such as channel, transcription and utterance-length. These information could affect the verification performance especially on the mismatched evaluation data.## 3 Bayesian neural network

### 3.1 Weight uncertainty

Traditional neural networks learn a set of deterministic parameters to fit with the training data via the maximum likelihood estimation, and then make inference based on these fixed parameters in the testing stage. This estimation may lead to overconfident parameters when the training data is limited or when there exists a mismatch between the training and evaluation data.

To alleviate this issue, Bayesian neural networks learn the parameters’ posterior distribution instead. The posterior distribution based on the training data models weight uncertainty and theoretically enables an infinite number of possible model parameters to fit with the data. This weight uncertainty modeling can smooth model parameters and make the model generalize well on unseen data. During the testing stage, the model computes the output given the input by making an expectation over the weight posterior distribution , as shown in Eq. 1.

(1) |

The estimation of the weight posterior distribution is an essential procedure when training a Bayesian neural network. However, the direct estimation is intractable for neural networks of any practical size, since the number of possible weight values could be infinity. So the variational approximation [barber1998ensemble] is commonly adopted to estimate the weight posterior distribution.

### 3.2 Variational approximation for Bayesian learning

The variational approximation estimates a set of parameters for a distribution to approximate the posterior distribution . This is achieved by minimizing the Kullback-Leibler (KL) divergence between these two distributions, as shown in Eq. 2.

(2) | ||||

(3) | ||||

(4) | ||||

(5) | ||||

(6) |

From Eq. 2 to 4, we apply Bayes’ Rule and drop the constant term that does not affect the minimization over . Equations from 4 to 6 demonstrate that this minimization equation could be decomposed into two parts: 1) the KL divergence between the approximation distribution and the prior distribution on the weight, 2) the expectation of the log likelihood of the training data over the approximation distribution . Eq. 6 is used as the loss function to be minimized during the training process.

As commonly adopted in [barber1998ensemble, blundell2015weight]

, we assume that the variational approximation follows a diagonal Gaussian distribution with a parameter set

. is the mean of the diagonal Gaussian distribution, while generates the diagonal Gaussian standard deviation by . The prior distribution is also assumed to be a diagonal Gaussian distribution with a parameter set . Unlike will be updated during the training stage, is usually a set of predetermined fixed parameters.Under the Gaussian distribution assumptions, the first part in Eq. 6 has a closed-form result that can be computed directly,

(7) |

where denotes the number of entries in the weight matrix, and , , and are the -th entry of , , and , respectively.

While the integration in the second part cannot be computed directly, Monte Carlo sampling is commonly applied to approximate this integration, as shown in Eq. 3.2:

(8) |

where is the number of samples and is the th sample from the distribution , and denotes the element-wise multiplication. As the equation shows, is sampled by scaling and shifting a random signal from the unit Gaussian distribution.

Finally, the loss function is derived as:

(9) |

The gradient with respective to parameters can be derived as,

(10) | ||||

(11) |

where is the standard gradient of loss function with respective to (the -th entry of the weight matrix).

Layer | Layer Context | Input Output |

frame1 | [-2, +2] | 120 512 |

frame2 | {-2, , +2} | 1536 512 |

frame3 | {-3, , +3} | 1536 512 |

frame4 | {} | 512 512 |

frame5 | {} | 512 1500 |

stats pooling | [0, ) | 1500 3000 |

segment6 | {0} | 3000 512 |

segment7 | {0} | 512 512 |

softmax | {0} | 512 |

in the softmax layer is the size of speaker population during training. The

in frame-level layers represents the current frame, and represents the total number of frames in an utterance. x-vectors are extracted from segment6, before the nonlinearity.Training set | Evaluation set | System | Scoring back-end | x-vector extractor | EER(%) | / |
---|---|---|---|---|---|---|

Voxceleb1 | Voxceleb1 | (1) | cosine | baseline | 9.58 | 0.6899 |

(2) | proposed | 9.30 | 0.6508 | |||

(3) | fusion | 8.64 | 0.6423 | |||

(4) | PLDA | baseline | 6.68 | 0.6023 | ||

(5) | proposed | 6.52 | 0.5423 | |||

(6) | fusion | 6.35 | 0.5487 | |||

NIST SRE10 | NIST SRE10 | (7) | cosine | baseline | 5.61 | 0.6830 |

(8) | proposed | 5.52 | 0.6555 | |||

(9) | fusion | 5.47 | 0.6502 | |||

(10) | PLDA | baseline | 3.29 | 0.3926 | ||

(11) | proposed | 3.19 | 0.3835 | |||

(12) | fusion | 3.17 | 0.3840 |

## 4 Experimental setup

In order to evaluate the effectiveness of Bayesian learning for speaker verification in both short- and long-utterance conditions, we perform experiments on two datasets. For the short-utterance condition, we consider the Voxceleb1 [nagrani2017voxceleb] dataset, where the recordings are short clips of human speech. There are totally 148,642 utterances for 1,251 celebrities. We follow the configuration in [nagrani2017voxceleb], where 4,874 utterances from 40 speakers are reserved for evaluation, and the remaining utterances are used for training x-vector systems (the baseline and BNN-based system) and the back-end model parameters.

For the long-utterance condition, the core test in the NIST speaker recognition evaluation 10 (SRE10) [martin2010nist] is used for evaluation, where the recordings are long-duration telephone speech. The training data used in this condition includes SREs from 04 through 08, Switchboard2 Phases 1, 2 and 3, and Switchboard cellular. The Switchboard portion is commonly used to increase the training data variety [snyder2017deep, snyder2018x]. In total there are around 65,000 recordings for 6,500 speakers. All this training data is used for training x-vector systems (the baseline and BNN-based system), while only the SRE parts are used for training the scoring back-ends. During the training stage, since the utterances are very long, the GPU memories limitation forces a tradeoff between minibatch size and maximum training example length. We randomly cut each recording into chunks of length from 2s to 10s (200 frames to 1000 frames) along with a minibatch size of 48. After this procedure, the average training chunks for each speaker is around 220. No data augmentation technique is applied in any experiment.

Mel-frequency cepstral coefficients (MFCCs) are adopted as acoustic features in all the experiments. Before extracting MFCCs, a pre-emphasis with coefficient of 0.97 is adopted. “Hamming” window having size of 25ms and step-size of 10ms is applied to extract a frame, and finally 30 cepstral coefficients are kept. The extracted MFCCs are mean-normalized over a sliding window of up to 3 seconds, and voice activity detection (VAD) [snyder2017deep] filters out nonspeech frames.

The configuration of the baseline x-vector extractor is consistent with [snyder2018x], as shown in Table 1

. After propagating through several frame-level layers and one statistic pooling layer, the outputs from the first segment-level layer (segment6 in the table) before nonlinearity are extracted as x-vectors. We adopt the stochastic gradient descent (SGD) optimizer during the training stage. In order to make a fair comparison, the Bayesian x-vector system is configured with the same architecture of the baseline system except the first TDNN layer is replaced by BNN layer with the same number of units. We also attempted to replace other layers with BNN layer, but experiment results show that operation on other layers gives a slightly worse performance than operation on the first layer. This may indicate that operation on the layer close to the input features is more effective and has more impact on improving the model’s generalization ability. The choice of prior distribution has an impact on model convergence and training efficiency. In this work, we set the prior distribution based on the baseline model parameters, similar with the strategy in

[hu2019bayesian]. The x-vector systems (the baseline and BNN-based system) are implemented by Pytorch

[NEURIPS2019_9015], while the other parts, including data preparation, feature extraction and training scoring back-ends, are implemented by Kaldi toolkit

[Povey_ASRU2011].To evaluate the generalization benefits that Bayesian learning could bring under evaluation of different mismatch degrees, we design two kinds of evaluation experiments: in-domain evaluation and out-of-domain evaluation. The training and testing stages are executed on the same dataset in in-domain evaluation, while they are executed on different datasets in out-of-domain evaluation. On both evaluations, we perform experiments on two datasets, i.e. Voxceleb1 for the short-utterance condition and NIST SRE10 for the long-utterance condition. Two kinds of scoring back-ends are adopted in our experiments: cosine and probabilistic linear discriminative analysis (PLDA) back-ends. Before propagating into the back-end scoring, speaker embeddings are projected into a lower dimension space via linear discriminant analysis (LDA). Following the default settings adopted in Kaldi toolkit [Povey_ASRU2011], the LDA dimension is set as 150 and 200 for cosine and PLDA scoring, respectively.

The evaluation metrics adopted in this work are the commonly used equal error rate (EER) and minimum detection cost function (min-DCF). For the min-DCF metric, we consider the prior probability of target trials to be 0.01 on the Voxceleb1 (denoted as

). For the NIST SRE10 dataset, the target trial partitions are much smaller and thus we consider the prior probability to be 0.001 (denoted as ). Intuitively, with the consideration that the baseline system could have correct operation with high confidence on the common characterisitcs between the training and evaluation data, while the BNN-based system could generate well on the mismatch characteristics, we also design a fusion system for performance comparison. The fusion is operated by averaging the verification scores from the two systems.Training set | Evaluation set | System | Scoring back-end | x-vector extractor | EER(%) | / |
---|---|---|---|---|---|---|

Voxceleb1 | NIST SRE10 | (1) | cosine | baseline | 10.78 | 0.8650 |

(2) | proposed | 10.38 | 0.8633 | |||

(3) | fusion | 10.15 | 0.8428 | |||

(4) | PLDA | baseline | 8.31 | 0.8646 | ||

(5) | proposed | 7.84 | 0.8541 | |||

(6) | fusion | 7.71 | 0.8378 | |||

NIST SRE10 | Voxceleb1 | (7) | cosine | baseline | 15.30 | 0.8101 |

(8) | proposed | 14.85 | 0.8164 | |||

(9) | fusion | 14.14 | 0.7913 | |||

(10) | PLDA | baseline | 11.27 | 0.7636 | ||

(11) | proposed | 10.91 | 0.7555 | |||

(12) | fusion | 10.68 | 0.7461 |

## 5 Experiment results

### 5.1 In-domain evaluation

In this section, we perform in-domain evaluation on two datasets: Voxceleb1 for the short-utterance condtion and NIST SRE10 for the long-utterance condition. The corresponding performance is shown in Table 2. From the table, we observe that EERs consistently decrease after incorporating the Bayesian learning in both short- and long-utterance conditions. In each condition, we consider the average relative EER decrease across cosine and PLDA back-ends. In the short-utterance condition, the average relative EER decrease from Bayesian x-vector system is 2.66%, and the fusion system could achieve further average relative EER decrease by 7.24%. For the long-utterance condition, the average relative EER decrease is 2.32% for Bayesian x-vector system and 3.08% for the fusion system.

Our experiment results show that systems in the short-utterance condition could benefit more from the Bayesian learning when compared with the long-utterance condition. One possible explanation is that, in the short-utterance condition, systems may be heavily affected by speaker-unrelated information, such as channel and phonetic information, which may bring larger mismatches in the testing stage. With the uncertainty modeling of model parameters, the Bayesian learning could bring extra benefits to alleviate these larger mismatches and improve the performance. Similar results could be observed in detection cost function (DCF) metrics as shown in the last column of Table 2. Fig. 2 illustrates the detection error trade-off (DET) curves of systems with the cosine back-end (Systems 1, 2 and 3 in Table 2). It shows that the proposed Bayesian system outperforms the baseline for all operating points, and the fusion system could achieve further improvements due to the complementary advantages of the baseline and the Bayesian system.

### 5.2 Out-of-domain evaluation

The out-of-domain evaluation is performed in this section, as shown in Table 3. The model trained on Voxceleb1 (Systems 1 to 6) will be evaluated on NIST SRE10, and vice versa. System performance usually drops significantly due to the larger mismatch between the training and evaluation data, so this evaluation has a higher demand for the system’s generalization ability.

From Table 3, we observe that systems could benefit more from the generalization power of Bayesian learning. We also consider the average relative EER decrease across cosine and PLDA scoring back-ends for performance evaluation. In the experiments evaluated on NIST SRE10, the average relative EER decrease is 4.69% and 6.53% for the Bayesian system and the fusion system, respectively. For the experiments on the Voxceleb1 dataset, the average relative EER decrease is 3.07% for the Bayesian x-vector system, and the fusion system achieves a further average relative EER decrease of 6.41%. The larger relative EER decrease compared with that in in-domain evaluation suggests that Bayesian learning could be more beneficial when larger mismatch exists between the training and evaluation data. This phenomenon is similar to the observations stated in [hu2019lf], where the improvement on the out-of-domain dataset (with larger mismatch) is larger than the in-domain dataset. The last column in Table 3 shows the corresponding DCF performance, and we observe consistent improvement by applying Bayesian learning and the fusion system. Similar to the observations in Fig. 2, the DET curves in Fig. 3 show consistent improvements by applying Bayesian learning and the fusion model for all operating points.

## 6 Conclusion

In this work, we incorporate the BNN technique into the DNN x-vector system to improve the model’s generalization ability. BNN layers embedded in the x-vector extractor make the extracted speaker embedding (Bayesian x-vector) generalize better on unseen data. Our experimental results show that the DNN x-vector could benefit from Bayesian learning for both short- and long-utterance conditions, and the fusion system could achieve further performance improvements. Moreover, we observe that systems could benefit more from Bayesian learning in out-of-domain evaluation. Especially, in out-of-domain evaluation performed on the NIST SRE10 dataset, the average relative EER decrease across cosine and PLDA scoring is around 4.69% and 6.53% by applying the Bayesian system and the fusion system, respectively. This suggests that Bayesian learning is more beneficial when larger mismatch exists between the training and evaluation data. Possible future research will focus on incorporating Bayesian learning into the end-to-end speaker verification systems.

## 7 Acknowledgements

This work is partially supported by a grant from the HKSAR Government’s Research Grants Council General Research Fund (reference number 14208718).

Comments

There are no comments yet.