Speaker verification has long been a fundamental task in speech processing. A speaker verification system verifies an identity claim made by a test speaker, and decides to accept or reject the claim. It can be either text-dependent or text-independent based on its input speech materials: the former constrains a speaker to pronounce a prescribed text, while the latter does not constrain the speech contents. This paper studies text-independent speaker verification. A text-indepdent speaker verification system generally contains two components. The first component is a front-end which extracts a feature vector from a speaker utterance by some density estimator. The second component is a back-end which builds speaker models and measures the similarity of two speaker models by a classifier.
An early speaker verification front-end is feature averaging which learns a feature vector from a speaker utterance by averaging the frame-level acoustic features . The method requires long speech utterances to reach stable speech statistics. Another class of front-ends is statistical models, which estimates the density of speech frames by statistical models. Early approaches of this kind build a model, e.g. vector quantization  or Gaussian mixture model (GMM) [3, 4], for each speaker. These approaches are inefficient when the number of speakers is large. To alleviate this problem, in , Reynolds et al. proposed a GMM-based universal background model (GMM-UBM) which builds a single GMM from the pool of all training speakers. GMM-UBM is a fundamental method of speaker verification in recent years. To deal with speaker and channel variability, many approaches were proposed along with GMM-UBM, where factor analysis  is among the effective ones. It first extracts high-dimensional supervectors of utterances which are their first- and second-order statistics produced from GMM-UBM, and then reduces the supervectors to low-dimensional identity vectors (i-vectors) by factor analysis. The above combination of GMM-UBM and i-vector is the GMM/i-vector front-end.
Recently, deep neural network (DNN) based front-ends have received much attention [7, 8, 9]. In , Sarkar et al. used DNN to extract frame-level bottleneck features that were then used as the input of GMM-UBM. In , Lei et al.
took a DNN trained for a different task, e.g. speech recognition, to generate posterior probability of speech frames, which is a supervised alternative to GMM-UBM, and then used the factor analysis to extract i-vectors from the DNN based UBM. The method is denoted as the DNN/i-vector front-end. To demonstrate the advantages of the DNN/i-vector front-end, its DNN acoustic model needs to be trained with additional data . In , Variani et al. trained a DNN classifier to map frame-level features in a given context to the corresponding speaker identity target, and extracted a feature vector, referred as a deep vector or “d-vector”, from a speaker utterance by averaging the activations derived from the last DNN hidden layer. The method is known as the d-vector front-end.
After feature extraction by a front-end, a speaker verification back-end builds speaker models for classification. It generally contains two stages—a development stage and a test stage. The development stage builds aspeaker space from development data, where each speaker acts like a coordinate axis of the space. The test stage gets the enrollment and test speaker models of a trial from the speaker space and then evaluates the similarity of the two models by a classifier.
We summarize some back-ends as follows. In , Reynolds et al.
first built a speaker space by adapting the GMM-UBM to many speaker-dependent GMMs by the maximum a posteriori estimation in the development stage and then verifies the identity of a test speaker by a likelihood ratio test. Later on, in, Campbell et al.
trained a support vector machine classifier to distinguish true speakers from imposter speakers with nuisance attribute projection for compensating session variability. In , Dehak et al. proposed to learn a speaker space by within class covariance normalization (WCCN) or linear discriminant analysis (LDA) and then applied cosine similarity scoring as the classifier. In , Kenny proposed to extract speaker models from an i-vector based front-end or LDA and then used probabilistic LDA (PLDA) as the classifier. Besides, in , Snyder et al. proposed an end-to-end training method to train a DNN based front-end and a PLDA-like back-end jointly.
In this paper, we propose a linear regression (LR) based back-end. LR is a traditional statistical regression model that minimizes the mean squared error between the target and its estimate with a closed-form solution. In the development stage of the back-end, we apply LR to learn a speaker space where the target of the LR model is the ground-truth indicator vectors of the speaker utterances. In the enrollment and test stages, we first extract the enrollment and test speaker models of a trial from the speaker space, and then evaluate the similarity of the two models by cosine similarity scoring. The overall back-end is denoted as the LR+cosine back-end. To evaluate its effectiveness, we propose three speaker verification systems which combine the LR+cosine back-end with the GMM/i-vector, DNN/i-vector, and d-vector front-ends, respectively.
We have conducted an extensive experiment on the NIST 2006 speaker recognition evaluation (SRE) and NIST 2008 SRE data sets. We have compared the LR+cosine back-end with the cosine similarity scoring (cosine), WCCN with the cosine similarity scoring (WCCN+cosine), LDA with the cosine similarity scoring (LDA+cosine), and LDA with the PLDA scoring (LDA+PLDA) back-ends. Our experimental results show that the proposed method outperforms the comparison methods, and the experimental conclusion is consistent in different lengths of enrollment speech.
Ii Linear regression for speaker verification
The procedure of any of the three speaker verification systems is as follows. The front-end extracts a feature vector from an utterance where denotes the number of frames of the utterance. Then, the LR-based back-end first gets the speaker model from by the LR model and then verifies the identity of by the cosine similarity scoring.
Ii-a Linear regression based back-end
Suppose a labeled development corpus after processed by a front-end is given by where is the number of speakers, is the number of utterances of the -th speaker, is the feature vector of a speaker utterance produced from a front-end, and is the ground-truth label of the utterance representing the identification of the speaker, . Suppose , then we change to an -dimensional indicator vector which is a binary code with the -th dimension set to and the other dimensions set to . As a result, we can rewrite the labeled corpus as . We fit to a LR model:
where is the LR model and is the estimation error. Minimizing the squared estimation error derives the following closed-form solution:
In the enrollment and test stages, we apply the LR model to extract a new feature from by the following equation:
The speaker model is given by:
where is the number of utterances of the speaker.
Finally, we employ a classifier to identify the similarity of two speaker models and . Despite that many classifiers could be applied, we use the simple and effective cosine similarity scoring as an example based on the experimental conclusion of reference . The cosine similarity of the two models is calculated by:
which is compared with a decision threshold . If the score is larger than , then the two models are judged as from the same speaker; otherwise, they are from different speakers.
Ii-B1 GMM/i-vector front-end
The GMM/i-vector front-end contains a GMM-UBM [5, 15] which is a speaker- and channel-independent GMM trained from the pool of all speech frames of the development data, and a total variability matrix  that encompasses both speaker- and channel-variability. Suppose contains Gaussian mixture components, and suppose we have an utterance of frames where is a -dimensional acoustic feature. The zero-th order and centralized first-order Baum-Welch statistics of the utterance extracted from the -th component of is:
where is the mean of the -th component of . If we define as a -dimensional diagonal matrix whose diagonal blocks are , as a supervector, and as a -dimensional diagonal covariance matrix estimated during factor analysis training , then we obtain the i-vector by:
where and is invariant across utterances.
Ii-B2 DNN/i-vector front-end
The difference between the DNN/i-vector front-end  and the GMM/i-vector front-end is that the GMM-UBM in the DNN/i-vector front-end is estimated by a DNN acoustic model trained for automatic speech recognition. Specifically, the DNN acoustic model is used to estimate the senone posteriors of acoustic features, where a senone is used to model the tied states of a set of triphones that are close in the acoustic space. If we model the posterior distribution of a senone by a Gaussian mixture component of the GMM-UBM, then we can use the senone posteriors to train the GMM-UBM in the following way.
Suppose the development corpus contains utterances, and the -th utterance has frames . The parameters of the GMM-UBM are estimated by:
where represent the alignments of which are the posteriors of produced by the DNN acoustic model, and are the prior and covariance of the -th mixture component, respectively.
The DNN acoustic model is trained in a supervised mode, where the ground-truth labels of the speech frames are the alignments produced by a hidden-Markov-model-GMM (HMM-GMM) speech recognition system. It usually adopts a contextual window with a window size of, e.g., to expand the input from to where is the half-window length.
Ii-B3 D-vector front-end
The d-vector front-end  averages the frame-level features of an utterance produced from the top hidden layer of a DNN classifier for an utterance-level d-vector . The DNN is trained to minimize the classification error of speech frames, where the ground-truth label of a speech frame is the indicator vector of the speaker that the speech frame belongs to. The DNN adopts a large contextual window with a window size of to expand its input acoustic feature from to , which is important in improving the effectiveness and robustness of the d-vector front-end.
In this section, we present the databases and evaluation metrics at first in SectionIII-A, then the experimental setup in Section III-B, and finally the experimental results in Sections III-C and III-D.
Iii-a Databases and evaluation metrics
We took the 8conv condition of NIST 2006 speaker recognition evaluation (SRE) database as the development set, and the 8conv condition of NIST 2008 SRE for enrollment and test. The 8conv condition of NIST 2006 SRE contains 402 female speakers and 298 male speakers. The 8conv condition of NIST 2008 SRE contains 395 female speakers and 240 male speakers. Each speaker has 8 conversations. A speaker utterance in a conversation was about 1 to 2 minutes long after removing the silence segments by VAD, where we took its ASR transcript as its VAD label. We split all speech signals into 15 second segments.
To illustrate the global performance of the proposed method in terms of detection error tradeoff (DET) curves, we built an initial test condition as follows. We selected the first 150 second speech of the first conversation of a speaker as the enrollment data of the speaker, and split the last 30 second speech of the 6-th conversation of the speaker into two test segments with each segment as an individual test. We took each speaker as a claimant with the remaining speakers acting as imposters, and rotated through the tests of all speakers. We conducted the experiment on females and males respectively. The number of claimant and imposter trials are summarized in Table II. The closer the DET curve approaches to the origin, the better the performance is.
To investigate how the performance of the proposed method varies with the length of the enrollment speech, we conducted experiments in six test conditions described in Table I. Specifically, for each speaker in the 8conv condition of the NIST 2008 SRE, we first randomly picked 2 segments from a randomly selected conversation with each segment as an individual test; then, we randomly selected segments from the remaining 7 conversations as the enrollment data of the speaker, where we set to 1, 2, 3, 5, 10, and 15 for the six test conditions respectively. For a given test condition, we built the claimant and imposter trials in the same way as the initial test condition. Therefore, the number of the trials are the same as that in Table II. Because the enrollment and test speech of a trial was selected randomly, we ran the experiments on each test condition 100 times and reported the average results so as to prevent biased conclusions. We used equal error rate (EER), minimum detection cost function (DCF) with SRE’08 parameters (DCF08), and minimum DCF with SRE’10 parameters (DCF10) as the evaluation metrics. The smaller the EER or DCF is, the better the performance is.
Iii-B Experimental setup
Iii-B1 Acoustic features
We set the frame length to 25 ms and the frame shift to 10 ms. We extracted 19-dimensional mel-frequency cepstral coefficients (MFCC), 13-dimensional relative spectral filtered perceptual linear predictive cepstral coefficients (RASTA-PLP) and 1-dimensional log energy, as well as their delta and delta-delta coefficients from each frame, which produced a total of 99-dimensional acoustic feature per frame.
For the GMM/i-vector front-end, we used gender-dependent UBMs containing 2048 Gaussian mixtures and 400 total factors defined by the total variability matrix . We followed the MSR identity toolbox for the implementation of the GMM/i-vector front-end.
For the DNN/i-vector front-end, we trained a DNN acoustic model from the Switchboard-1 database. The alignments of the frames for the DNN training, which contained 8730 senones, were generated by a HMM-GMM speech recognition system implemented in the Kaldi pipeline. The half-window length of the DNN was set to 3, which expanded the acoustic features to 693 dimensions. As a result, the DNN acoustic model used the 693-dimensional feature as the input and its corresponding 8730 dimensional alignment as the ground-truth label. The DNN has hidden layers, each of which consists of rectified linear units. The output units of the DNN are the softmax functions. The DNN was optimized by the minimum cross-entropy criterion. The number of epoches for backpropagation training was set to . The batch size was set to
. The learning rate of the stochastic gradient descent was set to. The momentum was set to 0.5 for the first 10 epoches, and set to 0.9 for the other epoches. The dropout rate of the hidden units was set to 0.2. We used the posterior probability of the development data produced by the DNN acoustic model to train gender-dependent UBMs. Because many senones have small posterior probabilities, we truncated the UBMs from 8730 Gaussian mixtures to 3096 Gaussian mixtures by discarding the mixtures that have small zero-th order Baum-Welsh statistics. We used 400 total factors to generate the i-vectors.
For the d-vector front-end, we trained gender-dependent DNNs on the development data, where the two DNNs have the same parameter setting as follows. The half-window length was set to 20, which expanded the acoustic feature to 4059 dimensions. Each DNN has hidden layers, each of which consists of 400 rectified linear units. The output dimensions of the two DNNs are 395 for the females and 240 for the males, respectively. The learning rate of the stochastic gradient descent was set to . All other parameters were set to the same values as those in the DNN/i-vector front-end.
Iii-B3 Back-ends in comparison
We compared the LR+cosine back-end with the following back-ends:
Cosine similarity scoring (cosine): The cosine back-end evaluates the cosine similarity of two speaker models directly where the speaker model is simply an average of the utterance-level feature vectors of the speaker produced from a front-end .
LDA+cosine: LDA is a supervised dimensionality reduction method. Dehak et al.  applied LDA to the cosine similarity scoring. Here we set the output dimension of LDA to 200 in all evaluations, which is a common experimental setting in literature.
LDA+PLDA: The PLDA classifier was first introduced to speaker verification by Kenny in . LDA is usually used as a feature extractor for PLDA. We set the output dimension of LDA to 200 in all evaluations.
We report the comparison results in the initial test condition in Figs. 1 to 6 respectively. From the figures, we observe that the proposed method outperforms the comparison methods significantly when the GMM/i-vector or DNN/i-vector front-end is used (Figs. 1, 2, 4 and 5), and outperforms the comparison methods slightly when the d-vector front-end is used (Figs. 3 and 6).
To prevent a biased conclusion that the proposed method happens to have some advantage in the initial test condition, we ran a comparison in the 6 test conditions described in Table I, where each test condition has 100 independent implementations randomly generated from the NIST 2008 SRE database. We report the average results on the male and female parts of the implementations in Tables III and IV respectively. From the tables, we observe that the proposed LR+cosine back-end outperforms the comparison methods when the enrollment speech is longer than 15 seconds, and is comparable to LDA+PLDA when the enrollment speech is 15 seconds long, given any of the three front-ends.
|EER (in %)||DCF08||DCF10|
|EER (in %)||DCF08||DCF10|
with and denoted as the EERs of the proposed method and best comparison method respectively. From the figures, we observe the following phenomena. (i) The relative improvement is getting larger when the enrollment speech is getting longer. An exception is that, when the DNN/i-vector is used as the front-end, the relative improvement is not always increased for the females. This is caused by the fast performance improvement of the cosine similarity scoring when the enrollment speech is getting longer. (ii) The highest relative improvement happens with the GMM/i-vector front-end, which reaches 44.3% for the females in the 225"-15" test condition and 33.0% for the males in the 150"-15" test condition.
We also drew the soft decision scores produced from the LR+cosine and LDA+PLDA back-ends for the females in Fig. 9
where we have normalized the decision scores to a range where the mean values of the decision scores of the imposter and true trials are zero and one respectively. From the figure, we observe that the scores produced by LR+cosine have smaller with-in class variances and smaller overlaps than those produced by LDA+PLDA.
Iii-D Effects of back-ends in fusion systems
Fusing the decision scores produced from multiple base methods is an effective way for further improving the performance of the base methods. This subsection studies the approach of averaging the soft decision scores produced from the systems that use GMM/i-vector and DNN/i-vector as the front-ends, respectively. Figures 10 and 11 show the DET curves of the fusion systems with different back-ends on the initial test condition. Tables V and VI list the comparison results of the fusion systems on the 6 test conditions defined in Table I. From the figures and tables, we observe the same experimental phenomena as those in Section III-C, which supports the effectiveness of the LR+cosine back-end in the fusion systems.
Note that we have also evaluated the fusion systems that fuse the GMM/i-vector, DNN/i-vector, and d-vector front-ends together. The experimental conclusions are similar with the above.
|EER (in %)||DCF08||DCF10|
|EER (in %)||DCF08||DCF10|
In this paper, we have presented a speaker verification back-end based on linear regression. Linear regression is a simple linear model that minimizes the mean squared estimation error between the target and its estimate with a closed form solution, where the target for our speaker verification problem is defined as the ground-truth indicator vectors of utterances. The proposed LR+cosine back-end first learns speaker models by the LR model, and then applies the cosine similarity scoring to evaluate the similarity of a pair of speaker models. We have further proposed three LR-based speaker verification systems by combining the LR+cosine back-end with the GMM/i-vector, DNN/i-vector, and d-vector front-ends respectively. We have conducted an extensive experiment on the NIST 2006 SRE and NIST 2008 SRE data sets, where we used the 8conv condition of the NIST 2006 SRE for development and the 8conv condition of the NIST 2008 SRE for enrollment and test. To prevent a biased experimental conclusion on a particular evaluation environment, the experiment was carried out with different lengths of enrollment speech covering a range from 15 seconds to 225 seconds and repeated 100 times. The experimental results show that the proposed LR+cosine back-end outperforms several common back-ends including the cosine, WCCN+cosine, LDA+cosine, and LDA+PLDA back-ends in most cases in terms of DET curves, EER, DCF08, and DCF10.
This work was supported in part by the Natural Science Foundation of China under Grant No. 61671381.
-  J. Markel, B. Oshika, and A. Gray, “Long-term feature averaging for speaker recognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. 25, no. 4, pp. 330–337, 1977.
-  F. K. Soong, A. E. Rosenberg, B.-H. Juang, and L. R. Rabiner, “Report: A vector quantization approach to speaker recognition,” AT&T Tech. J., vol. 66, no. 2, pp. 14–26, 1987.
-  D. A. Reynolds, “Speaker identification and verification using gaussian mixture speaker models,” Speech Commun., vol. 17, no. 1, pp. 91–108, 1995.
-  D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE Trans. Speech, Audio Process., vol. 3, no. 1, pp. 72–83, 1995.
-  D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital Signal Process., vol. 10, no. 1-3, pp. 19–41, 2000.
-  N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788–798, 2011.
-  A. K. Sarkar, C.-T. Do, V.-B. Le, and C. Barras, “Combination of cepstral and phonetically discriminative features for speaker verification,” IEEE Signal Process. Lett., vol. 21, no. 9, pp. 1040–1044, 2014.
-  E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 4052–4056.
-  Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 1695–1699.
-  F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network approaches to speaker and language recognition,” IEEE Signal Process. Lett., vol. 22, no. 10, pp. 1671–1675, 2015.
-  W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Process. Lett., vol. 13, no. 5, pp. 308–311, 2006.
-  W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, “SVM based speaker verification using a GMM supervector kernel and NAP variability compensation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2006, pp. 1–4.
-  P. Kenny, “Bayesian speaker verification with heavy-tailed priors.” in Proc. Odyssey, 2010, pp. 14–23.
-  D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in Proc. IEEE Spoken Lang. Tech. Workshop, 2016, pp. 165–170.
-  P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A study of interspeaker variability in speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 5, pp. 980–988, 2008.
-  A. O. Hatch, S. S. Kajarekar, and A. Stolcke, “Within-class covariance normalization for SVM-based speaker recognition,” in Proc. Interspeech, 2006, pp. 1874–1877.