1 Introduction
As opposed to textindependent speaker verification, where the speech content is unconstrained, textdependent speaker verification systems are more favorable for security applications since they showed higher accuracy on shortduration sessions [1, 2].
The previous methods regarding textdependent speaker verification can be grouped into two categories. The first category is based on the traditional stateoftheart GMMUBM or ivector approach, which may not work well in this case [3, 1, 4]
. In the second category, deep models are ported to speaker verification: deep neural network (DNN) is used to estimate the frame posterior probabilities
[5]; DNN as a feature extractor for the utterance representation [6]; Matejka et al. [7] have shown that using bottleneck DNN features (BN) concatenated to other acoustic features outperformed the DNN method for textdependent speaker verification; multitask learning jointly learns both speaker identity and text information [8].This paper is based on two works: one is of Chen et al. [8], in which the jvector was introduced as a kind of more compact representation for text dependent utterances; the other is of Chen et al. [9], in which the stateoftheart joint Bayesian analysis is proposed to model the two facial images jointly with an appropriate prior that considers intra and extrapersonal variations over the image pairs. However, the standard joint Bayesian model only considers one single label, but in practice the extracted features are always associated with several labels, for example when using multitask learned networks as feature extractor to extract the jvector [8]. Since jvector potentially have different kinds of labels, the text latent variable is no longer only dependent on the current label, but rather depends on a separate text label. This means for jvector there are two latent variables related to speaker and text have equal importance, and both variables are tied across all samples that sharing a certain label.
In order to better modeling jvector, we propose a generalization of the standard joint Bayesian [9] called Double Joint^{1}^{1}1
For the “double joint” term, the first “joint” is for modeling the multiview information jointly, e.g. text and identity in jvector, while the second “joint” is for joint distribution of two features, e.g. target and test jvectors
Bayesian (DoJoBa), which can explicitly and jointly model the multiview information from samples, such as certain individual saying some text content. The relationship between DoJoBa and standard joint Bayesian is analogous to that between joint factor analysis and factor analysis. DoJoBa is also related to the work of Shi et al. [10], in which a joint PLDA is proposed for jvector verification. One of the most important advantages of DoJoBa compared to joint PLDA, is that DoJoBa can learn the appropriate dimensionality (or the number of columns) of the lowrank speaker subspace and phrase subspaces without user tuning.The remainder of this paper is organized as follows: Section 2 reviews the standard jvector/joint Bayesian system. Section 3 describes the DoJoBa approach. The detailed experimental results and comparisons are presented in Section 4 and the whole work is summarized in Section 5.
2 Baseline jvector/joint Bayesian model
The standard jvector [8] and the joint Bayesian model [9] is used as the baseline in this work. This section gives a brief review of this baseline.
2.1 Jvector extraction
Chen et al. [8]
proposed a method to train a DNN to make classifications for both speaker and phrase by minimizing a total loss function consisting a sum of two crossentropy losses  one related to the speaker label and the other to the text label. Once training is completed, the output layer is removed, and the rest of the neural network is used to extract speakerphrase joint features. Each frame of an utterance is forward propagated through the network, and the output activations of all the frames are averaged to form an utterancelevel feature called jvector. The enrollment speaker models are formed by averaging the jvectors corresponding to the enrollment recordings.
2.2 The joint Bayesian model
For the backend, the stateoftheart joint Bayesian model [9] is employed as a classifier for speaker verification. For simplicity of notation, joint Bayesian model with only single speaker label is used here as an example. Joint Bayesian models data generation using the following equation:
where is certain jvector, and and are defined to be Gaussian with diagonal covariance and respectively.
The parameters
of this joint Bayesian model can be estimated using the Expectation Maximization (EM)
[11, 9] algorithm. With the learned joint Bayesian model, given a test and an enrolled model , the likelihood ratio score isThis standard joint Bayesian cannot properly deal with the jvector that jointly belong to certain speaker and certain phrase at the same time. For jvector, it is noted that we need to define the joint Bayesian latent variable as the joint variable considering both speaker and phrase information. This means the latent variable is dependent on both a speaker identity and a phrase label. In this work we try to separate the into two independent latent variables  one related to the speaker identity information and the other to the phrase. This intuitive idea results in the following DoJoBa.
3 Double joint Bayesian model
In this section, we propose an effective model to describe the jvector as resulting from a generative model which incorporates both intraspeaker/phrase and interspeaker/phrase variation.
3.1 Generative model
We assume that the training data is obtained from speakers saying phrases each with sessions. We denote the jvector of the ’th session of the ’th speaker saying ’th phrase by . We model the text dependent feature generation by the process:
(1) 
The model comprises two parts: 1, the signal component which depends only on the speaker and phrase, rather than on the particular feature vector (i.e. there is no dependence on ); 2, the noise component which is different for every feature vector of the speaker/phrase and represents withinspeaker/phrase noise. The term represents the overall mean of the training vectors. Remaining unexplained data variation is explained by the residual noise term which is defined to be Gaussian with diagonal covariance . The latent variables and are defined to be Gaussian with diagonal covariance and respectively, and are particularly important in real application, as these represents the identity of the speaker and the content of the text respectively.
Formally the model can be described in terms of conditional probabilities
where represents a Gaussian in with mean and covariance . Here it’s worth to notice that the mathematical relationship between DoJoBa and joint Bayesian [9] is analogous (not exactly) to that between joint PLDA [10] and PLDA [12]
. Compared to joint PLDA, DoJoBa allows the data to determine the appropriate dimensionality of the lowrank speaker and text subspaces for maximal discrimination, as opposed to requiring heuristic manual selections.
Let , , and . In order to maximize the likelihood of data set with respect to parameters , the classical EM algorithm [11] is employed.
3.2 EM formulation
The auxiliary function for EM is
By maximizing the auxiliary function, we obtain the following EM formulations.
E steps:we need to calculate the expectations , , , , and . Indeed we have
(2)  
and
(3)  
It is almost the similar equations for and . For , we have
(4)  
where and
3.3 Likelihood Ratio Scores
We treat the verification as a kind of hypothesis testing problem with the null hypothesis
where two jvectors have the same speaker and phrase variables and and the alternative hypothesis where they do not (there are three cases: different underlying variable with same variable in model , same variable with different variables in model , or different underlying variables with different variables in model ). Given a test jvector and an enrolled jvector , and let the priori probability of the models , , as , , , then the likelihood ratio score iswhere
Notice that like standard joint Bayesian model [9], we do not calculate a point estimate of hidden variable. Instead we compute the probability that the two multilabel vectors had the same hidden variables, regardless of what this actual latent variable was.
4 Experiments
In this section, we describe the experimental setup and results for the proposed method on the public RSR2015 English corpus [1] and our internal Huiting202 Chinese Mandarin database collected by the Huiting Techonogly^{2}^{2}2http://huitingtech.com/.
4.1 Experimental setup
RSR2015 corpus [1] was released by I2R, is used to evaluate the performance of different speaker verification systems. In this work, we follow the setup of [13], the part I of RSR2015 is used for the testing of DoJoBa. The background and development data of RSR2015 part I are merged as new background data to train the jvector extractor.
Our internal gender balanced Huiting202 database is designed for local applications. It contains 202 speakers reading 20 different phrases, 20 sessions each phrase. All speech files are of 16kHz. 132 randomly selected speakers are used for training the background multitask learned DNN, and the remaining 70 speakers were used for enrollment and evaluation.
In this work, 39dimensional Melfrequency cepstral coefficients (MFCC, 13 static including the log energy + 13 + 13
) are extracted and normalized using utterancelevel mean and variance normalization. The input is stacked normalized MFCCs from 11 frames (5 frames from each side of the current frame). The DNN has 6 hidden layers (with sigmoid activation function) of 2048 nodes each. During the background model development stage, the DNN was trained by the strategy of pretraining with Restricted Boltzmann Machine (RBM) and fine tuning with SGD using crossentropy criterion. Once the DNN is trained, the jvector can be extracted during the enrollment and evaluation stages.
4.2 Results and discussion
Four systems are evaluated and compared across above conditions:

jvector
: the standard jvector system with cosine similarity.

joint Bayesian: the jvector system with classic joint Bayesian in [9].

jPLDA: joint PLDA system described in [10] with jvector.

DoJoBa: double joint Bayesian system described in Section 3 with jvector.
When evaluation a speaker is enrolled with 3 utterances of the same phrase. The task concerns on both the phrase content and speaker identity. Nontarget trials are of three types: the impostor pronouncing wrong lexical content (impostor wrong, IW); a target speaker pronouncing wrong lexical content (target wrong, TW); the imposter pronouncing correct lexical content (impostor correct, IC).
The joint Bayesian, jPLDA, and DoJoBa models are trained using the jvectors. The class defined in both models is the multitask label of both the speaker and phrase. For each test session the jvector is extracted using the same process and then the log likelihood from joint Bayesian, jPLDA, and DoJoBa are used to distinguish among different models. The number of principle components is set to 100 and then the joint Bayesian model is estimated with 10 iterations; the speaker and the phrase subspace dimensions of jPLDA and DoJoBa are both set to 100 regarding of fair comparisons and the jPLDA and DoJoBa model are also trained with 10 iterations.
Table 1 and 2 compare the performances of all abovementioned systems in terms of equal error rate (EER) for the three types of nontarget trials. Obviously DoJoBa is superior to the standard joint Bayesian and jPLDA, regardless of the test database. Since DoJoBa system can explore both the identity and the lexical information from the jvector, it constantly performs better than standard joint Bayesian systems.
EER(%)  jvector  joint Bayesian  jPLDA  DoJoBa 
IW  0.95  0.02  0.02  0.02 
TW  3.14  0.03  0.06  0.02 
IC  7.86  3.61  3.12  2.97 
Total  1.45  0.46  0.40  0.37 
EER(%)  jvector  joint Bayesian  jPLDA  DoJoBa 
IW  0.86  0.10  0.13  0.08 
TW  6.71  0.04  0.07  0.04 
IC  4.57  2.52  2.37  2.13 
Total  1.37  0.45  0.36  0.31 
5 Conclusions
In this paper we have proposed a double joint Bayesian (DoJoBa) analysis for jvector verification. DoJoBa is related to joint Bayesian model, and can be thought of as joint Bayesian with multiple probability distributions attached to the features. The most important advantages of DoJoBa, compared to joint Bayesian, is that multiple information can be explicitly modeled and explored from the samples to improve the verification performance; comparing to jPLDA, DoJoBa can determine the latent dimension without tuning. Reported results showed that DoJoBa provided significant reduction in error rates over conventional systems in term of EER.
References
 [1] Anthony Larcher, Kong Aik Lee, Bin Ma, and Haizhou Li, “Textdependent speaker verification: Classifiers, databases and rsr2015,” Speech Communication, vol. 60, pp. 56–77, 2014.
 [2] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “Endtoend textdependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 5115–5119.
 [3] Patrick Kenny, Themos Stafylakis, Pierre Ouellet, and Md Jahangir Alam, “Jfabased front ends for speaker recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1705–1709.
 [4] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
 [5] Yun Lei, Nicolas Scheffer, Luciana Ferrer, and Mitchell McLaren, “A novel scheme for speaker recognition using a phoneticallyaware deep neural network,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1695–1699.
 [6] Ehsan Variani, Xin Lei, Erik Mcdermott, and Ignacio Lopez Moreno, “Deep neural networks for small footprint textdependent speaker verification,” in ICASSP 2014  2014 IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 4052–4056.
 [7] Hossein Zeinali, Hossein Sameti, Lukas Burget, Jan Cernocky, Nooshin Maghsoodi, and Pavel Matejka, “ivector/hmm based textdependent speaker verification system for reddots challenge,” in INTERSPEECH, 2016.
 [8] Nanxin Chen, Yanmin Qian, and Kai Yu, “Multitask learning for textdependent speaker verificaion,” in INTERSPEECH, 2015.
 [9] Dong Chen, Xudong Cao, David Wipf, Fang Wen, and Jian Sun, “An efficient joint formulation for bayesian face verification,” IEEE Transactions on pattern analysis and machine intelligence, vol. 39, no. 1, pp. 32–46, 2017.
 [10] Ziqiang Shi, Liu Liu, Mengjiao Wang, and Rujie Liu, “Multiview (joint) probability linear discrimination analysis for jvector based text dependent speaker verification,” in ASRU, 2017.
 [11] A. P. Dempster, “Maximum likelihood estimation from incomplete data via the em algorithm (with discussion,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1–38, 1977.
 [12] Y. Jiang, K. A. Lee, Z. Tang, B. Ma, A. Larcher, and H. Li, “Plda modeling in ivector and supervector space for speaker verification,” in ACM International Conference on Multimedia, Singapore, November, 2012, pp. 882–891.
 [13] Yuan Liu, Yanmin Qian, Nanxin Chen, Tianfan Fu, Ya Zhang, and Kai Yu, “Deep feature for textdependent speaker verification,” Speech Communication, vol. 73, pp. 1–13, 2015.
Comments
There are no comments yet.