As opposed to text-independent speaker verification, where the speech content is unconstrained, text-dependent speaker verification systems are more favorable for security applications since they showed higher accuracy on short-duration sessions [1, 2].
The previous methods regarding text-dependent speaker verification can be grouped into two categories. The first category is based on the traditional state-of-the-art GMM-UBM or i-vector approach, which may not work well in this case [3, 1, 4]5]; DNN as a feature extractor for the utterance representation ; Matejka et al.  have shown that using bottle-neck DNN features (BN) concatenated to other acoustic features outperformed the DNN method for text-dependent speaker verification; multi-task learning jointly learns both speaker identity and text information .
This paper is based on two works: one is of Chen et al. , in which the j-vector was introduced as a kind of more compact representation for text dependent utterances; the other is of Chen et al. , in which the state-of-the-art joint Bayesian analysis is proposed to model the two facial images jointly with an appropriate prior that considers intra- and extra-personal variations over the image pairs. However, the standard joint Bayesian model only considers one single label, but in practice the extracted features are always associated with several labels, for example when using multi-task learned networks as feature extractor to extract the j-vector . Since j-vector potentially have different kinds of labels, the text latent variable is no longer only dependent on the current label, but rather depends on a separate text label. This means for j-vector there are two latent variables related to speaker and text have equal importance, and both variables are tied across all samples that sharing a certain label.
In order to better modeling j-vector, we propose a generalization of the standard joint Bayesian  called Double Joint111 For the “double joint” term, the first “joint” is for modeling the multi-view information jointly, e.g. text and identity in j-vector, while the second “joint” is for joint distribution of two features, e.g. target and test j-vectors
For the “double joint” term, the first “joint” is for modeling the multi-view information jointly, e.g. text and identity in j-vector, while the second “joint” is for joint distribution of two features, e.g. target and test j-vectorsBayesian (DoJoBa), which can explicitly and jointly model the multi-view information from samples, such as certain individual saying some text content. The relationship between DoJoBa and standard joint Bayesian is analogous to that between joint factor analysis and factor analysis. DoJoBa is also related to the work of Shi et al. , in which a joint PLDA is proposed for j-vector verification. One of the most important advantages of DoJoBa compared to joint PLDA, is that DoJoBa can learn the appropriate dimensionality (or the number of columns) of the low-rank speaker subspace and phrase subspaces without user tuning.
The remainder of this paper is organized as follows: Section 2 reviews the standard j-vector/joint Bayesian system. Section 3 describes the DoJoBa approach. The detailed experimental results and comparisons are presented in Section 4 and the whole work is summarized in Section 5.
2 Baseline j-vector/joint Bayesian model
2.1 J-vector extraction
Chen et al. 
proposed a method to train a DNN to make classifications for both speaker and phrase by minimizing a total loss function consisting a sum of two cross-entropy losses - one related to the speaker label and the other to the text label. Once training is completed, the output layer is removed, and the rest of the neural network is used to extract speaker-phrase joint features. Each frame of an utterance is forward propagated through the network, and the output activations of all the frames are averaged to form an utterance-level feature called j-vector. The enrollment speaker models are formed by averaging the j-vectors corresponding to the enrollment recordings.
2.2 The joint Bayesian model
For the back-end, the state-of-the-art joint Bayesian model  is employed as a classifier for speaker verification. For simplicity of notation, joint Bayesian model with only single speaker label is used here as an example. Joint Bayesian models data generation using the following equation:
where is certain j-vector, and and are defined to be Gaussian with diagonal covariance and respectively.
of this joint Bayesian model can be estimated using the Expectation Maximization (EM)[11, 9] algorithm. With the learned joint Bayesian model, given a test and an enrolled model , the likelihood ratio score is
This standard joint Bayesian cannot properly deal with the j-vector that jointly belong to certain speaker and certain phrase at the same time. For j-vector, it is noted that we need to define the joint Bayesian latent variable as the joint variable considering both speaker and phrase information. This means the latent variable is dependent on both a speaker identity and a phrase label. In this work we try to separate the into two independent latent variables - one related to the speaker identity information and the other to the phrase. This intuitive idea results in the following DoJoBa.
3 Double joint Bayesian model
In this section, we propose an effective model to describe the j-vector as resulting from a generative model which incorporates both intra-speaker/phrase and inter-speaker/phrase variation.
3.1 Generative model
We assume that the training data is obtained from speakers saying phrases each with sessions. We denote the j-vector of the ’th session of the ’th speaker saying ’th phrase by . We model the text dependent feature generation by the process:
The model comprises two parts: 1, the signal component which depends only on the speaker and phrase, rather than on the particular feature vector (i.e. there is no dependence on ); 2, the noise component which is different for every feature vector of the speaker/phrase and represents within-speaker/phrase noise. The term represents the overall mean of the training vectors. Remaining unexplained data variation is explained by the residual noise term which is defined to be Gaussian with diagonal covariance . The latent variables and are defined to be Gaussian with diagonal covariance and respectively, and are particularly important in real application, as these represents the identity of the speaker and the content of the text respectively.
Formally the model can be described in terms of conditional probabilities
where represents a Gaussian in with mean and covariance . Here it’s worth to notice that the mathematical relationship between DoJoBa and joint Bayesian  is analogous (not exactly) to that between joint PLDA  and PLDA 
. Compared to joint PLDA, DoJoBa allows the data to determine the appropriate dimensionality of the low-rank speaker and text subspaces for maximal discrimination, as opposed to requiring heuristic manual selections.
Let , , and . In order to maximize the likelihood of data set with respect to parameters , the classical EM algorithm  is employed.
3.2 EM formulation
The auxiliary function for EM is
By maximizing the auxiliary function, we obtain the following EM formulations.
E steps:we need to calculate the expectations , , , , and . Indeed we have
It is almost the similar equations for and . For , we have
3.3 Likelihood Ratio Scores
We treat the verification as a kind of hypothesis testing problem with the null hypothesiswhere two j-vectors have the same speaker and phrase variables and and the alternative hypothesis where they do not (there are three cases: different underlying variable with same variable in model , same variable with different variables in model , or different underlying variables with different variables in model ). Given a test j-vector and an enrolled j-vector , and let the priori probability of the models , , as , , , then the likelihood ratio score is
Notice that like standard joint Bayesian model , we do not calculate a point estimate of hidden variable. Instead we compute the probability that the two multi-label vectors had the same hidden variables, regardless of what this actual latent variable was.
In this section, we describe the experimental setup and results for the proposed method on the public RSR2015 English corpus  and our internal Huiting202 Chinese Mandarin database collected by the Huiting Techonogly222http://huitingtech.com/.
4.1 Experimental setup
RSR2015 corpus  was released by I2R, is used to evaluate the performance of different speaker verification systems. In this work, we follow the setup of , the part I of RSR2015 is used for the testing of DoJoBa. The background and development data of RSR2015 part I are merged as new background data to train the j-vector extractor.
Our internal gender balanced Huiting202 database is designed for local applications. It contains 202 speakers reading 20 different phrases, 20 sessions each phrase. All speech files are of 16kHz. 132 randomly selected speakers are used for training the background multi-task learned DNN, and the remaining 70 speakers were used for enrollment and evaluation.
In this work, 39-dimensional Mel-frequency cepstral coefficients (MFCC, 13 static including the log energy + 13 + 13
) are extracted and normalized using utterance-level mean and variance normalization. The input is stacked normalized MFCCs from 11 frames (5 frames from each side of the current frame). The DNN has 6 hidden layers (with sigmoid activation function) of 2048 nodes each. During the background model development stage, the DNN was trained by the strategy of pre-training with Restricted Boltzmann Machine (RBM) and fine tuning with SGD using cross-entropy criterion. Once the DNN is trained, the j-vector can be extracted during the enrollment and evaluation stages.
4.2 Results and discussion
Four systems are evaluated and compared across above conditions:
When evaluation a speaker is enrolled with 3 utterances of the same phrase. The task concerns on both the phrase content and speaker identity. Nontarget trials are of three types: the impostor pronouncing wrong lexical content (impostor wrong, IW); a target speaker pronouncing wrong lexical content (target wrong, TW); the imposter pronouncing correct lexical content (impostor correct, IC).
The joint Bayesian, jPLDA, and DoJoBa models are trained using the j-vectors. The class defined in both models is the multi-task label of both the speaker and phrase. For each test session the j-vector is extracted using the same process and then the log likelihood from joint Bayesian, jPLDA, and DoJoBa are used to distinguish among different models. The number of principle components is set to 100 and then the joint Bayesian model is estimated with 10 iterations; the speaker and the phrase subspace dimensions of jPLDA and DoJoBa are both set to 100 regarding of fair comparisons and the jPLDA and DoJoBa model are also trained with 10 iterations.
Table 1 and 2 compare the performances of all above-mentioned systems in terms of equal error rate (EER) for the three types of nontarget trials. Obviously DoJoBa is superior to the standard joint Bayesian and jPLDA, regardless of the test database. Since DoJoBa system can explore both the identity and the lexical information from the j-vector, it constantly performs better than standard joint Bayesian systems.
In this paper we have proposed a double joint Bayesian (DoJoBa) analysis for j-vector verification. DoJoBa is related to joint Bayesian model, and can be thought of as joint Bayesian with multiple probability distributions attached to the features. The most important advantages of DoJoBa, compared to joint Bayesian, is that multiple information can be explicitly modeled and explored from the samples to improve the verification performance; comparing to jPLDA, DoJoBa can determine the latent dimension without tuning. Reported results showed that DoJoBa provided significant reduction in error rates over conventional systems in term of EER.
-  Anthony Larcher, Kong Aik Lee, Bin Ma, and Haizhou Li, “Text-dependent speaker verification: Classifiers, databases and rsr2015,” Speech Communication, vol. 60, pp. 56–77, 2014.
-  Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “End-to-end text-dependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 5115–5119.
-  Patrick Kenny, Themos Stafylakis, Pierre Ouellet, and Md Jahangir Alam, “Jfa-based front ends for speaker recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1705–1709.
-  Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
-  Yun Lei, Nicolas Scheffer, Luciana Ferrer, and Mitchell McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1695–1699.
-  Ehsan Variani, Xin Lei, Erik Mcdermott, and Ignacio Lopez Moreno, “Deep neural networks for small footprint text-dependent speaker verification,” in ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 4052–4056.
-  Hossein Zeinali, Hossein Sameti, Lukas Burget, Jan Cernocky, Nooshin Maghsoodi, and Pavel Matejka, “i-vector/hmm based text-dependent speaker verification system for reddots challenge,” in INTERSPEECH, 2016.
-  Nanxin Chen, Yanmin Qian, and Kai Yu, “Multi-task learning for text-dependent speaker verificaion,” in INTERSPEECH, 2015.
-  Dong Chen, Xudong Cao, David Wipf, Fang Wen, and Jian Sun, “An efficient joint formulation for bayesian face verification,” IEEE Transactions on pattern analysis and machine intelligence, vol. 39, no. 1, pp. 32–46, 2017.
-  Ziqiang Shi, Liu Liu, Mengjiao Wang, and Rujie Liu, “Multi-view (joint) probability linear discrimination analysis for j-vector based text dependent speaker verification,” in ASRU, 2017.
-  A. P. Dempster, “Maximum likelihood estimation from incomplete data via the em algorithm (with discussion,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1–38, 1977.
-  Y. Jiang, K. A. Lee, Z. Tang, B. Ma, A. Larcher, and H. Li, “Plda modeling in i-vector and supervector space for speaker verification,” in ACM International Conference on Multimedia, Singapore, November, 2012, pp. 882–891.
-  Yuan Liu, Yanmin Qian, Nanxin Chen, Tianfan Fu, Ya Zhang, and Kai Yu, “Deep feature for text-dependent speaker verification,” Speech Communication, vol. 73, pp. 1–13, 2015.