I Introduction
During the last several years, ivectors [1]
have become the dominant approach to textindependent Speaker Verification (SV). In ivector based systems, utterances of arbitrary duration are mapped onto a lowdimensional subspace modelling both speaker and channel variability, which is estimated in an unsupervised way. The backend classifier that is usually employed is a Probabilistic Linear Discriminant Analysis (PLDA) model which performs a linear disentanglement of the two dominant types of variability and enables the evaluation of likelihood ratios
[2] [3]. The ivector/PLDA approach, when trained and evaluated on large textindependent datasets (such as those provided by NIST [4]) has shown a remarkable consistency over the years in attaining stateoftheart performance. More recently, neural architectures (e.g. xvectors [5]) have managed to outperform ivectors in most textindependent SV benchmarks, by employing recent advances in deep learning and aggressive data augmentation
[6].In parallel, the great potential of voice biometrics in commercial applications and forensics has increased the need for methods yielding stateoftheart results with utterances of short duration. However, a straightforward application of the ivector/PLDA model to short utterances has been proven to be an inadequate solution [7]. When utterances become shorter, variations due to differences in phonetic content can no longer be averaged out, as happens with utterances of long duration (e.g. 1min). There have been several efforts to propagate the ivector uncertainty to the PLDA model, but they were only partially successful and the results were inconsistent across datasets [8, 7, 9].
Due to the moderate performance of ivectors in the particular setting, textdependent SV started attracting much attention. Textdependent SV reduces the phonetic variations of short utterances by constraining their vocabulary to either (a) a fixed phrase, (b) a set of predefined phrases, or (c) random sequences of words coming from a specific domain, such as digits. The first two approaches yield superior performance in general, due to the matched order of acoustic events between training and runtime utterances, which prevents random and hard to model coarticulation effects from appearing. On the other hand, when speakers utter a predefined passphrase, the system becomes vulnerable to spoofing attacks (e.g. replay attacks), which have become a major threat to speaker recognition systems [10]. Textprompted SV with random sequences of words from a specific domain is less vulnerable to replay attacks (yet not immune to attacks created by TextToSpeech and Voice Conversion systems^{1}^{1}1In this case, creating a random sequence of words from a prerecorded audio is more difficult due to coarticulation effects of words on each other, but not impossible.[11]) and it is employed as a means to perform liveness detection.
In this paper, we primarily work with RSR2015 part III, aiming at enhancing the ivector paradigm in textprompted speaker recognition [12, 13]. One of our main motivations is to develop a method for utilizing the ivector uncertainty tailored to textdependent and textprompted SV. We show that by introducing the concept of average uncertainty, a simple and effective linear digitspecific transform can be derived, which can compensate for the ivector uncertainty without the computational burden in training and evaluation introduced by other uncertainty propagation methods [7, 14]. Building on top of our previous framework on textdependent SV with fixed phrases ([15, 16]) and textprompted case [17], we use digitspecific HMMs and ivector extractors and we report an extensive experimentation with respect the frontend features (including bottleneck features), channel compensation, uncertaintyaware transforms, and backend approaches. To the best of our knowledge, the results we report are the best published on the challenging RSR2015 part III and constitute a strong baseline for newer deep learning methods (e.g. [18, 19]).
The rest of this paper is organized as follows. In Section II, we provide a detailed review of the proposed approaches to textdependent SV and ivector uncertainty modelling. In Section III, our digitspecific subsystems are explained. In Section IV we discuss different methods for performing uncertaintyaware channel compensation. The description of the dataset, experimental setup and results are given in Sections V and VI. Finally, Section VII we provide a brief conclusion of our work and directions for future work.
Ii Related Work
In this section we present and discuss some of the recent approaches that are related to our work, with emphasis to those involving textdependent speaker verification and uncertainty modelling.
Iia Textdependent speaker verification
There are several interesting approaches to textdependent SV that have been proposed over the last few years. In [20], the authors examine a passphrase based system which is evaluated on the (proprietary) Wells Fargo textdependent dataset. NIST datasets were also used for UBM training to overcome the small development set constraint. In [21]
, experiments are conducted on the same dataset and the authors propose the use of a separate Gaussian Mixture Model (GMM) mean supervector for each digit, adapted from a common UBM. Extracted supervectors undergo Nuisance Attribute Projection (NAP) and are passed to a Support Vector Machine (SVM) classifier to compute the scores. The authors show that their method outperforms the one proposed in
[20] on the same dataset.In [22, 23, 24], the authors propose a Joint Factor Analysis (JFA) approach to address the problem of SV with random digit strings, using RSR2015 part III for training and testing. JFA is employed as a feature extractor, built on top of a tiedmixture model, i.e. an HMM with shared Gaussians and digitspecific sets of weights. The tiedmixture model serves for segmenting utterances into digits, as well as for collecting digitspecific BaumWelch statistics for JFA modelling. JFA features are either local (i.e. one per digit) or global (i.e. a single per recording), and in the former case each local feature in the test utterance is scored against the corresponding ones in the enrolment utterances. In [23], JFA features are passed to a joint density backend (alternative to PLDA), while in [22] the ivector mechanics are used to incorporate the uncertainty in the backend. Finally, in [14], the authors apply the same uncertaintyaware backend to individual Gaussian mixture components, resulting in 20% error rate reduction on RSR2015 part III.
In the aforementioned approaches, a digitindependent or adapted UBM is employed, spanning the whole acoustic space. However, obtaining a robust estimate of a JFA speaker vector (i.e. vector) using merely the tiny amount of information contained in a single digit (as happens with the local JFA features) proved to be very hard, making subspace methods to yield inferior results compared to supervectorsize features (i.e. vector). To address this data scarcity problem, a new scheme for using ivector in textprompted SV is introduced in [17], where wordspecific UBMs and ivector extractors are employed. These UBMs and ivector extractors are of small size (64component and 175dimensional, respectively) as they cover only the phonetic content of each individual word. Following a similar approach, in [15, 16] it is shown that ivectors, when extracted using phrasespecific UBMs and ivector extractors yield superior performance compared to JFA frontend features.
IiB Modelling ivector uncertainty
The use of ivector uncertainty in the backend may yield notable improvement in SV with short utterances and several methods for making use of it have been proposed. In [8, 7] the authors introduce a modified version of PLDA for propagating the ivector uncertainty to the PLDA model and they derive an EM algorithm for PLDA training using utterances of arbitrary durations. Similarly, the use of the ivector uncertainty in PLDA is investigated in [9], taking into account only the uncertainty in the test utterances (i.e. assuming long training and enrollment utterances). The authors in [25, 26] speedup the uncertainty propagation method by grouping ivectors together based on their reliability and by finding a representative posterior covariance matrix for each group. In [27], the authors incorporate the uncertainty associated with frontend features into the ivector extraction framework. Finally, in [28]
, an extension of uncertainty decoding using simplified PLDA scoring and modified imputation is proposed. The authors also employ the uncertainty decoding technique in Linear Discriminant Analysis (LDA) in
[29].Iii Digitspecific HMMs, ivector extractors and scoring
In this section we present our HMMbased UBM which we use to extract BaumWelch statistics and the method for using these statistics to train digitdependent ivector extractors. The scheme extends our previous method developed on Persian months dataset [17]. In [17] we proposed a simple but effective scheme based on separate ivector extractor for each word with a common ivector pipeline. In this paper, the above scheme is adapted to random digit strings and enhanced by exploring different methods for modelling uncertainty and combining it with channel compensation.
Note that in RSR2015 part III, the sequence of digits in each utterance is assumed to be known and therefore can be used during training and evaluation [12], but our methods can be extended to settings where the digit sequences should be estimated by an ASR system.
Iiia Digitspecific HMMs
It is generally agreed that HMMs are a more natural solution to textdependent SV than GMMs [30]. The partition of the Gaussians into HMM states permits us to capture the speaker characteristics over segments corresponding to phrases and words, rather than over merely spectral areas, as happens with a UBM. The HMM corresponding to digit is parametrized by a collection of statespecific GMM emission distributions
and a transition probability matrix
. We index HMM states by and Gaussian components of the GMM corresponding to HMM state by .We initialize a collection of digit dependent HMMs, each having states and Gaussian components in each state . We use the subscripts to indicate their dependence on the digit and state , respectively, although in practice we use a fixed number of and . The overall number of Gaussian components of the digitspecific HMM is . HMM training and segmentation into digits is performed using Viterbi training (i.e. a single alignment of frames to HMM states is considered), by concatenating the corresponding digitdependent HMMs and utilizing the “lefttoright, no skips” structure. Therefore, the concatenated HMM corresponding to an utterance with digits has states and overall Gaussian components, and it is constructed by concatenating the corresponding digitdependent HMMs.
Once the HMMs are trained, we jointly perform (a) segmentation of utterance into digits, and (b) segmentation of each frame sequence assigned to a digit to digitspecific HMM states . We apply Viterbibased forced alignment to assign frames to HMM states and hence we estimate hard assignment of frames to HMM states. Then, given the estimated alignments to digitspecific HMM states , frame posteriors corresponding to GMM components of the specific state are computed as follows,
(1) 
where is the Kronecker delta function, and
is the probability density function (PDF) of the multivariate normal distribution. Note that in
the dependence on is kept implicit.The frame posteriors, together with the corresponding Gaussian components are used to extract zero and first order centralized statistics, and , which are computed by the following equations
(2)  
(3) 
In the above equations, is the number of frames of the utterance, is the index of mixture component of the digitspecific mixture model, is the frame at time and
is the posterior probability that the
frame has been emitted by the component. Note that once the frameposteriors are calculated the HMM structure is no longer required for extracting BaumWelch statistics. Therefore, is used for indexing components of the flattened HMM, i.e. a GMM corresponding to the concatenated statespecific GMMs, having overall Gaussian components and rescaled weights so that they sum up to 1. The flattened HMM plays the role of the UBM in textindependent speaker recognition.IiiB Digit Dependent ivector Extractor
Due to the use of digitspecific HMMs as UBMs for collecting BaumWelch statistics, all the following structures should also be digitspecific. This includes ivector extractors, transforms applied to ivectors as well as trainable backends (e.g. PLDA).
The supervector of an utterance associated with a digit is assumed to be generated from the following equation
(4) 
where is a low rank matrix representing the subspace spanning the dominant variability in the supervector space, and is the supervector corresponding to the digitspecific flattened HMM. Moreover, is a latent variable with standard normal distribution as a prior. Given the BaumWelch statistics of an utterance, the posterior distribution of is normal with mean and covariance matrix estimated as follows
(5)  
(6) 
where and are zero and centralized first order statistics (using the means of the corresponding digitspecific HMM), and is a block diagonal covariance matrix obtained from the corresponding digitspecific HMM.
IiiC Digitspecific scoring
After extracting digitspecific ivectors and applying a set of transforms (to be discussed in Sect. IV), scoring is also implemented in a digitspecific fashion, i.e.
(7) 
where superscripts and indicate enrollment and test respectively, is the set of digits appearing in the test utterance of the trial ( in RSR2015 part III), and
is a similarity measure on the ivector space (e.g. cosine similarity, PLDAbased loglikelihood ratio, a.o.), which is a function of parameters
(e.g. transforms for channel compensation, PLDA parameters) and can also include score normalization. Finally, we use to denote the averaged enrollment ivectors of the digit , since there might be more than one ivectors of the same digit in the enrollment side (e.g. three in RSR2015 part III).This scoring rule is identical to the “local” approach, proposed in [24]. The rationale is to breakdown the utterances into segments of limited phonetic content (e.g. words, digits) in order to suppress the phonetic variability between enrollment and test segments. A caveat is that certain segments of the enrollment utterances are not used in each trial, as the test utterance may not contain all the words appearing in the enrollment. In the case of RSR2015 part III, about 50% of the enrolment number of frames is used in each trial, since .
IiiD Differences between the proposed method and tiedmixture models
Apart from certain similarities between our method and the one in [22, 23, 24]
, the two methods are substantially different. Aside from differences in (a) subspace modelling (ivectors vs. JFA features), (b) linear transforms applied to i or yvectors, and (c) backends (cosine distance vs jointdensity models) there are differences in the way frames are assigned to Gaussian components. We propose digitspecific HMMs of
Gaussian components each, without sharing them between digits or states, while in the tiedmixture approach, all Gaussian components are shared between digits, with the weights being the only digitspecific set of parameters. As a results, digitspecific ivectors (or vectors [24]) using tiedmixture models are extracted over highly sparse BaumWelch statistics, and are therefore characterized by high posterior uncertainty. Moreover, in the tiedmixture model approach the HMM structure is merely employed for segmenting utterances into digits, while we propose digitspecific HMMs to segment each digit into subword units. As a result, the Gaussian components are localized in the joint temporal and spectral domain, while in the tiedmixture approach they are merely localized in the spectral domain, via a standard UBM.Iv ivector uncertainty and channel compensation
Due to the unsupervised way the ivector extractors are trained, the ivector space contains both speaker and session variability. Since only speaker information is useful to verify a speaker, a strategy for removing undesirable session effects is required. In parallel, in short duration SV the problem of increased uncertainty should also be addressed. To this end, we proposed three methods for channel and uncertainty compensation, which are explained in this section. Fig. 1 illustrates the block diagram of the whole system, where all the examined compensation methods are depicted. In this figure, based on the selected method for uncertainty and channel compensation, one of the parallel switches is activated.
Iva Between and withinclass covariance and uncertainty
It is well known that the total variability covariance matrix can be decomposed into betweenclass and withinclass covariance matrices, and as follows
(8)  
(9)  
(10) 
However, by defining in the above way we are essentially treating ivectors as point estimates. In order to take into account the uncertainty in the ivector estimates, we should redefine the total variability as follows S^u_tot = 1n∑_i=1^n E[(y_i  ¯y)(y_i ¯y)^T ] = 1n∑_i=1^n (E[y_i]  ¯y)(E[y_i] ¯y)^T + cov(y_i) = S_tot + S_u , where is the average uncertainty of the ivectors and is the overall number of ivectors. The uncertainty in estimating is negligible as it is equal to . It is interesting to note that is used in the ivector extractor and in JFA during the minimum divergence estimation, where the latent variables are transformed in such a way so that . In other words, the covariance of the aggregated posterior is set equal to the covariance of the prior distribution by transforming accordingly [1]. The principal components of correspond to the directions with the highest uncertainty.
On the other hand, when dealing with short utterances, becomes comparable to and it would be interesting to make use of it when performing channel compensation. We should moreover note that by decomposing into expected within and betweenclass covariance
(11) 
we may consider as being part of the withinclass covariance, i.e.
(12) 
and
(13) 
This is due to the fact that the uncertainty contained in is smaller compared to , where the average number of ivectors per speaker, since
(14) 
where is the average uncertainty of ivectors of speaker .
IvB Digit dependent Uncertainty and Channel Compensation
We examine here our three different proposed approaches, as well as regularized LDA for applying session and uncertainty compensation. In all cases, the transformed vectors are obtains as .
IvB1 Uncertain LDA
LDA is a standard technique to compensate for intersession variability by finding a set of speakerdiscriminant nonorthogonal directions and projecting the ivectors onto the subspace they define [1]. LDA minimizes the withinclass variability while maximizing the betweenclass variability. Using the expectations of these matrices, the objective function of LDA becomes
(15) 
where
is the projection matrix. By solving the above equation using generalized eigenvalue decomposition, uncertaintyaware channel compensation can be applied to ivectors
[29].IvB2 Digit dependent uncertain WCCN
WithinClass Covariance Normalization (WCCN) is a popular technique for channel compensation that uses the Cholesky decomposition of the inverse within class covariance matrix (10) to project the input features. In speaker recognition, it is used typically before applying length normalization or cosine distance scoring [1]. The uncertain version of WCCN is as follows
(16) 
where is the projection matrix.
IvB3 Digit Dependent Uncertainty Normalization
Finally, we propose a novel technique which we call Uncertainty Normalization. In this case, we are using only the average uncertainty and we ignore the clustering structure of ivectors into speakers. It is an unsupervised method and therefore it does not require multiple recordings per speaker. The rationale is to project the ivectors onto a space that downscales directions exhibiting high uncertainty, since their estimates are less reliable. Similarly to uncertain WCCN, it is defined as follows
(17) 
where is the projection matrix.
IvB4 Regularized LDA
LDA has the constraint of reducing the dimensionality to at most where is the number of classes. Yet, in RSR2015 the number of training speakers is smaller than the ivector dimension. To overcome this limitation and avoid dimensionality reduction we add a simple regularization term to . The regularized version of LDA yields better results than standard LDA in textindependent task too [31]. In our experiments, we combine Regularized LDA with Uncertain WCCN and Uncertainty Normalization.
V Experimental Setup
Va Datasets
We used the RSR2015 part III dataset for almost all our experiments. In this dataset, there are 157 males and 143 females speakers, divided into three disjoint speaker subsets: background, development and evaluation, of about 100 speakers each. Each speaker model is enrolled with 3 10digit utterances, recorded with the same handset, while each speaker contributes 3 different speaker models. Test utterances contain a quasirandom string of 5 digits, one out of 52 unique strings. Six commercial mobile devices were used for the recordings that took place under a typical office environment. All utterances are in English, while speakers are balanced in such a way so that they form a representative sample of the Singaporean population [12, 24].
Apart from the RSR2015 part III, two clean parts of 16 kHz LibriSpeech dataset are used for training a DNN model and performing experiments with Bottleneck (BN) features (namely TrainClean100 and TrainClean360 [32]). The dataset contains English speech which is automatically aligned and segmented.
VB Baseline and stateoftheart
As a baseline method, we refer to the experiments performed by CRIM ([24]) where both subspace and supervector domain methods are investigated. For fair comparison, we used the same setup as in [22, 23, 24] and our baseline results are copied from the reference paper. The number of trials can be found in Table I.
To the best of our knowledge, the current stateoftheart in RSR2015 part III is the model presented in [33]. The proposed system makes use of a DNN trained either on Fisher data or on RSR2015. Two main approaches are examined, namely DNN posteriors with MFCC features and tandem features, i.e. bottleneck features concatenated with MFCCs.
VC Features
We use 60dimensional PLP or MFCC, extracted using HTK with a similar configuration: 25 ms Hamming windowed frames with 15 ms overlap. For each utterance, the features are normalized using Cepstral Mean and Variance Normalization (CMVN). A separate silence model is used for performing supervised Voice Activity Detection (VAD). Silent frames are removed after applying Viterbi alignment.
In addition to the cepstral features, a set of experiments is performed to examine the effectiveness of bottleneck (BN) and tandem features in the textprompted task. To this end, a neural network is trained following the stacked architecture described and evaluated in
[34, 35]. Based on the reported results in [35], this architecture exhibits very good performance in textdependent SV. The output layer (softmax) has about 9000 senones, its input has 30 frames context around the current frame and it is trained using crossentropy loss. Finally, the 80dimensional BN features are concatenated to the cepstral features and used as input features to the ivector pipeline.VD Model dimensions and gender dependence
Digitspecific HMMs with 8 states and 8 components per state are used as UBM, while the ivector dimensionality is set to 300. Gender independent UBMs and ivector extractors are trained using only the background set of RSR2015. The background set is also used for training gender dependent LDA transforms as well as for score normalization. LDA and score normalization are applied in a digitdependent manner. The MSR open source toolbox was used as a base for developing our code [36].
VE Scoring method
In our proposed system we use scorenormalized cosine distance. As Eq. (7) shows, for each digitdependent test ivector we extract, its cosine similarity with the average of the corresponding digitdependent ivectors from the enrolment speaker utterances is computed and the total score of the utterance is evaluated as the average score [17]. It is worth mentioning that the proposed verification system uses a simple scoring method while other uncertaintyaware approaches typically require more complicated and computationally demanding methods, such as PLDA with uncertainty propagation [7].
VF Score and Length Normalization
Score normalization is essential when cosine distance scoring is employed [1]. After experimenting with several score normalization methods, we found that SNorm yields the best performance. Therefore, for all the reported experiments and unless explicitly stated, SNorm is applied in a gender and digit dependent manner, using the training set for collecting the cohort set of speakers.
Although implicit in cosine distance scoring, length normalization helps towards obtaining more Gaussianlike distributions [37]
. It is therefore useful to apply it before LDA (and after uncertainty normalization), as the latter assumes Gaussian distributed classmeans and classconditional observations.
Vi Results
The evaluation metrics we report to assess the performance of the proposed methods are the Equal Error Rate (EER) and the Detection Cost Functions (DCFs) defined for NISTSRE08 and NISTSRE10, namely old Normalized DCF (
) and new Normalized DCF ().Via Baseline, stateoftheart and our methods
Table II shows the comparison between the proposed methods and several flavors of the baseline system. We select the best single system on this dataset from [24] and fusion results of single systems with different combinations. vector and vector are JFAfeatures with and without speaker subspace, respectively.
We also report results using speaker embeddings (xvectors [5]), which define the stateoftheart in textindependent speaker recognition. The model attains stateoftheart results on the Speakers InTheWild benchmark (namely 2.32% EER on Eval Core [38]). The xvector architecture is trained using a large dataset with more than 7K speakers (VoxCeleb 1 and 2 [39, 40]) compared to the 97 speakers used to train the ivector extractors. All results reported are derived using identical evaluation setup, ensuring a fair comparison.
In addition, Fig. 2 shows the DET curves of some selected systems from Table II for female speakers. In the third and fourth sections of this table we report results for the systems with PLP and MFCC features. We observe that for both genders, MFCCs outperform PLPs in almost all experiments. Based on these results, the system with MFCC features is considered as the best single system. Moreover, scorelevel fusion results of the two systems are given the fifth section of Table II.
ViB Uncertainty Normalization, Channel Compensation and Score Normalization
As Table II shows, the proposed uncertainty normalization methods attain the best results. Hence, it is worth further analyzing its performance, e.g. by deactivating channel compensation (i.e. Regularized LDA) and score normalization.
In Table III, we report results using several such combinations, as well as an experiment with PLDA. First of all, we observe that the contribution of Regularized LDA is rather minor compared to uncertainty normalization. This result is rather surprising; it shows that stateoftheart performance can be attained even without explicit channel modelling, i.e. without the need of collecting multiple training recording coming from different channels, sessions or handsets, per speaker. We mention again that in RSR2015 part III the enrolment utterances for a given speaker are coming from a single handset, which is different to the ones used in the test utterances [12].
Finally, we examine the effectiveness of Gaussian PLDA as a backend. To this end, we train digitdependent PLDA models using the RSR2015 part III training set. After experimentation, we found that the combination of uncertainty normalization, regularized LDA and number of speaker factors equal to 50 yields the best performance, while SNorm does not yield any further gains. However, even the best PLDA configuration is clearly inferior to that attained by cosine distance. We believe that the failure of PLDA is due to the small number of training speakers in RSR2015, which prevents us from estimating robustly the speaker subspace.
ViC Comparison with xvector
The embedding extractor is implemented using the standard Kaldi recipe, and it is trained on VoxCeleb 1 and 2 (containing more that 7K speakers) [39]. The PLDA model used for evaluating LLRs is also trained on VoxCeleb, while we also report results where the RSR2015 training set is employed for PLDA training or adaptation. For enrolling the speakers, three utterances are concatenated and a single xvector is extracted and for evaluation utterances, and each sequence is represented by an xvector. The results in the third row of Table II show that the best performance is attained by training PLDA on VoxCeleb without any adaptation. However, our proposed method performs notably better than this. To improve the performance of xvectors, recently proposed methods for applying domain adaptation to the xvector extractor (e.g. using Generative Adversarial Networks [41, 42]) are worth exploring, in order to reduce the mismatch in channel and accent between VoxCeleb and RSR2015.
ViD Using bottleneck features
Neural approaches using DNNs trained for ASR have resulted in significant improvements in SV, especially in textindependent SV, where the text is unknown and DNNs help towards assigning frames to ASR recognition units (e.g. senones) [43]. Some recent works apply DNNs to textdependent SV and report notable improvements [15] [35]
. Hence, it is worth examining the performance of bottleneck and tandem features extracted from a DNN in textprompted case. Since the best performance among uncertainty and channel compensation methods is attained by uncertainty normalization followed by regularized LDA, we report the results using only this method. Table
IV shows the results obtained by 80dimensional bottleneck feature vector, by their concatenation with MFCC feature vector (i.e. tandem features) and by their fusion with other cepstral features. The results show that although the performance of bottleneck features without any fusion is poor, fusing tandem features with other cepstral features yields significant improvements. The reason for this degradation could be the randomness in digit sequence compared to the fixed sequences of other textdependent tasks, as well as the fact that we did not use indomain RSR2015 data to finetune the network. It is also apparent from the results that tandem features yield more notable improvement for female speakers.In Table V we examine the performance of our best singlefeature model (i.e. with MFCC) by varying the number of states per HMM . As we observe, the performance is rather insensitive to , being slightly higher for . However, we choose to use for the rest of the experiments, since their differences are minor and the algorithm becomes less computationally and memory demanding.
ViE The effect of length normalization
It is generally agreed that applying length normalization before LDA improves its performance. In order to reexamine its positive effect we perform an experiment to compare the performance of length normalization followed by LDA and LDA without length normalization. The system with MFCC features and uncertainty normalization is used as the single system in this experiment. Table VI shows that although cosine similarity scoring applies length normalization implicitly, applying length normalization before LDA and after uncertainty compensation is beneficial. As discussed above, length normalization makes vectors more normally distributed, which is in line with the Gaussian assumptions of LDA.
ViF Results on phrases using the RedDots corpus
Although we developed our method primarily for words as recognition units, we can evaluate it on short phrases in a similar way. For experimentation on phrases, RSR2015 part I used to be a standard option, however it is now considered as a too easy corpus [7]. RedDots is more challenging in terms of channel variability, mostly due to (a) the longer time intervals between successive recordings of the same speaker, and (b) the higher levels of background noise [44].
The main caveat of RedDots is the lack of a training set, due to the small number of participants (49 males and 13 females, with only 35 males and 6 females having target trials). This shortcoming prevents us from evaluating our method on the whole set of RedDots phrases, since training utterances of the evaluation phrases are compulsory in order to train our models. Nevertheless, two of the RedDots phrases (namely the 33rd and the 34th) are also contained in RSR2015 Part I, enabling us to train our models on the corresponding training utterances of RSR2015.
In Table VII we report the performance on maleonly trials, as the number of female speakers is too small for drawing any conclusion. The results are averaged over the 33rd and 34th RedDots phrases. Our focus is on the ImpostorCorrect results, i.e. with nontargets trials containing the correct phrase, since we believe ASRbased methods are more adequate to estimate whether or not the uttered phrases match the prompted ones. The results show that uncertainty normalization is more effective than regularized LDA, attaining drastic relative improvement in low false acceptance operating points (63% in and 45% in ) and 30% relative improvement in terms of EER. These improvements are attained without any channel compensation, i.e. without requiring repetitions of the same phrase from each training speaker. Finally, by combining uncertainty normalization with regularized LDA a further small improvement is attained.
ViG Discussion
The comparison of our results with the tiedmixture model approach (Table II) shows that our single system outperforms the baseline by a large margin. In fact, its performance is superior not only to all single baseline systems, but also to the fusion of all systems in [24]. Furthermore, the use of ivectors rather than supervector size features (vectors) makes our methods significantly faster. Additionally, memory requirements for each speaker are considerably lower than those of the baseline. Moreover, our system attains higher performance compared to the current stateoftheart (which is based on DNNs [33]), even when a single system is used and without training on any external dataset (Fisher dataset is used to train the tandem feature system in [33]).
In terms of uncertainty and channel compensation methods, uncertainty normalization followed by length normalization and LDA is the more effective combination (Table IV). The results in Table II show a consistency with respect to features (MFCC and PLP) and gender, while the experiments on RedDots reaffirm the effectiveness of the proposed sequence of transforms, yielding drastic improvements especially in the low false alarm area (Table VII). In terms of frontend features, MFCC perform consistently better than PLP in both genders, while bottleneck features seem to be marginally effective, and only when fused with MFCC. Bottleneck features perform very well in textindependent speaker recognition, especially when used as a means to assign frames into UBM components [45] [46]. However, there is a severe mismatch between the way frames are assigned in textindependent speaker recognition and our proposed HMMbased method. For example, the large context window used in the former does necessarily provide finegrained temporal localization, required in order to segment each digit into states. More recent endtoend methods may be more effective ways of using DNNs for textdependent speaker recognition than bottleneck features (e.g. [47], [48]), with the caveat that they require large amounts of indomain data, which are not available in RSR2015 part III.
ViH Scalingup to larger vocabulary
In cases where a larger vocabulary can be employed the proposed method may suffer from data fragmentation. The number of overall training examples should scale linearly with the number of words, as no parameter sharing is assumed between the wordspecific models. In such cases, introducing parameter sharing between the models (especially between the several wordspecific HMMs and ivector extractors) should be considered. Although such a setting is beyond the scope of this work, one may start with a typical largesize UBM/ivector system (e.g. with 2048 Gaussian components) trained on textindependent datasets or on the available indomain dataset. Then for each word in the vocabulary, the most dominant Gaussian components should be selected and their means should possibly be reestimated e.g. via meanonly MAP adaptation, with the remaining components being removed. Wordspecific ivector extractors on top of the wordspecific UBM can then be derived by (a) keeping only those rows corresponding to the most dominant Gaussian components, and (b) refining the matrix by applying e.g. minimum divergence training (i.e. without reestimating the subspace). One may also consider starting from a higher dimensional ivector extractor (e.g. 600) and selecting the most dominant dimensions for each wordspecific extractor.
Vii Conclusions and Future Work
In this paper, we developed a system for textprompted speaker verification using random digit strings. The core of the system comprises a set of digitspecific HMMs, which we employ in order to perform segmentation of utterances into digits, alignment of frames to HMM states and extraction of BaumWelch statistics. On top of these HMMs, digitspecific ivector extractors are trained, enabling us to compare digitspecific ivectors that appear in both enrolment and test utterances using simple cosine distance scoring with score normalization. Furthermore, we investigated three different methods for compensating channel and uncertainty and we concluded that the novel uncertainty normalization technique followed by LDA yields consistently superior performance. The proposed system outperforms the baseline by a large margin and yields superior performance compared to the current stateoftheart, which is based on DNNs.
We also examined the use of bottleneck features and different types of cepstral features. The experiments showed that although the performance of cepstral features is superior to that of bottleneck features, fusion with other cepstral features leads to further notable improvement. Our final set of experiments were conducted on whole phrases. To this end, the challenging RedDots corpus is used [44]. The results we reported reaffirm the effectiveness of uncertainty normalization, yielding an impressive 63% relative improvement in terms of .
For future work, we are interested in fitting certain elements of the proposed approach to endtoend neural architectures. Recently emerged approaches in textindependent speaker recognition combine endtoend deep learning methods with implicit modeling of acoustic units via multihead attention and learnable dictionaries or with mimicking the ivector/PLDA framework [18, 49, 50]. We expect that the proposed method will contribute to this research direction, by demonstrating the potential of digitspecific HMMs and ivector extractors.
Finally, we should note that the channel and uncertainty compensation approaches examined here may also be applicable to speaker embeddings. Modeling the uncertainty in xvectors is less straightforward compared to ivectors. However, recent advances in Bayesian deep learning demonstrate that model averaging via dropouts is a means for quantifying the uncertainty of extracted representations [51]. As a result, uncertainty normalization may also be relevant to neural representations, such as xvectors.
Viii Acknowledgements
Themos Stafylakis is funded by the European Commission program Horizon 2020, under grant agreement no. 706668 (Talking Heads).
References
 [1] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
 [2] S. Ioffe, “Probabilistic linear discriminant analysis,” in Proc. Computer Vision–ECCV 2006. New York, NY, USA: Springer, 2006, pp. 531–542.
 [3] M. Senoussaoui, P. Kenny, P. Dumouchel, and F. Castaldo, “Wellcalibrated heavy tailed Bayesian speaker verification for microphone speech,” in Proc. ICASSP. IEEE, 2011, pp. 4824–4827.
 [4] G. R. Doddington, M. A. Przybocki, A. F. Martin, and D. A. Reynolds, “The NIST speaker recognition evaluation–overview, methodology, systems, results, perspective,” Speech Communication, vol. 31, no. 2, pp. 225–254, 2000.
 [5] D. Snyder, D. GarciaRomero, G. Sell, D. Povey, and S. Khudanpur, “Xvectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
 [6] D. Snyder, D. GarciaRomero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multispeaker conversations using xvectors,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5796–5800.
 [7] T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kockmann, and P. Dumouchel, “Textdependent speaker recognition using PLDA with uncertainty propagation,” in Proc. Interspeech, 2013, pp. 3684–3688.
 [8] P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel, “PLDA for speaker verification with utterances of arbitrary duration,” Proc. ICASSP, pp. 7649–7653, 2013.
 [9] S. Cumani, O. Plchot, and P. Laface, “On the use of ivector posterior distributions in probabilistic linear discriminant analysis,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, no. 4, pp. 846–857, 2014.
 [10] N. Evans, M. Sahidullah, J. Yamagishi, M. Todisco, K. A. Lee, H. Delgado, T. Kinnunen et al., “The 2nd automatic speaker verification spoofing and countermeasures challenge (asvspoof 2017) database,” 2017.
 [11] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio detection,” arXiv preprint arXiv:1904.05441, 2019.
 [12] A. Larcher, K. A. Lee, B. Ma, and H. Li, “Textdependent speaker verification: Classifiers, databases and RSR2015,” Speech Communication, vol. 60, pp. 56–77, 2014.
 [13] A. Larcher, K. A. Lee, and B. Ma, “The RSR2015: Database for textdependent speaker verification using multiple passphrases,” in Proc. Interspeech, 2012.
 [14] P. Kenny, T. Stafylakis, J. Alam, V. Gupta, and M. Kockmann, “Uncertainty modeling without subspace methods for textdependent speaker recognition,” in Proc. OdysseyThe Speaker and Language Recognition Workshop, 2016, pp. 16–23.
 [15] H. Zeinali, L. Burget, H. Sameti, O. Glembek, and O. Plchot, “Deep neural networks and hidden Markov models in ivectorbased textdependent speaker verification,” in Proc. OdysseyThe Speaker and Language Recognition Workshop, 2016, pp. 24–30.
 [16] H. Zeinali, H. Sameti, and L. Burget, “HMMbased phraseindependent ivector extractor for textdependent speaker verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1421–1435, 2017.
 [17] H. Zeinali, E. Kalantari, H. Sameti, and H. Hadian, “Telephony textprompted speaker verification using ivector representation,” in Proc. ICASSP. IEEE, 2015, pp. 4839–4843.
 [18] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Selfattentive speaker embeddings for textindependent speaker verification,” Proc. Interspeech 2018, pp. 3573–3577, 2018.
 [19] Z. Huang, S. Wang, and K. Yu, “Angular softmax for shortduration textindependent speaker verification,” Proc. Interspeech 2018, pp. 3623–3627, 2018.
 [20] H. Aronowitz, “Text dependent speaker verification using a small development set,” in Proc. OdysseyThe Speaker and Language Recognition Workshop, 2012, pp. 312–316.
 [21] S. Novoselov, T. Pekhovsky, A. Shulipa, and A. Sholokhov, “Textdependent GMMJFA system for password based speaker verification,” in Proc. ICASSP. IEEE, 2014, pp. 729–737.
 [22] P. Kenny, T. Stafylakis, J. Alam, and M. Kockmann, “An ivector backend for speaker verification,” in Proc. Interspeech, 2015, pp. 2307–2310.
 [23] T. Stafylakis, P. Kenny, J. Alam, and M. Kockmann, “JFA for Speaker Recognition with Random Digit Strings,” in Proc. Interspeech, 2015.
 [24] T. Stafylakis, J. Alam, and P. Kenny, “Text dependent speaker recognition with random digit strings,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1194–1203, 2016.
 [25] W.w. Lin, M.W. Mak, and J.T. Chien, “Fast scoring for plda with uncertainty propagation via ivector grouping,” Computer Speech & Language, vol. 45, pp. 503–515, 2017.
 [26] W. Lin and M.W. Mak, “Fast scoring for plda with uncertainty propagation,” in Proc. OdysseyThe Speaker and Language Recognition Workshop, 2016, pp. 31–38.
 [27] D. Ribas, E. Vincent, and J. R. Calvo, “Uncertainty propagation for noise robust speaker recognition: the case of NISTSRE,” in Proc. Interspeech, 2015.
 [28] R. Saeidi and P. Alku, “Accounting for uncertainty of ivectors in speaker recognition using uncertainty propagation and modified imputation,” in Proc. Interspeech, 2015.
 [29] R. Saeidi, R. Astudillo, and D. Kolossa, “Uncertain LDA: Including observation uncertainties in discriminative transforms,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 7, pp. 1479–1488, 2015.
 [30] T. Stafylakis, P. Kenny, M. J. Alam, and M. Kockmann, “Speaker and channel factors in textdependent speaker recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 65–78, 2016.
 [31] H. Zeinali, H. Sameti, and N. Maghsoodi, “SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis,” in Proc. ROCLING, 2017.
 [32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP. IEEE, 2015, pp. 5206–5210.
 [33] J. Zhong, W. Hu, F. Soong, and H. Meng, “Dnn ivector speaker verification with short, textconstrained test utterances,” in Proc. Interspeech, 2017, pp. 1507–1511.
 [34] M. Karafiát, F. Grézl, K. Veselỳ, M. Hannemann, I. Szőke, and J. Černockỳ, “But 2014 babel system: Analysis of adaptation in nn based systems,” in Proc. Interspeech, 2014.
 [35] H. Zeinali, H. Sameti, L. Burget et al., “Textdependent speaker verification based on ivectors, neural networks and hidden markov models,” Computer Speech & Language, vol. 46, pp. 53–71, 2017.
 [36] S. O. Sadjadi, M. Slaney, and L. Heck, “MSR identity toolbox v1. 0: A MATLAB toolbox for speakerrecognition research,” Speech and Language Processing Technical Committee Newsletter, vol. 1, no. 4, 2013.
 [37] D. GarciaRomero and C. Y. EspyWilson, “Analysis of ivector length normalization in speaker recognition systems.” in in Proc. Interspeech, 2011, pp. 249–252.
 [38] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (sitw) speaker recognition database.” in Interspeech, 2016, pp. 818–822.
 [39] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
 [40] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
 [41] J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, and O. Plchot, “Speaker verification using endtoend adversarial language adaptation,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6006–6010.
 [42] P. S. Nidadavolu, J. Villalba, and N. Dehak, “Cyclegans for domain adaptation of acoustic features for speaker recognition,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6206–6210.
 [43] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phoneticallyaware deep neural network,” in Proc. ICASSP. IEEE, 2014, pp. 1695–1699.
 [44] K. A. Lee, A. Larcher, G. Wang, P. Kenny, N. Brümmer, D. v. Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma et al., “The reddots data collection for speaker recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [45] A. LozanoDiez, A. Silnova, P. Matejka, O. Glembek, O. Plchot, J. Pešán, L. Burget, and J. GonzalezRodriguez, “Analysis and optimization of bottleneck features for speaker recognition,” in Proceedings of Odyssey, vol. 2016, 2016, pp. 352–357.

[46]
T. Fu, Y. Qian, Y. Liu, and K. Yu, “Tandem deep features for textdependent speaker verification,” in
Fifteenth Annual Conference of the International Speech Communication Association, 2014.  [47] S.X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “Endtoend attention based textdependent speaker verification,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 171–178.
 [48] F. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attentionbased models for textdependent speaker verification,” arXiv preprint arXiv:1710.10470, 2017.
 [49] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in endtoend speaker and language recognition system,” arXiv preprint arXiv:1804.05160, 2018.
 [50] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matějka, and L. Burget, “Endtoend dnn based speaker recognition inspired by ivector and plda,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4874–4878.

[51]
Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing
model uncertainty in deep learning,” in
International Conference on Machine Learning (ICML)
, 2016, pp. 1050–1059.