1 Introduction
The ivector based framework has defined the stateoftheart for textindependent speaker recognition. The ivectors are extracted from either a Gaussian mixture model (GMM) based (Dehak et al., 2011) or a deep neural network (DNN) based system (Lei et al., 2011), and for the backend, probabilistic linear discriminant analysis (PLDA) (Prince et al., 2007) has been widely used. The ivector/ PLDA system performs well if long (e.g. more than 30 s) enrollment and test utterances are available, but the performance degrades rapidly when only limited data are available (Kanagasundaram et al., 2011). To address this issue, a range of techniques has been studied on different aspects of this problem (Poddar et al., 2017; Das and Prasanna, 2017).
There has been a number of methods to model the variation of short utterance ivectors. In Cumani (2014, 2015), a Full Posterior Distribution PLDA (FPPLDA) is proposed to exploit the covariance of the ivector distribution, which improves the standard Gaussian PLDA (GPLDA) model by accounting for the uncertainty of ivector extraction. In Hasan et al. (2013), the effect of short utterance ivectors on system performance was analyzed, and the duration variability was modeled as additive noise in the ivector space. The work in Kanagasundaram et al. (2014) introduces a short utterance variance normalization technique and a short utterance variance modeling approach at the ivector feature level; the technique makes use of the covariance matrices of long and short ivectors for normalization.
Alternatively, several approaches have been proposed that leverage phonetic information to perform content matching. The work in Li et al. (2016) proposes a GMM based subregion framework where speaker models are trained for each subregion defined by phonemes. Test utterances are then scored with subregion models. In Chen et al. (2016)
, the authors use the local session variability vectors estimated from certain phonetic components instead of computing the ivector from the whole utterance. Phonetic classes are obtained by clustering similar senones (group of triphones with similar acoustic properties) that are estimated from posterior probabilities of a DNN trained for phone state classification. Another approach was proposed in
Scheffer and Lei (2014) which matches the zeroorder statistics of test and enrollment utterances using posteriors of each phone state before computing the ivectors.In addition, a few studies have focused on the role of feature extraction and score calibration. In
Guo et al. (2016, 2017a), the authors proposed several different methods (DNN and linear regression models) to estimate speakerspecific subglottal acoustic features, which are more stationary compared to MFCCs, largely phoneme independent, and can alleviate the phoneme mismatch between training and testing utterances. In addition,
Hasan et al. (2013) proposes a Quality Measure Function (QMF) which is a scorecalibration mechanism that compensates for the duration mismatch in trial scores.Recently, several approaches have been proposed which use deep neural networks to learn speaker embedding from shortutterances. In Snyder et al. (2017), the authors use a neural network, which is trained to discriminate between a large number of speakers, to generate fixeddimensional speaker embedding, and the speaker embedding are used for PLDA scoring. In Zhang and Koishida (2017)
, the authors propose an endtoend system which directly learns a speaker discriminative embedding using a triplet loss function and an Inception Net. Both methods show improvement over GMMbased ivector systems.
A few recent papers have focused on ivector mapping, which maps the short utterance ivector to its long version. In Kheder et al. (2016, 2018), the authors proposed a probabilistic approach, in which a GMMbased joint model between long and short utterance ivectors was trained, and a minimum mean square error (MMSE) estimator was applied to transform a short ivector to its long version. Since the GMMbased mapping function is actually a weighted sum of linear functions, our previous research (Guo et al., 2017b)
demonstrates that a proposed nonlinear mapping using convolutional neural networks (CNNs) outperforms the GMMbased linear mapping methods across different conditions. The CNNbased mapping methods use unsupervised learning to regularize the supervised regression model, and result in significant performance improvement.
This paper is an extension of our aforementioned work in Guo et al. (2017b) where we investigate neural network based nonlinear mapping methods for ivector mapping. Here, we first compare and analyze the performance of both GMM and DNN based ivector systems with shortutterance evaluation tasks. Based on the results which show that Ivector_DNN systems outperform Ivector_GMM systems across durations, we first investigate our proposed nonlinear ivector mapping methods using Ivector_DNN systems. Two novel DNNbased ivector mapping methods are proposed and compared. They both model the joint representation of short and long utterance ivectors by making use of an autoencoder.
The first method trains an autoencoder using concatenated short and long utterance ivectors and then the pretrained weights are used to perform finetuning for the supervised regression task which directly maps short to long utterances. By learning a joint embedding of short and long utterances ivectors, the pretrained autoencoder can help to initialize the weights at a desirable basin of the landscape of the loss function for the supervised training. Such pretraining proves to be useful especially when the training dataset is not large enough. Similar ideas of pretraining have been studied by Hinton et al. (2006) and Erhan et al. (2010).
The second method jointly trains the supervised regression model with an autoencoder to reconstruct the shortutterance ivector itself. The autoencoder here plays the role of a regularizer, which is important when the training dataset is not large enough and the dimensions of the input and output are relatively high. The fact that the autoencoder loss helps prevent overfitting has been observed in the machine learning literature. For example, in
Rasmus et al. (2015); Zhang et al. (2016), a supervised neural network is augmented with decoding pathways for reconstruction, and it is shown that the reconstruction loss helps improve the performance of supervised tasks. More recently, a paper on CapNet (Sabour et al., 2017) introduces a decoder that plays a critical role in achieving the state of the art performance on a classification task.We further discuss several key factors of the proposed DNN mapping models in detail, including pretraining iteration, regularization weights and encoder depth. The best model provides more than 26.47% relative improvement. We also show that by adding additional phoneme information as input, we can achieve further mapping improvements (28.43%). We apply the proposed mapping methods to different durations of evaluation utterances to represent reallife situations, and the results show their effectiveness across all conditions. The mapping results for both Ivector_GMM and Ivector_DNN systems are compared, and show significant improvement for both systems. In the end, in order to show the generalization of the proposed methods, we apply the bestvalidated models of SRE10 (Martin and Greenberg, 2010) dataset to the Speaker In The Wild (SITW) dataset (McLaren et al., 2015), which also show considerable improvement (23.12%).
This paper is structured as follows. Section 2 describes the stateoftheart ivector/PLDA speaker verification systems. Section 3 analyzes the effect of utterance duration on ivectors and introduces the proposed DNNbased ivector mapping methods in detail. Section 4 presents the experimental setup. Experimental results and analysis of the proposed techniques are presented in Section 5. Section 6 discusses mapping effects, and finally, in Section 7, major conclusions are presented.
2 Ivector based speaker verification systems
As mentioned earlier, the stateoftheart textindependent speaker verification system is based on the ivector framework. In these systems, a universal background model (UBM) is used to collect sufficient statistics for ivector extraction, and a PLDA backend is adopted to obtain the similarity scores between ivectors. There are two different ways to model a UBM: using unsupervisedtrained GMMs or using a DNN trained as a senone classifier. Therefore, we will introduce both the Ivector_GMM and Ivector_DNN systems as well as PLDA modeling.
2.1 Ivector_GMM system
The ivector representation is based on the total variability modeling concept which assumes that speaker and channel dependent variabilities reside in a lowdimensional subspace, represented by the total variability matrix . Mathematically, the speaker and channeldependent GMM supervector can be modeled as:
(1) 
where is the speaker and channelindependent supervector, is a rectangular matrix of low rank and
is a random vector called the ivector which has a standard normal distribution
.In order to learn the total variability subspace, the BaumWelch statistics need to be computed for a given utterance, which are defined as:
(2) 
(3) 
where and represents the zeroth and first order statistics, is the feature sample at time index , represent the UBM of C mixture components, is the Gaussian index and corresponds to the posterior of mixture component c generating the vector .
2.2 Ivector_DNN system
As mentioned in the previous section, for an Ivector_GMM system, the posterior of mixture component generating the vector is computed with a GMM acoustic model trained in an unsupervised fashion (i.e. with no phonetic labels).
(4) 
However, recently, inspired by the success of DNN acoustic models in automatic speech recognition (ASR), Lei et al. (2011) proposed a method which uses DNN senone (cluster of contextdependent triphones) posteriors to replace the GMM posteriors as illustrated in Eq.4, which leads to significant improvement in speaker verification. represents the trained DNN model for senone classfication.
The senone posterior approach uses ASR features to compute the class soft alignment and the standard speaker verification features for sufficient statistic estimation. Once sufficient statistics are accumulated, the training procedure is the same as in the previous section. In this paper, we use a stateoftheart time delay neural network (TDNN) as in Peddinti et al. (2015) to train the ASR acoustic model.
2.3 PLDA modeling
PLDA is a generative model of ivector distributions for speaker verification. In this paper, we use a simplified variant of PLDA, termed as GPLDA (Kenny et al., 2013), which is widely used by researchers. A standard GPLDA assumes that the ivector is represented by:
(5) 
where, is the mean of ivectors, defines the betweenspeaker subspace, and the latent variable represents the speaker identity and is assumed to have standard normal distribution. The residual term represents the withinspeaker variability, which is normally distributed with zero mean and full covariance .
PLDA based ivector system scoring is calculated using the log likelihood ratio (LLR) between a target and test ivectors, denoted as and . The likelihood ratio can be calculated as follows:
(6) 
where and denote the hypothesis that two ivectors represent the same speaker, and different speakers, respectively.
3 Shortutterance speaker verification
ivectors  
long utterance  short utterance  
mean variance()  283  493 
3.1 The effect of utterance durations on ivectors
Fulllength ivectors have relatively smaller variations compared with ivectors extracted from short utterances (Poddar et al., 2017), because ivectors of short utterances can vary considerably with changes in phonetic content. In order to show the variation changes between long and short utterance ivectors, we first calculate the average diagonal covariance (denoted as ) of ivectors across all utterances of a given speaker and then calculate the mean (denoted as ) of the covariances over all speakers. and are defined in Eqs.78 as:
(7) 
(8) 
where corresponds to the mean of the ivectors belonging to speaker . represents the total number of utterances for speaker , represents the trace operation, and is total number of speakers.
In order to compare the for long and short utterance ivectors, we choose around 4000 speakers with multiple long utterances (more than 2 mins durations and 100 s active speech) from the SRE and Switchboard (SWB) datasets (in total around 40000 long utterances) and truncate each long utterances into multiple 510 s short utterances. We plot the distribution of activespeech length (utterance length after voice activity detection) across these 40000 long utterances in Fig. 1. The ivectors are extracted for each short and long utterance using the Ivector_DNN system, and Table 1 shows the mean variance across all speakers calculated from long and short utterance ivectors individually. The mean of variances in the Table 1 indicates that shortutterance ivectors have larger variation compared to those of longutterance ivectors.
3.2 DNNbased ivector mapping
In order to alleviate possible phoneme mismatch in textindependent short utterances, we propose several methods to map shortutterance ivectors to their long version. This mapping is a manytoone mapping, from which we want to restore the missing information from the shortutterance ivectors and reduce their variance.
In this section, we will introduce and compare several novel DNNbased ivector mapping methods. Our pilot experiments indicate that, if we train a supervised DNN to learn this mapping directly, which is similar to the approaches in Bousquet and Rouvier (2017) , the improvement is not significant, due to overfitting to the training dataset. In order to solve this problem, we propose two different methods which both model the joint representation of short and long utterance ivectors by using an autoencoder. The decoder reconstructs the original input representation and forces the encoded embedding to learn a hidden space which represents both short and long utterance ivectors and thus can lead to a better generalization. The first is a twostage method: using an autoencoder to first train a bottleneck representation of both long and short utterance ivectors, and then uses the pretrained weights to perform a supervised finetuning of the model, which maps the shortutterance ivector to its long version directly. The second is a singlestage method: jointly train the supervised regression model with an autoencoder to reconstruct the short ivector. The final loss to optimize is a weighted sum of the supervised regression loss and the reconstruction loss. In the following subsections, we will introduce these two methods in detail.
3.2.1 (twostage method): pretraining and finetunning
In order to find a good initialization of the supervised DNN model, we first train a joint representation of both short and long utterance ivectors using an autoencoder. We first concatenate the short ivector and its long version into , then the concatenated vector
is used to train an autoencoder with some specific constraints. The autoencoder learns the joint hidden representation of both short and long ivectors, which leads to good initialization of the secondstage supervised finetuning. The autoencoder consists of an encoder and a decoder as illustrated in Fig.
2. The encoder function learns a hidden representation of input vector , and the decoder function produces a reconstruction. The learning process is described as minimizing the loss function . In order to learn a more useful representation, we add a restriction on the autoencoder: constrain the hidden representation to have a relatively small dimension in order to learn the most salient features of the training data.For the encoder function , we adopt options from several fullyconnected layers to stacked residual blocks (He et al., 2016), in order to investigate the effect of encoder depth. Each residual block has two fullyconnected layers with a shortcut connection as shown in Fig. 3. By using residual blocks, we are able to train a very deep neural network without adding extra parameters. A deep encoder may help learn better hidden representations. For a decoder function , we use a single fully connected layer with a linear regression layer, since it is enough to approximate the mapping from the learned hidden representation to the output vector. For the loss function, we use the mean square error criterion, which is .
Once the autoencoder is trained, we use the trained DNNstructure and weights to initialize the supervised mapping. We optimize the loss between the predicted long ivector and the real long ivector as shown in Fig. 2. We denote this method as .
3.2.2 (singlestage method): semisupervised training
The twostage method mentioned in the previous section, needs to first train a joint representation using the autoencoder and then perform a finetuning to train the supervised mapping. In this section, we introduce another unified semisupervised framework based on our previous work (Guo et al., 2017b) which can jointly train the supervised mapping with an autoencoder to minimize the reconstruction error. The joint framework is motivated by the fact that by sharing the hidden representations among supervised and unsupervised tasks, the network generalizes better and it can also avoid using the twostage training procedures and speed up training. This method is denoted as .
We adopt the same autoencoder framework as mentioned in the previous section, which has an encoder and a decoder, but the input to the encoder here is the shortutterance ivector . The output from the encoder will be connected to a linear regression layer to predict the longutterance ivector , and it will also be used to reconstruct the shortutterance ivector itself by inputing it into a decoder, which gives rise to the autoencoder structure. The entire framework is shown in Fig. 4.
We define a new objective function to jointly train the network. Let us use and to represent the output from the supervised regression model and autoencoder respectively. We can define the objective loss function which combines the loss from the regression model and the autoencoder in a weighted fashion as:
(9) 
where is the loss of regression model defined as
(10) 
and is the loss of an autoencoder defined as:
(11) 
Moreover, and are parameters of the regression model and autoencoder respectively, which are jointly trained and share the weights of the encoder layer.
is a scalar weight, which determines how much the reconstruction error is used to regularize the supervised learning. The reconstruction loss of the autoencoder
forces the hidden vector generated from the encoder to reconstruct the shortutterance ivector in addition to predicting the target longutterance ivector , and helps prevent the hidden vector from overfitting . For testing, we only use the output from the regression model as the mapped ivector.Ivector_GMM  Ivector_DNN  

UBM (3472)  Switchboard, NIST 04, 05, 06, 08  Fisher English 
T (600)  Switchboard, NIST 04, 05, 06, 08  Switchboard, NIST 04, 05, 06, 08 
PLDA  NIST 04, 05, 06, 08  NIST 04, 05, 06, 08 
3.2.3 Adding phoneme information
The variance of short utterances is mainly due to phonetic differences. In order to aid the neural network to train this nonlinear mapping, for a given utterance, we extract the senone posteriors for each frame and calculate the mean posterior across frames as a phoneme vector, which is then appended to a short utterance ivector as input (Fig. 5). The training procedure still follows the proposed joint modeling methods ( or ). The phoneme vectors are expected to help normalize the shortutterance ivector, and provide extra information for this mapping. The phoneme vector is defined as:
(12) 
The posterior is generated from the TDNNbased senone classifier, which was defined in Section 2.2.
Female  Male  
EER (Rel Imp)  DCF08/DCF10  EER (Rel Imp)  DCF08/DCF10  
Fulllength condition  
Ivector_GMM  2.2  0.011/0.043  1.7  0.008/0.036 
Ivector_DNN  1.4 (36.36%)  0.005/0.022  0.8 (52.94%)  0.003/0.017 
10 s10 s condition  
Ivector_GMM  13.8  0.063/0.097  13.3  0.057/0.099 
Ivector_DNN  12.2 (11.59%)  0.054/0.093  10.2 (23.31%)  0.048/0.095 
5 s5 s condition  
Ivector_GMM  21.7  0.083/0.099  20.4  0.080/0.100 
Ivector_DNN  19.9 (8.29%)  0.078/0.099  17.0 (16.67%)  0.072/0.100 
4 Experimental setup
4.1 Ivector baseline systems
We evaluate our techniques using the stateoftheart GMM and DNNbased ivector/GPLDA systems using the Kaldi toolkit (Povey et al., 2011).
4.1.1 Configurations of Ivector_GMM system
For the Ivector_GMM system, the first 20 MFCC coefficients (discarding the zeroth coefficient) and their first and second order derivatives are extracted from the detected speech segments after an energybased voice activity detection (VAD). A 20 ms Hamming window, a 10 ms frame shift, and a 23 channels filterbank are used. Universal background models with 3472 Gaussian components are trained, in order to have a fair comparison with the Ivector_DNN system, whose DNN has 3472 outputs. Initial training consists of four iterations of EM using a diagonal covariance matrix and then an additional four iterations with a fullcovariance matrix. The total variability subspace with low rank (600) is trained for five iterations of EM. The backend training consists of ivector mean subtraction and length normalization, followed by PLDA scoring.
The UBM and ivector extractor training data consist of male and female utterances from the SWB and NIST SRE datasets. The SWB data contains 1000 speakers and 8905 utterances of SWB 2 Phases II. The SRE dataset consists of 3805 speakers and 36614 utterances from SRE 04, 05, 06, 08. The PLDA backends are trained only on the SRE data. The dataset information is summarized in Table 2.
4.1.2 Configurations of Ivector_DNN system
For the Ivector_DNN system, a TDNN is trained using about 1,800 hours of the English portion of Fisher (Cieri et al., 2004). In the TDNN acoustic modeling system, a narrow temporal context is provided to the first layer and context width increases for the subsequent hidden layers, which enables higher levels of the network to learn greater temporal relationships. The features are 40 melfilterbank features with a framelength of 25 ms. Cepstral mean subtraction is performed over a window of 6 s. The TDNN has six layers, and a splicing configuration similar to those described in Peddinti et al. (2015)
. In total, the DNN has a leftcontext of 13 and a rightcontext of 9. The hidden layers use the pnorm (where p = 2) activation function
(Zhang et al., 2014), an input dimension of 350, and an output dimension of 3500. The softmax output layer computes posteriors for 3472 triphone states, which is the same as the number of components for Ivector_GMM system. No fMLLR or ivectors are used for speaker adaptation.The trained TDNN is used to create a UBM which directly models phonetic content. A supervisedGMM with fullcovariance is created first to initialize the ivector extractor based on TDNN posteriors and speaker recognition features. Training the matrix also requires TDNN posteriors and speaker recognition features. During ivector extraction, the only difference between this and the standard GMMbased systems is the model used to compute posteriors. In the Ivector_GMM system, speaker recognition features are selected using a framelevel VAD, however, in order to maintain the correct temporal context, we cannot remove frames from the TDNN input features. Instead, the VAD results are used to filter out posteriors corresponding to nonspeech frames.
4.1.3 Evaluation databases
We first evaluate our systems on condition 5 (extended task) of SRE10 (Martin and Greenberg, 2010). The test consists of conversational telephone speech in enrollment and test utterances. There are 416119 trials, over 98% of which are nontarget comparisons. Among all trials, 236781 trials are for female speakers and 179338 trials are for male speakers. For shortutterance speaker verification tasks, we extracted short utterances which contain 10 s and 5 s speech (after VAD) from condition 5 (extended task). We train the PLDA and evaluate the trials in a genderdependent way.
Moreover, in order to validate our proposed methods in real conditions and demonstrate the models’ generalization, we use SITW, a recently published speech database (McLaren et al., 2015). The SITW speech data was collected from opensource media channels with considerable mismatch in terms of audio conditions. We designed an arbitrarylength shortutterance task using SITW dataset to represent reallife conditions. We show the evaluation results using the bestperformed models validated on SRE10 dataset.
4.2 Ivector mapping training
In order to train the ivector mapping model, we selected 39754 long utterances, each having more than 100 s of speech after VAD, from the development dataset. For each long utterance, we used a 5 s or 10 s window to truncate the utterance, and the shift step is half of window size (2.5 s or 5 s). We applied the aforementioned procedures to all long utterances, and in the end we got 1.2M 10 s utterances and 2.4M 5 s utterances. All shortutterance ivector together with its corresponding longutterance ivector are used as training pairs for DNNbased mapping models. We train the mapping models for each gender separately and evaluate the model in a genderdependent way.
For the proposed two DNNbased mapping models, we use the same encoder and decoder configurations. For the encoder, we first use two fullyconnected layers. The first layer has 1200 hidden nodes and the second layer has 600 hidden nodes which is a bottleneck layer (1.44M parameters in total). In order to investigate the depth of the encoder, we design a deep structure with two residual blocks and a bottleneck layer, in a total of 5 layers. Each residual block (as defined in Section 3.2.1) has two fully connected layers with 1200 hidden nodes and the bottleneck layer has 600 hidden nodes (5.76M parameters in total). For the decoder, we always use one fullyconnected layer (1200 hidden nodes) with a linear output layer (1.44M parameters in total).
In order to add phoneme information for ivector mapping, phoneme vectors are generated for each utterance by taking the average of the senone posteriors across frames. Since the phoneme vectors have a different value range compared with ivectors, it will deemphasize their effect for training the mapping. Therefore we scale up the phoneme vector values by a factor of 500, in order to match the range of ivector values. The upscaled phoneme vector is then concatenated with shortutterance ivector for ivector mapping.
All neural networks are trained using the Adam optimization strategy (Kingma et al., 2014) with mean square error criterion and exponentially decaying learning rate starting from 0.001. The networks are initialized with the Xavier initializer (Glorot and Bengio, 2010), which is better than the Gaussian initializer as shown in Guo et al. (2017b)
. The relu activation function is used for all layers. For each layer, before passing the tensors to the nonlinearity function, a batch normalization layer
(Ioffe and Szegedy, 2015) is applied to normalize the tensors and speed up the convergence. For the combined loss of , we set equal weights (=0.5) for both regression and autoencoder loss for initial experiments. The shuffling mechanism is applied on each epoch. The Tensorflow toolkit
(Abadi et al., 2016) is used for neural network training.Female  Male  

EER (Rel Imp)  DCF08/DCF10  EER (Rel Imp)  DCF08/DCF10  
baseline  12.2  0.054/0.093  10.2  0.048/0.095 
matched length PLDA  11.3 (7.38%)  0.052/0.093  9.4 (7.84%)  0.043/0.095 
LDA 150  11.6 (5.00%)  0.052/0.093  9.8 (3.92%)  0.047/0.093 
DNN direct mapping  10.5 (13.93%)  0.054/0.096  9.7 (4.90%)  0.047/0.093 
DNN1 mapping  9.5 (22.13%)  0.047/0.091  7.7 (24.51%)  0.039/0.090 
DNN2 mapping  9.5 (22.13%)  0.047/0.091  7.7 (24.51%)  0.039/0.089 
5 Evaluation results and analysis
5.1 Ivector baseline systems
In this section, we present and compare two baseline systems: a Ivector_GMM system and a Ivector_DNN system, with standard NIST SRE 10 fulllength condition and truncated 10 s10 s and 5 s5 s conditions.
Table 3 shows the equal error rate (EER) and minimum detection cost function (minDCF) of the two baseline systems under fulllength evaluation condition and truncated shortlength evaluation conditions. Both DCF08 and DCF10 (defined in NIST 2008 and 2010 evaluation plan) are shown in the table. From the table, we can observe that the Ivector_DNN system gives significant improvement under the fulllength condition compared with Ivector_GMM system and achieved a max of 52.94% relative improvement for the male condition, which is consistent with previous reported results (Snyder et al., 2015). This is mainly because the DNN model provides phoneticallyaware class alignments, which can better model speakers. The good performance is also due to the strong TDNNbased senone classifier, which makes the alignments more accurate and robust. When both systems were evaluated on the truncated 10 s10 s, 5 s5 s evaluation conditions, the performances degrade significantly compared with the fulllength condition. The main reason is that when the length of the evaluation utterance is shorter, there is significant phonetic mismatch between utterances. However, the performance of the Ivector_DNN system still outperforms the Ivector_GMM system by 8%24%, even though the improvement is not as big as the fulllength condition. From the table, we can also observe that the improvement is more significant for male speakers across all conditions. It may be the fact that phoneme classification is more accurate for male speakers, which could lead to a better phonemeaware speaker modeling.
5.2 Ivector mapping results
In this section, we show and discuss the performance of the proposed algorithms when only short utterances are available for evaluation. Since from Table 3 we can observe better performance using Ivector_DNN systems, we will mainly use the Ivector_DNN system to investigate the mapping methods. We first show the results on the 10 s10 s condition.
Previous work (Kheder et al., 2016; Guo et al., 2017b) highlights the importance of duration matching in PLDA model training. For instance when the PLDA is trained using long utterances and evaluated on short utterances, there is degradation in speaker verification performance compared to PLDA trained using matchedlength short utterances. Therefore, we not only show our baseline results for the PLDA trained using the regular SRE development utterances, but also show the results for the PLDA condition using truncated matchedlength short utterances.
For other baseline comparison, we first apply dimensionality reduction on ivectors using linear discriminant analysis (LDA) and reduce the dimension of ivectors from 600 to 150. This value has been selected according to the results of previous research (Cumani, 2016). LDA can maximize interspeaker variability and minimize intraspeaker variability. We train the LDA transformation matrix using the SRE development dataset, and then, perform the dimension reduction for all development utterances and train a new PLDA model. For evaluation, all ivectors are subjected to dimensionality reduction first and then we use the new PLDA model to get similarity scores. To compare with another shortutterance compensation technique, we evaluate the ivector mapping methods proposed in Bousquet and Rouvier (2017), which use DNNs to train a direct mapping from shortutterance ivectors to the corresponding long version. Similar to Bousquet and Rouvier (2017), we also add some longutterance ivectors as input for regularization purposes.
Female  Male  

EER (Rel Imp)  DCF08/DCF10  EER (Rel Imp)  DCF08/DCF10  
baseline  12.2  0.054/0.093  10.2  0.048/0.095 
DNN1 mapping (3 layer)  9.5 (22.13%)  0.047/0.091  7.7 (24.51%)  0.039/0.090 
DNN2 mapping (3 layer)  9.5 (22.13%)  0.047/0.091  7.7 (24.51%)  0.039/0.089 
DNN1 mapping (6 layer + residual block)  9.1 (25.41%)  0.046/0.091  7.5 (26.47%)  0.038/0.089 
DNN2 mapping (6 layer + residual block)  9.3 (23.77%)  0.047/0.091  7.6 (25.49%)  0.038/0.089 
For our proposed DNN mapping methods, we first show the mapping results for both and with three hidden layers. Note that for mapped ivectors, we use the same PLDA as the baseline system to get similarity scores. We further investigate the effect of pretraining iterations for , the weight of the reconstruction loss for and the depth of encoder, compare the results for different durations, and investigate the effect of additional phoneme information. We also compare with mapping results for both Ivector_GMM and Ivector_DNN systems. In the end, we test the generalization of the trained models on the SITW dataset.
Table 4 presents the results for regular PLDA training condition (baseline), matchedlength PLDA condition, LDA dimetionality reduction method, DNNbased direct mapping method, DNNbased twostage method () and DNNbased singlestage method (, =0.5). We observe that matchedlength PLDA training gives considerable improvement compared with nonmatched PLDA training (baseline), which is consistent with previous work. When training the PLDA using shortutterance ivectors, the system can capture the variance of shortutterance ivectors. Using LDA to do dimentionality reduction also results in some improvement, since it reduces the variance of the ivectors. DNNbased direct mapping gives more improvement for female speakers (13.93%) compared with male speakers (5%) in terms of EERs, and it may be due to the fact that more training data is available for female speakers and thus the overfitting problem is less severe for females. In the last two rows, we show the performance of our proposed DNNbased mapping methods on shortutterance ivectors. From the results, we can observe that they both result in significant improvements over the baseline for both the EER and minDCF metrics, and they also outperform the other shortutterance compensation methods by a large margin. and methods have comparable performance, which prove the importance of learning joint representation of both short and long utterance ivectors. The proposed methods outperform the baseline system by 22.13% for female speakers and improve the male speaker baseline by 24.51%. One of the advantages using is that the unified framework avoids using the twostage training procedure, which speeds up the training.
Female  Male  

EER (Rel Imp)  DCF08/DCF10  EER (Rel Imp)  DCF08/DCF10  
baseline  12.2  0.054/0.093  10.2  0.048/0.095 
DNN mapping (best)  9.1 (25.41%)  0.046/0.091  7.5 (26.47%)  0.038/0.089 
DNN mapping (best) + phoneme info  8.9 (27.05%)  0.046/0.090  7.3 (28.43%)  0.037/0.090 
5.2.1 Effect of pretraining for
In this section, we will show how firststage pretraining influences the secondstage mapping training for . We have investigated the number of training iterations used for firststage pretraining from 1000050000. What we find interesting is that when the number of training iterations is small, the second stage finetuning will overfit the data, but when the number of training iterations is large, the finetuning results are not optimal. In the end, 25000 iterations was a roughly good initialization for second stage finetuning. This indicates that the number of iterations for unsupervised training does influence the secondstage supervised training.
5.2.2 Effect of reconstruction loss for
In this section, we investigate the impact of the weights for the reconstruction loss in . We set = {0.1,0.2,0.5,
0.8,0.9}. Since the weight of regression loss is , the larger is, the less weight will be assigned to regression loss. Fig. 6 shows the EER for female speakers as a function of the weights assigned to reconstruction loss. The reconstruction loss is clearly important for this joint learning framework. It forces the network to learn the original representations for short utterances, which can regularize the regression task and generalize better. The optimal reconstruction weight is = 0.8, which indicates that the reconstruction loss is even more important for this task. Hence, it appears that unsupervised learning is very crucial for a speaker recognition task.
Female  Male  
EER (Rel Imp)  DCF08/DCF10  EER (Rel Imp)  DCF08/DCF10  
10 s10 s  
baseline  12.2  0.054/0.093  10.2  0.048/0.095 
DNN mapping (best)  9.1 (25.41%)  0.046/0.091  7.5 (26.47%)  0.038/0.089 
5 s5 s  
baseline  19.9  0.078/0.099  17.0  0.072/0.100 
DNN mapping (best)  14.8 (25.62%)  0.067/0.099  13.5 (20.59%)  0.061/0.100 
mix  
baseline  17.8  0.068/0.097  14.4  0.061/0.100 
DNN mapping (best)  13.2 (25.84%)  0.061/0.097  11.8 (18.06%)  0.053/0.096 
5.2.3 Effect of encoder depth
The depth of neural network has been proven to be important for network performance. Adding more layers will make the network more efficient and powerful to model data. Therefore, as discussed in Section 4.2, we will compare a shallow (2layer) and a deep (5layer) encoder for both and . It’s well known that training a deep model suffers a lot from gradient vanishing/exploding problems and also it can be easily stuck into local minimum points. Therefore, we use two methods to alleviate this problem. Firstly, as stated in Section 4.2, we use a normalized initialization (Xavier initialization) and a batch normalization layer to normalize the intermediate hidden output. Secondly, we apply residual learning, which uses several residual blocks (defined in Section 3.2.1) with no extra parameter compared with regular fullyconnected layers. The residual blocks will make the information flow between layers easy and enable very smooth forward/backward propagation, which makes it feasible to train deep networks. To our knowledge, this is one of the first studies to investigate the effect of residual networks for autoencoder and unsupervised learning. Here, for the deep encoder, we use 2 residual blocks and 1 fullyconnected bottleneck layer (in total 5 layers). For the decoder, we use a single hidden layer with a linear regression output layer.
From Table 5, we can observe that a deep encoder does result in improvements compared with a shallow encoder. Especially for , the residual networks give a 25.41% relative improvement for female speakers and 26.47% relative improvement for male speakers. The results indicate that learning a good joint representation of both short and long utterance ivectors is very beneficial for this supervised mapping task, and the deep encoder can help learn a better bottleneck joint embedding. The deep encoder can also decrease the amount of training data needed to model the nonlinear function, which can also alleviate the overfitting problem. In order to show the effect of residual shortcuts, we performed experiments using a deep encoder without shortcut connections, and the system resulted in even worse performance compared with the shallow encoder. Therefore, residual blocks with shortcut connections are very crucial for deep neural network training, since it alleviates the hard optimization problems of deep networks.
5.2.4 Effect of adding phoneme information
In this section, we show the results when adding phoneme vector (mean of phoneme posteriors across frames) with shortutterance ivectors to learn the mapping. We will investigate the effect of adding phoneme information based on the best performed DNNmapping structures. From Table 6, we can observe that when adding phoneme vector, the EER further improves to 8.9% for female speakers and 7.3% for male speakers from the previous best DNNmapping results. It achieves the best results for this task. The results prove the hypothesis that adding a phoneme vector can help the neural network reduce the variance of shortutterance ivectors, which will lead to better and more generalizable mapping results. In Section 5.4, we will also show the effect of adding phoneme vectors to GMMivectors.
5.3 Results with different durations
In this section, the results for different durations of evaluation utterances are listed. Table 7 shows the baseline and the best mapping results for 10 s10 s, 5 s5 s and mixed duration conditions. From the table, we can observe that the proposed methods give significant improvements for both 10 s10 s and 5 s5 s conditions, which indicates that the proposed method generalizes to different durations. In real applications, however, the duration of short utterances can not be controlled, therefore we train the mapping using the ivectors generated from mixed 10 s and 5 s utterances and show the results also on a mixedduration evaluation task (mixed of 5 s and 10 s). From Table 7, we can see that the baseline results for the mixed condition range between the EER results of 10 s10 s and the 5 s5 s evaluation tasks. The proposed mapping algorithms can model ivectors extracted from various durations, and thus give consistent improvement as shown in the table.
Female  Male  
EER (Rel Imp)  DCF08/DCF10  EER (Rel Imp)  DCF08/DCF10  
Ivector_GMM  
baseline  13.8  0.063/0.097  13.3  0.057/0.099 
DNN mapping (best)  11.0 (20.29%)  0.054/0.095  10.6 (20.30%)  0.051/0.096 
DNN mapping (best) + phoneme info  10.4 (24.64%)  0.053/0.094  9.6 (27.82%)  0.048/0.096 
Ivector_DNN  
baseline  12.2  0.054/0.093  10.2  0.048/0.095 
DNN mapping (best)  9.1 (25.41%)  0.046/0.091  7.5 (26.47%)  0.038/0.089 
DNN mapping (best) + phoneme info  8.9 (27.05%)  0.046/0.090  7.3 (28.43%)  0.037/0.090 
5.4 Comparison of mapping results for both Ivector_GMM and Ivector_DNN systems
In the previous sections, we only show the mapping experiments for Ivector_DNN system, therefore, in this section, we will show the mapping results for the Ivector_GMM system. In Section 5.1, we show that for the baseline results, Ivector_DNN system outperforms the Ivector_GMM system, but it is also interesting to compare the results after mapping. From Table 8 we observe that the proposed mapping methods give significant improvement for both systems. After mapping, the Ivector_DNN systems still outperform the Ivector_GMM systems and the superiority of Ivector_DNN systems is even more significant. We also compare the mapping results when adding phoneme vectors. The table shows that the effect of adding phoneme information is more significant for GMMivectors and it can achieve as much as a 10% relative improvement on the best DNN mapping baseline. The reason is that DNNivectors already contain some phoneme information, while GMMivectors do not have clear phoneme representation. Therefore GMMivectors can benefit more from adding phoneme vectors. In the end, we summarize the baseline and the best mapping results for both systems in Fig. 7. The DET (Detection Error Tradeoff) curves are presented for both female and male speakers. The figures indicate that the proposed mapping algorithms give significant improvement from the baseline across all operation points.
Female  Male  
EER (Rel Imp)  DCF08/DCF10  EER (Rel Imp)  DCF08/DCF10  
Arbitrary durations  
baseline  17.3  0.061/0.089  12.0  0.046/0.083 
DNN mapping (best models from SRE10)  13.3 (23.12%)  0.050/0.086  9.4 (21.67%)  0.039/0.078 
5.5 Performance on the SITW database
In the previous experiments, we show the performance of our proposed DNNmapping methods on NIST data. In this subsection, we apply our technique on the recently published database SITW which contains realworld audio files collected from opensource media channels with considerable mismatch conditions. In order to generate a large number of randomduration short utterances, we first combined the dev and eval datasets and then selected 207 utterances from relatively clean condition. We truncated each of 207 utterances into several nonoverlapped short utterances with duration 5 s, 3.5 s, 2.5 s (including both speech and nonspeech portions). In the end, a total number of 1836 utterances was generated. We plot the distribution of active speech length across these 1836 utterances in Fig. 8. From the figure, we can observe that active speech length varies between 1 s5 s across those short utterances. Therefore, we can use these short utterances to design trials, which represent realworld conditions (arbitrarylength short utterances). In total, we designed 664672 trials for our arbitrarylength shortutterance speaker verification task.
For each short utterance, we first downsampled the audio files to 8 kHz sampling rate, and then extracted the ivectors using the previously trained Ivector_DNN system introduced in Section 4.1. For PLDA scoring, we use the same PLDA in Section 4.1, which is trained using the SRE dataset. For ivector mapping, we use the bestvalidated models on SRE10 dataset (5 s condition) to apply to the SITW dataset. Evaluation results of EERs and minDCFs are show in Table 9. From the table, we can observe that the best models validated on SRE10 dataset generalize well to the SITW dataset, which give a 23.12% relative improvement of EERs for female speakers and a 21.67% relative improvement for male speakers. The results also indicate that the proposed methods can be used in reallife conditions, such as smart home and forensic related applications.
6 Mapping effects
In order to investigate the effect of the proposed ivector mapping algorithms, we first calculate the average square Euclidean distance between short and long utterance ivector pairs on the SRE10 evaluation dataset before and after mapping. The average mean square Euclidean distance between short and long utterance ivector is defined as follow:
(13) 
where and represent the shortutterance and longutterance ivector respectively, is the length of ivectors and is number of short and long ivector pairs.
We compare the values for 10 s and 5 s shortutterance ivectors and also the mapped 10 s and 5 s shortutterance ivectors for female and male speakers in Table 10. From the table, we observe that, after mapping, the mapped shortutterance ivectors have considerably smaller compared to the ones before mapping. After mapping, the in the 10 s condition is smaller compared with the 5 s condition.
Moreover, we calculate and compare the Jratio (Fukunaga, 1990) of the shortutterance ivectors from SRE10 before and after mapping in Table 11, which measures the ability of speaker separation. Given ivectors for speakers, the Jratio can be computed using Eqs.1416:
(14) 
(15) 
(16) 
where is the withinclass scatter matrix, is the betweenclass scatter matrix, is the mean ivector for the speaker, is the mean of all s, and is the covariance matrix for the speaker (note that a higher JRatio means better separation).
From Table 11, we can observe that the mapped ivectors have considerably higher Jratios compared with original shortutterance ivectors for both 5 s and 10 s conditions.
These results indicate that the proposed DNNbased mapping methods can generalize well to unseen speakers and utterances, and improve the speaker separation ability of ivectors.
10 s  5 s  
original  mapped  original  mapped  
female  558.3  306.8  618.8  352.1 
male  493.2  308.8  556.1  346.5 
Jratio  
10 s  5 s  
original  mapped  original  mapped  
female  87.96  92.97  82.73  85.18 
male  85.23  90.25  80.41  84.39 
7 Conclusions
In this paper, we show how the performance of both GMM and DNNbased ivector speaker verification systems degrade rapidly as the duration of the evaluation utterances decreases. This paper explains and analyzes the reasons of the degradation and proposes several DNNbased techniques to train a nonlinear mapping from shortutterance ivectors to their long version, in order to improve the shortutterance evaluation performance.
Two DNNbased mapping methods ( and ) are proposed and they both model the joint representations of shortutterance and longutterance ivectors. For , an autoencoder is trained first using concatenated short and long utterance ivectors in order to learn a joint hidden representation, and then the pretrained DNN is fine tuned by a supervised mapping from short to long ivectors. adopts a unified structure, which jointly trains the supervised regression task with an autoencoder since autoencoders can directly regularize the nonlinear mapping between short and long utterances. The unified structure simplifies the training procedure and can also learn a generalized nonlinear function.
Both and
result in significant improvement over the shortutterance evaluation baseline for both male and female speakers, and they also outperform other shortutterance compensation techniques by a large margin. After performing a ttest (p<0.001), the results indicate that all the improvements are statistically significant. We study several key factors of DNN models and conclude the following: 1) for the twostage trained DNN model (
), the number of iterations for unsupervised training in the first stage is important for secondstage supervised training; 2) for the semisupervised trained DNN model (), unsupervised training plays a more important role than supervised training in a speaker verification task; 3) by increasing the depth of the neural networks using residual blocks, we can alleviate the hard optimization problem of deep neural networks and obtain an improvement compared with a shallow network, especially for ; 4) adding phoneme information can aid in learning the nonlinear mapping and provide further performance improvement, and the effect is more significant for GMM ivectors; 5) the proposed DNNbased mapping methods work well for short utterances with different and mixed durations; 6) the proposed models can also improve both Ivector_GMM and Ivector_DNN systems and after mapping, a Ivector_DNN system still performs better than a Ivector_GMM system; and 7) the bestvalidated models of SRE10 generalize well to the SITW dataset and give significant improvement for arbitrarylength short utterances.References
 (1)
 Dehak et al. (2011) Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., Ouellet, P. 2011. Frontend factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, 19(4), pp. 788–798.
 Lei et al. (2011) Lei, Y., Scheffer, N., Ferrer, L., McLaren, M. 2014. A novel scheme for speaker recognition using a phoneticallyaware deep neural network, ICASSP 2014, pp. 1695–1699.
 Kanagasundaram et al. (2011) Kanagasundaram, A., Vogt, R., Dean, D., et al, 2011. Ivector based speaker recognition on short utterances, in Proc. of Interspeech 2011, pp. 2341–2344.
 Poddar et al. (2017) Poddar, A., Sahidullah, M., Saha, G., 2017. Speaker verification with short utterances: a review of challenges, trends and opportunities, in IET Biometrics, 2018, vol. 7, no. 2, pp. 91–101.
 Das and Prasanna (2017) Das, R. K., Prasanna, S. M., 2017. Speaker verification from short utterance perspective: a review, IETE Technical Review, 2017, pp. 1–19.
 Prince et al. (2007) Prince, S. J., Elder, J. H., 2007. Probabilistic linear discriminant analysis for inferences about identity, ICCV 2007, pp. 1–8.
 Cumani (2014) Cumani, S., PIchot, O, Laface, P, 2015. On the use of ivector posterior distributions in probabilistic linear discriminant analysis, IEEE Transactions on Audio, Speech and Language Processing, 22(4), pp. 846–857.
 Cumani (2015) Cumani, S., 2015. Fast scoring of full posterior PLDA models, IEEE Transactions on Audio, Speech and Language Processing, 23(11), pp. 2036–2045.
 Cumani (2016) Cumani, S., Laface, P, 2016. Ivector transformation and scaling for PLDA based speaker recognition, In Proc. of Odyssey 2016, pp. 39–46.
 Hasan et al. (2013) Hasan, T., Saeidi, R., Hansen, J. H., et al, 2013. Duration mismatch compensation for ivector based speaker recognition systems, ICASSP 2016, pp. 7663–7667.
 Kanagasundaram et al. (2014) Kanagasundaram, A., Dean, D., Sridharan, S., et al, 2014. Improving short utterance ivector speaker verification using utterance variance modeling and compensation techniques, Speech Communication, 59:69–82.
 Li et al. (2016) Li, L., Wang, D., Zhang, C., Zheng, T. Z., 2016. Improving short utterance speaker recognition by modeling speech unit classes, IEEE Transactions on Audio, Speech, and Language Processing, 24(6), pp. 1129–1139.
 Chen et al. (2016) Chen, L., Lee,K. A., Chng, E. S., Ma, B., Li, H, Dai, L. R., 2016. Contentaware local variability vector for speaker verification with short utterance, ICASSP 2016, pp. 5485–5489.
 Scheffer and Lei (2014) Scheffer, N. and Lei, Y., 2014. Content matching for short duration speaker recognition, in Proc. of Interspeech 2014, pp. 1317–1321.
 Snyder et al. (2017) Snyder, D., GarciaRomero, D., Povey, D., Khudanpur, S. 2017. Deep neural network embeddings for textindependent speaker verification, in Proc. of Interspeech 2017, pp. 999–1003.
 Zhang and Koishida (2017) Zhang, C., Koishida, K., 2017. Endtoend textindependent speaker verification with triplet loss on short utterances, in Proc. of Interspeech 2017, pp. 1487–1491
 Guo et al. (2016) Guo, J., Yeung, G., Muralidharan, D., Arsikere, H., Afshan, A., Alwan, A., 2016. Speaker verification using short utterances with DNNBased estimation of subglottal acoustic features, in Proc. of Interspeech 2016, pp. 2219–2222.
 Guo et al. (2017a) Guo, J., Yang, R., Arsikere, H., Alwan, A., 2017. Robust speaker identification via fusion of subglottal resonances and cepstral features, the Journal of the Acoustical Society of America, 141(4), EL, pp. 420–426.
 Kheder et al. (2016) Kheder, W. B., Matrouf, D., Ajili, M., Bonastre, J. F. , 2016. Probabilistic approach using joint long and short session iVectors modeling to deal with short utterances for speaker recognition, in Proc. of Interspeech 2016, pp. 1830–1834.
 Kheder et al. (2018) Kheder, W. B., Matrouf, D., Ajili, M., Bonastre, J. F. , 2018. A unified joint model to deal with nuisance variabilities in the iVector space, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), pp. 633–645.
 Ma et al. (2017) Ma, J., Sethu, V., Ambikairajah, E., Lee, K. A. 2017. Incorporating local acoustic variability information into short duration speaker verification, in Proc. of Interspeech 2017, pp. 1502–1506.
 Bhattacharya et al. (2017) Bhattacharya, G., Alam, J., Kenny, P. 2017. Deep Speaker embeddings for shortDuration speaker verification, in Proc. of Interspeech 2017, pp. 1517–1521.

He et al. (2016)
He, K., Zhang, X., Ren, S., Sun, J. 2016. Deep residual learning for image recognition, in Proc. of the IEEE conference on computer vision and pattern recognition 2016, pp. 770–778.
 Guo et al. (2017b) Guo, J., Nookala, U. A., Alwan, A. 2017. CNNbased joint mapping of short and long utterance ivectors for speaker verification using short utterances, in Proc. of Interspeech 2017, pp. 3712–3716.
 Hinton et al. (2006) Hinton, G. E., Osindero, S., Teh, Y. W. 2006. A fast learning algorithm for deep belief nets, Neural computation 2006, 18(7), pp 1527–1554.

Erhan et al. (2010)
Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., Bengio, S. 2010. Why does unsupervised pretraining help deep learning? Journal of Machine Learning Research 2010, 11(Feb), pp. 625–660.

Rasmus et al. (2015)
Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T. 2015 Semisupervised learning with ladder networks, Advances in Neural Information Processing Systems 2015, pp. 35463554.
 Zhang et al. (2016) Zhang, Y., Lee, K., Lee, H. 2016 Augmenting supervised neural networks with unsupervised objectives for largescale image classification, International Conference on Machine Learning 2016, pp 612–621.
 Sabour et al. (2017) Sabour, S., Frosst, N., Hinton, G. E. 2017. Dynamic routing between capsules, Advances in Neural Information Processing Systems 2017, pp 3859–3869.
 Bousquet and Rouvier (2017) Bousquet, P. M., Rouvier, M. 2017. Duration mismatch compensation using fourcovariance model and deep neural network for speaker verification., in Proc. of Interspeech 2017, pp. 1547–1551.
 Kenny et al. (2013) Kenny, P., Stafylakis, T., Ouellet, P., Alam, M. J., Dumouchel, P. 2013. PLDA for speaker verification with utterances of arbitrary duration, ICASSP 2013, pp. 7649–7653
 Cieri et al. (2004) Cieri, C., Miller, D., Walker, K. 2004 The Fisher Corpus: a resource for the next generations of speechtoText, in LREC Vol. 4, pp. 69–71 .
 Peddinti et al. (2015) Peddinti, V., Povey, D., Khudanpur, S. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts, in Proc. of Interspeech 2015, pp. 3214–3218.
 Povey et al. (2011) Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Silovsky, J. 2011. The Kaldi speech recognition toolkit, in IEEE workshop on automatic speech recognition and understanding 2011.
 Zhang et al. (2014) Zhang, X., Trmal, J., Povey, D., Khudanpur, S. 2014. Improving deep neural network acoustic models using generalized maxout networks, ICASSP 2014, pp. 215–219.
 Martin and Greenberg (2010) Martin, A. F., Greenberg, C. S. 2010. The NIST 2010 speaker recognition evaluation, in Proc. of Interspeech 2010.
 McLaren et al. (2015) McLaren, M., Lawson, A., Ferrer, .L, Castan, D., Graciarena, M. 2015 The speakers in the wild speaker recognition challenge plan.
 Kingma et al. (2014) Kingma, D., Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Glorot and Bengio (2010)
Glorot, X., Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks, in Proc. of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 249–256.
 Ioffe and Szegedy (2015) Ioffe, S., Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning. pp. 448–456.
 Abadi et al. (2016) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., … Ghemawat, S. 2016. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
 Snyder et al. (2015) Snyder, D., GarciaRomero, D., Povey, D. 2015. Time delay deep neural networkbased universal background models for speaker recognition, in IEEE workshop on automatic speech recognition and understanding 2015, pp. 92–97.
 Fukunaga (1990) Fukunaga, K. 1990. Introduction to statistical pattern recognition. Academic Press, 1990
Comments
There are no comments yet.