Deep neural network based i-vector mapping for speaker verification using short utterances

10/16/2018 ∙ by Jinxi Guo, et al. ∙ Snap Inc. 0

Text-independent speaker recognition using short utterances is a highly challenging task due to the large variation and content mismatch between short utterances. I-vector based systems have become the standard in speaker verification applications, but they are less effective with short utterances. In this paper, we first compare two state-of-the-art universal background model training methods for i-vector modeling using full-length and short utterance evaluation tasks. The two methods are Gaussian mixture model (GMM) based and deep neural network (DNN) based methods. The results indicate that the I-vector_DNN system outperforms the I-vector_GMM system under various durations. However, the performances of both systems degrade significantly as the duration of the utterances decreases. To address this issue, we propose two novel nonlinear mapping methods which train DNN models to map the i-vectors extracted from short utterances to their corresponding long-utterance i-vectors. The mapped i-vector can restore missing information and reduce the variance of the original short-utterance i-vectors. The proposed methods both model the joint representation of short and long utterance i-vectors by using autoencoder. Experimental results using the NIST SRE 2010 dataset show that both methods provide significant improvement and result in a max of 28.43 relative improvement in Equal Error Rates from a baseline system, when using deep encoder with residual blocks and adding an additional phoneme vector. When further testing the best-validated models of SRE10 on the Speaker In The Wild dataset, the methods result in a 23.12 s) short-utterance conditions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The i-vector based framework has defined the state-of-the-art for text-independent speaker recognition. The i-vectors are extracted from either a Gaussian mixture model (GMM) based (Dehak et al., 2011) or a deep neural network (DNN) based system (Lei et al., 2011), and for the backend, probabilistic linear discriminant analysis (PLDA) (Prince et al., 2007) has been widely used. The i-vector/ PLDA system performs well if long (e.g. more than 30 s) enrollment and test utterances are available, but the performance degrades rapidly when only limited data are available (Kanagasundaram et al., 2011). To address this issue, a range of techniques has been studied on different aspects of this problem (Poddar et al., 2017; Das and Prasanna, 2017).

There has been a number of methods to model the variation of short utterance i-vectors. In Cumani (2014, 2015), a Full Posterior Distribution PLDA (FP-PLDA) is proposed to exploit the covariance of the i-vector distribution, which improves the standard Gaussian PLDA (G-PLDA) model by accounting for the uncertainty of i-vector extraction. In Hasan et al. (2013), the effect of short utterance i-vectors on system performance was analyzed, and the duration variability was modeled as additive noise in the i-vector space. The work in Kanagasundaram et al. (2014) introduces a short utterance variance normalization technique and a short utterance variance modeling approach at the i-vector feature level; the technique makes use of the covariance matrices of long and short i-vectors for normalization.

Alternatively, several approaches have been proposed that leverage phonetic information to perform content matching. The work in Li et al. (2016) proposes a GMM based subregion framework where speaker models are trained for each subregion defined by phonemes. Test utterances are then scored with subregion models. In Chen et al. (2016)

, the authors use the local session variability vectors estimated from certain phonetic components instead of computing the i-vector from the whole utterance. Phonetic classes are obtained by clustering similar senones (group of triphones with similar acoustic properties) that are estimated from posterior probabilities of a DNN trained for phone state classification. Another approach was proposed in

Scheffer and Lei (2014) which matches the zero-order statistics of test and enrollment utterances using posteriors of each phone state before computing the i-vectors.

In addition, a few studies have focused on the role of feature extraction and score calibration. In

Guo et al. (2016, 2017a)

, the authors proposed several different methods (DNN and linear regression models) to estimate speaker-specific subglottal acoustic features, which are more stationary compared to MFCCs, largely phoneme independent, and can alleviate the phoneme mismatch between training and testing utterances. In addition,

Hasan et al. (2013) proposes a Quality Measure Function (QMF) which is a score-calibration mechanism that compensates for the duration mismatch in trial scores.

Recently, several approaches have been proposed which use deep neural networks to learn speaker embedding from short-utterances. In Snyder et al. (2017), the authors use a neural network, which is trained to discriminate between a large number of speakers, to generate fixed-dimensional speaker embedding, and the speaker embedding are used for PLDA scoring. In Zhang and Koishida (2017)

, the authors propose an end-to-end system which directly learns a speaker discriminative embedding using a triplet loss function and an Inception Net. Both methods show improvement over GMM-based i-vector systems.

A few recent papers have focused on i-vector mapping, which maps the short utterance i-vector to its long version. In Kheder et al. (2016, 2018), the authors proposed a probabilistic approach, in which a GMM-based joint model between long and short utterance i-vectors was trained, and a minimum mean square error (MMSE) estimator was applied to transform a short i-vector to its long version. Since the GMM-based mapping function is actually a weighted sum of linear functions, our previous research (Guo et al., 2017b)

demonstrates that a proposed non-linear mapping using convolutional neural networks (CNNs) outperforms the GMM-based linear mapping methods across different conditions. The CNN-based mapping methods use unsupervised learning to regularize the supervised regression model, and result in significant performance improvement.

This paper is an extension of our aforementioned work in Guo et al. (2017b) where we investigate neural network based non-linear mapping methods for i-vector mapping. Here, we first compare and analyze the performance of both GMM- and DNN- based i-vector systems with short-utterance evaluation tasks. Based on the results which show that I-vector_DNN systems outperform I-vector_GMM systems across durations, we first investigate our proposed non-linear i-vector mapping methods using I-vector_DNN systems. Two novel DNN-based i-vector mapping methods are proposed and compared. They both model the joint representation of short and long utterance i-vectors by making use of an autoencoder.

The first method trains an autoencoder using concatenated short and long utterance i-vectors and then the pre-trained weights are used to perform fine-tuning for the supervised regression task which directly maps short to long utterances. By learning a joint embedding of short and long utterances i-vectors, the pre-trained autoencoder can help to initialize the weights at a desirable basin of the landscape of the loss function for the supervised training. Such pre-training proves to be useful especially when the training dataset is not large enough. Similar ideas of pre-training have been studied by Hinton et al. (2006) and Erhan et al. (2010).

The second method jointly trains the supervised regression model with an autoencoder to reconstruct the short-utterance i-vector itself. The autoencoder here plays the role of a regularizer, which is important when the training dataset is not large enough and the dimensions of the input and output are relatively high. The fact that the autoencoder loss helps prevent overfitting has been observed in the machine learning literature. For example, in

Rasmus et al. (2015); Zhang et al. (2016), a supervised neural network is augmented with decoding pathways for reconstruction, and it is shown that the reconstruction loss helps improve the performance of supervised tasks. More recently, a paper on CapNet (Sabour et al., 2017) introduces a decoder that plays a critical role in achieving the state of the art performance on a classification task.

We further discuss several key factors of the proposed DNN mapping models in detail, including pre-training iteration, regularization weights and encoder depth. The best model provides more than 26.47% relative improvement. We also show that by adding additional phoneme information as input, we can achieve further mapping improvements (28.43%). We apply the proposed mapping methods to different durations of evaluation utterances to represent real-life situations, and the results show their effectiveness across all conditions. The mapping results for both I-vector_GMM and I-vector_DNN systems are compared, and show significant improvement for both systems. In the end, in order to show the generalization of the proposed methods, we apply the best-validated models of SRE10 (Martin and Greenberg, 2010) dataset to the Speaker In The Wild (SITW) dataset (McLaren et al., 2015), which also show considerable improvement (23.12%).

This paper is structured as follows. Section 2 describes the state-of-the-art i-vector/PLDA speaker verification systems. Section 3 analyzes the effect of utterance duration on i-vectors and introduces the proposed DNN-based i-vector mapping methods in detail. Section 4 presents the experimental set-up. Experimental results and analysis of the proposed techniques are presented in Section 5. Section 6 discusses mapping effects, and finally, in Section 7, major conclusions are presented.

2 I-vector based speaker verification systems

As mentioned earlier, the state-of-the-art text-independent speaker verification system is based on the i-vector framework. In these systems, a universal background model (UBM) is used to collect sufficient statistics for i-vector extraction, and a PLDA backend is adopted to obtain the similarity scores between i-vectors. There are two different ways to model a UBM: using unsupervised-trained GMMs or using a DNN trained as a senone classifier. Therefore, we will introduce both the I-vector_GMM and I-vector_DNN systems as well as PLDA modeling.

2.1 I-vector_GMM system

The i-vector representation is based on the total variability modeling concept which assumes that speaker- and channel- dependent variabilities reside in a low-dimensional subspace, represented by the total variability matrix . Mathematically, the speaker- and channel-dependent GMM supervector can be modeled as:


where is the speaker- and channel-independent supervector, is a rectangular matrix of low rank and

is a random vector called the i-vector which has a standard normal distribution


In order to learn the total variability subspace, the Baum-Welch statistics need to be computed for a given utterance, which are defined as:


where and represents the zeroth and first order statistics, is the feature sample at time index , represent the UBM of C mixture components, is the Gaussian index and corresponds to the posterior of mixture component c generating the vector .

2.2 I-vector_DNN system

As mentioned in the previous section, for an I-vector_GMM system, the posterior of mixture component generating the vector is computed with a GMM acoustic model trained in an unsupervised fashion (i.e. with no phonetic labels).


However, recently, inspired by the success of DNN acoustic models in automatic speech recognition (ASR), Lei et al. (2011) proposed a method which uses DNN senone (cluster of context-dependent triphones) posteriors to replace the GMM posteriors as illustrated in Eq.4, which leads to significant improvement in speaker verification. represents the trained DNN model for senone classfication.

The senone posterior approach uses ASR features to compute the class soft alignment and the standard speaker verification features for sufficient statistic estimation. Once sufficient statistics are accumulated, the training procedure is the same as in the previous section. In this paper, we use a state-of-the-art time delay neural network (TDNN) as in Peddinti et al. (2015) to train the ASR acoustic model.

2.3 PLDA modeling

PLDA is a generative model of i-vector distributions for speaker verification. In this paper, we use a simplified variant of PLDA, termed as G-PLDA (Kenny et al., 2013), which is widely used by researchers. A standard G-PLDA assumes that the i-vector is represented by:


where, is the mean of i-vectors, defines the between-speaker subspace, and the latent variable represents the speaker identity and is assumed to have standard normal distribution. The residual term represents the within-speaker variability, which is normally distributed with zero mean and full covariance .

PLDA based i-vector system scoring is calculated using the log likelihood ratio (LLR) between a target and test i-vectors, denoted as and . The likelihood ratio can be calculated as follows:


where and denote the hypothesis that two i-vectors represent the same speaker, and different speakers, respectively.

3 Short-utterance speaker verification

long utterance short utterance
mean variance() 283 493
Table 1: Mean variance of long and short utterances (from SRE and Switchboard dataset)

3.1 The effect of utterance durations on i-vectors

Full-length i-vectors have relatively smaller variations compared with i-vectors extracted from short utterances (Poddar et al., 2017), because i-vectors of short utterances can vary considerably with changes in phonetic content. In order to show the variation changes between long and short utterance i-vectors, we first calculate the average diagonal covariance (denoted as ) of i-vectors across all utterances of a given speaker and then calculate the mean (denoted as ) of the covariances over all speakers. and are defined in Eqs.7-8 as:


where corresponds to the mean of the i-vectors belonging to speaker . represents the total number of utterances for speaker , represents the trace operation, and is total number of speakers.

In order to compare the for long and short utterance i-vectors, we choose around 4000 speakers with multiple long utterances (more than 2 mins durations and 100 s active speech) from the SRE and Switchboard (SWB) datasets (in total around 40000 long utterances) and truncate each long utterances into multiple 5-10 s short utterances. We plot the distribution of active-speech length (utterance length after voice activity detection) across these 40000 long utterances in Fig. 1. The i-vectors are extracted for each short and long utterance using the I-vector_DNN system, and Table 1 shows the mean variance across all speakers calculated from long and short utterance i-vectors individually. The mean of variances in the Table 1 indicates that short-utterance i-vectors have larger variation compared to those of long-utterance i-vectors.

Figure 1: Distribution of active speech length of the selected 40000 long utterances.

3.2 DNN-based i-vector mapping

In order to alleviate possible phoneme mismatch in text-independent short utterances, we propose several methods to map short-utterance i-vectors to their long version. This mapping is a many-to-one mapping, from which we want to restore the missing information from the short-utterance i-vectors and reduce their variance.

In this section, we will introduce and compare several novel DNN-based i-vector mapping methods. Our pilot experiments indicate that, if we train a supervised DNN to learn this mapping directly, which is similar to the approaches in Bousquet and Rouvier (2017) , the improvement is not significant, due to over-fitting to the training dataset. In order to solve this problem, we propose two different methods which both model the joint representation of short and long utterance i-vectors by using an autoencoder. The decoder reconstructs the original input representation and forces the encoded embedding to learn a hidden space which represents both short and long utterance i-vectors and thus can lead to a better generalization. The first is a two-stage method: using an autoencoder to first train a bottleneck representation of both long and short utterance i-vectors, and then uses the pre-trained weights to perform a supervised fine-tuning of the model, which maps the short-utterance i-vector to its long version directly. The second is a single-stage method: jointly train the supervised regression model with an autoencoder to reconstruct the short i-vector. The final loss to optimize is a weighted sum of the supervised regression loss and the reconstruction loss. In the following subsections, we will introduce these two methods in detail.

3.2.1 (two-stage method): pre-training and fine-tunning

In order to find a good initialization of the supervised DNN model, we first train a joint representation of both short and long utterance i-vectors using an autoencoder. We first concatenate the short i-vector and its long version into , then the concatenated vector

is used to train an autoencoder with some specific constraints. The autoencoder learns the joint hidden representation of both short and long i-vectors, which leads to good initialization of the second-stage supervised fine-tuning. The autoencoder consists of an encoder and a decoder as illustrated in Fig. 

2. The encoder function learns a hidden representation of input vector , and the decoder function produces a reconstruction. The learning process is described as minimizing the loss function . In order to learn a more useful representation, we add a restriction on the autoencoder: constrain the hidden representation to have a relatively small dimension in order to learn the most salient features of the training data.

Figure 2: : two-stage training of i-vector mapping. Left schema corresponds to the first-stage pre-training. A short-utterance i-vector and a corresponding long-utterance i-vector are first concatenated into . Then is fed into an encoder to generate the joint embedding . is passed to the decoder to generate the reconstructed , which is expected to be a concatenation of a reconstructed and . Right schema corresponds to the second-stage fine-tuning. The pre-trained weights in the first stage is used to initialize the supervised regression model from to . After training, the estimated i-vector is used for evaluation.

Figure 3: Residual block. An input is first passed into two hidden layers to get and it also goes through a short-cut connection, which skips the hidden layers and directly comes to the output. The final output of the residual block is a summation of and .

Figure 4: : single-stage training of i-vector mapping. A short-utterance i-vector is passed to an encoder and the output of the encoder is first used to generate the estimated long-utterance i-vector and it is also fed into a decoder to generate the reconstructed short-utterance i-vector . The two tasks are optimized jointly.

For the encoder function , we adopt options from several fully-connected layers to stacked residual blocks (He et al., 2016), in order to investigate the effect of encoder depth. Each residual block has two fully-connected layers with a short-cut connection as shown in Fig. 3. By using residual blocks, we are able to train a very deep neural network without adding extra parameters. A deep encoder may help learn better hidden representations. For a decoder function , we use a single fully connected layer with a linear regression layer, since it is enough to approximate the mapping from the learned hidden representation to the output vector. For the loss function, we use the mean square error criterion, which is .

Once the autoencoder is trained, we use the trained DNN-structure and weights to initialize the supervised mapping. We optimize the loss between the predicted long i-vector and the real long i-vector as shown in Fig. 2. We denote this method as .

3.2.2 (single-stage method): semi-supervised training

The two-stage method mentioned in the previous section, needs to first train a joint representation using the autoencoder and then perform a fine-tuning to train the supervised mapping. In this section, we introduce another unified semi-supervised framework based on our previous work (Guo et al., 2017b) which can jointly train the supervised mapping with an autoencoder to minimize the reconstruction error. The joint framework is motivated by the fact that by sharing the hidden representations among supervised and unsupervised tasks, the network generalizes better and it can also avoid using the two-stage training procedures and speed up training. This method is denoted as .

We adopt the same autoencoder framework as mentioned in the previous section, which has an encoder and a decoder, but the input to the encoder here is the short-utterance i-vector . The output from the encoder will be connected to a linear regression layer to predict the long-utterance i-vector , and it will also be used to reconstruct the short-utterance i-vector itself by inputing it into a decoder, which gives rise to the autoencoder structure. The entire framework is shown in Fig. 4.

We define a new objective function to jointly train the network. Let us use and to represent the output from the supervised regression model and autoencoder respectively. We can define the objective loss function which combines the loss from the regression model and the autoencoder in a weighted fashion as:


where is the loss of regression model defined as


and is the loss of an autoencoder defined as:


Moreover, and are parameters of the regression model and autoencoder respectively, which are jointly trained and share the weights of the encoder layer.

is a scalar weight, which determines how much the reconstruction error is used to regularize the supervised learning. The reconstruction loss of the autoencoder

forces the hidden vector generated from the encoder to reconstruct the short-utterance i-vector in addition to predicting the target long-utterance i-vector , and helps prevent the hidden vector from over-fitting . For testing, we only use the output from the regression model as the mapped i-vector.

Figure 5: I-vector mapping with additional phoneme information. A short-utterance i-vector is concatenated with a phoneme vector to generate the estimated long-utterance i-vectors .
I-vector_GMM I-vector_DNN
UBM (3472) Switchboard, NIST 04, 05, 06, 08 Fisher English
T (600) Switchboard, NIST 04, 05, 06, 08 Switchboard, NIST 04, 05, 06, 08
PLDA NIST 04, 05, 06, 08 NIST 04, 05, 06, 08
Table 2: Datasets used for developing I-vector_GMM and I-vector_DNN systems

3.2.3 Adding phoneme information

The variance of short utterances is mainly due to phonetic differences. In order to aid the neural network to train this non-linear mapping, for a given utterance, we extract the senone posteriors for each frame and calculate the mean posterior across frames as a phoneme vector, which is then appended to a short utterance i-vector as input (Fig. 5). The training procedure still follows the proposed joint modeling methods ( or ). The phoneme vectors are expected to help normalize the short-utterance i-vector, and provide extra information for this mapping. The phoneme vector is defined as:


The posterior is generated from the TDNN-based senone classifier, which was defined in Section 2.2.

Female Male
EER (Rel Imp) DCF08/DCF10 EER (Rel Imp) DCF08/DCF10
Full-length condition
I-vector_GMM 2.2 0.011/0.043 1.7 0.008/0.036
I-vector_DNN 1.4 (36.36%) 0.005/0.022 0.8 (52.94%) 0.003/0.017
10 s-10 s condition
I-vector_GMM 13.8 0.063/0.097 13.3 0.057/0.099
I-vector_DNN 12.2 (11.59%) 0.054/0.093 10.2 (23.31%) 0.048/0.095
5 s-5 s condition
I-vector_GMM 21.7 0.083/0.099 20.4 0.080/0.100
I-vector_DNN 19.9 (8.29%) 0.078/0.099 17.0 (16.67%) 0.072/0.100
Table 3: Baseline results for I-vector_GMM and I-vector_DNN systems under full-length and short-length utterances conditions reported in terms of EER, Relative Improvement (Rel Imp), minDCF.

4 Experimental set-up

4.1 I-vector baseline systems

We evaluate our techniques using the state-of-the-art GMM- and DNN-based i-vector/G-PLDA systems using the Kaldi toolkit (Povey et al., 2011).

4.1.1 Configurations of I-vector_GMM system

For the I-vector_GMM system, the first 20 MFCC coefficients (discarding the zeroth coefficient) and their first and second order derivatives are extracted from the detected speech segments after an energy-based voice activity detection (VAD). A 20 ms Hamming window, a 10 ms frame shift, and a 23 channels filterbank are used. Universal background models with 3472 Gaussian components are trained, in order to have a fair comparison with the I-vector_DNN system, whose DNN has 3472 outputs. Initial training consists of four iterations of EM using a diagonal covariance matrix and then an additional four iterations with a full-covariance matrix. The total variability subspace with low rank (600) is trained for five iterations of EM. The backend training consists of i-vector mean subtraction and length normalization, followed by PLDA scoring.

The UBM and i-vector extractor training data consist of male and female utterances from the SWB and NIST SRE datasets. The SWB data contains 1000 speakers and 8905 utterances of SWB 2 Phases II. The SRE dataset consists of 3805 speakers and 36614 utterances from SRE 04, 05, 06, 08. The PLDA backends are trained only on the SRE data. The dataset information is summarized in Table 2.

4.1.2 Configurations of I-vector_DNN system

For the I-vector_DNN system, a TDNN is trained using about 1,800 hours of the English portion of Fisher (Cieri et al., 2004). In the TDNN acoustic modeling system, a narrow temporal context is provided to the first layer and context width increases for the subsequent hidden layers, which enables higher levels of the network to learn greater temporal relationships. The features are 40 mel-filterbank features with a frame-length of 25 ms. Cepstral mean subtraction is performed over a window of 6 s. The TDNN has six layers, and a splicing configuration similar to those described in Peddinti et al. (2015)

. In total, the DNN has a left-context of 13 and a right-context of 9. The hidden layers use the p-norm (where p = 2) activation function

(Zhang et al., 2014), an input dimension of 350, and an output dimension of 3500. The softmax output layer computes posteriors for 3472 triphone states, which is the same as the number of components for I-vector_GMM system. No fMLLR or i-vectors are used for speaker adaptation.

The trained TDNN is used to create a UBM which directly models phonetic content. A supervised-GMM with full-covariance is created first to initialize the i-vector extractor based on TDNN posteriors and speaker recognition features. Training the matrix also requires TDNN posteriors and speaker recognition features. During i-vector extraction, the only difference between this and the standard GMM-based systems is the model used to compute posteriors. In the I-vector_GMM system, speaker recognition features are selected using a frame-level VAD, however, in order to maintain the correct temporal context, we cannot remove frames from the TDNN input features. Instead, the VAD results are used to filter out posteriors corresponding to non-speech frames.

4.1.3 Evaluation databases

We first evaluate our systems on condition 5 (extended task) of SRE10 (Martin and Greenberg, 2010). The test consists of conversational telephone speech in enrollment and test utterances. There are 416119 trials, over 98% of which are nontarget comparisons. Among all trials, 236781 trials are for female speakers and 179338 trials are for male speakers. For short-utterance speaker verification tasks, we extracted short utterances which contain 10 s and 5 s speech (after VAD) from condition 5 (extended task). We train the PLDA and evaluate the trials in a gender-dependent way.

Moreover, in order to validate our proposed methods in real conditions and demonstrate the models’ generalization, we use SITW, a recently published speech database (McLaren et al., 2015). The SITW speech data was collected from open-source media channels with considerable mismatch in terms of audio conditions. We designed an arbitrary-length short-utterance task using SITW dataset to represent real-life conditions. We show the evaluation results using the best-performed models validated on SRE10 dataset.

4.2 I-vector mapping training

In order to train the i-vector mapping model, we selected 39754 long utterances, each having more than 100 s of speech after VAD, from the development dataset. For each long utterance, we used a 5 s or 10 s window to truncate the utterance, and the shift step is half of window size (2.5 s or 5 s). We applied the aforementioned procedures to all long utterances, and in the end we got 1.2M 10 s utterances and 2.4M 5 s utterances. All short-utterance i-vector together with its corresponding long-utterance i-vector are used as training pairs for DNN-based mapping models. We train the mapping models for each gender separately and evaluate the model in a gender-dependent way.

For the proposed two DNN-based mapping models, we use the same encoder and decoder configurations. For the encoder, we first use two fully-connected layers. The first layer has 1200 hidden nodes and the second layer has 600 hidden nodes which is a bottleneck layer (1.44M parameters in total). In order to investigate the depth of the encoder, we design a deep structure with two residual blocks and a bottleneck layer, in a total of 5 layers. Each residual block (as defined in Section 3.2.1) has two fully connected layers with 1200 hidden nodes and the bottleneck layer has 600 hidden nodes (5.76M parameters in total). For the decoder, we always use one fully-connected layer (1200 hidden nodes) with a linear output layer (1.44M parameters in total).

In order to add phoneme information for i-vector mapping, phoneme vectors are generated for each utterance by taking the average of the senone posteriors across frames. Since the phoneme vectors have a different value range compared with i-vectors, it will de-emphasize their effect for training the mapping. Therefore we scale up the phoneme vector values by a factor of 500, in order to match the range of i-vector values. The up-scaled phoneme vector is then concatenated with short-utterance i-vector for i-vector mapping.

All neural networks are trained using the Adam optimization strategy (Kingma et al., 2014) with mean square error criterion and exponentially decaying learning rate starting from 0.001. The networks are initialized with the Xavier initializer (Glorot and Bengio, 2010), which is better than the Gaussian initializer as shown in Guo et al. (2017b)

. The relu activation function is used for all layers. For each layer, before passing the tensors to the nonlinearity function, a batch normalization layer

(Ioffe and Szegedy, 2015) is applied to normalize the tensors and speed up the convergence. For the combined loss of , we set equal weights (

=0.5) for both regression and autoencoder loss for initial experiments. The shuffling mechanism is applied on each epoch. The Tensorflow toolkit

(Abadi et al., 2016) is used for neural network training.

Female Male
EER (Rel Imp) DCF08/DCF10 EER (Rel Imp) DCF08/DCF10
baseline 12.2 0.054/0.093 10.2 0.048/0.095
matched length PLDA 11.3 (7.38%) 0.052/0.093 9.4 (7.84%) 0.043/0.095
LDA 150 11.6 (5.00%) 0.052/0.093 9.8 (3.92%) 0.047/0.093
DNN direct mapping 10.5 (13.93%) 0.054/0.096 9.7 (4.90%) 0.047/0.093
DNN1 mapping 9.5 (22.13%) 0.047/0.091 7.7 (24.51%) 0.039/0.090
DNN2 mapping 9.5 (22.13%) 0.047/0.091 7.7 (24.51%) 0.039/0.089
Table 4: Results for baseline (I-vector_DNN), matched-length PLDA training, LDA dimension reduction, DNN direct mapping and proposed DNN mapping in the 10 s-10 s condition.

5 Evaluation results and analysis

5.1 I-vector baseline systems

In this section, we present and compare two baseline systems: a I-vector_GMM system and a I-vector_DNN system, with standard NIST SRE 10 full-length condition and truncated 10 s-10 s and 5 s-5 s conditions.

Table 3 shows the equal error rate (EER) and minimum detection cost function (minDCF) of the two baseline systems under full-length evaluation condition and truncated short-length evaluation conditions. Both DCF08 and DCF10 (defined in NIST 2008 and 2010 evaluation plan) are shown in the table. From the table, we can observe that the I-vector_DNN system gives significant improvement under the full-length condition compared with I-vector_GMM system and achieved a max of 52.94% relative improvement for the male condition, which is consistent with previous reported results (Snyder et al., 2015). This is mainly because the DNN model provides phonetically-aware class alignments, which can better model speakers. The good performance is also due to the strong TDNN-based senone classifier, which makes the alignments more accurate and robust. When both systems were evaluated on the truncated 10 s-10 s, 5 s-5 s evaluation conditions, the performances degrade significantly compared with the full-length condition. The main reason is that when the length of the evaluation utterance is shorter, there is significant phonetic mismatch between utterances. However, the performance of the I-vector_DNN system still outperforms the I-vector_GMM system by 8%-24%, even though the improvement is not as big as the full-length condition. From the table, we can also observe that the improvement is more significant for male speakers across all conditions. It may be the fact that phoneme classification is more accurate for male speakers, which could lead to a better phoneme-aware speaker modeling.

5.2 I-vector mapping results

In this section, we show and discuss the performance of the proposed algorithms when only short utterances are available for evaluation. Since from Table 3 we can observe better performance using I-vector_DNN systems, we will mainly use the I-vector_DNN system to investigate the mapping methods. We first show the results on the 10 s-10 s condition.

Previous work (Kheder et al., 2016; Guo et al., 2017b) highlights the importance of duration matching in PLDA model training. For instance when the PLDA is trained using long utterances and evaluated on short utterances, there is degradation in speaker verification performance compared to PLDA trained using matched-length short utterances. Therefore, we not only show our baseline results for the PLDA trained using the regular SRE development utterances, but also show the results for the PLDA condition using truncated matched-length short utterances.

For other baseline comparison, we first apply dimensionality reduction on i-vectors using linear discriminant analysis (LDA) and reduce the dimension of i-vectors from 600 to 150. This value has been selected according to the results of previous research (Cumani, 2016). LDA can maximize inter-speaker variability and minimize intra-speaker variability. We train the LDA transformation matrix using the SRE development dataset, and then, perform the dimension reduction for all development utterances and train a new PLDA model. For evaluation, all i-vectors are subjected to dimensionality reduction first and then we use the new PLDA model to get similarity scores. To compare with another short-utterance compensation technique, we evaluate the i-vector mapping methods proposed in Bousquet and Rouvier (2017), which use DNNs to train a direct mapping from short-utterance i-vectors to the corresponding long version. Similar to Bousquet and Rouvier (2017), we also add some long-utterance i-vectors as input for regularization purposes.

Female Male
EER (Rel Imp) DCF08/DCF10 EER (Rel Imp) DCF08/DCF10
baseline 12.2 0.054/0.093 10.2 0.048/0.095
DNN1 mapping (3 layer) 9.5 (22.13%) 0.047/0.091 7.7 (24.51%) 0.039/0.090
DNN2 mapping (3 layer) 9.5 (22.13%) 0.047/0.091 7.7 (24.51%) 0.039/0.089
DNN1 mapping (6 layer + residual block) 9.1 (25.41%) 0.046/0.091 7.5 (26.47%) 0.038/0.089
DNN2 mapping (6 layer + residual block) 9.3 (23.77%) 0.047/0.091 7.6 (25.49%) 0.038/0.089
Table 5: DNN-based mapping results using DNNs with different depths in the 10 s-10 s condition.

For our proposed DNN mapping methods, we first show the mapping results for both and with three hidden layers. Note that for mapped i-vectors, we use the same PLDA as the baseline system to get similarity scores. We further investigate the effect of pretraining iterations for , the weight of the reconstruction loss for and the depth of encoder, compare the results for different durations, and investigate the effect of additional phoneme information. We also compare with mapping results for both I-vector_GMM and I-vector_DNN systems. In the end, we test the generalization of the trained models on the SITW dataset.

Table 4 presents the results for regular PLDA training condition (baseline), matched-length PLDA condition, LDA dimetionality reduction method, DNN-based direct mapping method, DNN-based two-stage method () and DNN-based single-stage method (, =0.5). We observe that matched-length PLDA training gives considerable improvement compared with non-matched PLDA training (baseline), which is consistent with previous work. When training the PLDA using short-utterance i-vectors, the system can capture the variance of short-utterance i-vectors. Using LDA to do dimentionality reduction also results in some improvement, since it reduces the variance of the i-vectors. DNN-based direct mapping gives more improvement for female speakers (13.93%) compared with male speakers (5%) in terms of EERs, and it may be due to the fact that more training data is available for female speakers and thus the over-fitting problem is less severe for females. In the last two rows, we show the performance of our proposed DNN-based mapping methods on short-utterance i-vectors. From the results, we can observe that they both result in significant improvements over the baseline for both the EER and minDCF metrics, and they also outperform the other short-utterance compensation methods by a large margin. and methods have comparable performance, which prove the importance of learning joint representation of both short and long utterance i-vectors. The proposed methods outperform the baseline system by 22.13% for female speakers and improve the male speaker baseline by 24.51%. One of the advantages using is that the unified framework avoids using the two-stage training procedure, which speeds up the training.

Female Male
EER (Rel Imp) DCF08/DCF10 EER (Rel Imp) DCF08/DCF10
baseline 12.2 0.054/0.093 10.2 0.048/0.095
DNN mapping (best) 9.1 (25.41%) 0.046/0.091 7.5 (26.47%) 0.038/0.089
DNN mapping (best) + phoneme info 8.9 (27.05%) 0.046/0.090 7.3 (28.43%) 0.037/0.090
Table 6: DNN-based mapping results with additional phoneme information in the 10 s-10 s condition.

5.2.1 Effect of pre-training for

In this section, we will show how first-stage pre-training influences the second-stage mapping training for . We have investigated the number of training iterations used for first-stage pre-training from 10000-50000. What we find interesting is that when the number of training iterations is small, the second stage fine-tuning will over-fit the data, but when the number of training iterations is large, the fine-tuning results are not optimal. In the end, 25000 iterations was a roughly good initialization for second stage fine-tuning. This indicates that the number of iterations for unsupervised training does influence the second-stage supervised training.

Figure 6: EER as a function of reconstruction loss for .

5.2.2 Effect of reconstruction loss for

In this section, we investigate the impact of the weights for the reconstruction loss in . We set = {0.1,0.2,0.5,
0.8,0.9}. Since the weight of regression loss is , the larger is, the less weight will be assigned to regression loss. Fig. 6 shows the EER for female speakers as a function of the weights assigned to reconstruction loss. The reconstruction loss is clearly important for this joint learning framework. It forces the network to learn the original representations for short utterances, which can regularize the regression task and generalize better. The optimal reconstruction weight is = 0.8, which indicates that the reconstruction loss is even more important for this task. Hence, it appears that unsupervised learning is very crucial for a speaker recognition task.

Female Male
EER (Rel Imp) DCF08/DCF10 EER (Rel Imp) DCF08/DCF10
10 s-10 s
baseline 12.2 0.054/0.093 10.2 0.048/0.095
DNN mapping (best) 9.1 (25.41%) 0.046/0.091 7.5 (26.47%) 0.038/0.089
5 s-5 s
baseline 19.9 0.078/0.099 17.0 0.072/0.100
DNN mapping (best) 14.8 (25.62%) 0.067/0.099 13.5 (20.59%) 0.061/0.100
baseline 17.8 0.068/0.097 14.4 0.061/0.100
DNN mapping (best) 13.2 (25.84%) 0.061/0.097 11.8 (18.06%) 0.053/0.096
Table 7: DNN-based mapping results with different utterance durations.

5.2.3 Effect of encoder depth

The depth of neural network has been proven to be important for network performance. Adding more layers will make the network more efficient and powerful to model data. Therefore, as discussed in Section 4.2, we will compare a shallow (2-layer) and a deep (5-layer) encoder for both and . It’s well known that training a deep model suffers a lot from gradient vanishing/exploding problems and also it can be easily stuck into local minimum points. Therefore, we use two methods to alleviate this problem. Firstly, as stated in Section 4.2, we use a normalized initialization (Xavier initialization) and a batch normalization layer to normalize the intermediate hidden output. Secondly, we apply residual learning, which uses several residual blocks (defined in Section 3.2.1) with no extra parameter compared with regular fully-connected layers. The residual blocks will make the information flow between layers easy and enable very smooth forward/backward propagation, which makes it feasible to train deep networks. To our knowledge, this is one of the first studies to investigate the effect of residual networks for auto-encoder and unsupervised learning. Here, for the deep encoder, we use 2 residual blocks and 1 fully-connected bottleneck layer (in total 5 layers). For the decoder, we use a single hidden layer with a linear regression output layer.

From Table 5, we can observe that a deep encoder does result in improvements compared with a shallow encoder. Especially for , the residual networks give a 25.41% relative improvement for female speakers and 26.47% relative improvement for male speakers. The results indicate that learning a good joint representation of both short and long utterance i-vectors is very beneficial for this supervised mapping task, and the deep encoder can help learn a better bottleneck joint embedding. The deep encoder can also decrease the amount of training data needed to model the non-linear function, which can also alleviate the over-fitting problem. In order to show the effect of residual short-cuts, we performed experiments using a deep encoder without short-cut connections, and the system resulted in even worse performance compared with the shallow encoder. Therefore, residual blocks with short-cut connections are very crucial for deep neural network training, since it alleviates the hard optimization problems of deep networks.

5.2.4 Effect of adding phoneme information

In this section, we show the results when adding phoneme vector (mean of phoneme posteriors across frames) with short-utterance i-vectors to learn the mapping. We will investigate the effect of adding phoneme information based on the best performed DNN-mapping structures. From Table 6, we can observe that when adding phoneme vector, the EER further improves to 8.9% for female speakers and 7.3% for male speakers from the previous best DNN-mapping results. It achieves the best results for this task. The results prove the hypothesis that adding a phoneme vector can help the neural network reduce the variance of short-utterance i-vectors, which will lead to better and more generalizable mapping results. In Section 5.4, we will also show the effect of adding phoneme vectors to GMM-i-vectors.

5.3 Results with different durations

In this section, the results for different durations of evaluation utterances are listed. Table 7 shows the baseline and the best mapping results for 10 s-10 s, 5 s-5 s and mixed duration conditions. From the table, we can observe that the proposed methods give significant improvements for both 10 s-10 s and 5 s-5 s conditions, which indicates that the proposed method generalizes to different durations. In real applications, however, the duration of short utterances can not be controlled, therefore we train the mapping using the i-vectors generated from mixed 10 s and 5 s utterances and show the results also on a mixed-duration evaluation task (mixed of 5 s and 10 s). From Table 7, we can see that the baseline results for the mixed condition range between the EER results of 10 s-10 s and the 5 s-5 s evaluation tasks. The proposed mapping algorithms can model i-vectors extracted from various durations, and thus give consistent improvement as shown in the table.

Female Male
EER (Rel Imp) DCF08/DCF10 EER (Rel Imp) DCF08/DCF10
baseline 13.8 0.063/0.097 13.3 0.057/0.099
DNN mapping (best) 11.0 (20.29%) 0.054/0.095 10.6 (20.30%) 0.051/0.096
DNN mapping (best) + phoneme info 10.4 (24.64%) 0.053/0.094 9.6 (27.82%) 0.048/0.096
baseline 12.2 0.054/0.093 10.2 0.048/0.095
DNN mapping (best) 9.1 (25.41%) 0.046/0.091 7.5 (26.47%) 0.038/0.089
DNN mapping (best) + phoneme info 8.9 (27.05%) 0.046/0.090 7.3 (28.43%) 0.037/0.090
Table 8: Results for I-vector_GMM and I-vector_DNN systems in the 10 s-10 s conditions.
(a) female speakers
(b) male speakers
Figure 7: DET curves for the mapping results of I-vector_GMM and I-vector_DNN systems under 10 s-10 s conditions. Left figure corresponds to female speakers and right one corresponds to male speakers.

5.4 Comparison of mapping results for both I-vector_GMM and I-vector_DNN systems

In the previous sections, we only show the mapping experiments for I-vector_DNN system, therefore, in this section, we will show the mapping results for the I-vector_GMM system. In Section 5.1, we show that for the baseline results, I-vector_DNN system outperforms the I-vector_GMM system, but it is also interesting to compare the results after mapping. From Table 8 we observe that the proposed mapping methods give significant improvement for both systems. After mapping, the I-vector_DNN systems still outperform the I-vector_GMM systems and the superiority of I-vector_DNN systems is even more significant. We also compare the mapping results when adding phoneme vectors. The table shows that the effect of adding phoneme information is more significant for GMM-i-vectors and it can achieve as much as a 10% relative improvement on the best DNN mapping baseline. The reason is that DNN-i-vectors already contain some phoneme information, while GMM-i-vectors do not have clear phoneme representation. Therefore GMM-i-vectors can benefit more from adding phoneme vectors. In the end, we summarize the baseline and the best mapping results for both systems in Fig. 7. The DET (Detection Error Trade-off) curves are presented for both female and male speakers. The figures indicate that the proposed mapping algorithms give significant improvement from the baseline across all operation points.

Female Male
EER (Rel Imp) DCF08/DCF10 EER (Rel Imp) DCF08/DCF10
Arbitrary durations
baseline 17.3 0.061/0.089 12.0 0.046/0.083
DNN mapping (best models from SRE10) 13.3 (23.12%) 0.050/0.086 9.4 (21.67%) 0.039/0.078
Table 9: DNN-based mapping results on SITW using arbitrary durations of short utterances.

5.5 Performance on the SITW database

In the previous experiments, we show the performance of our proposed DNN-mapping methods on NIST data. In this subsection, we apply our technique on the recently published database SITW which contains real-world audio files collected from open-source media channels with considerable mismatch conditions. In order to generate a large number of random-duration short utterances, we first combined the dev and eval datasets and then selected 207 utterances from relatively clean condition. We truncated each of 207 utterances into several non-overlapped short utterances with duration 5 s, 3.5 s, 2.5 s (including both speech and non-speech portions). In the end, a total number of 1836 utterances was generated. We plot the distribution of active speech length across these 1836 utterances in Fig. 8. From the figure, we can observe that active speech length varies between 1 s-5 s across those short utterances. Therefore, we can use these short utterances to design trials, which represent real-world conditions (arbitrary-length short utterances). In total, we designed 664672 trials for our arbitrary-length short-utterance speaker verification task.

For each short utterance, we first down-sampled the audio files to 8 kHz sampling rate, and then extracted the i-vectors using the previously trained I-vector_DNN system introduced in Section 4.1. For PLDA scoring, we use the same PLDA in Section 4.1, which is trained using the SRE dataset. For i-vector mapping, we use the best-validated models on SRE10 dataset (5 s condition) to apply to the SITW dataset. Evaluation results of EERs and minDCFs are show in Table 9. From the table, we can observe that the best models validated on SRE10 dataset generalize well to the SITW dataset, which give a 23.12% relative improvement of EERs for female speakers and a 21.67% relative improvement for male speakers. The results also indicate that the proposed methods can be used in real-life conditions, such as smart home and forensic related applications.

Figure 8: Distribution of active speech length of truncated short utterances in the SITW database.

6 Mapping effects

In order to investigate the effect of the proposed i-vector mapping algorithms, we first calculate the average square Euclidean distance between short and long utterance i-vector pairs on the SRE10 evaluation dataset before and after mapping. The average mean square Euclidean distance between short and long utterance i-vector is defined as follow:


where and represent the short-utterance and long-utterance i-vector respectively, is the length of i-vectors and is number of short and long i-vector pairs.

We compare the values for 10 s and 5 s short-utterance i-vectors and also the mapped 10 s and 5 s short-utterance i-vectors for female and male speakers in Table 10. From the table, we observe that, after mapping, the mapped short-utterance i-vectors have considerably smaller compared to the ones before mapping. After mapping, the in the 10 s condition is smaller compared with the 5 s condition.

Moreover, we calculate and compare the J-ratio (Fukunaga, 1990) of the short-utterance i-vectors from SRE10 before and after mapping in Table 11, which measures the ability of speaker separation. Given i-vectors for speakers, the J-ratio can be computed using Eqs.14-16:


where is the within-class scatter matrix, is the between-class scatter matrix, is the mean i-vector for the speaker, is the mean of all s, and is the covariance matrix for the speaker (note that a higher J-Ratio means better separation).

From Table 11, we can observe that the mapped i-vectors have considerably higher J-ratios compared with original short-utterance i-vectors for both 5 s and 10 s conditions.

These results indicate that the proposed DNN-based mapping methods can generalize well to unseen speakers and utterances, and improve the speaker separation ability of i-vectors.

10 s 5 s
original mapped original mapped
female 558.3 306.8 618.8 352.1
male 493.2 308.8 556.1 346.5
Table 10: Square Euclidean distance () between short and long utterance i-vector pairs from SRE10 before and after mapping.
10 s 5 s
original mapped original mapped
female 87.96 92.97 82.73 85.18
male 85.23 90.25 80.41 84.39
Table 11: J-ratio for short-utterance i-vectors from SRE10 before and after mapping.

7 Conclusions

In this paper, we show how the performance of both GMM and DNN-based i-vector speaker verification systems degrade rapidly as the duration of the evaluation utterances decreases. This paper explains and analyzes the reasons of the degradation and proposes several DNN-based techniques to train a non-linear mapping from short-utterance i-vectors to their long version, in order to improve the short-utterance evaluation performance.

Two DNN-based mapping methods ( and ) are proposed and they both model the joint representations of short-utterance and long-utterance i-vectors. For , an auto-encoder is trained first using concatenated short- and long- utterance i-vectors in order to learn a joint hidden representation, and then the pre-trained DNN is fine tuned by a supervised mapping from short to long i-vectors. adopts a unified structure, which jointly trains the supervised regression task with an auto-encoder since auto-encoders can directly regularize the non-linear mapping between short and long utterances. The unified structure simplifies the training procedure and can also learn a generalized non-linear function.

Both and

result in significant improvement over the short-utterance evaluation baseline for both male and female speakers, and they also outperform other short-utterance compensation techniques by a large margin. After performing a t-test (p<0.001), the results indicate that all the improvements are statistically significant. We study several key factors of DNN models and conclude the following: 1) for the two-stage trained DNN model (

), the number of iterations for unsupervised training in the first stage is important for second-stage supervised training; 2) for the semi-supervised trained DNN model (), unsupervised training plays a more important role than supervised training in a speaker verification task; 3) by increasing the depth of the neural networks using residual blocks, we can alleviate the hard optimization problem of deep neural networks and obtain an improvement compared with a shallow network, especially for ; 4) adding phoneme information can aid in learning the non-linear mapping and provide further performance improvement, and the effect is more significant for GMM i-vectors; 5) the proposed DNN-based mapping methods work well for short utterances with different and mixed durations; 6) the proposed models can also improve both I-vector_GMM and I-vector_DNN systems and after mapping, a I-vector_DNN system still performs better than a I-vector_GMM system; and 7) the best-validated models of SRE10 generalize well to the SITW dataset and give significant improvement for arbitrary-length short utterances.


  • (1)
  • Dehak et al. (2011) Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., Ouellet, P. 2011. Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, 19(4), pp. 788–798.
  • Lei et al. (2011) Lei, Y., Scheffer, N., Ferrer, L., McLaren, M. 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network, ICASSP 2014, pp. 1695–1699.
  • Kanagasundaram et al. (2011) Kanagasundaram, A., Vogt, R., Dean, D., et al, 2011. I-vector based speaker recognition on short utterances, in Proc. of Interspeech 2011, pp. 2341–2344.
  • Poddar et al. (2017) Poddar,  A., Sahidullah,  M., Saha,  G., 2017. Speaker verification with short utterances: a review of challenges, trends and opportunities, in IET Biometrics, 2018, vol. 7, no. 2, pp. 91–101.
  • Das and Prasanna (2017) Das, R. K., Prasanna, S. M., 2017. Speaker verification from short utterance perspective: a review, IETE Technical Review, 2017, pp. 1–19.
  • Prince et al. (2007) Prince, S. J., Elder, J. H., 2007. Probabilistic linear discriminant analysis for inferences about identity, ICCV 2007, pp. 1–8.
  • Cumani (2014) Cumani, S., PIchot, O, Laface, P, 2015. On the use of i-vector posterior distributions in probabilistic linear discriminant analysis, IEEE Transactions on Audio, Speech and Language Processing, 22(4), pp. 846–857.
  • Cumani (2015) Cumani, S., 2015. Fast scoring of full posterior PLDA models, IEEE Transactions on Audio, Speech and Language Processing, 23(11), pp. 2036–2045.
  • Cumani (2016) Cumani, S., Laface, P, 2016. I-vector transformation and scaling for PLDA based speaker recognition, In Proc. of Odyssey 2016, pp. 39–46.
  • Hasan et al. (2013) Hasan, T., Saeidi, R., Hansen, J. H., et al, 2013. Duration mismatch compensation for i-vector based speaker recognition systems, ICASSP 2016, pp. 7663–7667.
  • Kanagasundaram et al. (2014) Kanagasundaram, A., Dean, D., Sridharan, S., et al, 2014. Improving short utterance i-vector speaker verification using utterance variance modeling and compensation techniques, Speech Communication, 59:69–82.
  • Li et al. (2016) Li, L., Wang, D., Zhang, C., Zheng, T. Z., 2016. Improving short utterance speaker recognition by modeling speech unit classes, IEEE Transactions on Audio, Speech, and Language Processing, 24(6), pp. 1129–1139.
  • Chen et al. (2016) Chen, L., Lee,K. A., Chng, E. S., Ma, B., Li, H, Dai, L. R., 2016. Content-aware local variability vector for speaker verification with short utterance, ICASSP 2016, pp. 5485–5489.
  • Scheffer and Lei (2014) Scheffer, N. and Lei, Y., 2014. Content matching for short duration speaker recognition, in Proc. of Interspeech 2014, pp. 1317–1321.
  • Snyder et al. (2017) Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S. 2017. Deep neural network embeddings for text-independent speaker verification, in Proc. of Interspeech 2017, pp. 999–1003.
  • Zhang and Koishida (2017) Zhang, C., Koishida, K., 2017. End-to-end text-independent speaker verification with triplet loss on short utterances, in Proc. of Interspeech 2017, pp. 1487–1491
  • Guo et al. (2016) Guo, J., Yeung, G., Muralidharan, D., Arsikere, H., Afshan, A., Alwan, A., 2016. Speaker verification using short utterances with DNN-Based estimation of subglottal acoustic features, in Proc. of Interspeech 2016, pp. 2219–2222.
  • Guo et al. (2017a) Guo, J., Yang, R., Arsikere, H., Alwan, A., 2017. Robust speaker identification via fusion of subglottal resonances and cepstral features, the Journal of the Acoustical Society of America, 141(4), EL, pp. 420–426.
  • Kheder et al. (2016) Kheder, W. B., Matrouf, D., Ajili, M., Bonastre, J. F. , 2016. Probabilistic approach using joint long and short session i-Vectors modeling to deal with short utterances for speaker recognition, in Proc. of Interspeech 2016, pp. 1830–1834.
  • Kheder et al. (2018) Kheder, W. B., Matrouf, D., Ajili, M., Bonastre, J. F. , 2018. A unified joint model to deal with nuisance variabilities in the i-Vector space, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), pp. 633–645.
  • Ma et al. (2017) Ma, J., Sethu, V., Ambikairajah, E., Lee, K. A. 2017. Incorporating local acoustic variability information into short duration speaker verification, in Proc. of Interspeech 2017, pp. 1502–1506.
  • Bhattacharya et al. (2017) Bhattacharya, G., Alam, J., Kenny, P. 2017. Deep Speaker embeddings for short-Duration speaker verification, in Proc. of Interspeech 2017, pp. 1517–1521.
  • He et al. (2016)

    He, K., Zhang, X., Ren, S., Sun, J. 2016. Deep residual learning for image recognition, in Proc. of the IEEE conference on computer vision and pattern recognition 2016, pp. 770–778.

  • Guo et al. (2017b) Guo, J., Nookala, U. A., Alwan, A. 2017. CNN-based joint mapping of short and long utterance i-vectors for speaker verification using short utterances, in Proc. of Interspeech 2017, pp. 3712–3716.
  • Hinton et al. (2006) Hinton, G. E., Osindero, S., Teh, Y. W. 2006. A fast learning algorithm for deep belief nets, Neural computation 2006, 18(7), pp 1527–1554.
  • Erhan et al. (2010)

    Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., Bengio, S. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 2010, 11(Feb), pp. 625–660.

  • Rasmus et al. (2015)

    Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T. 2015 Semi-supervised learning with ladder networks, Advances in Neural Information Processing Systems 2015, pp. 3546-3554.

  • Zhang et al. (2016) Zhang, Y., Lee, K., Lee, H. 2016 Augmenting supervised neural networks with unsupervised objectives for large-scale image classification, International Conference on Machine Learning 2016, pp 612–621.
  • Sabour et al. (2017) Sabour, S., Frosst, N., Hinton, G. E. 2017. Dynamic routing between capsules, Advances in Neural Information Processing Systems 2017, pp 3859–3869.
  • Bousquet and Rouvier (2017) Bousquet, P. M., Rouvier, M. 2017. Duration mismatch compensation using four-covariance model and deep neural network for speaker verification., in Proc. of Interspeech 2017, pp. 1547–1551.
  • Kenny et al. (2013) Kenny, P., Stafylakis, T., Ouellet, P., Alam, M. J., Dumouchel, P. 2013. PLDA for speaker verification with utterances of arbitrary duration, ICASSP 2013, pp.  7649–7653
  • Cieri et al. (2004) Cieri, C., Miller, D., Walker, K. 2004 The Fisher Corpus: a resource for the next generations of speech-to-Text, in LREC Vol. 4, pp. 69–71 .
  • Peddinti et al. (2015) Peddinti, V., Povey, D., Khudanpur, S. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts, in Proc. of Interspeech 2015, pp. 3214–3218.
  • Povey et al. (2011) Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Silovsky, J. 2011. The Kaldi speech recognition toolkit, in IEEE workshop on automatic speech recognition and understanding 2011.
  • Zhang et al. (2014) Zhang, X., Trmal, J., Povey, D., Khudanpur, S. 2014. Improving deep neural network acoustic models using generalized maxout networks, ICASSP 2014, pp. 215–219.
  • Martin and Greenberg (2010) Martin, A. F., Greenberg, C. S. 2010. The NIST 2010 speaker recognition evaluation, in Proc. of Interspeech 2010.
  • McLaren et al. (2015) McLaren, M., Lawson, A., Ferrer, .L, Castan, D., Graciarena, M. 2015 The speakers in the wild speaker recognition challenge plan.
  • Kingma et al. (2014) Kingma, D., Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Glorot and Bengio (2010)

    Glorot, X., Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks, in Proc. of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 249–256.

  • Ioffe and Szegedy (2015) Ioffe, S., Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning. pp. 448–456.
  • Abadi et al. (2016) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., … Ghemawat, S. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
  • Snyder et al. (2015) Snyder, D., Garcia-Romero, D., Povey, D. 2015. Time delay deep neural network-based universal background models for speaker recognition, in IEEE workshop on automatic speech recognition and understanding 2015, pp. 92–97.
  • Fukunaga (1990) Fukunaga, K. 1990. Introduction to statistical pattern recognition. Academic Press, 1990