1 Introduction
The impressive gains in performance obtained using deep neural networks (DNNs) for automatic speech recognition (ASR) [1] have motivated the application of DNNs to other speech technologies such as speaker recognition (SR) and language recognition (LR) [2, 3, 4, 5, 6, 7, 8, 9]
. Two general methods of applying DNN’s to the SR and LR tasks have been shown to be effective. The first or “direct” method uses a DNN trained as a classifier for the intended recognition task. In the direct method the DNN is trained to discriminate between speakers for SR
[5] or languages for LR [4]. The second or “indirect” method uses a DNN trained for a different purpose to extract data that is then used to train a secondary classifier for the intended recognition task. Applications of the indirect method have used a DNN trained for ASR to extract framelevel features [2, 3, 10], accumulate a multinomial vector [7] or accumulate multimodal statistics [6, 8] that were then used to train an ivector system [11, 12].The unified DNN approach described in this work uses two of the indirect methods described above. The first indirect method (“bottleneck”) uses framelevel features extracted from a DNN with a special bottleneck layer
[13] and the second indirect method (“DNNposterior”) uses posteriors extracted from a DNN to accumulate multimodal statistics [6]. The features and the statistics from both indirect methods are then used to train four different ivector systems: one for each task (SR and LR) and each method (bottleneck and DNNposterior). A key point in the unified approach is that a single DNN is used for all four of these ivector systems. Additionally, we will examine the feasibility of using a single ivector extractor for both SR and LR.2 Ivector classifier for SR and LR
Over the past 5 years, stateoftheart SR and LR performance has been achieved using ivector based systems [11]. In addition to using an ivector classifier as a baseline approach for our experiments, we will also show how phoneticknowledge rich DNN feature representations and posteriors can be incorporated into the ivector classifier framework providing significant performance improvements. In this section we provide a highlevel description of the ivector approach (for a detailed description see, for example, [11, 14]).
In Figure 1
we show a simplified block diagram of ivector extraction and scoring. An audio segment is first processed to find the locations of speech in the audio (speech activity detection) and to extract acoustic features that convey speaker/language information. Typically 20 dimensional melfrequency cepstral coefficients (MFCC) and derivatives are used for SR and 56 dimensional static cepstra plus shifteddelta cepstra (SDC) are used for LR analyzed at 100 feature vectors/second. Using a Universal Background Model (UBM), essentially a speaker/languageindependent Gaussian mixture model (GMM), the permixture posterior probability of each feature vector (“GMMposterior”) is computed and used, along with the feature vectors in the segment, to accumulate zeroth, first, and second order sufficient statistics (SS). These SSs are then transformed into a low dimensional ivector representation (typically 400600 dimensions) using a total variability matrix,
. The ivector is whitened by subtracting a global mean, , scaled by the inverse square root of a global covariance matrix, , and then normalized to unit length [14]. Finally, a score between a model and test ivector is computed. The simplest scoring function is the cosine distance between the ivector representing a speaker/language model (average of ivectors from the speaker’s/language’s training segments) and the ivector representing the test segment. The current stateoftheart scoring function, called Probabilistic Linear Discriminant Analysis (PLDA) [14], requires a withinclass matrix , characterizing how ivectors from a single speaker/language vary, and an across class matrix , characterizing how ivectors between different speakers/languages vary.Collectively, the UBM, , , , , and
are known as the system’s hyperparameters and must be estimated before a system can enroll and/or score any data. The UBM,
, , andrepresent general feature distributions and total variance of statistics and ivectors, so unlabeled data from the desired audio domain (i.e., telephone, microphone, etc.) can be used to estimate them. The
and matrices, however, each require a large collection of labeled data for training. For SR, and typically require thousands of speakers each of whom contributes tens of samples to the data set. For LR, the enrollment samples from each desired languages, which typically hundreds of samples from many different speakers, can be used to estimate and .By far the most computationally expensive part of an ivector system is extracting the ivectors themselves. An efficient approach for performing both SR and LR on the same data is to use the same ivectors. This may be possible if both systems use the same feature extraction, UBM, and matrices. There may be some tradeoff in performance however since the UBM, matrix, and signal processing will not be specialized for SR or LR.
3 Deep Neural Network Classifier for Speech Applications
3.1 DNN architecture
A DNN, like a multilayer preceptron (MLP), consists of an input layer, several hidden layers and an output layer. Each layer has a fixed number of nodes and each sequential pair of layers are fully connected with a weight matrix. The activations of nodes on a given layer are computed by transforming the output of the previous layer with the weight matrix:
. The output of a given layer is then computed by applying an “activation function”
(see Figure 2). Commonly used activation function include the sigmoid, the hyperbolic tangent, rectified linear units and even a simple linear transformation. Note that if all the activation functions in the network are linear then the stacked matrices reduce to a single matrix multiply.
The type of activation function used for the output layer depends on what the DNN is used for. If the DNN is trained as a regression the output activation function is linear and the objective function is the mean squared error between the output and some target data. If the DNN is trained as a classifier then the output activation function is the softmax and the objective function is the cross entropy between the output and the true class labels. For a classifier, each output node of the DNN classifier correspond to a class and the output is an estimate of the posterior probability of the class given the input data.
3.2 DNN Training for ASR
DNN classifiers can be used as acoustic models in ASR systems to compute the posterior probability of a subphonetic unit (a “senone”) given an acoustic observation. Observations, or feature vectors, are extracted from speech data at a fixed sample rate using a spectral technique such as filterbank analysis, MFCC, or perceptual linear prediction (PLP) coefficients. Decoding is preformed using a hidden Markov model (HMM) and the DNN to find the most likely sequence of senones given the feature vectors (this requires using Bayes’ rule to convert the DNN posteriors to likelihoods). Training the DNN requires a significant amount of manually transcribed speech data
[1]. The senones labels are derived from the transcriptions using a phonetic dictionary and a stateoftheart GMM/HMM ASR system. Generally speaking, a refined set of phonotactic units aligned using a high performing ASR system is required to train a high performing DNN system [1].DNN training is essentially the same as traditional MLP training. The most common approach uses stochastic gradient descent (SGD) with a minibatch for updating the DNN parameters throughout a training pass or “epoch”. The backpropegation algorithm is used to estimate the gradient of the DNN parameters for each minibatch. Initializing the DNN is critical, but it has been shown that a random initialization is adequate for speech applications where there is a substantial amount of data
[15]. A held out validation data set is used to estimate the error rate after each training epoch. The SGD algorithm uses a heuristic learning rate parameter that is adjusted in accordance with a scheduling algorithm which monitors the validation error rate at each epoch. Training ceases when the error rate can no longer be reduced.
In the past, training neural networks with more than 2 hidden layers proved to be problematic. Recent advances in fast and affordable computing hardware, optimization software and initialization techniques have made it possible to train much deeper networks. A typical DNN for ASR will have 5 or more hidden layers each with the same number of nodes  typically between 500 and 3,000 [1]. The number of output senones varies from a few hundred to tens of thousands [15].
3.3 DNN bottleneck features
A DNN can also be used as a means of extracting features for use by a secondary classifier  including another DNN [16]
. This is accomplished by sampling the activation of one of the DNN’s hidden layers and using this as a feature vector. For some classifiers the dimensionality of the hidden layer is too high and some sort of feature reduction is necessary like LDA or PCA. In
[13], a dimension reducing linear transformation is optimized as part of the DNN training by using a special bottleneck hidden layer that has fewer nodes (see Figure 2). The bottleneck layer uses a linear activation so that it behaves very much like a LDA or PCA transformation on the activation of the previous layer. The bottleneck DNN used in this work is the same system described in [13]. In theory any layer can be used as a bottleneck layer, but in our work we have chosen to use the second to last layer with the hope that the output posterior prediction will not be too adversely affected by the loss of information at the bottleneck.
3.4 DNN stats extraction for an ivector system
A typical ivector system uses zeroth, first and second order statistics generated using a GMM. Statistics are accumulated by first estimating the posterior of each GMM component density for a frame (the “occupancy”) and using these posteriors as weights for accumulating the statistics for each component of the mixture distribution. The zeroth order statistics are the total occupancies for an utterance across all GMM components and the first order statistics are the weighted sum of the means per a component. The ivector is then computed using a dimension reducing transformation that is nonlinear with respect to the zeroth order statistics.
An alternate approach to extracting statistics has been proposed in [6]. Statistics are accumulated in the same way as for the GMM but class posteriors from the DNN are used in place of GMM component posteriors. Once the statistics have been accumulated, the ivector extraction is performed in the same way as it is from the GMM based statistics. This approach has been shown to give significant gains for both SR and LR [6, 7, 17].
4 Experiment setup
4.1 Corpora
Three different corpora are used in our experiments. The DNN itself is trained using a 100 hours subset of Switchboard 1 [18]. The 100 hour Switchboard subset is defined in the example system distributed with Kaldi [19]. The SR systems were trained and evaluated using the 2013 Domain Adaptation Challenge (DAC13) data [20]. The LR systems were evaluated on the NIST 2011 Language Recognition Evaluation (LRE11) data [21]. Details on the LR training and development data can be found in [22].
4.2 System configuration
4.2.1 Commonalities
All systems use the same speech activity segmentation generated using a GMM based speech activity detector (GMM SAD). The ivector system uses MAP and PPCA to estimate the matrix. Scoring is performed using PLDA [14]. With the exception of the input features or multimodal statistics, the ivector systems are identical and use a 2048 component GMM UBM and a 600 dimensional ivector subspace. All LR systems use the discriminative backend described in [22].
4.2.2 Baseline systems
The frontend feature extraction for the baseline LR system uses 7 static cepstra appended with 49 SDC. Unlike the frontend described in [22], vocal track length normalization (VTLN) and feature domain nuisance attribute projection (fNAP) are not used. The frontend for the baseline SR system uses 20 MFCCs including C0 and their first derivatives for a total of 40 features.
4.2.3 DNN system
The DNN was trained using 4,199 state cluster (“senone”) target labels generated using the Kaldi Switchboard 1 “tri4a” example system [19]. The DNN frontend uses 13 Gaussianized PLP coefficients and their first and second order derivatives (39 features) stacked over a 21 frame window (10 frames to either side of the center frame) for a total of 819 input features. The GMM SAD segmentation is applied to the stacked features.
The DNN has 7 hidden layers of 1024 nodes each with the exception of the 6^{th} bottleneck layer which has 64 nodes. All hidden layers use a sigmoid activation function with the exception of 6^{th} layer which is linear[13]. The DNN training is preformed on an nVidia Tesla K40 GPU using custom software developed at MIT/CSAIL.
5 Experiment Results
5.1 Speaker recognition experiments
Two sets of experiments were run on the DAC13 corpora: “indomain” and “outofdomain”. For both sets of experiments, the UBM and hyperparameters are trained on Switchboard (SWB) data. The other hyperparameters (the , , and ) are trained on 20042008 speaker recognition evaluation (SRE) data for the indomain experiments and SWB data for the outofdomain experiments (see [20] for more details). Tables 1 and 2 summarize the results for the indomain and outofdomain experiments with the first row of each table corresponding to the baseline system. While the DNNposterior technique with MFCCs gives a significant gain over the baseline system for both sets of experiments, as also reported in [6]and [17], an even greater gain is realized using bottleneck features with a GMM. Unfortunately, using both bottleneck features and DNNposteriors degrades performance.
Features  Posteriors  EER(%)  DCF*1000 

MFCC  GMM  2.71  0.404 
MFCC  DNN  2.27  0.336 
Bottleneck  GMM  2.00  0.269 
Bottleneck  DNN  2.79  0.388 
Features  Posteriors  EER(%)  DCF*1000 

MFCC  GMM  6.18  0.642 
MFCC  DNN  3.27  0.427 
Bottleneck  GMM  2.79  0.342 
Bottleneck  DNN  3.97  0.454 
5.2 Language recognition experiments
The experiments run on the LRE11 task are summarized in Table 3 with the first row corresponding to the baseline system and the last row corresponding to a fusion of 5 “postevaluation” systems (see [22] for details). Bottleneck features with GMM posteriors out performs the other systems configurations including the 5 system fusion. Interestingly, bottleneck features with DNNposteriors show more of an improvement over the baseline system than in the speaker recognition experiments.
Features  Posteriors  30s  10s  3s 

SDC  GMM  5.26  10.7  20.9 
SDC  DNN  4.00  8.21  19.5 
Bottleneck  GMM  2.76  6.55  15.9 
Bottleneck  DNN  3.79  7.71  18.2 
5way fusion  3.27  6.67  17.1 
5.3 Crosstask ivector Extraction
Table 4 shows the performance on the DAC13 and LRE11 tasks when extracting ivectors using parameters from one of the two systems. As expected, there is a degradation in performance for the mismatched task, but the degradation is less on the DAC13 SR task using the LRE11 LR hyperparameters. These result motivate further research in developing a unified ivector extraction system for both SR and LR by careful UBM/T training data selection.
UBM/  DAC13 indomain  LRE11 30s 

DAC13  2.00% EER / 0.269 DCF  6.12 
LRE11  2.68% EER / 0.368 DCF  2.76 
6 Conclusions
This paper has presented a DNN bottleneck feature extractor that is effective for both speaker and language recognition and produces significant performance gains over stateoftheart MFCC/SDC ivector approaches as well as more recent DNNposterior approaches. For the speaker recognition DAC13 task, the new DNN bottleneck features decreased indomain EER by 26% and DCF by 33% and outofdomain EER by 55% and DCF by 47%. The outofdomain results are particularly interesting since no indomain data was used for DNN training or hyperparameter adaptation. On LRE11, the same bottleneck features decreased EERs at 30s, 10s, and 3s test durations by 48%, 39%, and 24%, respectively, and even out performed a 5 system fusion of acoustic and phonetic based recognizers. A final set of experiments demonstrated that it may be possible to use a common ivector extractor for a unified speaker and language recognition system. Although not presented here, it was also observed that recognizers using the new DNN bottleneck features produced much better calibrated scores as measured by CLLR metrics.
The DNN bottleneck features, in essence, are the learned feature representation from which the DNN posteriors are derived. Experimentally, it appears that using the learned feature representation is better than using just the output posteriors with SR or LR features, but combining the DNN bottleneck features and DNN posteriors degrades performance. This may be because we are able to train a better suited posterior estimator (UBM) with data more matched to the task data. Since we are working with new features, future research will examine whether there are more effective classifiers to apply than ivectors. Other future research will explore the sensitivity of the bottleneck features to the DNN’s configuration, and training data quality and quantity.
Acknowledgments
The authors would like to thank Patrick Cardinal, Yu Zhang and Ekapol Chuangsuwanich at MIT CSAIL for sharing their DNN expertise and GPU optimized DNN training software.
References
 [1] Geoffrey Hinton, L. Deng, D. Yu, G. E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, pp. 82–97, November 2012.
 [2] Y. Song, B. Jiang, Y. Bao, S. Wei, and L.R. Dai, “Ivector representation based on bottleneck features for language identification,” IEEE Electronics Letters, pp. 1569–1580, 2013.
 [3] P. Matejka, L. Zhang, T. Ng, H. S. Mallidi, O. Glembek, J. Ma, and B. Zhang, “Neural network bottleneck features for language identification,” in Proc. of IEEE Odyssey, 2014, pp. 299–304.
 [4] I. LopezMoreno, J. GonzalezDominguez, O. Plchot, D. Martinez, J. GonzalezRodriguez, and P. Moreno, “Automatic language identification using deep neural networks,” in Proc. of ICASSP, 2014, pp. 5374–5378.
 [5] T. Yamada, L. Wang, and A. Kai, “Improvement of distanttalking speaker identification using bottleneck features of dnn,” in Proc. of Interspeech, 2013, pp. 3661–3664.
 [6] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phoneticallyaware deep neural network,” in Proc. of ICASSP, 2014, pp. 1714–1718.

[7]
Y. Lei, L. Ferrer, A. Lawson, M. McLaren, and N. Scheffer,
“Application of convolutional neural networks to language identification in noisy conditions,”
in Proc. of IEEE Odyssey, 2014, pp. 287–292.  [8] P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet, and J. Alam, “Deep neural networks for extracting baumwelch statistics for speaker recognition,” in Proc. of IEEE Odyssey, 2014, pp. 293–298.

[9]
O. Ghahabi and J. Hernando,
“Ivector modeling with deep belief networks for multisession speaker recognition,”
in Proc. of IEEE Odyssey, 2014, pp. 305–310.  [10] A. K. Sarkar, C.T. Do, V.B. Le, and C. Barras, “Combination of cepstral and phonetically discriminative features for speaker verification,” IEEE Signal Processing Letters, vol. 21, no. 9, pp. 1040–1044, Sept. 2014.
 [11] N. Dehak, P. Kenny, R. Dehak, P. Ouellet, and P. Dumouchel, “Front end factor analysis for speaker verification,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 19, no. 4, pp. 788–798, may 2011.
 [12] N. Dehak, P. TorresCarrasquillo, D. Reynolds, and R. Dehak, “Language recognition via ivectors and dimensionality reduction,” in Proc. of Interspeech, 2011, pp. 857–860.
 [13] Y. Zhang, E. Chuangsuwanich, and J. Glass, “Extracting deep neural network bottleneck features using lowrank matrix factorization,” in Proc. of ICASSP, 2014, pp. 185–189.
 [14] D. GarciaRomero and C. Y. EspyWilson, “Analysis of ivector length normalization in speaker recognition systems,” in Proc. of Interspeech, 2011, pp. 249–252.
 [15] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” in Proc. of ICASSP, 2013.
 [16] K. Vesely, M. Karafiat, and F. Grezl, “Convolutive bottleneck network features for lvcsr,” in Proc. of IEEE ASRU, 2011, pp. 42–47.
 [17] D. GarciaRomero, X. Zhang, A McCree, and D. Povey, “Improving speaker recognition performance in the domain adaptation challenge using deep neural networks,” in Proc. of IEEE SLT Workshop, 2014.
 [18] J. Godfrey, E. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” in Proc. of ICASSP, 1992, pp. 517–520.
 [19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J Silovsky, G. Stemmer, and K. Vesel, “The kaldi speech recognition toolkit,” in Proc. of IEEE ASRU, 2011.
 [20] S. H. Shum, D. A. Reynolds, D. GarciaRomero, and A. McCree, “Unsupervised clustering approaches for domain adaptation in speaker recognition systems,” in Proc. of IEEE Odyssey, 2014, pp. 265–272.
 [21] “The 2011 nist language recognition evaluation plan,” 2011.
 [22] E. Singer, P. TorresCarrasquillo, D. Reynolds, A. McCree, F. Richardson, N. Dehak, and D. Sturim, “The mitll nist lre 2011 language recognition system,” in Proc. of IEEE Odyssey, 2011, pp. 209–215.