that deep neural networks derived from acoustic models in ASR can be used as universal background models (UBMs) to provide phoneme posteriors as well as bottleneck features. While these have shown better performance than conventional UBMs based on Gaussian mixture models (GMMs), they have the drawback of language dependency and also require expensive phonetic transcriptions for training .
More recently, DNNs have been shown to be useful for extracting speaker-discriminative feature vectors independently from the i-vector framework. With the help of a certain amount of training data, such approaches lead to better results, particularly under conditions of short-duration utterances. It was shown in  that DNNs achieve better accuracy than do i-vectors. In , statistics pooling was employed to aggregate frame-level speaker representations to obtain an utterance-level representation, i.e., speaker embedding (x-vector), with a fixed number of dimensions, regardless of the length of the input utterance.
Most recent studies conducted from a different perspective  have incorporated attention mechanisms , which have produced significant improvements in machine translation . In the scenario of speaker recognition, an importance measure is computed by means of a small attention network that works as a part of the speaker embedding network, as well as in the pooling layer utilized for calculating the weighted mean of frame-level feature vectors. It has been applied to text-dependent  and text-independent speaker recognition, including fixed-duration  and variable-duration  settings.
The attention mechanism is a powerful technique which offers a way to obtain an even more discriminative utterance-level representation by explicitly selecting frame-level features that better represent speaker characteristics. Its remarkable advantage is that the attention model is automatically trained as a part of the deep speaker embedding network, in accord with a single objective function. Without attaching any additional labels, such as which frames are important, the attention model is optimized together with its parent network just so as to minimize speaker identification errors. This configuration of the attention mechanism suggests that the weight computed by the attention model is tightly bound to the frame-level features produced by the speaker embedding network.
Thus, a question arises: What does the attention model actually learn? It may not learn a general importance of frames but, rather, learn something specific to frame-level features which the coupled speaker embedding network produces. Even if it works well with the coupled speaker embedding network, it may not work well with other networks or conventional i-vector extractors, for which the i-vector paradigm continues to have its own advantages under some practical conditions, including relatively long speech . This paper attempts to give an answer to the above question through a series of experiments and demonstrates that the attention model coupled with a deep speaker embedding network can work with other networks and even with i-vector extractors. To the best of our knowledge, no study has yet been done on deep speaker embedding from such a perspective.
The remainder of this paper is organized as follows: Section 2 describes a conventional method for extracting deep speaker embedding with an attention mechanism. Section 3 presents fundamental formulae for the i-vector framework and how the attention weights from a deep speaker embedding network can be applied to it. The experimental setup and results are presented in Section 4. Section 5 summarizes our work.
2 Deep speaker embedding
Speaker embeddings are low-dimensional representations of speech utterances with the property of capturing speaker characteristic in recognition tasks . Presented below is a brief description of deep speaker embedding obtained with a DNN having either a non-attentive or an attentive statistics pooling layer.
2.1 Embedding via statistics pooling
A conventional DNN for extracting an utterance-level speaker representation consists of three modules, as shown in Figure 1.
The first module is a frame-level feature extractor. The input to this module is a sequence of acoustic features, e.g., Mel-frequency Cepstral Coefficients (MFCCs) and filter-bank coefficients. After considering relatively short-term acoustic features, this module outputs frame-level features. Any type of neural networks is applicable for the extractor, e.g., a Time-Delay Neural Network (TDNN) 
, Convolutional Neural Network (CNN), LSTM 
, or Gated Recurrent Unit (GRU).
The second module is a pooling layer that converts variable-length frame-level features into a fixed-dimensional vector. The most standard type of pooling layer obtains the mean vector of all frame-level features :
where indicates the number of frames.
, the second-order statistics, the standard deviation vector,was used as well:
where represents the Hadamard product.
The third module produces utterance-level representations for which a number of fully-connected hidden layers are stacked. One of these hidden layers is often designed to have a smaller number of units (i.e., to be a bottleneck layer), which forces the information brought from the preceding layer into a low dimensional representation. The output is a softmax layer, with each of its output nodes corresponds to one speaker ID. For training, we employ back-propagation with cross-entropy loss. We can then use bottleneck features in the third module as utterance-level representations. Some studies refrain from using softmax layers and achieve end-to-end neural networks by using contrastive loss or triplet loss . Probabilistic linear discriminant analysis (PLDA)  can also be used for measuring the distance between two utterances .
2.2 Attentive speaker embedding
It is often the case that frame-level features of some frames are more unique and important for discriminating speakers than others in a given utterance. Recent studies  have applied attention mechanisms to speaker recognition for the purpose of frame selection by automatically calculating the importance of each frame.
As shown in Figure 2, an attention model works in conjunction with the original DNN and calculates a scalar score for each frame-level feature
The normalized score is then used as the weight in the pooling layer to calculate the following weighted mean and standard deviation vectors:
In this way, the utterance-level representations in the form of weighted statistics focus on important frames and hence become more speaker discriminative. Notice that, if we set , attentive pooling in Eqs. (5) and (6) falls back to the non-attentive (i.e., equal weight) pooling in Eqs. (1) and (2). Results in  have shown considerable performance improvement in speaker verification tasks using attentive weights produced by an attention model in the pooling layer.
3 What an Attention Model Learns
3.1 Motivation and objectives
In the attentive speaker embedding described in Subsection 2.2, the attention model is trained as a part of the speaker embedding network in accord with a single objective function so as to maximize the speaker-discriminative power. It is reasonable to assume that the weights produced by the attention model are tightly bound to the frame-level features which the speaker embedding network produces. The question is whether the weights and frame-level features are still able to play their roles when they are decoupled from the DNN. To the best of our knowledge, there have been few studies on such a perspective toward better understanding of deep speaker embedding.
The use of an attention mechanism in deep speaker embedding (x-vector) extraction is largely driven by the statistics pooling layer. The aim is to find an optimal set of weights for each utterance such that higher weights are assigned to frames which are more unique and important than others in the statistics pooling operation. It is intuitive to conjecture that frames receiving higher weights correspond to certain phonetic classes (e.g., vowels) which are more effective or useful to discriminating among speakers. Another line of thought has suggested that the attention weight might be associated with just simple speech versus non-speech classes. Figure 3
shows a scatter plot for the attention weights and speech (versus non-speech) class posteriors (log odds) on the y- and x- axes, respectively, for one utterance drawn from the SRE’16 corpus
. The speech class posterior is estimated using an LSTM neural network, where the non-speech class encompasses laughter, unclear voices, noise-like (noise, sigh, lip smack, cough and breath) phenomena, and silence. See Section 4 for details regarding datasets and our experimental setup. A simple analysis gives a normalized correlation coefficient of 0.37. The weak correlation suggests that the attention weights relate to more than just speech/non-speech detection.
In this study, we set out to consider various aspects of the attention weights mentioned above. We present the results of three evaluations: (1) Using attentive frame-level features alone without attention weights, (2) Applying attention weights to non-attentive frame-level features from another deep speaker embedding network, and (3) Applying attention weights to statistics for i-vector extraction. In addition, we also try (4) Combining soft voice activity detection (VAD) as another kind of attention mechanism for better speaker recognition accuracy. Among those, (3) is especially new since the i-vector framework is quite different from a deep speaker embedding framework, such as that with x-vectors. Details in this regard are presented in the next subsection.
3.2 I-vector extraction with attention weights
3.2.1 i-vector extraction
The i-vector framework has been a standard in speaker recognition over the last decade . In spite of the increasing research on DNN-based methods, the i-vector framework continues to have its own advantage for some conditions, including relatively long speech .
An i-vector is a low-dimensional vector in total variability space to which factor analysis allows the projection of an utterance . It is assumed that a GMM-supervector , corresponding to an utterance, can be modeled as
where is the speaker- and channel-independent supervector typically taken from a universal background model (UBM), the total variability matrix (TVM) is a rectangular matrix of low rank. An i-vector is the posterior mean of the latent variable in Eq. (7).
The i-vector for a given utterance can be obtained using the following equation:
where is the block-diagonal covariance matrix of supervectors obtained from the UBM. This equation uses two types of statistics w.r.t the utterance, and . When a GMM is used as a UBM, for example, these statistics on mixture component of the UBM are written as follows:
where is the acoustic feature at the -th frame of utterance with frames,
corresponds to the posterior probability of mixture componentfor acoustic feature , and is the mean of .
3.2.2 Extended i-vector extraction with attention weights
As noted in Subsection 2.2, it is often the case that some frames are more unique and important for discriminating speakers than others in a given utterance. In , it was shown that applying the attention model to an x-vector extraction network improves speaker verification performance, which indicates that the attention weights are able to represent the importance of deep frame-level features. Under the assumption that x-vectors are able to fairly represent a speech utterance, then the attention weights are supposed to be general in representing the importance of frames and independent from feature representation. In other words, the importance of frames is independent of the representations, i.e., deep speaker embedding (x-vector)  or i-vector . For this reason, we propose application of the attention weights trained with an x-vector network to i-vector extraction in order to emphasize more important frames.
The attention weights in Eq. (4) can be seamlessly incorporated into a formulation in i-vector extraction . We extend the framework of standard i-vector extraction by incorporating attention weighs into the statistics of Eqs. (9)(10) as follows:
The scale factor ensures that Eq. (8) is kept the same for the new i-vector extraction. Notice that Eqs. (11) and (12) reduce to Eqs. (9) and (10), respectively, by using an equal weight for all frames.
We have evaluated the performance of speaker embedding on a speaker verification task in NIST 2016 Speaker Recognition Evaluation (SRE’16) . In the experiments, we followed the fixed condition in which only the designated data are used for system training. We used English-language telephone recordings from SRE’04’10, Switchboard, and Fisher for training of all our systems. The evaluation set consists of 1,986,728 trials taken from Call My Net telephone conversation spoken in Cantonese and Tagalog.
In addition to equal error rate (EER), results are reported w.r.t the official performance metric for SRE’16, i.e., equalized , the average detection cost function at two operating points . More precisely, we use the minimum cost () that indicates the best achievable performance without considering the issue of score calibration .
4.1 Investigation of decoupled attention weights and frame-level features
We first investigated decoupled attention weights and frame-level features extracted from a deep attentive speaker embedding network (x-vector).
4.1.1 Experimental settings
We used 20-dimensional MFCCs for every 10ms. Sliding mean normalization with a 3-second window and energy-based voice activity detection (VAD) were then applied, in that order.
. A 5-layer TDNN with ReLU followed by batch normalization was used for extracting frame-level features. The number of hidden nodes in each hidden layer was 512. The dimension of a frame-level feature for pooling was 1500. Each frame-level feature was generated from a 15-frame context of acoustic feature vectors.
The pooling layer aggregates frame-level features to produce the mean and standard deviation, followed by 2 fully-connected layers with ReLU activation functions, batch normalization, and a softmax output layer. The 512-dimensional bottleneck features from the first fully-connected layer were used as speaker embeddings. We used ReLU followed by batch normalization for activation function in Eq. (3) of the attention model. The number of hidden nodes was 64.
We compared four systems with two pooling techniques to evaluate the coupled and decoupled attention weights and frame-level features, as shown in Table 1: (S1) frame-level features from a non-attentive network were aggregated without an attention mechanism, (S2) frame-level features from an attentive network were aggregated with an attention mechanism, (S3) attention weights in S2 was applied to frame-level features in S1, and (S4) frame-level features in S2 were used in non-attentive pooling. Note that S3 and S4 contain a mismatch between frame-level features and attention weights.
Mean subtraction, whitening, and length normalization  were applied to the speaker embedding as pre-processing steps prior to PLDA scoring, and likelihood scores were then computed using a PLDA model with a speaker space of 512 dimensions.
4.1.2 Experimental results and analyses
Experimental results w.r.t systems S1S4 are shown in Table 2. A comparison of S1 and S2 showed that applying an attention mechanism in conjunction with the original DNN improved deep speaker embedding-based speaker verification performance by 3.2% EER reduction and 2.3% reduction. These results are consistent with .
S3 applied the attention weights from the attentive model to the frame-level features from non-attentive speaker embedding network, and also outperformed S1. It also even achieved results comparable to those with S2. S4 extracted frame-level features from the attentive speaker embedding network and then applied non-attentive pooling by discarding the simultaneously trained attention model. Surprisingly, its performance was much worse. The comparison of the four systems indicates that: (1) attention weights derived from an attentive speaker embedding network can be used with frame-level features from a non-attentive network, and (2) decoupling the attention model from an attentive embedding network is detrimental.
4.2 I-vector extraction with attention weights
In accord with the results obtained in Subsection 4.1, we examined a new combination of embeddings and attention weights.
4.2.1 Experimental settings
The baseline i-vector system and our proposed system use 20-dimensional MFCCs for every 10ms, the same as with deep speaker embedding systems. Their delta and delta-delta features were appended to form 60-dimensional acoustic features. Sliding mean normalization with a 3-second window and energy-based VAD were then applied in the same way as was done with deep speaker embedding systems. An i-vector of 400 dimensions was then extracted from the acoustic feature vectors, using a 2048-mixture UBM and a total variability matrix (TVM). Mean subtraction, whitening, and length normalization  were applied to the i-vector in the same way as was done with deep speaker embedding systems, and similarity was then evaluated using a PLDA model with a speaker space of 400 dimensions. For our proposed i-vector extraction with attention weights, the weights were extracted from S2, as described in Subsection 4.1.
4.2.2 Experimental results and analyses
Table 3 shows the results of i-vector systems. S5 represents the conventional i-vector baseline. S6 is the proposed i-vector extraction with the attention weights derived from deep speaker embeddings. With the proposed method, the i-vector-based system was improved by 6.6% EER reduction and 3.5% reduction. The attention weights derived from an attention model in a deep speaker embedding network S2 improved not only the matched x-vectors but also the i-vectors, which the attention model had never seen. This interesting result suggests that the weights from the attention model trained with deep speaker embedding is able to represent the importance regardless of the type of feature representation, i-vector or x-vector.
Note that the i-vector and x-vector paradigms have their own advantages under different conditions and with different measures. In this paper, we don’t compare across x-vector and i-vector systems.
|S6: with attention||12.18||0.797|
4.3 Combination of attention mechanisms and soft VAD
As shown in , i-vector extraction with voice posteriors as the weights (soft VAD) in extraction improved speaker recognition performance. We tried combining attention weights with voice posteriors and replaced the attention weight with , the product of the attention weight and voice posterior in Eqs. (5) and (6) for deep speaker embeddings, and in Eqs. (11) and (12) for i-vector extraction.
4.3.1 Experimental settings
We used the same soft VAD reported in 
, for which an LSTM (Long Short-Term Memory) neural network was trained for voice posterior estimation. A subset of the Fisher corpus which consists of only the transcribed segments including noise was utilized as the training data for the VAD. Five classes were assigned as the LSTM output: voice, laughter, noise, unclear voice, and silence. Training was implemented using the nnet3 neural network library in Kaldi’s official repository. The acoustic features were 40-dimensional MFCCs extracted from a frame of 25ms width at every 10ms.
4.3.2 Experimental results and analyses
Results are shown in Table 4. From S1 and S5, we see that applying the voice posterior as a weight in pooling not only improves the performance in i-vector system (S5) , but also works in deep speaker embedding system (S1). From S2 and S6, results show that applying soft VAD to an attentive model further improved performance.
All the experiments described so far were done with systems trained with English telephone recordings from SRE04-10, Switchboard and Fisher. A certain amount of domain difference between the training data and evaluation data exists, including language differences, channel differences. In our last experiment, we applied Kaldi’s unsupervised domain adaptation  to adapt the PLDA in systems S1S6 using SRE’16 unlabeled development data, which included 2,274 Cantonese and Tagalog utterances. Here, we refer to pre-adaptation systems as out-of-domain systems and to post-adaptation systems as in-domain systems.
Table 5 shows the performance of adapted in-domain systems, which can be used for comparison with results in Table 4. Adaptation correspondingly improved performance considerably and the trends in performance observed in out-of-domain systems remained the same in in-domain systems. We achieved our best results by applying domain adaptation, attention weights, and soft VAD.
|w/o soft VAD||w/ soft VAD|
|w/o soft VAD||w/ soft VAD|
This paper has presented an experimental investigation on deep speaker embedding with an attention mechanism. Interesting results include (1) attention weights derived from an attentive speaker embedding network can be used with frame-level features from a non-attentive network, and (2) decoupling the attention model from an attentive embedding network is detrimental. Inspired by these findings, we have also proposed the application of attention weights from a deep speaker embedding network to another type of speaker embedding: i-vector. Experimental results have shown a 9.0% EER reduction and a 3.8% reduction, which shows that the attention weights can truly represent the importance of frames regardless of the feature representations of the frames. This indicates the possibility of an extension to other speaker embeddings in the future. Finally, we have shown that combining soft VAD with an attention weight further reduces in deep speaker embedding and i-vector systems, by 6.6% and 1.6%, respectively.
-  D. Yu, F. Seide, and G. Li, “Conversational speech transcription using context-dependent deep neural networks,” in the 29th International Conference on International Conference on Machine Learning ICML, 2012, pp. 1–2.
-  G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” in IEEE Signal Processing Magazine, 2012, vol. 29, pp. 82–97.
-  F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in , 2015.
-  Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 1695–1699.
-  M. McLaren, Y. Lei, and L. Ferrer, “Advances in deep neural network approaches to speaker recognition,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4814–4818.
-  J. Chien and C. Hsu, “Variational manifold learning for speaker recognition,” in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2017, pp. 4935–4939.
-  D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in Spoken Language Technology workshop (SLT), 2016.
-  C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
-  N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” in IEEE Transactions on Audio, Speech, and Language Processing, 2011, pp. 788–798.
-  H. Zheng, S. Zhang, and W. Liu, “Exploring robustness of DNN/RNN for extracting speaker Baum-Welch statistics in mismatched conditions,” in INTERSPEECH, 2015, pp. 1161–1165.
-  Y. Tian, M. Cai, L. He, W. Zhang, and J. Liu, “Improving deep neural networks based speaker verification using unlabeled data.,” in Interspeech, 2016, pp. 1863–1867.
-  A. Nagrani, J. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in Interspeech, 2017, pp. 2616–2620.
-  D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” in Interspeech, 2017, pp. 999–1003.
-  G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker embeddings for short-duration speaker verification,” in Interspeech, 2017, pp. 1517–1521.
-  F. Chowdhury, Q. Wang, I. Moreno, and L. Wan, “Attention-based models for text-dependent speaker verification,” arXiv preprint arXiv:1710.10470, 2017.
-  K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Interspeech, 2018.
-  Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,” in Interspeech, 2018.
-  C. Raffel and D. Ellis, “Feed-forward networks with attention can solve some long-term memory problems,” arXiv preprint arXiv:1512.08756, 2015.
D. Bahdanau, K. Cho, and Y. Bengio,
“neural machine translation by jointly learning to align and translate,”in International Conference on Learning Representations (ICLR), 2015.
-  N. Brummer, A. Silnova, L. Burget, and T. Stafylakis, “Gaussian meta-embeddings for efficient scoring of a heavy-tailed plda model,” in Odyssey 2018 The Speaker and Language Recognition Workshop, 2018.
-  S. Ioffe, “Probabilistic linear discriminant analysis,” in European Conference on Computer Vision (ECCV). Springer, 2006, pp. 531–542.
-  S. Prince and J. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in IEEE 11th International Conference on Computer Vision (ICCV), 2007, pp. 1–8.
-  NIST 2016 speaker recognition evaluation plan., “Available: https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016,” .
-  H. Yamamoto, K. Okabe, and T. Koshinaka, “Robust i-vector extraction tightly coupled with voice activity detection using deep neural networks,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017, pp. 600–604.
-  N. Br mmer, “Measuring, refining and calibrating speaker and language information extracted from speech,” Ph.D. thesis, University of Stellenbosch, Stellenbosch, South Africa, 2010.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., “The Kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding (ASRU). IEEE Signal Processing Society, 2011, number EPFL-CONF-192584.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
-  D. Garcia-Romero and C. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Interspeech, 2011, pp. 249–252.