The generation of compact representation used to distinguish speakers has been an attractive topic and widely used in some related studies, such as speaker identification , verification [19, 14, 11], detection , segmentation [7, 23], and speaker dependent speech enhancement [2, 6].
To extract a general representation, Najim et al. 
defined a “total variability space” containing the speaker and channel variabilities simultaneously, and then extracted the speaker factors by decomposing feature space into subspace corresponding to sound factors including speaker and channel effects. With the rapid development of deep learning technologies, some architectures using deep neural networks (DNN) have been developed for general speaker representation[22, 20]. In , Variani et al. introduced the -vector approach using the LSTM and averaging over the activations of the last hidden layer for all frame-level features. David et al.  used a five-layer DNN with taking into account a small temporal context and statistics pooling. To further improve the performance for embedding generation, attention mechanisms have been also used in some recent studies [24, 26]. Wang, et al.  used attentive X-vector where a self-attention layer was added before a statistic pooling layer to weight each frame.
However, there might still need an improvement on how to highlight the importance of different part of the input utterance. For this issue, a hierarchical attention mechanism is employed in this paper. This is inspired by Yang’s work  in document classification, where it claimed that not all parts of a document are equally relevant for answering a query and attention models were thus applied to both word and sentence level feature vectors via a hierarchical network. In the proposed approach, an utterance can be viewed as a document, and its divided segments and acoustic frames are treated as sentences and words, respectively. An attention mechanism is then used hierarchically at both frame level and segment level. The utterance embedding can be constructed by first building representations of segments from frames and then aggregating those into an utterance representation. The use of this hierarchical attention network (HAN) can offer a way to obtain a discriminative utterance-level embedding by explicitly weight target relevant features.
2 Model Architecture
Figure 1 shows the architecture of hierarchical attention network. The network consists of several parts: a frame-level encoder and attention layer, a segment-level encoder and attention layer, and two fully connect layers. Given input acoustic frame vectors, the proposed model generates utterance-level representation, by which a classifier is trained to perform speaker identification. The details of each part will be in the following subsections.
2.1 Frame-Level Encoder and Attention
Assume that an utterance is divided into segments: with a fixed-length window, and each segment constains -dimensional acoustic frame vectors , .
In the frame-level encoder, a one-dimensional CNN is used and followed by a bidirectional GRU  in order to get information from both directions of acoustic frames and contextual information.
The output of frame-level encoder contains the summarized information of the segment centred around
In the frame-level attention layer, a two-layer MLP is first used to convert
into a hidden representation, by which a normalised importance weight can be computed via a softmax function.
where , and are the parameters of the two-layer MLP. These parameters are shared when processing segments. A weighted sum of the output of frame-level encoder is computed by
Following , a statistics pooling is applied to to compute its mean vector () and std () vector over . A segment vector is then obtained by concatenating the two vectors:
2.2 Segment Level Encoder and Attention
For segment-level attention, the same steps introduced in frame-level encoder and attention are followed except a bi-directional GRU layer, as the omission of the GRU layer can well accelerate training when processing a large number of samples.
|Dataset||Type||#Speaker||Size (hour)||#Utterance (1s)||#Utterance (3s)|
The weight output of the segment-level attention layer can then be computed as follow :
where , and are the parameters of a two-layer MLP used for . A vector is generated using a statistics pooling over all weighted segments:
The final speaker identity classifier is constructed using a two-layer MLP with being its input. As shown in Figure 1, the final utterance embedding is obtained after the first fully connected layer.
Three datasets, NIST SRE 2008 part1 (SRE08), CallHome American English Speech (CHE), and Switchboard Cellular Part 1 (SWBC), are used in this paper to train the proposed model and evaluate utterance embedding performance. SRE08 indicates the 2008 NIST speaker recognition evaluation test set , which contains multilingual telephone speech and English interview speech. In this work, Part1 of SRE2008, containing about 640-hour speech and 1336 distinct speakers, is selected in our experiments.
contains 130 hours telephone speech, totally 254 speakers (129 male and 125 female) under variance environment conditions (indoors, outdoors and moving vehicles). The stereo speech singles are split into two monos, and both of them are used in experiments. CHE contains 120 telephone conversations speech between native English speakers. Among all of the calls, 90 of them are placed to various locations outside North America. In this dataset, speech from the left channel is used, as the labels of speakers in the right channels is unavailable. In our experiments, SRE08 is used to train the proposed model, by which Utterance-level embeddings can be then generated using CHE and SWBC.
3.2 Experiment Setup
In this work, after removing unvoiced signals using energy based VAD , fixed length sliding windows (one second or thre seconds) with half-size shift is employed to divide speech streams into short segments. Each segment is viewed as an utterance independently. The total number of utterances of the three datasets are listed in Table 1. Each utterance is then split into 10 equal-length fragments without overlap. Each fragment is further segmented into frames using a 25ms sliding window with a 10ms hop. All frames are converted into 20-dimensional MFCC feature vectors. Similar to , to build a hierarchical structure, each utterance, fragment and frame vector obtained here are viewed as a document, sentence and word, respectively.
To evaluate the utterance-level embeddings, speaker identification and verification are conducted using the utterance-level embeddings generated on CHE and SWBC. Instead of directly processing on the embeddings, PLDA back-end  is applied on the embeddings to reduce the dimension to 300.
Both SWBC and CHE datasets are randomly split into training and test data with 9:1 ratio for speaker identification. For speaker verification task, in SWBC, there are 50 speakers in the enrolment set and 120 speakers in the evaluation set, with 10 utterances for each speaker. In CHE, there are 30 speakers in the enrolment set and 60 speakers in the evaluation set. Each speaker has 10 utterances.
In order to compare the proposed approach with other speaker embedding systems, two baselines are built using the methods developed in previous studies. The first baseline (”X-Vectors”) is based on a TDNN architecture . It is now widely used for speaker recognition and is effective in speaker embedding extraction. The second baseline (”X-Vectors+Attention”) is made by combining a global attention mechanism with a TDNN [24, 26]. For evaluation, in our speaker identification task, correct prediction rate (prediction accuracy) is reported in this work. In the speaker verification task, equal error rate (EER) is reported. Moreover, to show the quality of the learned utterance-level embeddings, t-SNE  is used to visualize their distributions after being projected in a 2-dimensional space.
Table 3 shows the prediction accuracies on the test data of SRE08 using the proposed approach and two baselines. Two different utterance lengths, 1 second and 3 seconds, are used in the experiments, respectively. The use of the H-vectors shows higher accuracy when using either 1-second or 3-second input length than the two baselines. When the length of input utterances is one second, the accuracy obtained using the H-vectors can reach 94.5%, with 4.4% improvement over X-vectors and 2.4% improvement over X-vectors+Attention, respectively. When the length of input utterances is three seconds, the accuracy obtained using the H-vectors can reach 98.5%, with about 3% improvement over X-vectors and about 2% improvement over X-vectors+Attention. The proposed approach is more robust than the two baselines when processed utterances are short. In addition, the accuracies obtained using 3-second utterances are better than those using 1-second utterances. As longer utterance contains more information relevant to a target speaker than those in short ones.
|Utterance Length||Model||Accuracy %|
|Utterance Length||Model||Accuracy %||EER %|
|Utterance Length||Model||Accuracy %||EER %|
To evaluate the quality of embeddings extracted using the proposed approach, two additional datasets are employed in our experiments. Table 4 and Table 5 show the identification accuracy and verification EER when using the embeddings extracted on SWBC and CHE dataset, respectively. On the two datasets, the H-vectors consistently outperforms the two baselines whether the length of utterances is one second or three seconds.
Since the model is trained on SRE08, the identification performances on its test data are clearly better than those on the other two datasets. As the SWBC dataset contains a wide range of environment conditions (indoors, outdoors and moving vehicles), both its identification and verification performances are relatively worse than those obtained on CHE dataset.
To further test the quality of extracted utterance-level embeddings, t-SNE  is used to visualise the distribution of embeddings by projecting these high-dimensional vectors into a 2D space. From SWBC dataset, 10 speakers are selected and 500 one-second segment are randomly sampled for each speaker. Figure 2 (a), (b), and (c) show the distribution of selected samples of 10 speakers after using X-vectors, X-vectors+Attention, and H-vectors, respectively. Each color represents a single distinct speaker and each point represents an utterance. The black mark represents the center point of each speaker class. Figure 2(a) shows the distribution of the embeddings obtained by X-vectors. It is clear that, in this figure, some samples from different speakers are not well discriminated as there are overlaps between speaker classes. Due to the use of an attention mechanism in X-vectors+Attention, figure 2(b) shows a better sample distribution than figure 2(a). However, some samples of a speaker labelled by blue colour are not well clustered. In figure 2(c), the embedding obtained by H-vectors performs better separation property than the other two baselines.
5 Conclusion And Future Work
In this paper, a hierarchical attention network was proposed utterance-level embedding extraction. Inspired by the hierarchical structure of a document made by words and sentences, each utterance is viewed as a document, segments and frame vectors are treated as sentences and words, respectively. The use of attention mechanisms at frame and segment levels provides a way to search for the information relevant to target locally and globally, and thus obtained better utterance level embeddings, including better performance on speaker identification and verification tasks using the extracted embeddings. Moreover, the obtained utterance-level embeddings are more discriminative than the use of X-vectors and X-vectors+Attention.
In the future work, different kinds of acoustic features such as filter-bank and Mel-spectrogram will be investigated and tested on some large datasets, such as Voxceleb1 and 2.
-  Alexandra Canavan, David Graff, G. Z. Callhome american english speech. https://catalog.ldc.upenn.edu/LDC97S42, 2001.
Chuang, F.-K., Wang, S.-S., Hung, J.-w., Tsao, Y., and Fang, S.-H.
Speaker-aware deep denoising autoencoder with embedded speaker identity for speech enhancement.Proc. Interspeech 2019 (2019), 3173–3177.
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y.
Empirical evaluation of gated recurrent neural networks on sequence modeling.In NIPS 2014 Workshop on Deep Learning, December 2014 (2014).
-  David Graff, Kevin Walker, D. M. Switchboard cellular part 1 audio. https://catalog.ldc.upenn.edu/LDC2001S13, 2001.
-  Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet, P. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing (2010).
-  Gao, T., Du, J., Xu, L., Liu, C., Dai, L.-R., and Lee, C.-H. A unified speaker-dependent speech separation and enhancement system based on deep neural networks. In 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP) (2015), IEEE, pp. 687–691.
-  Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., and McCree, A. Speaker diarization using deep neural network embeddings. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), IEEE, pp. 4930–4934.
-  Group, N. M. I. 2008 nist speaker recognition evaluation training set part 1. https://catalog.ldc.upenn.edu/LDC2011S05, 2011.
Ioffe, S., and Szegedy, C.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
International Conference on Machine Learning(2015), pp. 448–456.
-  Kingma, D. P., and Ba, J. L. Adam: A method for stochastic optimization.
-  Le, N., and Odobez, J.-M. Robust and discriminative speaker embedding via intra-class distance variance regularization. In Interspeech (2018), pp. 2257–2261.
-  Maaten, L. v. d., and Hinton, G. Visualizing data using t-sne. Journal of machine learning research 9, Nov (2008), 2579–2605.
-  McLaren, M., Castan, D., Nandwana, M. K., Ferrer, L., and Yilmaz, E. How to train your speaker embeddings extractor.
-  Novoselov, S., Shulipa, A., Kremnev, I., Kozlov, A., and Shchemelinin, V. On deep speaker embeddings for text-independent speaker recognition. In Proc. Odyssey 2018 The Speaker and Language Recognition Workshop (2018), pp. 378–385.
-  Pan, Y., Mirheidari, B., Reuber, M., Venneri, A., Blackburn, D., and Christensen, H. Automatic hierarchical attention neural network for detecting ad. Proc. Interspeech 2019 (2019), 4105–4109.
-  Pang, J. Spectrum energy based voice activity detection. In 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC) (2017), IEEE, pp. 1–5.
-  Park, H., Cho, S., Park, K., Kim, N., and Park, J. Training utterance-level embedding networks for speaker identification and verification. In Interspeech (2018), pp. 3563–3567.
-  Salmun, I., Opher, I., and Lapidot, I. On the use of plda i-vector scoring for clustering short segments. In Odyssey (2016), pp. 407–414.
-  Snyder, D., Garcia-Romero, D., Povey, D., and Khudanpur, S. Deep neural network embeddings for text-independent speaker verification. In Interspeech (2017), pp. 999–1003.
-  Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. X-vectors: Robust dnn embeddings for speaker recognition. In ICASSP (2018), IEEE.
-  Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research (2014).
-  Variani, E., Lei, X., McDermott, E., Moreno, I. L., and Gonzalez-Dominguez, J. Deep neural networks for small footprint text-dependent speaker verification. In ICASSP (2014), IEEE.
-  Wang, Q., Downey, C., Wan, L., Mansfield, P. A., and Moreno, I. L. Speaker diarization with lstm. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), IEEE, pp. 5239–5243.
-  Wang, Q., Okabe, K., Lee, K. A., Yamamoto, H., and Koshinaka, T. Attention mechanism in speaker recognition: What does it learn in deep speaker embedding? In 2018 IEEE Spoken Language Technology Workshop (SLT) (2018), IEEE.
-  Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies (2016).
-  Zhu, Y., Ko, T., Snyder, D., Mak, B., and Povey, D. Self-attentive speaker embeddings for text-independent speaker verification. In Interspeech (2018).