1 Introduction
Speaker diarisation finds “Who spoke when” in a multispeaker audio stream. This involves segmenting the audio into speakerhomogenous intervals followed by clustering into groups that should correspond to the same speaker. A recent trend is for clustering systems to first convert variable length audio segments to a fixed length vector, referred to as an embedding, and perform clustering over such vectors. More broadly, the use of embeddings has become widespread in many speech and language processing tasks.
Traditionally, ivectors have been used as the embeddings for speech segments and are produced using factor analysis in the total variability space [2, 3, 19, 20]
. Recently, dvectors based on outputs from an intermediate layer of deep neural networks (DNN) trained on speaker classification tasks, have shown superior performance over ivectors in a range of tasks
[4, 5, 6, 7, 8, 10, 21, 22]. Although recent dvector extraction systems have investigated different DNN architectures, such as deep feedforward models [4, 6, 24, 26], versions of recurrent neural networks (RNN)
[7, 25], and convolutional recurrent neural networks [8], each architecture tends to produce embeddings with different strengths and weaknesses. Therefore, it is natural to try and take advantage of the complementarity among different embeddings to achieve improved performance and robustness by combining different embedding vectors. For instance, a method that concatenates a dvector and an ivector was studied in [16].In this paper, a generic framework which combines different embedding vectors via an attention mechanism is proposed. Extending the idea of dvector extraction using a single temporal attention model [22, 24]
, the proposed attention mechanism combines outputs across both time and across embeddings from different systems and therefore has a 2D attentive structure. In particular, instead of dynamically estimating a single set of weights, a “multihead” selfattentive layer
[9, 11] is used for combination throughout the paper. The annotation vectors found in the selfattentive layer are used to find a linear combination of embeddings and are computed based on the embedding vectors themselves, and multiple annotation vectors are used to extract a more diverse combination result. The training objective function for the selfattentive layer includes a penalty term which causes the annotation vectors to be diverse. This paper modifies the penalty term so that the multiple annotation vectors produce not only distinct spiky distributions for local attention focus, but also smooth distributions that reveal overall trends.Two types of DNN models are studied as example dvector systems involved in the combination, namely a feedforward TDNN system [23] and a high order recurrent neural network (HORNN) system [12]. Two alternative structures are proposed to implement the 2D attention mechanism. First, the simultaneous attention approach where the annotation matrix produced by a single attention model is learned across all vectors extracted at each time point by each system, and second the consecutive attention approach where a separate attention across time is performed inside each system ahead of the final attention across systems. Speaker embeddings generated using this 2D attentive mechanism are named cvectors. Experimental results on speaker clustering show that our modified penalty term improves dvector extraction by a clear margin, and cvectors with both 2D attentive structures outperform the individual dvector systems with the consecutive structure gives the lowest error rate.
2 Selfattentive Structure
An important step of dvector generation is the combination of embedding vectors extracted at each time in a window of several seconds. As these embedding vectors may have different level of speakerdiscriminative ability, they should be combined with different weights. Therefore, a selfattentive layer is introduced to achieve a dynamic linear combination, where an annotation matrix , computed from the input vectors, provides the combination weights. Each column of the annotation matrix is an annotation vector which gives a set of scaling factors that weights the importance of each input before summing them. Specifically, if input vectors within a window forms a matrix where is the dimension of each vector, the annotation matrix can be calculated using Eq. (1) and applied to the inputs as in Eq. (2)
(1) 
(2) 
where () is the output and () is the head annotation matrix. is generated by passing the input matrix through two fullyconnected layers with weight matrices and respectively, and the Softmax is performed columnwise to ensure each annotation vector sums to one. When a multihead selfattentive layer is used (i.e. ), to encourage different heads to extract dissimilar information, a penalty term in Eq. (3
) is added to the crossentropy loss function during training.
(3) 
where is the annotation vector,
is the identity matrix and
denotes the Frobenius norm. The degree of influence of this term is adjusted by . As all terms in Eq. (3) are nonnegative, minimising the cross terms, , encourages the annotation vectors to be orthogonal, while minimising the diagonal terms encourages the annotation vectors to have fewer nonzero terms, ideally being onehot vectors. Therefore, under the effect of the penalty term, the annotation vectors will put large weights on different but very few inputs. However, the trend of each annotation vector, whether to be spiky or smooth, can also be controlled, see. Sec. 3.3. In the rest of the paper, the output from the selfattentive structure in a single system is denoted using , as in [11], while that incorporating model combination is denoted using , for cvectors. For clarity, the output of head selfattentive layer is expressed as:(4) 
3 2D Selfattentive Topologies
In this section, the two kinds of 2D selfattentive structures are introduced, and the modification to the penalty term is also explained.
3.1 Simultaneous Combination Architecture
One natural way of performing the 2D attention is to add up all the embeddings extracted from each frame by each network simultaneously using one annotation matrix. This is achieved by extending the row dimension of matrix in Eq. (1) from to where is the number of systems to be combined, as shown in Eq. (5).
(5) 
The cvector of this combination is the concatenation of the embeddings each generated using one annotation vector, as shown in Fig. 1. This structure is not only able to reflect the importance of each frame in terms of the ability to distinguish speakers, but also able to weight the two network outputs frame by frame.
3.2 Consecutive Combination Architecture
An alternative proposed uses consecutive combination where selfattentive combination is first performed across time for each system with separate annotation matrices, and another selfattentive layer is applied across all the systems thereafter, as illustrated in Fig. 2. As the selfattentive layers for each system can be designed differently, this combination retains more individuality of each system while introducing more flexibility.
Particularly, it is interesting to investigate the following two types of second stage combination. First, model combination could be performed on the multihead output where all heads in the dvector share the same annotation vector, as shown in Eq. (6).
(6) 
where is the multihead output generated from each individual system. Secondly, it could also be performed at the head level where different heads from the same system can be assigned different weights, as shown in Eq. (7). This relaxes the constraint on the number of heads for each system which has to be equal in the previous combination method.
(7) 
As the aspects of speaker characteristic information encapsulated in the dvectors for different systems may be ordered differently, a direct weighted average of the output vectors may be inappropriate. Therefore, a fullyconnected (FC) layer that transforms the output of each system is introduced before the model combination, as shown in Eq. (8).
(8) 
Further extending the method above, instead of using selfattention for model combination, embeddings from different systems are concatenated and passed through an FC layer for transformation and combination together, as shown in Eq. (9).
(9) 
A similar approach was proposed in [16] for speaker verification tasks where the ivector is fused with an RNN output by direct concatenation. Nevertheless, the proposed cvector method allows joint training of the complete model, and by including the FC layer, the order of elements in e.g. an ivector could also be altered.
3.3 Penalty Term Modification
The penalty term for multihead selfattention in Eq. (3) was originally designed for sentence embeddings [9] to focus on as few words as possible while encouraging different annotation vectors to be estimated. Such a setting is not necessarily transferable to our task, since the minimum value of the penalty term can be reached only when all the annotation vectors become different onehot vectors. Therefore, propose the following modified penalty term is proposed:
(10) 
where is a diagonal matrix replacing in Eq. (3). The diagonal values control the smoothness of the annotation vectors. We term annotation vectors that only focus on a few input vectors “spiky”, while “smooth” annotation vectors reflect the general trends of importance. For a single annotation vector, , the penalty term becomes a quadratic function of . For the annotation vector taking onehot form and more evenly distributed forms, the variation of the penalty term P against is plotted below.
As the dashed line moves toward the left, which represents decreasing , the lowest point changes from to to , and hence the annotation vector that gives the minimum value of the penalty term shifts from to , and eventually reaches the evenly distributed . Therefore, by varying the value of between which is the norm of uniform vector , and 1 which is the norm of onehot vector , the smoothness of the weight can be controlled. The multihead system can use this modified penalty term to give a couple of smooth annotation vectors while keeping the rest spiky as before, and these settings will also vary between systems according to their characteristics.
4 Experimental Setup
4.1 Data Preparation
All of the models were implemented using an extended version of HTK [27], and trained and tested on the AMI corpus [18] which contains group meetings recorded at four different sites. The full training set which contains 135 meetings with 149 speakers was used which is further split into 90% for model training and 10% cross validation set for hyperparameter tuning. For evaluation, instead of using the full dev and eval sets, we use the meetings recorded at IDIAP, Edinburgh and Brno which are the sets frequently used for evaluation of speaker diarisation [8, 21], and which are more consistent with our observations on other datasets. The partition of the dataset is shown in Table 1.
Meetings  Speakers  

Train  135  149 
Dev  14  17 (4 seen in Train) 
Eval  12  12 (0 seen in Train) 
During both training and testing, the system input is 40d logmel filter bank features (25 ms frame size, 10 ms frame increment) extracted from Multiple Distance Microphone (MDM) data after beamforming using BeamformIt [28].
DiarTK  dvector TDNN  dvector HORNN  cvector Simult.  cvector Consec. 1  cvector Consec. 2  cvector Consec. FC  

#Params.  N/A  1.76M  0.29M  2.03M  2.46M  2.07M  2.87M 
Dev  23.62%  13.40%  13.40%  12.73%  13.18%  12.22%  12.75% 
Eval  23.31%  14.75%  15.97%  16.28%  13.53%  12.99%  15.00% 
4.2 Model Specification
The two DNN systems used as an example for combination in this paper are the TDNN and HORNN. The TDNN structure resembles the one used in the xvector extraction system [10], except the statistical pooling layer is replaced with selfattentive layer as described in [11]
. The HORNN here uses ReLU activation functions, and adds connections from both the previous hidden state and the state with 4 time steps ahead to the current RNN input. This provides a more direct access to the longterm memory to prevent vanishing gradient problem with much less parameters than the LSTM structure.
To extract window level dvectors, a 2second sliding window was applied with a 1second overlap between adjacent windows. A twolayer HORNN was used with a state output dimension of 256 and a projection dimension of 128. For the TDNN, the original 512dimensional system in [11] was used. In order to have similarly performing systems, the HORNN uses fewer parameters than the TDNN, as the former uses parameters in a more efficient way. Then, both network outputs are reduced to 128d vectors using the fifth layer of the TDNN and an additional fullyconnected layer for HORNN before feeding into the selfattentive layer. The simultaneous combination learns a set of 5 weight annotation vectors. The consecutive combination have 5 heads from each system, and the second combination stage uses a single head and 5 heads for the first and second types of attentive combinations respectively.
After the combination stages, we use a bottleneck layer to map the multihead cvector output down to a 128dimensional representation space, which is then used as the cvector for clustering. Two individual networks are initialised with framelevel pretraining, and then jointly trained in the combination networks. Furthermore, instead of using normal softmax at the output layer, we adopt the ”Asoftmax” function [15] in the
case to provide better discrimination in the angular aspect. This further helps the clustering process as it constructs the affinity matrix using cosine distances.
4.3 Diarisation Pipeline
As the main focus of the paper is on the use of new speaker embeddings in clustering, similar to e.g. [21, 29, 30]
, the experiments reported here use the AMI manual segments and report only the speaker error (there is no missed or false alarm speech). As in training, a 2second sliding window with 1second overlap is applied to the segments, and cvectors extracted by forward propagation to the bottleneck layer. These window level embeddings are then clustered using the spectral clustering methods proposed in
[7]. The threshold value used in the affinity matrix preprocessing stage is tuned for each system separately on the dev set, and applied to the eval set. Scoring uses the setup from NISTRT evaluations with a 0.25 second collar.4.4 Baseline Systems
The first baseline used DiarTK [14] to perform bottomup agglomerative clustering based on the information bottleneck principle [13, 17]. It used 19d MFCCs and the maximum window length to be 2 seconds in DiarTK. The values of and NMI threshold were set to be 10 and 0.3 respectively. Another baseline uses the statistical pooling layer [10]
in the two example DNNs which calculates the mean and standard deviation across the frames instead of using the selfattentive layer.
5 Results
The reductions in SER obtained for TDNN and HORNN by using the modified penalty term are shown in Table 3.
Dataset  Mean+std. deviation  Attention (original)  Attention (modified)  

HORNN  Dev  21.00%  16.72%  13.40% 
Eval  23.70%  20.55%  15.97%  
TDNN  Dev  17.46%  15.02%  13.40% 
Eval  19.22%  14.95%  14.75% 
Compared to the dvector using mean and standard deviation, there were reductions in SER using the selfattentive layer with the original penalty term. The modified penalty term further gives a relative reduction in SER for the TDNN by 6% and for the HORNN system 21%. The effect of changing in the penalty term can be seen in Fig. 4, where each curve represents one annotation vector across 200 frames in a selected window corresponding to a specific value. The curve with provides two spikes at the 150th and 190th frames respectively, while the curve with reflects a general trend of which regions of frames are more important.
The results of using different 2D attentive combinations are shown in Table 2 above^{1}^{1}1The models were also tested on the full dev and eval sets. Improvements were found using the 2D attentive combinations and FC layer combination.. Even though the HORNN system has far fewer parameters than the TDNN, they provide similar performance on the dev set where optimised systemspecific threshold values were used. Table 2 shows that all combinations achieve improvements on the dev set where the clustering threshold is optimised. The performance on the eval set is rather more variable, but both types of consecutive combination show their superiority over the individual dvector systems. In particular, the second method of consecutive model combination achieves a consistent relative reduction in SER of 9% and 12% on dev and eval set respectively, and 10% overall relative reduction, which provides the best performance. This represents a 46% reduction in SER over the DiarTK baseline.
6 Conclusions
In this paper, a novel embedding extraction approach for diarisation using a 2D selfattentive structure has been proposed. Both simultaneous combination and consecutive combination approaches were analysed. Furthermore, a modified penalty term was also introduced which provided more diversity to the multihead weight vector in the selfattentive layer. Taking the TDNN and HORNN as an example of two complementary systems, the proposed models were evaluated in experimented using the AMI corpus. Experimental results showed a relative reduction in diarisation speaker error rate of 21% for a HORNN model and 6% for a TDNN model by including the modification to the penalty term. Furthermore, a further reduction in SER of 10% was obtained using the 2D consecutive combination method.
7 References
References
 [1]
 [2] N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel & P. Ouellet, “Frontend factor analysis for speaker verification”, IEEE Transactions on Audio, Speech and Language Processing, vol. 19, pp. 788–798, 2011.
 [3] G. Sell & D. GarciaRomero, “Diarization resegmentation in the factor analysis subspace”, Proc. ICASSP, Brisbane, 2015.
 [4] E. Variani, X. Lei, E. McDermott, I.L. Moreno, & J. GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification”, Proc. ICASSP, Florence, 2014.
 [5] G. Heigold, I. Moreno, S. Bengio, & N. Shazeer, “Endtoend textdependent speaker verification”, Proc. ICASSP, Shanghai, 2016.
 [6] D. GarciaRomero, D. Snyder, G. Sell, D. Povey, & A. McCree, “Speaker diarization using deep neural network embeddings”, Proc. ICASSP, New Orleans, 2017.
 [7] Q. Wang, C. Downey, L. Wan, P.A. Mansfield, & I. Lopez Moreno, “Speaker diarization with LSTM”, Proc. ICASSP, Calgary, 2018.
 [8] P. Cyrta, T. Trzcinski, & W. Stokowiec, “Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings”, arXiv.org, 1708.02840, 2017.
 [9] Z. Lin, M. Feng, C.N. dos Santos, M. Yu, B. Xiang, B. Zhou, & Yoshua Bengio, “A structured selfattentive sentence embedding”, Proc. ICLR, Toulon, 2017.
 [10] D. Snyder, D. GarciaRomero, G. Sell, D. Povey, & S. Khudanpur, “Xvectors: robust DNN embeddings for speaker recognition”, Proc. ICASSP, Calgary, 2018.
 [11] Y. Zhu, T. Ko, D. Snyder, B. Mak, & D. Povey, “Selfattentive speaker embeddings for textIndependent speaker verification”, Proc. Interspeech, Hyderabad, 2018.
 [12] C. Zhang, & P.C. Woodland, “High order recurrent neural networks for acoustic modeling”, Proc. ICASSP, Calgary, 2018.
 [13] D. Vijayasenan, F. Valente, & H. Bourlard, ”An information theoretic approach to speaker diarization of meeting data”, IEEE Transactions on Audio, Speech and Language Processing, vol. 17, pp. 1382–1393, 2009.
 [14] D. Vijayasenan, & F. Valente, “DiarTk: An open source toolkit for research in multistream speaker diarization and its application to meetings recordings”, Proc. Interspeech, Portland, 2012.
 [15] Z. Huang, S. Wang, & K. Yu, “Angular softmax for shortduration textindependent speaker verification”, Proc. Interspeech, Hyderabad, 2018.
 [16] G. Bhattacharya, J. Alam, V. Gupta, & P. Kenny, “Deeply fused speaker embeddings for textIndependent speaker verification”, Proc. Interspeech, Hyderabad, 2018.
 [17] N. Dawalatabad, S. Madikeri, C.C. Sekhar, & H.A. Murthy, “Twopass IB based speaker diarization system using meetingspecific ANN based features”, Proc. Interspeech, San Francisco, 2016.
 [18] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, & P. Wellner, “The AMI meeting corpus: A preannouncement”, Proc. MLMI, Bethesda, 2006.
 [19] S.H. Shum, G. Dehak, R. Dehak, & J.R. Glass, ”Unsupervised methods for speaker diarization: An integrated and iterative approach”, IEEE Transactions on Audio, Speech and Language Processing, vol. 20, pp. 2015–2028, 2013.
 [20] G. Selland, & D. GarciaRomero, “Speaker diarization with PLDA ivector scoring and unsupervised calibration”, Proc. SLT, California and Nevada, 2014.
 [21] S.H. Yella, & A. Stolcke, “A comparison of neural network feature transforms for speaker diarization”, Proc. Interspeech, Dresden, 2015.
 [22] F.A.R.R. Chowdhury, Q. Wang, I.L. Moreno, & L. Wan, “Attentionbased models for textdependent speaker verification”, Proc. ICASSP, Calgary, 2018.
 [23] V. Peddinti, D. Povey, & S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts”, Proc. Interspeech, Dresden, 2015.
 [24] S. Zhang, Z. Chen, Y. Zhao, J. Li, & Y. Gong, “EndtoEnd attention based textdependent speaker verification”, Proc. SLT, San Diego, 2016.

[25]
C. Zhang, C. Yu, & J.H.L. Hansen,
“An investigation of deeplearning frameworks for speaker verification antispoofing”,
IEEE Journal of Selected Topics in Signal Processing, vol. 11, pp. 684–694, 2017.  [26] L. Li, Y. Chen, Y. Shi, Z. Tang, & D. Wang, “Deep speaker feature learning for textindependent speaker verification”, Proc. Interspeech, Stockholm, 2017.
 [27] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, A. Ragni, V. Valtchev, P. Woodland, & C. Zhang, The HTK Book (for HTK version 3.5), Cambridge University Engineering Department, 2015.
 [28] X. Anguera, C. Wooters, & J. Hernando,, “Acoustic beamforming for speaker diarization of meetings”, IEEE Transactions on Audio, Speech and Language Processing, vol. 6, pp. 2011–2022, 2007.
 [29] S.H. Yella1, A. Stolcke, & M. Slaney, “Artifitial neural network features for speaker diarization”, Proc. ICASSP, Florence, 2014.
 [30] S.H. Yella1, & A. Stolcke, “A comparison of neural network feature transforms for speaker diarization”, Proc. Interspeech, Dresden, 2015.
Comments
There are no comments yet.