1 Introduction
The speaker diarization task consists in establishing “who spoke when” in a given audio recording [25, 1]. Despite having been investigated for decades, diarization is still an unsolved problem in many real scenarios, as highlighted by the recent DIHARD I and DIHARD II challenges [22].
Typically, speaker diarization is addressed integrating several different components: voice activity detection, speaker change detection, feature extraction and clustering. Most of the research works in literature focus on extracting highly discriminative feature vectors. The first example in this direction are ivectors
[15, 6], which represent a given utterance with a single fixeddimensional feature vector. The recent rise of neural paradigms has led to the introduction of a variety of approaches to extract the socalled speaker embeddings. These are, typically, derived from the outputs of the inner layers of a neural network trained on a speaker classification task [5]. The most popular embeddings are dvectors [26] and xvectors [11, 24].Conversely, not much progress has been done with regard to clustering. In most of the approaches, this stage is still based on the Agglomerative Hierachical Clustering (AHC) [14] in combination with Probabilistic Linear Discriminat Analysis (PLDA) scoring [23]
. Recently, spectral clustering
[20][27], and variatioanl bayesian clustering [7, 8] have been introduced, showing promising result. Also, alternatives to the PLDA scoring have been introduced using neural networks that learn how to score two speech segments [4], using siamese networks [29] or BiLSTMs [16]. Nevertheless, clustering remains unsupervised and heavily dependent on finetuned hyperparameters (e.g. thresholds to stop clustering).
Recently, efforts have been made to formulate clustering in a supervised learning framework [2, 10, 28]. Supervised clustering is attractive because it can be optimized on the diarization metrics directly, or learning context dependent parameters. Additionally, supervision allows to improve performance by learning from the increasing amount of data at our disposal. For example, [2] tackles the diarization problem as a classification task, while [10] uses a permutation invariant loss and a clustering loss to dynamically identify speakers. Both [2] and [10] assume that the number of speaker is known apriori or at least bounded. This assumption is removed in the UISRNN [28]: a fully supervised approach which handles an unbound number of speakers using an online generative process. Speaker distributions are modelled with multiple instances of a parametersharing Recurrent Neural Network (RNN). A further, strong advantage of [28] over traditional clustering algorithms is the fact that decoding is online using beam search [18]. Though online diarization had already been explored, using both unsupervised [12, 17, 9] and supervised [2] paradigms, the UISRNN stands out in terms of performance, outperforming the previous offline state of the art on telephone data.
Although these are very interesting results, an online system that works well across multiple domains still remains an open problem. As a matter of fact, diarization systems presented in the literature appear to work relatively well on domains with a low number of speakers and no overlapping speech, like telephone data, while performance tends to deteriorate in more challenging contexts such as meetings or dinner parties.
In this paper we present an evolution of the UISRNN [28], which substantially improves the performance. First of all, we introduce a new loss function for training the RNN that models speakers, which provides faster convergence, encouraging the network to find deeper minima, and generalizes better on the evaluation set. Secondly, we propose a semantically grounded formulation for the unseen speaker intervention probability that is easy to calculate and improves performance in inference. In addition we train on fixedlength speech segments, and let the neural network aggregate embeddings, removing the constraint on speaker change information in inference. Finally, we shed light on the performance of the proposed method with respect to the original UISRNN in a multidomain scenario through extensive testing on the DIHARD datasets. We also make our results reproducible, since we use a publicly available embedding extractor and fully disclose our code^{1}^{1}1The first author performed this work as an intern at PerVoice and Fondazione Bruno Kessler. The implementation of this paper is available at: https://github.com/DonkeyShot21/uisrnnsml.
2 Proposed approach
Given a set of embeddings and the related speaker labels , where is the total number of observations, we can cast the diarization problem in a probabilistic framework, looking for the sequence of speaker labels that maximizes the joint probability:
(1) 
If we model eq. 1 as an online generative problem as in [28], we can rewrite the joint probability as:
(2) 
where is a hidden binary indicator of speaker change and denotes all observations up to included. In the original definition of the UISRNN [28], the speaker change term of eq. 2 is modelled by a coin flipping process where the only parameter is , the transition probability. The speaker assignment term is implemented as a distance dependent Chinese Restaurant Processes (ddCRP) [3], a Bayesian nonparametric process that guides how speakers interleave in the time domain. Finally, the sequence generation part of eq. 2 is modelled using an RNN, specifically a Gated Recurrent Unit (GRU
), that parametrizes the distribution of embeddings assuming a Gaussian distribution as follows:
(3) 
where is the averaged output of the neural network with parameters instantiated for speaker .
2.1 Original UISRNN training
Given a dataset , including sequences of embeddings and related label, the optimal set of network parameters can be obtained minimizing the following negative log likelihood [28]:
(4) 
Using the model in eq. 3, eq. 4 can be reformulated in a Mean Squared Error (MSE) fashion [13]:
(5) 
Given speakers and permutations applied to the data for augmentation purposes, is a set of single speaker sequences, where each sequence is obtained by concatenating a random permutation of the embeddings generated by the th speaker. and are respectively the length and the th embedding of sequence .
Note that, since the sequences are shuffled, the network can not learn any causal relationship between observations and how to predict the next embedding. Basically, the network is trained to generate samples from an auxiliary distribution with expectation equal to . Therefore, the network will learn to predict the mean of the distribution of the embeddings. Figure 1 graphically describes the training presented in [28] and implemented in [13].
2.2 Sample Mean Loss training: UISRNNSML
In this section we propose a modified loss that relies on more accurate targets for the network output. Rather than adjusting the network by comparing the mean of its outputs with the next observed embedding, we define a MSE loss with respect to the actual mean of the speaker embeddings of a given speaker. This results in defining a predictor of the mean of the embedding distribution, having seen only a small sample of it. More formally, we replace the MSE loss in eq. 5 with:
(6) 
where is a function that maps each index to the embedding distribution of the speaker who generated the observations in .
In practice, the actual probability distribution of the embeddings is not available. In addition, given the limited amount of labelled data, using the bare mean over the sequence would lead to overfitting. Therefore, we build the ground truth for the network by estimating the mean over a collection of unseen samples we draw randomly with replacement from the permuted sequence itself. In formulas, given a generic sequence
and a subset of randomly sampled embeddings, we estimate the mean of the embeddings as: Eq. 6 is then rewritten leading to our Sample Mean Loss (SML) definition:(7) 
where we denote the ordered set as . Figure 2 depicts the proposed training approach for a generic sequence .
2.3 New speaker probability
One of the most interesting advantages of the UISRNN [28] over other supervised methods, like [10], is its ability to model an unbounded number of speakers. This is achieved using a ddCRP model [3] that provides the probability of switching back to a previously seen speaker proportionally to the number of turns of that speaker and accounts for the probability of a new speaker joining the conversation. Assuming speakers are numerated in order of appearance starting from , we let:
(8)  
(9) 
where is the number of blocks of contiguous utterances of speaker . The probability of switching to a new speaker is controlled by the parameter which is critical for the correct functioning of the whole framework: large values of force the model to over estimate the number of speakers, instantiating several networks; conversely small values result in limiting the number of speakers by merging clusters.
With respect to the estimation performed in [28], we propose the following analytical formulation for :
(10) 
This formulation has the advantage that it can be derived from eq. 9
, and therefore it is semantically coherent with the role of the parameter. In addition, the value of the parameter is estimated straight from the data, independently of any the error metric or heuristic.
3 Experiments and results
3.1 Dataset
We train and evaluate our method on the data used in the DIHARDII challenge [22]. The challenge features two audio input conditions: single channel and multichannel. We focus on single channel data with reference Speech Activity Detection (SAD), as per the track 1 of the competition. The dataset is divided into two subsets, development and evaluation, each consisting of selections of 510 minute audio files sampled from 11 different conversational domains for a total of approximately 2 hours of audio.
Using stratified holdout, we further split the development set into training set (80%) and validation set (20%). Also, we randomize the holdout procedure, such that for every experiment we get a different data partitioning. Stratification is performed over the set of domains, according to their frequency in the whole development set.
Although the proposed approach does not handle cases where multiple speakers are active simultaneously, we do not exclude overlapping speech segment from the training material. In fact, we observed that considering multispeaker segments as a separate speaker slightly improves performance.
3.2 Experimental setup
We use as speaker embeddings of our supervised diarization system xvectors [24] using the pretrained models available in the Kaldi diarization recipe [21]. Xvectors with dimension 512 are extracted from nonoverlapped 1 second speech segments and are subsequently reduced to dimension 200 with Principal Component Analysis (PCA) before feeding them to the model.
For what concerns the sequence generation component, our network resembles the architecture presented in [28]. However, since we are using different features, xvectors on fixedlength segments instead of dvectors extracted from ground truth speaker segments, we explored several configurations varying the sizes of the layers. We found that reasonable results are obtained using one recurrent and one fully connected layer with 200 units each.
The other two parameters, and for the speaker change and the speaker assignment components respectively, are estimated using their analytical formulations. For the transition probability we apply the same formula as in [28], while for we use eq. 10
. We also explored some search based techniques for hyperparameter optimization, like grid search and line search, but we found they do not provide noticeable improvements in performance. Furthermore, the value for the variance of the observations
is optimized during training using Adam, as in [28].Apart from the SML loss, two more regularization losses help the model to converge [13]
. The first one is a simple L2 loss on the parameters of the GRU, the second one uses an inverse gamma distribution to regularize the value of
that would otherwise diverge to very large values.In inference we use beam search with beam size . Unlike in [28], in our dataset we can not consider the number of speakers to be bounded. This makes inference expensive.
Networks are trained several times using Adam optimizer and the best model is selection by measuring the Diarization Error Rate (DER) on the validation set, using a smaller beam width () to reduce the computational cost.
DER is measured using dscore [19], the official scoring tool of DIHARDII competition which does not account for any forgiveness collar, considering also overlapped speech segment. However, since none of the methods under evaluation handle overlapped speech we also report performance without overlap.
3.3 Results
Method  DER  DER  no overlap 

Cum. mean + beam search  34.0  26.7 
UISRNN [28][13]  30.9  23.4 
UISRNN + eq.10  30.3  22.8 
UISRNNSML + eq.10  27.3  19.4 
PLDA + AHC [22] (offline)  26.1  17.7 
Table 1 reports the performance of our proposed UISRNNSML, based on SML and estimation, in comparison against two online baselines. The first one is a naïve implementation in which the GRU is replaced by a simple cumulative mean of the embeddings (Cum. mean + beam search in Table 1). This naïve baseline helps highlighting the contribution of the neural network, disentangling it from the other components of the framework. The second is the original UISRNN [28], using the implementation provided in [13]. To give an idea of how difficult the task is, we also report the offline baseline provided in the DIHARDII challenge [22], which performs diarization by scoring the xvectors with PLDA [23], and clustering using AHC [14].
The naïve implemenation based on cumulative mean with beam search is outperformed by the UISRNN by a large margin, with and without overlapping speech segments. This confirms that the simple mean of a partial sequence of embeddings does not properly model the speaker and that the neural network makes an active contribution. A further small but significative DER reduction, both with and without overlap, with respect to the original implementation is provided by estimating with eq. 10 (third row in Table 1).
Finally, a larger leap in performance is achieved by replacing the original loss function with the SML we proposed. Note that the UISRNNSML achieves similar performance to the offline baseline used in DIHARDII [22], although online unsupervised clustering algorithms usually perform significantly worse than offline clustering algorithms. The performance improvement is due the regularizing effect introduced by the SML in training. We observed that, keeping learning rate and batch size fixed, training with SML is much less noisy than the original one: using the more accurate supervision given by the sample mean results in better gradients, which in turn helps convergence to deeper minima. The stabilizing effect of the SML is evident in Fig. 4 where we report the variance of the means of the speaker clusters generated by the network during training. Models trained with eq. 7 exhibit less output variance compared to those trained with eq. 5. This behaviour turns out to be very beneficial in the decoding phase when the means of the clusters should not change dramatically while the sequence unfolds.
For a better understanding of the behaviour of our proposed method, Fig. 3 reports the DER for each context in the dataset. Our method is better than the original UISRNN in all the most challenging contexts, except for “socio field” and “child”, where our performance is basically aligned to the other methods. We observe a small performance deterioration in “Audiobooks”. This occurs because the UISRNNSML, predicting the mean more accurately, produces slightly smaller values for the cluster variance . Although this is beneficial in most cases, it can marginally reduce performance in contexts with very low number of speakers. This disadvantage can be partially alleviated by defining context dependent and .
Finally we evaluate the impact of the number of samples used to estimate the mean of the distribution. Fig. 5 shows the DER on the whole evaluation set for different values of . On these data, provides the lower DER, but values from 2 to 4 produce very similar results. Unsurprisingly, performance degrades using larger values for , due to overfitting, because the sample mean approximates the real mean too tightly. Note that the case
would be equivalent to the UISRNN except for the fact that observations are sampled with replacement. This gives a considerable improvement (27.83% against 30.3%) because outliers of the speaker clusters are less likely to be observed by the network as targets during training, reducing the overall variance of the output.
4 Conclusions
In this paper we presented an evolution of a supervised speaker diarization system where the clustering module is replaced by a trainable model called unbounded interleavedstate RNN. Specifically, we proposed a modified loss function that stimulates the neural network to model speakers more accurately. In addition, we introduced a semantically grounded formulation for the estimation of the parameter that controls the speaker assignment probability. We evaluated the proposed online diariaztion approach on the DIHARDII multidomain data, showing, through extensive experiments, that it outperforms the original UISRNN formulation. Finally, we fully disclose our code and trained models to make our results reproducible.
References
 [1] (201202) Speaker diarization: a review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20 (2), pp. 356–370. Cited by: §1.
 [2] (2017) Deep learning approaches for online speaker diarization. Cited by: §1.

[3]
(2011)
Distance dependent chinese restaurant processes.
Journal of Machine Learning Research
12 (Aug), pp. 2461–2488. Cited by: §2.3, §2. 
[4]
(200506)
Learning a similarity metric discriminatively, with application to face verification.
In
IEEE Conference on Computer Vision and Pattern Recognition
, Vol. 1, pp. 539–546 vol. 1. Cited by: §1.  [5] (2018) VoxCeleb2: deep speaker recognition. In INTERSPEECH, Cited by: §1.
 [6] (2011) Frontend factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §1.
 [7] (201806) Speaker diarization based on bayesian hmm with eigenvoice priors. In Speaker Odissey, pp. 147–154. Cited by: §1.
 [8] (2019) Bayesian HMM Based xVector Clustering for Speaker Diarization. In INTERSPEECH, pp. 346–350. External Links: Document, Link Cited by: §1.
 [9] (2017) Developing online speaker diarization system. In INTERSPEECH, Cited by: §1.
 [10] (201909) Endtoend neural speaker diarization with permutationfree objectives. INTERSPEECH. External Links: Link, Document Cited by: §1, §2.3.
 [11] (2017) Speaker diarization using deep neural network embeddings. In International Conference on Acoustics, Speech and Signal Processing, pp. 4930–4934. Cited by: §1.
 [12] (2010) GMMUBM based openset online speaker diarization. In Eleventh Annual Conference of the International Speech Communication Association, Cited by: §1.
 [13] Official library for the Unbounded InterleavedState Recurrent Neural Network (UISRNN) algorithm. Note: https://github.com/google/uisrnn (Oct. 2019) Cited by: §2.1, §2.1, §3.2, §3.3, Table 1.

[14]
(2008)
Strategies to improve the robustness of agglomerative hierarchical clustering under data source variation for speaker diarization
. IEEE Transactions on Audio, Speech, and Language Processing 16 (8), pp. 1590–1601. Cited by: §1, §3.3.  [15] (2008) A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 16 (5), pp. 980–988. Cited by: §1.
 [16] (2019) LSTM Based Similarity Measurement with Spectral Clustering for Speaker Diarization. In INTERSPEECH, pp. 366–370. Cited by: §1.
 [17] (2018) Links: a highdimensional online clustering method. arXiv preprint arXiv:1801.10123. Cited by: §1.
 [18] (1977) Speech understanding systems: report of a steering committee. Artificial Intelligence 9 (3), pp. 307 – 316. Cited by: §1.
 [19] dscore, official scoring tool for DIHARDII. Note: https://github.com/nryant/dscore (Oct. 2019) Cited by: §3.2.
 [20] (2006) A spectral clustering approach to speaker diarization. In Ninth International Conference on Spoken Language Processing, Cited by: §1.

[21]
(201112)
The kaldi speech recognition toolkit.
In
Workshop on Automatic Speech Recognition and Understanding
, Cited by: §3.2.  [22] (2019) The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines. In INTERSPEECH, pp. 978–982. External Links: Document, Link Cited by: §1, §3.1, §3.3, §3.3, Table 1.
 [23] (2014) Speaker diarization with PLDA ivector scoring and unsupervised calibration. In Spoken Language Technology Workshop, pp. 413–417. Cited by: §1, §3.3.
 [24] (2018) Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In INTERSPEECH, pp. 2808–2812. External Links: Document, Link Cited by: §1, §3.2.
 [25] (2006) An overview of automatic speaker diarization systems. IEEE Transactions on audio, speech, and language processing 14 (5), pp. 1557–1565. Cited by: §1.
 [26] (2014) Deep neural networks for small footprint textdependent speaker verification. In International Conference on Acoustics, Speech and Signal Processing, pp. 4052–4056. Cited by: §1.
 [27] (2018) Speaker diarization with lstm. In International Conference on Acoustics, Speech and Signal Processing, pp. 5239–5243. Cited by: §1.
 [28] (2019) Fully supervised speaker diarization. In International Conference on Acoustics, Speech and Signal Processing, pp. 6301–6305. Cited by: §1, §1, §2.1, §2.1, §2.3, §2.3, §2, §3.2, §3.2, §3.2, §3.3, Table 1.
 [29] (2018) Selfattentive speaker embeddings for textindependent speaker verification. In INTERSPEECH, pp. 3573–3577. Cited by: §1.
Comments
There are no comments yet.