Speaker diarization aims at answering the question “who spoke when”, effectively partitioning an audio sequence into segments with a particular speaker identity. Most dependable diarization approaches consist of a cascade of several steps [1, 10]: voice activity detection to discard non-speech regions, speaker embedding [19, 28] to obtain discriminative speaker representations, and clustering [10, 18, 17] to group speech segments by speaker identity. The main limitation of this family of multi-stage approaches relates to how they handle overlapped speech (which is known to be one of the main sources of errors): either they simply ignore the problem or they address it a posteriori as a final post-processing step based on a dedicated overlapped speech detection module [21, 5, 14, 3]
. A new family of approaches have recently emerged, rethinking speaker diarization completely. Dubbed end-to-end diarization (EEND), the main idea of this approach is to train a single neural network – in a permutation-invariant manner – that ingests the audio recording and directly outputs the overlap-aware diarization output[12, 11]. We propose to meet half-way between multi-stage and overlap-aware end-to-end diarization and design a multi-stage pipeline where overlapped speech is a first-class citizen in every single step: from segmentation to incremental clustering. In particular, our first contribution (discussed in Section 2.2.1) is a modified version of the statistics pooling layer (initially introduced in the x-vector architecture) to give less weight to frames where the intial segmentation step predicts simultaneous speakers.
Despite being competitive with multi-stage approaches, the main limitation of the overlap-aware end-to-end approaches is the strong assumption that the number of speakers is upper bounded or even known a priori. While reasonable for some particular use cases (e.g. one-to-one phone conversations), this assumption does not hold in many other situations (e.g. physical meetings or conference calls). One solution to this problem is to augment end-to-end
approaches with mechanisms to automatically estimate the number of speakers. For instance, EEND-EDA extends EEND [12, 11] with a recurrent Encoder-Decoder network to generate a variable number of Attractors – similar to speaker centroids. Multi-stage approaches usually do not suffer from this limitation as they rely on a clustering step for which a growing number of techniques exist to accurately estimate the number of speakers . We propose to combine the best of both worlds  by first applying the end-to-end approach on audio chunks small enough to reasonably estimate an upper bound on the local number of speakers and, then only, apply global constrained clustering on top of the resulting local speakers. As discussed in Section 2.2.2, we say that clustering is constrained because cannot-link constraints are inferred from the output of the local end-to-end diarization. The main difference between this work and  is that we target low-latency online speaker diarization while they address offline speaker diarization.
This work relies heavily on the speaker segmentation model introduced in  and summarized in Section 2.1 for convenience. However, they address two very different problems with radically different constraints. While  performs local offline speaker diarization of extremely short 5s chunks of audio, this work addresses online speaker diarization of (possibly infinite) audio streams. Hence, this work extends  with a mechanism to track speakers over the duration of a conversation, with a latency much lower than 5s and real-time processing.
Low-latency online speaker diarization differs from its offline counterpart in several ways. While the latter assumes that the whole audio sequence is available at once (and hence can rely on multiple passes over the whole sequence to output its final prediction), the former ingests a possibly infinite audio stream and can only afford a short delay between when it receives a buffer of audio and when it outputs the corresponding prediction (without the option to correct it afterwards). These additional constraints prevent state-of-the-art multi-stage approaches like VBx  from being used in that setting as they heavily rely on the possibility to pass several times over the audio sequence. EEND-like approaches are not suitable either because they expect large chunks of audio (30 seconds or more), leading to prohibitively high latency. One notable exception is FlexSTB  that astutely relies on an adaptive internal buffer to both simulate large audio chunks and support low (1s) latency.
A comprehensive set of experiments on AMI, DIHARD II, DIHARD III and VoxConverse datasets is reported and discussed in Section 4 – where FlexSTB and a state-of-the-art offline approach based on VBx  respectively serve as baseline and topline. In particular, we show how the latency of the proposed approach can easily be adjusted (without retraining) between 500ms and 5s to match the requirements of a particular use case.
2 Overlap-aware online diarization
As depicted in Figure 1, we propose to address online speaker diarization as the iterative interplay between two main steps: segmentation and incremental clustering. Every few hundred milliseconds (500ms in our case), the segmentation module first performs a fine-grained overlap-aware diarization of a 5s rolling buffer. This local diarization is then ingested by the incremental clustering module that relies on speaker embeddings to map local speakers to the appropriate global speakers (or create new ones), before updating its own internal state.
The segmentation step is the direct application of the end-to-end speaker segmentation neural network introduced in , used to obtain a fine-grained local speaker diarization. As depicted in Figure 1
, it ingests the 5s audio rolling buffer and outputs speaker activity probabilitieswhere is the number of output frames and , with the estimated maximum number of different speakers that a 5s chunk may contain ( in our case). Speakers whose activity probability exceeds a tunable threshold at least once during the chunk constitute the set of local speakers. Inactive speakers are simply discarded.
Active speaker probabilities are then passed unchanged (i.e. with continuous values between 0 and 1) to the incremental clustering step. In particular, it means that overlapping speech (i.e. when two or more speakers have high probabilities simultaneously) is handled from the very beginning of the pipeline. This is in contrast with most dependable speaker diarization approaches that handle overlapping speech as a post-processing step [17, 3]. This early detection of overlapping speech will prove very useful for the incremental clustering.
2.2 Incremental clustering
Because the segmentation model is trained in a permutation-invariant manner and applied locally to the rolling buffer, one cannot guarantee that one particular speaker consistently activates the same index over time. Figure 2 illustrates this limitation for two states of the rolling buffer: despite being only 500ms apart from each other and therefore having most of their audio content in common, notice how both active speakers are swapped. This section describes how we use incremental clustering to circumvent this limitation by tracking speakers (and detecting new ones) over the whole duration of the audio stream.
2.2.1 Segmentation-driven speaker embedding
Like most recent speaker diarization systems, we rely on neural speaker embeddings to represent and compare speakers. Our model is based on the canonical x-vector TDNN-based architecture, with the difference that the statistics pooling layer  is modified to return the concatenation of weighted mean
and weighted standard deviationfor each active speaker – instead of the regular mean and standard deviation :
where is the output of frame of the last TDNN layer. One straightforward option is to derive from the speaker activity probability and use directly, so that the final (pooled) speaker embedding mostly relies on frames where the segmentation model is confident that speaker is active. This generates exactly one embedding per active speaker in the current buffer, even when split into multiple speech turns (e.g. the red speaker in the lower row of Figure 2).
Furthermore, as summarized in , the segmentation model is also very good at detecting overlapped speech regions (where two or more speakers are active simultaneously). Therefore, another option is to make the speaker embedding focus on frames where it is confident that speaker is the only active speaker:
where the effect of this transformation is illustrated in Figure 3. The use of weighs down frames where two or more speakers are active, and the exponent weighs down frames where the segmentation model is not quite confident about the activity of a speaker. Embeddings extracted with this weighing scheme are called overlap-aware speaker embeddings in the rest of the paper.
2.2.2 Constrained incremental clustering
Given the initial content of the rolling buffer, the segmentation and embedding steps are combined to extract one embedding for each active speaker in the first 5s of the audio stream. These speaker embeddings are stacked to form the initial centroid matrix with shape where is the number of active speakers so far, and is the dimension of the speaker embedding.
Every few hundred milliseconds (e.g. 500ms), the rolling buffer is updated, and the segmentation and speaker embedding steps are combined to extract one embedding for each of the locally active speakers. Those speaker embeddings are then compared to the current state of the centroid matrix to find the optimal mapping between local and global speakers. Denoting the distance between centroid and local speaker embedding , one option is to assign the th local speaker to the closest centroid:
Yet, this simple option does not take full advantage of the output of the segmentation model, as two local speakers might end up being assigned to the same centroid. This would be in contradiction to the output of the segmentation model that already chose to discriminate local speakers. Therefore, we add the constraint that any two local speakers cannot be assigned to the same centroid, while keeping the objective of minimizing the overall distance between local speakers and their assigned centroids:
where is the set of mapping functions between local speakers and centroids with the following property:
In practice, this optimal mapping is obtained by applying the Hungarian algorithm on the speaker-to-centroid distance matrix, and can be seen as an incremental clustering step with cannot-link constraints.
2.2.3 Detecting new speakers and updating centroids
Once the optimal mapping is determined, for any given local speaker and their local embedding
if , they are marked as new speaker (i.e. it is the first time they are active since the beginning of the audio stream) and their embedding is appended to the pool of centroids:
otherwise, they are marked as returning speaker, and their embedding is used to update the corresponding centroid.
Because of the weighing scheme described in Section 2.2.1, the quality of a speaker embedding is expected to be positively correlated with the estimated duration during which local speaker is active: . Therefore, we propose to only update a centroid when this duration is long enough:
where is the minimum duration below which a speaker embedding is considered to be too noisy to help refine the centroid. Equation 6 assumes that speaker embeddings
are unit-normalized and optimized for cosine similarity.
2.3 Adjusting the latency
Even though the whole buffer is used to extract embeddings and assign local speakers to an existing (or new) cluster, only the (active) speaker activity probabilities at its rightmost part are output: effectively controls the latency of the whole system.
The lowest possible value for corresponds to the period between two consecutive updates of the rolling buffer (500ms in our case). In this configuration, the rightmost parts of two consecutive buffer states and do not overlap: and . Therefore, they are simply concatenated and frame-level speaker activity probabilities are passed through a final thresholding step. Local speaker is marked as active at frame if .
The careful reader might have noticed that, at the very beginning of the audio stream, the initial buffer must be filled entirely before a first output can be provided – effectively leading to a much larger latency of 5s, an order of magnitude larger than the promised ms. However, once this initial warm-up period has passed, the latency is indeed
ms. If having a low latency from the very beginning of the stream is critical, one can simply left-pad theinitial incomplete buffer with zeros.
Figure 4 shows that, for cases where longer latency is permitted, several positions of the rolling buffer can be combined in an ensemble-like manner to obtain a more robust output. In practice, for a given frame , the final speaker activity probabilities are computed as the average of the speaker activity probabilities obtained from each buffer position.
We ran experiments on three different datasets covering a wide range of domains and number of speakers.
DIHARD III [27, 25] does not provide a training set. Therefore, we split its development set into two parts: 192 files used as training set, and the remaining 62 files used as a smaller development set. The latter is simply referred to as development sets in the rest of the paper. When defining this split (shared at huggingface.co/pyannote/segmentation), we made sure that the 11 domains were equally distributed between both subsets. The test set is kept unchanged. We also report performance on DIHARD II  for comparison with FlexSTB .
VoxConverse does not provide a proper training set either . Therefore, we also split its development set into two parts: the first 144 files (abjxc to qouur, in alphabetical order) constitute the training set, leaving the remaining 72 files (qppll to zyffh) for the actual development set. Furthermore, multiple versions of VoxConverse test set have been circulating: we rely on version 0.0.2 available at github.com/joonson/voxconverse.
3.2 Implementation details
We use the pretrained segmentation model available at hf.co/pyannote/segmentation, which was trained on the composite training set made of the union of AMI, DIHARD III, and VoxConverse respective training sets. It ingests 5 second audio chunks and outputs one prediction every 16ms with speakers. More details about the training process can be found in . The speaker embedding model is based on the canonical x-vector TDNN-based architecture , but with filter banks replaced by trainable SincNet features . It was trained with additive angular margin loss  using chunks of variable duration (from 2 to 5 seconds) drawn from VoxCeleb [20, 8], augmented with reverberation based on impulse responses from EchoThief and , and additive background noise from MUSAN . It reaches an equal error rate of on VoxCeleb 1 test set using cosine distance only. We share the pretrained model and more details about the training process at hf.co/pyannote/embedding. Weights used in the statistics pooling layer of the overlap-aware speaker embeddings were obtained with and . Those values were not optimized with the rest of the hyper-parameters. Instead, we handpicked them based on examples like the one in Figure 3.
3.3 Experimental protocol
While the same pretrained segmentation and embedding models were used for all three datasets, we rely on their respective development sets to optimize hyper-parameters (, and ) specifically for each dataset. More precisely, we use the pyannote.pipeline optimization toolkit that relies on a tree-structured Parzen estimator algorithm  to minimize the overall diarization error rate (DER) – computed with pyannote.metrics  without any forgiveness collar and including overlapped speech regions. To ensure a fair comparison between different approaches, the optimization process is applied for all of them independently. In other words, it means that every row dataset entry in Table 1 results from one dedicated optimization process. This includes the offline topline, the proposed online approach and its ablative variants, but excludes both FlexSTB (as we unfortunately did not have access to its implementation) and experiments on DIHARD II (where we use respective hyper-parameters tuned for DIHARD III).
4 Results and discussion
Table 1 summarizes the whole set of experiments.
Offline vs. online. We start by reporting the performance of a strong offline topline that consists of VBx  followed by the overlap-aware resegmentation step introduced in . Because this latter resegmentation step relies on the exact same pretrained segmentation model as our proposed approach, most of the reported decrease in performance with s is caused by speaker confusion errors (relative for DIHARD III, for AMI, for VoxConverse). Incremental clustering still has a long way to go to be on par with offline multi-pass clustering.
Overlap-aware speaker embedding. The first ablative experiment shows that the overlap-aware weighing scheme introduced in Equation 2 brings a relative performance improvement of 10% on AMI, 18% on VoxConverse and 1% on DIHARD III. Given that they respectively contain 17%, 3%, and 11% of overlapped speech, there is still room for improvement on this particular aspect. In particular, while we handcrafted this weighing scheme, it should be possible to train the segmentation and speaker embedding models jointly for the latter to fully take advantage of the former’s capability at detecting and separating simultaneous speakers.
Overlap-aware speaker segmentation. In a second ablative experiment, we replace the segmentation model by an oracle that provides perfect binary (i.e. ) overlap-aware segmentation. As expected, missed detection is where most of the difference occurs (caused by overlapped speech), while speaker confusion only marginally improves. The community has yet to solve the problem of overlapped speech detection.
Adjustable latency. Figure 5 shows how the performance of our online approach evolves as we decrease the allowed latency from s to ms.
Speaker confusion error rate consistently increases as the latency decreases – while false alarm and missed detection remain constant. This can be explained by the ensemble-like aggregation process described in Section 2.3 that combines more views of the same problem as the allowed latency increases. Note that we kept the hyper-parameters (, , ) optimized for latency s and still get reasonable performance for lower latencies. However, it is also possible to re-optimize the hyper-parameters for a specific latency. This is what we did for the s setting marked with in Table 1 for comparison with FlexSTB . Not only do we get better overall performance, but our approach also has the advantage of a lower memory footprint, as it never ingests nor runs inference on more than 5s of audio at a time (compared to 100s of FlexSTB) and keeps a single vector per speaker in memory (compared to 100s of acoustic features and per-speaker scores in FlexSTB). Furthermore, our approach with s reaches the same performance as the official offline baseline  of the DIHARD III challenge ( vs ).
Figure 6 compares the performance over time of our online system against the offline topline .
While the performance of the latter remains somewhat constant, the former gets better as conversations unfold, almost bridging the gap after 5 minutes of conversation. As new information becomes available, our system learns better speaker centroids, hence decreasing speaker confusion error.
While very long conversations can become rather expensive (if not impossible) to process with most offline models, our system can handle daylong audio streams at a practically constant memory cost, while getting better and better.
. We share an open-source implementation of this work, as well as expected outputs inRTTM format at this address to facilitate future comparisons: github.com/juanmc2005/StreamingSpeakerDiarization.
Real time. Computation time for one step of the rolling buffer is 165ms on a CPU Intel Cascade Lake 6248 (20 cores at 2.5Ghz) or 50ms on a GPU Nvidia Tesla V100 SXM2. This is suitable for real time applications, as the rolling buffer can be processed before its next update (every 500ms).
We have proposed an overlap-aware online speaker diarization system combining the end-to-end local segmentation of a 5 second long rolling buffer with incremental clustering. Apart from handling overlapping speech at every stage, our system benefits from an adjustable latency between 500ms and 5s. We show that our system outperforms FlexSTB  with a lower memory consumption, and that it is capable of bridging the gap to offline performance as conversations unfold. This last advantage may make it preferable to an offline system when recordings are long and resources low.
-  (2012) Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20 (2), pp. 356–370. Cited by: §1.
-  (2011) Algorithms for Hyper-Parameter Optimization. In Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), Vol. 24, pp. . External Links: Cited by: §3.3.
-  (2021) End-to-end speaker segmentation for overlap-aware resegmentation. In Proc. Interspeech 2021, Cited by: §1, §1, §2.1, §2.1, §2.2.1, §3.2, Table 1, Figure 6, §4, §4.
-  (2017) pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems. In Proc. Interspeech 2017, pp. 3587–3591. External Links: Cited by: §3.3.
-  (2020) Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection. In Proc. ICASSP 2020, Cited by: §1.
-  (2007) Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus. Language Resources and Evaluation 41 (2), pp. 181–190. Cited by: §3.1, Table 1.
-  (2020) Spot the Conversation: Speaker Diarisation in the Wild. In Proc. Interspeech 2020, pp. 299–303. External Links: Cited by: §3.1, Table 1.
-  (2018) VoxCeleb2: Deep Speaker Recognition. In Proc. Interspeech 2018, pp. 1086–1090. External Links: Cited by: §3.2.
ArcFace: Additive Angular Margin Loss for Deep Face Recognition. , pp. 4685–4694. Cited by: §3.2.
-  (2018) BUT System for DIHARD Speech Diarization Challenge 2018. In Proc. Interspeech 2018, pp. 2798–2802. External Links: Cited by: §1.
End-to-End Neural Speaker Diarization with Self-Attention.
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. , pp. 296–303. External Links: Cited by: §1, §1.
-  (2019) End-to-End Neural Speaker Diarization with Permutation-Free Objectives. In Proc. Interspeech 2019, pp. 4300–4304. External Links: Cited by: §1, §1.
-  (2020) End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors. In Proc. Interspeech 2020, pp. 269–273. External Links: Cited by: §1.
-  (2021) End-to-end speaker diarization as post-processing. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.
-  (2021) Integrating end-to-end neural and clustering-based diarization: getting the best of both worlds. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 7198–7202. External Links: Cited by: §1.
-  (2022) Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks. Computer Speech & Language 71, pp. 101254. External Links: Cited by: §1, §1, §3.1, Table 1, §4.
-  (2020) But System for the Second Dihard Speech Diarization Challenge. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6529–6533. External Links: Cited by: §1, §2.1.
LSTM Based Similarity Measurement with Spectral Clustering for Speaker Diarization. In Proc. Interspeech 2019, pp. 366–370. External Links: Cited by: §1.
-  (2015) Integrating online i-vector extractor with information bottleneck based speaker diarization system. In Proc. Interspeech 2015, Cited by: §1.
-  (2017) VoxCeleb: A Large-Scale Speaker Identification Dataset. In Proc. Interspeech 2017, pp. 2616–2620. External Links: Cited by: §3.2.
-  (2007) Efficient use of overlap information in speaker diarization. In 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pp. 683–686. Cited by: §1.
-  (2019) Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap. IEEE Signal Processing Letters. Cited by: §1.
-  (2018) Speaker Recognition from Raw Waveform with SincNet. In 2018 IEEE Spoken Language Technology Workshop (SLT), Vol. , pp. 1021–1028. External Links: Cited by: §3.2.
-  (2019) The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines. In Proc. Interspeech 2019, pp. 978–982. External Links: Cited by: §3.1, Table 1.
-  (2020) Third DIHARD Challenge Evaluation Plan. arXiv preprint arXiv:2006.05815. Cited by: §3.1.
-  (2020) The Third DIHARD Diarization Challenge. arXiv preprint arXiv:2012.01477. Cited by: Table 1, §4.
-  (2020) The Third DIHARD Diarization Challenge. arXiv preprint arXiv:2012.01477. Cited by: §3.1.
-  (2018) X-Vectors: Robust DNN Embeddings for Speaker Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5329–5333. External Links: Cited by: §1, §2.2.1, §3.2.
-  (2015) MUSAN: A Music, Speech, and Noise Corpus. arXiv preprint arXiv:1510.08484. Cited by: §3.2.
-  (2016) Statistics of natural reverberation enable perceptual separation of sound and space. Proceedings of the National Academy of Sciences 113 (48), pp. E7856–E7865. External Links: Cited by: §3.2.
-  (2021) Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers. arXiv preprint arXiv:2101.08473. Cited by: §1, §3.1, Table 1, §4, §5.