Diarisation using location tracking with agglomerative clustering

09/22/2021
by   Jeremy H. M. Wong, et al.
0

Previous works have shown that spatial location information can be complementary to speaker embeddings for a speaker diarisation task. However, the models used often assume that speakers are fairly stationary throughout a meeting. This paper proposes to relax this assumption, by explicitly modelling the movements of speakers within an Agglomerative Hierarchical Clustering (AHC) diarisation framework. Kalman filters, which track the locations of speakers, are used to compute log-likelihood ratios that contribute to the cluster affinity computations for the AHC merging and stopping decisions. Experiments show that the proposed approach is able to yield improvements on a Microsoft rich meeting transcription task, compared to methods that do not use location information or that make stationarity assumptions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/23/2021

Joint speaker diarisation and tracking in switching state-space model

Speakers may move around while diarisation is being performed. When a mi...
10/28/2017

Jointly Tracking and Separating Speech Sources Using Multiple Features and the generalized labeled multi-Bernoulli Framework

This paper proposes a novel joint multi-speaker tracking-and-separation ...
10/08/2021

Location-based training for multi-channel talker-independent speaker separation

Permutation-invariant training (PIT) is a dominant approach for addressi...
08/30/2019

Enhancements for Audio-only Diarization Systems

In this paper two different approaches to enhance the performance of the...
08/25/2018

Multiobjective Optimization Training of PLDA for Speaker Verification

Most current state-of-the-art text-independent speaker verification syst...
02/08/2016

The "Sprekend Nederland" project and its application to accent location

This paper describes the data collection effort that is part of the proj...
05/15/2020

Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification

Identifying multiple speakers without knowing where a speaker's voice is...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker diarisation aims to cluster together segments of audio that are uttered by the same speaker. This is useful in a rich meeting transcription task, where both the identity of a speaker and the word being said need to be hypothesised. Spectral [17] and -means [24]

clustering can be used for diarisation, after first estimating the number of clusters, by for example, finding the maximum gap in a chosen statistic

[28, 19]. Agglomerative Hierarchical Clustering (AHC) [25, 10]

instead jointly estimates the cluster assignments and number of clusters. The Hidden Markov Model (HMM) can also be used, either to compute merging scores within AHC

[1], or on its own after having estimated the number of clusters [4, 13]. These methods often rely solely on features in the form of speaker embeddings, such as [3], [26], and

-vectors

[29]. The speaker embeddings are intended to express information that is useful in discriminating between different speakers.

When multi-channel audio is available, it is possible to estimate the instantaneous location from where the sound originated from. This information may be complementary to the speaker embeddings in the diarisation task. Previous works have investigated using time-delay-of-arrival [18, 30] and Sound Source Localisation (SSL) information [31], together with speaker embeddings in HMM clustering. There are also a diversity of methods to count and localise multiple speakers, without using speaker embeddings [16, 15, 20].

Speakers may move over the duration of a meeting. Explicitly modelling this movement may aid in diarisation. In multi-face tracking, Kalman filters are often used to track face movements from visual information [23, 6]. When multi-channel audio is available, acoustic location information has been shown to be complementary to visual information for face movement tracking [7]. The LOCATA challenge [5] has helped to spur the development of audio-only location tracking methods. Several of these approaches also rely on Kalman filters, to track the locations of a single [2, 21] or multiple [22] audio sources.

This paper proposes to perform diarisation, while modelling the movements of multiple speakers. It builds upon the works in [18, 30, 31], by tracking the movements of speakers, rather than assuming that speakers are stationary. It also extends upon the audio-only tracking methods, such as in [16, 15, 20, 22], by using both location information and speaker embeddings in diarisation. Diarisation is performed using AHC. Speaker movement is modelled as the likelihood of a sequence of instantaneous locations, computed using a Kalman filter. This is used together with a speaker embedding affinity score in the AHC cluster merging and stopping criteria.

2 von Mises Kalman filter tracking

The Kalman filter [11] can be used to model movement through location tracking. Using the Markov assumptions, the Kalman filter computes the likelihood of an observation sequence as

(1)

where is the frame index, is the total number of frames, and is an observed instantaneous location feature, whose possible forms are discussed in section 2.1. In this paper, the hidden state, , represents the estimated location of the speaker. In the future, it may be beneficial to also investigate modelling the velocity and higher time derivatives in the hidden state, as is often done in face tracking [23, 6].

In this paper, the speaker location is expressed as the horizontal angle around a microphone array. This is a continuous variable that is bounded within in radians, with a periodic boundary condition. Previous works have satisfied these properties using von Mises and warped normal density functions [12]. The transition likelihood used in this paper is a von Mises density function,

(2)

where is the modified Bessel function of the first kind of order and the concentration parameter, , expresses how tightly the density function is concentrated about the mean of . A higher concentration yields a lower likelihood for protean sequences. The initial state likelihood, , is set to a uniform density function, as there is complete uncertainty of where the speaker is before any observation is made.

2.1 Observation feature

Two forms of observed features are considered, namely the scalar Direction-Of-Arrival (DOA) and the full SSL vector. The SSL vector,

, is a categorical distribution, with each dimension representing the probability that the sound had originated from the respective angular bin around the microphone array,

(3)

where is the angular bin index and is the angular bin from which the audio forming feature may have originated. The SSL is computed using a complex angular central Gaussian model [9], as is described in [32]. The DOA, , is computed as the mode of the SSL,

(4)

and is the angle in radians of the th bin. Instead of the mode, it is also possible to estimate the DOA as the circular mean of the SSL, which can be computed using (10). However, initial tests did not suggest any significant difference between the performances of either form of DOA.

When using the DOA as the observed feature, the Kalman filter is used to compute , by substituting the placeholder, , with in (1). The emission likelihood is chosen to be a von Mises density function,

(5)

where the concentration parameter, , expresses the random noise in the observation.

The DOA only represents an estimate of the instantaneous angle of the speaker. However, interactions with the environment, noise, and the limited spatial resolution of the microphone array geometry may result in uncertainty in this estimation. The full SSL vector may express information about this uncertainty. The likelihood of an SSL sequence, , can be computed by substituting with in (1), and using an emission likelihood in the form of a continuous categorical density function [8],

(6)

where is the number of angular bins, is the normalisation constant defined in [8], and the continuous categorical bin probabilities are computed as a discretised von Mises distribution about the mean of ,

(7)

The continuous categorical emission likelihood can be interpreted as being that if multiple samples are drawn when given the same , then (6) computes the likelihood of observing each angular bin, , at a fraction of , out of all of the samples. If a Dirichlet density function is used instead, by swapping the places of and in (6), then the interpretation will be different, and the simplification of (8) will no longer be applicable. By taking the logarithm of (6), the emission log-likelihood can be seen to be a Kullback Leibler (KL)-divergence between two categorical distributions, one of the observation, , and the other being a prediction of the angular distribution, . That is to say, that when the model predicts the angular state to be , the model also predicts that the observed SSL should be similar to . The emission log-likelihood then measures a similarity score between the observed SSL, , and the predicted SSL, . The work in [31] also uses an emission log-likelihood in the form of a KL-divergence between SSL vectors, for HMM diarisation.

However, the exact form of (6) presents a challenge, as the normalisation term of is difficult to compute in a numerically stable manner [8]. The approximation is therefore made that is independent of , thereby allowing to be ignored when computing log-likelihood ratios for the AHC affinity scores, as will be described in Section 3. By ignoring and substituting in (7), (6) can be simplified to

(8)

where

(9)

and

(10)

The form of (8) is reminiscent of a von Mises density function. Therefore, by choosing to use a combination of a continuous categorical density function and a discretised von Mises distribution for the SSL emission likelihood in (6) and (7), the effective emission likelihood will also look similar to a von Mises density function, and at each frame can be completely summarised by its equivalent concentration, , and circular mean, . The concentration, , may weigh the contribution of each frame to the total log-likelihood proportionally to the sharpness of the SSL distribution.

(a)
(b)
(c)
(d)
Figure 1: Denominator term of discretised von Mises distribution (7)

However, unlike a von Mises density function, the denominator in (7) and (8) depends on , which is inconvenient for the forward recursion integral described in Section 2.2. Figure 1 plots the denominator, , over a variety of and values. It can be seen that the denominator is approximately independent of , except when the concentration, , is large at the same time as the number of angular bins, , is small. The experiments in this paper do not operate in such a regime. Therefore, it may be reasonable to approximate the denominator as being independent of . This allows the SSL emission likelihood to be expressed as

(11)

which has the form of a von Mises density function over .

2.2 Likelihood computation

The Kalman filter can be used to compute the log-likelihood of a DOA or SSL observation sequence, . This log-likelihood can be used as a score in the AHC merging and stopping criteria, as will be discussed in Section 3. The likelihood can be computed using the Kalman filter forward recursion,

(12)

This can be broken down into the prediction step,

(13)

and the update step,

(14)

In this paper, the transition likelihood in (2), and emission likelihoods in (5) and (11

) all have the form of von Mises density functions in terms of the random variable

. The prediction step in (13) is a convolution operation. Unfortunately, the von Mises density function is not closed under convolution, but instead the result takes a form described in [14]. However, it has been shown that the result of the convolution can be closely approximated by a von Mises density function [27], thereby allowing (13) to be expressed as

(15)

The prediction concentration is

(16)

and the prediction mean is

(17)

where is the functional inverse of

(18)

which can be solved for using the Newton-Raphson root finding algorithm, and both and are the parameters of the update step von Mises density function from the previous frame. In the prediction step, the concentration is broadened from the previous frame, through the inclusion of in (16). The mean in (17) does not change, because higher temporal derivatives of the angle are not modelled in the hidden state.

The von Mises density function is closed under multiplication. Therefore, by using the approximate prediction in (15), the update step of (14) becomes

(19)

When using DOA observations with an emission likelihood of (5), the update concentration is

(20)

and the update mean is

(21)

When instead using SSL observations with an emission likelihood of (11), the update concentration is

(22)

and the update mean is

(23)

The mean updates of (21) and (23) are weighted circular averages between the prediction mean and the observation. This serves to bring closer to the current observation.

After having computed the prediction density function through the forward recursion, the log-likelihood of the observation sequence can then be computed as

(24)

When computing the log-likelihood, there is no need to preserve the von Mises form, as only a point estimate is needed. Therefore, the exact convolution [14] can be used, which for DOA observations is

(25)

When using SSL observations, the exact emission likelihood of (6) is difficult to use, because of the numerical instability of the normalisation term. Therefore, the experiments in this paper approximate the SSL observation sequence log-likelihood using the same form as (25), with and being substituted with and respectively. This approximation again ignores the denominator terms in both (6) and (7), and re-normalises (11) over .

A speaker may have discontiguous regions of speech, since a speaker may not speak continuously throughout a whole meeting. As such, when modelling a speaker’s movement with a Kalman filter, there may be frames for which the speaker has no DOA or SSL observation. This is analogous to an occlusion in the visual face tracking task. For such frames, is undefined and the emission likelihood is simply set to . The update step for these frames simplifies from (14) to,

(26)

When computing the conditional log-likelihood for such frames, using in (24) yields . Thus these frames, without observations, do not contribute to the total observation sequence log-likelihood.

In the setup used in this paper, diarisation is performed after speech separation, to handle overlapped speech. As such, it is possible to encounter situations where the observation sequence may have frames from multiple separated channels that overlap in time. Let and be observations at frame index from channels 1 and 2 respectively. During the update step in (14) and the likelihood computation in (24), the emission likelihood of is simply substituted with , where it is assumed that the parallel observations of and are conditionally independent of each other when given . As a reminder, the product of multiple von Mises density functions also has the form of a von Mises density function.

2.3 Parameter estimation

The Kalman filter has two parameters, and . These express the dynamic ranges of the speaker’s movement speed and random noise in the observation respectively. One possible method of estimating them is through maximising the log-likelihood of the observation sequence,

(27)

using the Expectation-Maximisation (EM) algorithm.

The E-step requires the computation of the state posteriors,

(28)

which is a product between the forward and backward density functions. The backward density function can be expressed as

(29)

This can be computed using the product and approximate convolution of von Mises density functions, described in Section 2.2. An analogous expression can also be derived for the joint state posterior, , which is omitted here for brevity. For frames that do not have an observation of the speaker’s location, in (29).

The M-step update for with DOA observations is

(30)

and that for is

(31)

where is the EM iteration index. Monte Carlo approximations can be used to compute the integrals in (30) and (31).

Estimating the parameters by maximising the likelihood only makes sense when using DOA observations. When using SSL observations, the normalising terms in (6) and (7) are dependent on . Therefore a naive maximum likelihood optimisation with the previously described SSL observation approximations, of omitting the normalising terms, may not converge. In the experiments presented in this paper, and were optimised for DOA observations, and the same parameter values were then also used for SSL observations.

The approximate convolution used in the forward and backward recursions yields E-step posteriors and M-step updates that are also approximate. Therefore, there is no guarantee that the observation sequence log-likelihood will not worsen at each iteration. Furthermore, there is no guarantee that the parameters will converge to locally optimal values. However, the convolution approximation is needed to allow the density functions to remain closed within the von Mises family, and simplify the mathematics for the recursions. In the future, it may be interesting to investigate parameter estimation methods with fewer approximations.

3 Agglomerative hierarchical clustering

Speaker diarisation can be performed using AHC. The aim is to cluster together segments that belong to the same speaker. AHC begins by treating each segment as a separate cluster. At each iteration, the two clusters with the highest affinity score are merged in a greedy manner. The merging iterations continue until the maximum remaining affinity falls below a threshold. Work in [1]

uses the Bayes’ Information Criterion (BIC) as the affinity and computes the observation sequence likelihoods for the clusters through a HMM with Gaussian mixture model emission likelihoods. The BIC allows the model complexity to remain constant through all AHC iterations, thereby alleviating any favouritism toward having more speakers. However, the stopping threshold for the BIC can be difficult to tune robustly in practice. Furthermore, each AHC merger iteration requires an optimisation of the HMM to be run using the EM algorithm, which can be computationally expensive.

The setup in this paper instead uses a simpler AHC formulation that follows [31]

, where the affinity is computed as a cosine similarity between the speaker embedding centroid of each cluster. Given two clusters,

and , with unit-length speaker embeddings and respectively, the affinity is

(32)

This measures a similarity between the speaker embeddings of the two clusters. These embeddings are often extracted using a model that is trained using a speaker identification or speaker verification task. The embeddings are therefore expected to express characteristics of the audio that are useful in discriminating one speaker from another. It does not consider the locations or movements of the speakers.

In order to use location information, it is proposed in [31] to use a similarity score that is a KL-divergence between SSL vectors, within a HMM diarisation framework. This paper proposes as a baseline, that this score can also be used with AHC, by computing the SSL contribution to the affinity as

(33)

where is the SSL centroid of cluster . Unlike in [31], the symmetric KL-divergence is used here. However, this affinity averages the SSL vectors over time when computing the centroid, and may therefore not explicitly model the movement of the speakers.

This paper proposes to model speaker movements within AHC by computing an affinity based on the log-likelihood ratio between merged clusters and separated clusters, where the log-likelihood is computed using the Kalman filter,

(34)

where is the number of frames that have a DOA or SSL observation in cluster , and and represent the first and last frame indexes of cluster respectively. The normalisation by the number of frames with observations is necessary, as the log-likelihood from (24) scales linearly with the number of frames with observations. In the future, it may be interesting to explore BIC equivalents for this affinity, to take into account the model complexity.

The affinities of and express the cluster similarity based on the speakers’ angular locations, while

expresses the cluster similarity based on the characteristics of the audio that are useful in discriminating one speaker from another. The location affinities alone may not be sufficient for the clustering task, as it is possible that multiple speakers may overlap in their angular locations, or even in their movements. However, an affinity based on location may be complementary to one based on the speaker’s acoustic discriminative characteristics. Therefore, it may be useful to interpolate together

and either or .

4 Meeting transcription setup

A rich meeting transcription task was used to evaluate the proposed approach. The setup followed that described in [32, 31]. Audio from a microphone array was beamformed and separated into multiple channels, where it was assumed that there were no concurrent speakers within each channel. On each channel, voice activity detection and speech recognition were run. The non-silence segments were then further split into segments with speaker purity, using speaker change detection at word boundaries. Each of these segments might contain one or more words. Diarisation was then run to cluster and tag the resulting segments from all channels together. For each segment, a speaker embedding was extracted using the model described in [33]. DOA and SSL features were also extracted, as is described in Section 2.1. AHC was used to cluster the segments from the same speaker together, using a combination of one or more of the affinity measures described in Section 3. Finally, the Hungarian algorithm was used to tag the clusters, by finding an optimal mapping between the clusters and the enrolled speakers, based on the similarity of the speaker embeddings. The AHC affinity of in (33) was computed using a single SSL vector per cluster that represents the centroid, while in (34) used a sequence of DOA or SSL features with a duration and shift of 0.4s.

Feature type dev speaker-attributed WER (%)
DOA 22.71
SSL 22.46
Table 1: Comparison of Kalman filter location feature types

5 Experiments

Experiments were performed on audio collected from internal Microsoft meetings, lasting up to 1 hour each, with an average of 7 active participants each. The dev set comprised 51 meetings totalling 23 hours, while the eval set comprised 60 meetings totalling 35 hours. The speaker embeddings had 128 dimensions, while the SSL vectors had 360 dimensions. The AHC stopping criterion and affinity interpolation weights were tuned on the dev set using parameter sweeps. The Kalman filter parameters were optimised on the dev set using the EM algorithm, as is described in Section 2.3. The performance was measured using the speaker-attributed Word Error Rate (WER) [32]. This first computed the WER separately for each speaker, by comparing the hypothesis to the reference for that speaker, then the WERs were averaged over all speakers. The speaker-attributed WER expresses a combination of the speech recognition and diarisation performances, both of which are important for the rich meeting transcription task.

Table 1 compares computing the log-likelihood ratio using either DOA or SSL features in the Kalman filter, on the dev set. In both cases, the location tracking affinity of in (34) was interpolated with the speaker embedding affinity of in (32). The results suggest that using the full SSL vectors as the observed location features in the Kalman filter performs better than using DOA features. Each SSL vector is summarised by a mean angle in (10) and a concentration in (9). The concentration may contain information about the certainty of the instantaneous location estimation, which is not expressed in the DOA features. This weighs the contribution of each frame to the total sequence log-likelihood. The remaining experiment used the SSL features in the Kalman filter.

Speaker-attributed WER (%)
Test set Affinity stationary moving average
dev 22.61 27.19 25.42
22.05 27.55 25.43
21.10 23.32 22.46
eval 25.32 20.39 23.65
24.73 19.83 23.06
23.65 20.15 22.40
Table 2: Usefulness of modelling movement for diarisation

The use of the Kalman filter log-likelihood ratio, in (34), within the affinity for AHC can be compared against the baselines of using speaker embeddings with in (32) and the KL-divergence based instantaneous location affinity of in (33). The meetings were categorised into those with and without moving speakers. A meeting was considered to contain moving speakers if for at least one speaker in the meeting, it was possible to find two disjoint angular arcs of at least radians each, where that speaker spent at least 30s of active speech in each of the two angular regions not covered by these two arcs, based on manually transcribed location information from video data. The performances of the various AHC affinities on the stationary and moving meetings are shown in Table 2. Interpolating with improves the performance for both stationary and moving meetings on the eval set, and the stationary meetings on the dev set, compared against using alone. This agrees with [31] in suggesting that location information may be complementary to the speaker embeddings for the clustering task. Initial tests suggested that it may be difficult to robustly tune the AHC stopping criterion and the affinity interpolation weights between and , as the cosine similarity in has a dynamic range between -1 and 1, while that for the KL-divergence in is between and 0. The Kalman filter log-likelihood ratio in outperforms on the moving meetings in the dev set, but not the eval set. This may again suggest that it can be difficult to robustly tune the hyper-parameters for the interpolated affinity to generalise well to new data, since and again have different dynamic ranges. Despite this, consistently yields improvements for the moving meeting compared to using alone. In both datasets, is also able to yield consistent improvements over when averaged over all meetings, thereby suggesting that there may be a benefit to explicitly model speaker movements when performing diarisation.

6 Conclusion

This paper has presented an approach to explicitly model the spatial movements of speakers while performing diarisation. The movements are modelled through location tracking, using a Kalman filter with von Mises density functions as the transition and emission likelihoods. This Kalman filter is used to compute log-likelihood ratios between different cluster merging hypotheses in AHC. The results suggest that explicitly modelling the movements of speakers may provide information that is complementary to the speaker embeddings for the diarsation task.

References

  • [1] J. Ajmera and C. Wooters (2003-11) A robust speaker clustering algorithm. In ASRU, St. Thomas, US Virgin Islands, pp. 411–416. Cited by: §1, §3.
  • [2] D. Bechler, M. Grimm, and K. Kroschel (2003-05) Speaker tracking with a microphone array using Kalman filtering. Advances in Radio Science 1, pp. 113–117. Cited by: §1.
  • [3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2011-05) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §1.
  • [4] M. Diez, L. Burget, S. Wang, J. Rohdin, and H. Černocký (2019-09) Bayesian HMM based x-vector clustering for speaker diarization. In Interspeech, Graz, Austria, pp. 346–350. Cited by: §1.
  • [5] C. Evers, H. W. Löllmann, H. Mellmann, A. Schmidt, H. Barfuss, P. A. Naylor, and W. Kellermann (2020-04) The LOCATA challenge: acoustic scource localization and tracking. IEEE/ACM Transactions on Audio, Speech, and Language processing 28, pp. 1620–1643. Cited by: §1.
  • [6] J. Foytik, P. Sankaran, and V. Asari (2011) Tracking and recognizing multiple faces using Kalman filter and ModularPCA. Procedia Computer Science 6, pp. 256–261. Cited by: §1, §2.
  • [7] I. D. Gebru, S. Ba, G. Evangelidis, and R. Horaud (2015-12) Tracking the active speaker based on a joint audio-visual observation model. In ICCVW, Santiago, Chile, pp. 702–708. Cited by: §1.
  • [8] E. Gordon-Rodriguez, G. Loaiza-Ganem, and J. P. Cunningham (2020-07) The continuous categorical: a novel simplex-valued exponential family. In ICML, pp. 3637–3647. Cited by: §2.1, §2.1.
  • [9] N. Ito, S. Araki, and T. Nakatani (2016-08) Complex angular central Gaussian mixture model for directional statistics in mask-based microphone array signal processing. In EUSIPCO, Budapest, Hungary, pp. 1153–1157. Cited by: §2.1.
  • [10] H. Jin, F. Kubala, and R. Schwartz (1997-02) Automatic speaker clustering. In DARPA Speech Recognition Workshop, Chantilly, USA. Cited by: §1.
  • [11] R. E. Kalman (1960-03) A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82 (1), pp. 35–45. Cited by: §2.
  • [12] G. Kurz, I. Gilitschenski, and U. D. Haneback (2016-03) Recursive Baysian filtering in circular state spaces. IEEE Aerospace and Electronic Systems Magazine 31 (3), pp. 70–87. Cited by: §2.
  • [13] F. Landini, S. Wang, M. Diez, L. Burget, P. Matějka, K. Žmolíková, L. Mošner, A. Silnova, O. Plchot, O. Novotný, H. Zeinali, and J. Rohdin (2020-05) BUT system for the second DIHARD speech diarization challenge. In ICASSP, Barcelona, Spain, pp. 6529–6533. Cited by: §1.
  • [14] K. V. Mardia and P. E. Jupp (1999-01) Directional statistics. John Wiley and Sons. Cited by: §2.2, §2.2.
  • [15] J. McDonough, K. Kumatani, T. Arakawa, K. Yamamoto, and B. Raj (2013-05) Speaker tracking with spherical microphone arrays. In ICASSP, Vancouver, Canada, pp. 3981–3985. Cited by: §1, §1.
  • [16] M. Murase, S. Yamamoto, J.-M. Valin, K. Nakadai, K. Yamada, K. Komatani, T. Ogata, and H. G. Okuno (2005-09) Multiple moving speaker tracking by microphone array on mobile robot. In Interspeech, Lisbon, Portugal, pp. 249–252. Cited by: §1, §1.
  • [17] H. Ning, M. Liu, H. Tang, and T. Huang (2006-09)

    A spectral clustering approach to speaker diarization

    .
    In ICSLP, Pittsburgh, USA, pp. 2178–2181. Cited by: §1.
  • [18] J. M. Pardo, X. Anguera, and C. Wooters (2007-09) Speaker diarization for multiple-distant-microphone meetings using several sources of information. IEEE Transactions on Computers 56 (9), pp. 1212–1224. Cited by: §1, §1.
  • [19] T. J. Park, K. J. Han, M. Kumar, and S. Narayanan (2019-12) Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Processing Letters 27, pp. 381–385. Cited by: §1.
  • [20] A. Plinge and G. A. Fink (2014-05) Multi-speaker tracking using multiple distributed microphone arrays. In ICASSP, Florence, Italy, pp. 614–618. Cited by: §1, §1.
  • [21] D. Salvati, C. Drioli, and G. L. Foresti (2018-09) Localization and tracking of an acoustic source using a diagonal unloading beamforming and a Kalman filter. In LOCATA Challenge Workshop, Tokyo, Japan. Cited by: §1.
  • [22] C. Segura, A. Abad, J. Hernando, and C. Nadeu (2007-05) Multispeaker localization and tracking in intelligent environments. In CLEAR2007 and RT2007, Baltimore, USA, pp. 82–90. Cited by: §1, §1.
  • [23] Z. Shaik and V. Asari (2007-10) A robust method for multiple face tracking using Kalman filter. In AIPR, Washington DC, USA, pp. 125–130. Cited by: §1, §2.
  • [24] S. Shum, N. Dehak, E. Chuangsuwanich, D. Reynolds, and J. Glass (2011-08) Exploiting intra-conversation variability for speaker diarization. In Interspeech, Florence, Italy, pp. 945–948. Cited by: §1.
  • [25] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern (1997-02) Automatic segmentation, classification and clustering of broadcast news audio. In DARPA Speech Recognition Workshop, Chantilly, USA, pp. 97–99. Cited by: §1.
  • [26] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018-04) X-vectors: robust DNN embeddings for speaker recognition. In ICASSP, Calgary, Canada, pp. 5329–5333. Cited by: §1.
  • [27] M. A. Stephens (1963-12) Random walk on a circle. Biometrika 50 (3-4), pp. 385–390. Cited by: §2.2.
  • [28] R. Tibshirani, G. Walther, and T. Hastie (2002-01) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B 63 (2), pp. 411–423. Cited by: §1.
  • [29] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez (2014-05)

    Deep neural networks for small footprint text-dependent speaker verification

    .
    In ICASSP, Florence, Italy, pp. 4052–4056. Cited by: §1.
  • [30] D. Vijayasenan and F. Valente (2012-03) Speaker diarization of meetings based on large TDOA feature vectors. In ICASSP, Kyoto, Japan, pp. 4173–4176. Cited by: §1, §1.
  • [31] J. H. M. Wong, X. Xiao, and Y. Gong (2021-06)

    Hidden Markov model diarisation with speaker location information

    .
    In ICASSP, Toronto, Canada, pp. 7158–7162. Cited by: §1, §1, §2.1, §3, §3, §4, §5.
  • [32] T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang, A. Hurvitz, L. Jiang, S. Koubi, E. Krupka, I. Leichter, C. Liu, P. Parthasarathy, A. Vinnikov, L. Wu, X. Xiao, W. Xiong, H. Wang, Z. Wang, J. Zhang, Y. Zhao, and T. Zhou (2019-12) Advances in online audio-visual meeting transcription. In ASRU, Singapore, pp. 276–283. Cited by: §2.1, §4, §5.
  • [33] T. Zhou, Y. Zhao, and J. Wu (2021-01) ResNeXt and Res2Net structures for speaker verification. In SLT, Shenzhen, China, pp. 301–307. Cited by: §4.