Joint speaker diarisation and tracking in switching state-space model

by   Jeremy H. M. Wong, et al.

Speakers may move around while diarisation is being performed. When a microphone array is used, the instantaneous locations of where the sounds originated from can be estimated, and previous investigations have shown that such information can be complementary to speaker embeddings in the diarisation task. However, these approaches often assume that speakers are fairly stationary throughout a meeting. This paper relaxes this assumption, by proposing to explicitly track the movements of speakers while jointly performing diarisation within a unified model. A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers. The model is implemented as a particle filter. Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able to perform comparably with other methods that use location information.



There are no comments yet.


page 1

page 2

page 3

page 4


Diarisation using location tracking with agglomerative clustering

Previous works have shown that spatial location information can be compl...

Jointly Tracking and Separating Speech Sources Using Multiple Features and the generalized labeled multi-Bernoulli Framework

This paper proposes a novel joint multi-speaker tracking-and-separation ...

A Real-time Speaker Diarization System Based on Spatial Spectrum

In this paper we describe a speaker diarization system that enables loca...

DNN Speaker Tracking with Embeddings

In multi-speaker applications is common to have pre-computed models from...

Intensity Particle Flow SMC-PHD Filter For Audio Speaker Tracking

Non-zero diffusion particle flow Sequential Monte Carlo probability hypo...

A cascaded multiple-speaker localization and tracking system

This paper presents an online multiple-speaker localization and tracking...

The "Sprekend Nederland" project and its application to accent location

This paper describes the data collection effort that is part of the proj...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker diarisation is the task of clustering segments of audio that are uttered by the same speaker. This can be used with speech recognition to provide rich transcriptions of audio, expressing both words and speaker identities. The task of diarisation can be broken down into counting the number of clusters and clustering the audio segments. By treating these sub-tasks separately, the number of clusters can first be estimated by finding the maximum gap in a chosen statistic [33, 25], then the segments can be clustered using either -means [31]

or spectral clustering


. Alternatively, both sub-tasks can be performed in unison in the Agglomerative Hierarchical Clustering (AHC) framework

[32, 16]

. This iteratively performs greedy merging of clusters based on a measured affinity, until a stopping criterion is reached. A Hidden Markov Model (HMM) can capture information about the temporal nature of speech, which may be useful for diarisation. The HMM can either be used within AHC in the computation of the affinities

[1], or on its own after being given an upper bound of the number of clusters [4, 19].

Diarisation is often performed using only speaker embeddings, which are extracted using models that are trained to discriminate between speakers through a speaker identification or speaker verification task. Information about the locations of the speakers may be complementary to the speaker embeddings. Such location information is available when using a microphone array. In the HMM framework, location information can be incorporated by using the speaker embeddings together with either time-delay-of-arrival [24, 34] or Sound Source Localisation (SSL) [35] features as the observations. In these works, the HMM state only encodes information about the identities of the speakers, and does not keep track of where each speaker is at each point in time. This therefore may not explicitly model the movements of speakers, and may assume that speakers are fairly stationary throughout a meeting.

In the vision domain, multi-face tracking can be achieved using separate Kalman filters to track the movements of each face

[30, 7]. When using a microphone array, localisation information from the audio has been shown to be complementary to visual information for face tracking [9]. In the audio-only scenario, challenges such as LOCATA [6] help to spur the development of audio localisation and tracking methods. Several of these methods also rely on Kalman or particle filtering techniques, to track the locations of a single [2, 28, 20] or multiple [29] audio sources. When tracking multiple audio sources, multi-target extensions of probabilistic data association provide a framework to estimate which observations belong to each of the targets being tracked [10]. However, when used with multiple speakers [22, 21, 26], these tracking methods often only rely on location information, and not speaker embeddings.

This paper proposes to track speaker movements jointly with performing diarisation, while also using speaker embeddings. It is hoped that explicitly modelling the movements of speakers may be beneficial to the diarisation task. A switching state-space model [11] is proposed, that does joint modelling through a hidden state that encodes both information about the active speaker identity and also the current locations of each of the speakers. This model is implemented using a particle filter framework to accommodate for the forms of transition and emission likelihoods that are used. The model is referred to as the Switching State-space Particle Filter (SSPF).

2 Joint clustering and location tracking

The HMM that is used for diarisation often encodes the current active speaker as the hidden state. In the work in [35], the HMM computes the observation sequence likelihood as


where and are the speaker embedding and SSL features respectively at frame , is the number of frames, and

is the discrete hidden state that encodes the active speaker identity. The initial state probability is omitted here for brevity. In this formulation, it is not possible to infer where each speaker is at each point in time. Thus, the model does not explicitly capture the movements of speakers.

In order to track speaker movements, this paper proposes to encode the current active speaker identity as well as the current locations of all of the speakers in the hidden state. Furthermore, multiple concurrent active speakers are allowed, to accommodate overlapping speech. Speech separation is applied to the microphone array audio, forming channels without concurrent speakers in each. The SSPF simultaneously models all channels. In contrast, [35] merges the channels into a single stream. The SSPF hidden state is defined as


where is a discrete variable representing the active speaker at frame in channel , represents the angular location in radians around the microphone array at frame for speaker , and is the number of speakers. Using the same Markov assumptions as (1), the observation sequence likelihood is computed as


where is used as a placeholder to represent a location-based observation feature that can take several possible forms. Here, diarisation is performed after speech separation, and thus each frame has unmixed observations, and .

The transition probability is factorised for each state entity,


This assumes that each separate and propagate independently over time. The speaker transition probability, is an matrix that is shared across all channels. The angular location transition likelihood is chosen to be a von Mises density function that is shared across all speakers,


where the concentration, , expresses how fast speakers tend to move, and is the modified Bessel function of the first kind with order . The von Mises density function is chosen to abide by being bounded by with a periodic boundary condition.

The initial state likelihood, which is omitted in (3) for brevity, is similarly factorised into each of the separate state entities,


Both the active speaker initial state probability, , and the initial location likelihood, , are set to be uniform, because the model has no information about the identity of the active speaker or the locations of the speakers, before any observation is made.

Similarly, the speaker embedding emission likelihood is also factorised into separate channels,


which makes the assumption that the emissions of the channels are independent of each other when given the state. Similarly to [35], the emission likelihood for each channel is chosen to be a von Mises-Fisher density function,


where is the speaker embedding dimension, represents the embedding centroid for speaker , and

is the concentration. The log-likelihood is a cosine similarity between

and .

The location emission likelihood is also factorised per-channel,


which again makes the assumption that the observed locations in each channel are independent of each other when given the current state. Two forms of location features are considered. The first is the SSL vector,

, which represents a categorical distribution, where each dimension expresses the probability that the sound had originated from each angular bin around the microphone array,


where is the angular bin index and is the angular bin from which the frame may have originated. This is computed using a complex angular central Gaussian model [15], as is described in [36]. The second form of location feature is the Direction-Of-Arrival (DOA), , which is computed as the mode of the SSL,


and is the angle in radians of the th bin. An alternative is to compute the DOA as the circular mean of the SSL, similarly to (17), instead of the mode, but initial tests did not suggest any significant performance difference between the two choices.

When using the DOA as the observed location feature, is substituted with in (9), and the location emission likelihood for each channel can be computed as a von Mises density function,


where the concentration, , expresses the observation noise. This measures a similarity between the observed location, , and the predicted location of the speaker that is estimated to be active on the channel, .

Alternatively, the full SSL vector can be used as the observed location feature, by substituting with in (9). For this feature, the location emission likelihood for each channel is computed using a continuous categorical density function [13],


where is the number of discrete angular bins and is the normalisation term defined in [13]. The continuous categorical bin probabilities are computed as a discretised von Mises distribution about a mean that represents the predicted location, , of the current active speaker in the channel,


The equivalent log-likelihood of (13) is a KL-divergence between the predicted SSL, , and the measured SSL, , both of which represent discrete categorical distributions. Substituting (14) into (13) yields






This suggests that with the choice of location emission likelihood of (13) and (14), the SSL, , at each frame can be completely characterised by an equivalent concentration, , and mean, . The concentration may weigh the contribution of each frame to the total log-likelihood proportionally to the sharpness of the SSL.

Figure 1: Denominator term of discretised von Mises distribution (14)

However, the normalisation term, , is difficult to compute in a numerically stable manner [13]. As such, it is ignored in this paper. Furthermore, the term in the denominator of (15) is also ignored. Figure 1 plots as a function of , for various values of and . The plots suggest that is approximately independent of , except when both the concentration, , is large and the number of angular bins, , is small. The setup in this paper does not operate in this regime. Thus it seems reasonable to omit this term. Therefore, the location emission likelihood is computed as


which looks similar in form to a von Mises density function.

With separated channels, there may not be concurrent active speakers at every frame. If frame in channel does not have an observation, then the emission likelihoods are set to and for this frame and channel.

Figure 2: Joint modelling of discrete speaker turns (squares) and continuous locations (circles) using a switching state-space model

The joint speaker turn and location tracking model is illustrated graphically in Figure 2. This is reminiscent of the switching state-space model proposed in [11]. The discrete chains that express the current active speakers switch the outputs between the continuous chains that track the locations of each of the speakers, to generate the observed locations. The parameters of the model are the speaker embedding centroids, , the speaker transition probabilities, , and the concentrations, , , and . The concentrations are estimated using parameter sweeps on the dev

data, while the speaker embedding centroids and speaker transition probabilities are maximum likelihood estimates from the hypothesised clusters from an initial AHC run. Uniform smoothing is interpolated into the speaker transition probabilities to improve generalisation.

3 Particle filter implementation

Performing clustering by decoding the model requires computing the forward recursion of


where is used to concisely represent the pair of observations, . The choice of emission and transition likelihoods in Section 2 are not closed under the multiplication and convolution operations in (19). This makes it difficult to implement the model exactly, in a manner analogous to a Kalman filter or HMM. In this paper, the model is implemented as a particle filter [12]. This does not require the likelihoods to be closed under the forward pass operations, and instead performs a Monte Carlo simulation of the propagation of density functions along the forward pass.

The sequential importance resampling algorithm [27] is used. At each frame in the forward pass, the prediction step samples particles from either the initial state likelihood, , for the first frame or from the transition likelihood, , at subsequent frames, when given the particles or resampled particles from the previous frame. The factorised forms of (4) and (6) allow each state entity to be sampled separately. The collection of particles represents an approximation of the prediction likelihood,


where is the th particle, is the number of particles, are the importance weights after resampling from the previous frame, and the Dirac delta function is defined as


After sampling the particles, the update step then computes the importance sampling weights as


The collection of particles and importance weights now approximate the update likelihood,


Often, sequential Monte Carlo simulation methods suffer from the importance weights attenuating to zero for many particles, as the forward pass progresses. This is because the importance weights are computed recursively as a product of previous importance weights in (22). This may make it difficult to effectively explore the support of the state space. Resampling [27]

aims to alleviate this at the expense of an increase in the variance of the estimates. A new collection of resampled particles are sampled with replacement from the original particles,

, with each original particle being resampled with a probability equal to its importance weight, . The systematic method [3] is used in this paper to perform resampling. After resampling, the new resampled importance weights are set uniformly, . Resampling is only performed at a frame if the effective sample size [18], , falls below a threshold.

4 Decoding

Clustering can be performed by decoding the model. Only the active speakers, , are of interest to the diarisation task, while the speaker locations, , can be marginalised over. One approach to estimate the active speaker sequence is to use a Viterbi-style decoding


However, it may not be trivial to develop an efficient algorithm for this when the hidden state contains continuous variables. Furthermore, in the diarisation setup used in this paper, the objective is to hypothesise a speaker identity for each word, which may not be perfectly matched with finding the most likely sequence.

Decoding is instead performed by first computing the per-frame speaker state posteriors, marginalising over the location states,


The speaker for each word is then estimated by choosing the most probable speaker from the aggregated speaker state posteriors over the frames within the word. Aggregation of the state posteriors can be done either as a sum,


a product,


or majority voting,


where is the speaker identity of the th hypothesised word, and are the start and end frame indexes of the word respectively, is the channel on which the word is detected, the Kronecker delta function is defined as


and is computed by marginalising over the other channels in . The product combination in (27

) is most closely related to a maximum probability interpretation, as the probability for the speaker of a word should be computed as a joint probability of the same speaker over all of the frames within the word.

The state posterior in (25) can be estimated through Forward Filtering-Backward Smoothing (FFBS) [5],


where the backward recursion computes the backward importance weights as


In this paper, an exact computation of the backward importance weights in (31) is used, which has a computational cost that scales as . This can be expensive when using many particles. Many particles may be required to sufficiently explore the state space. A kernel density approximation [8, 14] can be used to speed up the computation to scale as , but this requires that the transition likelihoods represent monotonic kernels [17], which may limit the form of the allowed active speaker transition probabilities, , to matrices with a probability attenuating monotonically away from the diagonal. As opposed to this, the forward recursion has a computational cost that scales as . Therefore, the computational cost can be reduced by decoding using only the forward pass, by replacing with in the conditional dependencies in (25), (26), (27), and (28). However, this foregoes information from the future context when making the decoding decisions.

An alternative method to reduce the computational cost is to uniformly sub-sample the particles after the forward pass, when performing the backward pass. The exploration of the state space in the FFBS algorithm is primarily achieved during the sampling of particles in the prediction step of the forward pass. Therefore, having a large number of particles is more important for the forward pass than the backward pass.

Decoding for diarisation is done per word. Thus, it seems reasonable to restrict the state transitions to only allow speaker changes at the word boundaries. In the forward pass, this can be achieved by setting

to the identity matrix when sampling in the prediction step at frames that are not at word boundaries. In the backward pass, the same restricted speaker transition probabilities can be used to compute the backward importance weights in (


5 Meeting transcription setup

The proposed approach was evaluated on a rich meeting transcription task, with the setup that was initially described in [36], and used again in [35]. Audio from a microphone array was separated into multiple channels, with the assumption that there were no concurrent speakers within each channel. Voice activity detection and speech recognition were run on each channel. Speaker change detection was used to find segments with speaker purity, by applying a threshold to the cosine similarity of the speaker embeddings computed using the model described in [37]. AHC was then used to cluster together all of the segments from all of the channels that belonged to the same speaker, by greedily merging clusters with the highest speaker embedding cosine similarity, until the maximum similarity fell below a threshold. The Hungarian algorithm was then used to find the optimal mapping between the AHC hypothesised clusters and the enrolled speakers. These tagged AHC clusters were used to initialise the parameters of either a HMM or SSPF model, which then refined the clusters. The maximum number of active speakers, , was equal to the number of AHC clusters. As with in [35], the HMM parameters here were fine-tuned for each meeting using expectation-maximisation. The SSPF parameters were not modified after initialisation. In [35], Hungarian speaker tagging was performed after HMM clustering. However, in this paper, HMM or SSPF clustering was performed after Hungarian tagging, to isolate the experimental trends associated with the HMM and SSPF methods, and ignore the trends due to the interactions between clustering and tagging. Following [35], the HMM here also used a segment of one or more words as a frame. A uniform time segmentation may be essential to effectively model the temporal movements of speakers in the SSPF. As such, the SSPF used frames with a duration and shift of 0.4s.

6 Experiments

Audio data was collected from internal Microsoft meetings, with an average of 7 active participants per meeting, lasting up to 1 hour each. The dev set comprised 51 meetings making up 23 hours, while the eval set comprised 60 meetings making up 35 hours. The model described in [37] was used to extract 128-dimensional -vector speaker embeddings. The dimension of the SSL vectors was 360. The baseline HMM used SSL vectors that were downsampled to 18 dimensions, as this yielded improvements in initial tests. The SSPF used the full 360-dimensional SSL vectors, to retain the spatial resolution for accurate location tracking. The speaker-attributed Word Error Rate (WER) [36] was used to measure the performance. This was computed by measuring the WER separately for each speaker, then averaging the WERs over all speakers. The speaker-attributed WER assesses both the speaker diarisation and speech recognition performances together, which are both important for the rich meeting transcription task.

(a) Number of forward particles
(b) Number of backward particles sub-sampled from 20000 forward particles
Figure 3: Performance on the dev set with various numbers of forward and sub-sampled backward particles

The first experiment assesses the influence of the number of particles. This primarily affects the exploration of the state space during the forward pass. As is explained in Section 4, the particles can be sub-sampled during the backward pass to reduce the computational cost. Decoding of the SSPF can be done using only a forward pass or by using both the forward and backward passes. Figure 2(a) assesses the impact on the dev set of the number of particles in the forward pass, by decoding using only the forward pass. DOA features were used with sum aggregation, without state transition restrictions. The speaker-attributed WER can be seen to degrade when fewer than 1000 particles are used. It may not be possible to effectively explore the support of the state space with so few particles. In the remaining experiments, 20000 particles were used in the forward pass to ensure adequate exploration of the state space. Going beyond 20000 particles required more than the available CPU memory, as this implementation was not optimised for memory efficiency.

Decoding using only the forward pass ignores information about the future context. Such information can be utilised by performing decoding using FFBS. In the backward pass, the computational cost can be reduced by sub-sampling the particles from the 20000 in the forward pass. Figure 2(b) assesses how the number of sub-sampled particles used in the backward pass affects the performance on the dev set. The speaker-attributed WER improves as more sub-sampled particles are used. As a comparison between the two passes, a forward pass with 20000 particles yielded a speaker-attributed WER of 17.56%, while a backward pass with 5000 sub-sampled particles yielded 17.64%. It is a reasonable guess that the performance of the backward pass may eventually surpass that of the forward pass when given sufficient sub-sampled particles. However, going beyond 5000 sub-sampled particles required infeasible computation times in the current implementation. Unless otherwise stated, the remaining experiments perform decoding using only the forward pass.

The next experiment investigates the benefit of tracking the speaker locations, for the diarisation task. The SSPF model can use only speaker embeddings, by setting . Speaker location tracking can be jointly performed with diarisation within the SSPF model, by using location features in the form of either the DOA with an emission likelihood of (12), or the SSL with an emission likelihood of (18). A comparison of these features on the dev set is shown in Table 1. The results suggest that both DOA and SSL features may yield small gains over using only -vectors, thereby suggesting that jointly performing speaker tracking with clustering may aid in the diarisation task. The results also agree with [35] in suggesting that location features may be complementary to speaker embeddings for diarisation. SSL features do not show any significant gain over DOA features. In the remaining experiments, the SSL features were used.

Observations dev speaker-attributed WER (%)
-vector 17.65
-vector + DOA 17.56
-vector + SSL 17.55
Table 1: Location observation feature type
Aggregation method dev speaker-attributed WER (%)
sum 17.55
product 17.56
majority voting 17.56
Table 2: Posterior aggregation methods

As is described in Section 4, the speaker for each word can be chosen by aggregating the per-frame state posteriors within each word using either a sum, product, or majority voting. Table 2 assesses these aggregation techniques on the dev set. There does not seem to be any significant difference between the performances of these three aggregation methods.

Restrict in dev speaker-attributed WER (%)
forward backward forward backward
no no 17.55 17.72
yes no 17.60 17.78
yes yes 17.65 17.77
Table 3: Restricting speaker transitions to word boundaries

Section 4 also describes the possibility of restricting the speaker transitions, such that speaker changes are only allowed at word boundaries. This restriction can be applied in either or both of the forward and backward passes. Table 3 assess these restrictions on the dev set. Here, the backward pass used only 1000 sub-sampled particles for faster experimentation. The results suggest that there may not be any significant gain yielded by enforcing these restrictions. The 17.60% and 17.65% forward pass speaker-attributed WERs for when speaker transitions are restricted in the forward pass differ because of the stochasticity of the SSPF model.

Speaker-attributed WER (%)
Test set Model stationary moving average
dev HMM 16.59 18.19 17.53
SSPF 16.68 18.14 17.55
eval HMM 19.45 15.26 16.02
SSPF 19.54 15.17 16.00
Table 4: Effect of explicitly modelling movement

Table 4 compares the SSPF against the baseline HMM from [35], on both the dev and eval sets. Here the meetings were categorised into those with and without speaker movements. A meeting was considered to have movement if that meeting had at least one speaker, such that it was possible to find two disjoint angular arcs of at least radians each, and that speaker spent at least 30s of active speech in each of the two remaining regions that were not covered by these two arcs, based on manually transcribed location information from video data. The results suggest that the SSPF may improve the speaker-attributed WER performance over the HMM for meetings that have movement. Although the improvements may be small for each of the dev and eval sets, the improvements are consistent across both data sets. However, the SSPF seems to degrade the performance of stationary meetings compared to the HMM. If speakers are fairly stationary through a meeting, then their static location information may be particularly useful for the diarisation task. This scenario may fit particularly well with the assumptions of the HMM, which does not explicitly model temporal changes in the speaker locations. It is shown in [35] that expectation-maximisation fine-tuning of the initial state and state transition probabilities on the current test meeting yield improvements for the HMM. It is difficult to perform per-meeting fine-tuning of the analogous parameters in the SSPF in a computationally feasible manner, and these parameters were instead only initialised from the AHC hypothesis. Despite this, the SSPF is able to perform comparably with the HMM on average.

Figure 4: Example prediction of a speaker’s location. Blue crosses represent the DOA observations, while the heat map shows the weighted distribution of the particles, where darker means higher probability

An advantage of the SSPF over the HMM is that the SSPF can yield the estimated locations of each of the speakers as they move, through the duration of the meeting. An example of such a predicted location trace after the forward pass is illustrated in Figure 4. The location estimation continues, even when the speaker is silent. The particles express growing uncertainty about the speaker’s location, as the duration of silence increases. This predicted location information may be useful to downstream tasks.

7 Conclusion

This paper has proposed a framework to jointly perform diarisation and speaker location tracking. A switching state-space model is implemented as a particle filter, with discrete chains that represent speaker turns, which are used to switch between continuous chains that express speaker locations. This model is shown to perform comparably with a previously proposed HMM diarisation approach that models static speaker locations.


  • [1] J. Ajmera and C. Wooters (2003-11) A robust speaker clustering algorithm. In ASRU, St. Thomas, US Virgin Islands, pp. 411–416. Cited by: §1.
  • [2] D. Bechler, M. Grimm, and K. Kroschel (2003-05) Speaker tracking with a microphone array using Kalman filtering. Advances in Radio Science 1, pp. 113–117. Cited by: §1.
  • [3] J. Carpenter, P. Clifford, and P. Fearnhead (1999-02) An improved particle filter for non-linear problems. IEE Proceedings - Radar, Sonar and Navigation 146 (1), pp. 2–7. Cited by: §3.
  • [4] M. Diez, L. Burget, S. Wang, J. Rohdin, and H. Černocký (2019-09) Bayesian HMM based x-vector clustering for speaker diarization. In Interspeech, Graz, Austria, pp. 346–350. Cited by: §1.
  • [5] A. Doucet, S. Godsill, and C. Andrieu (2000-07) On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing 10, pp. 197–208. Cited by: §4.
  • [6] C. Evers, H. W. Löllmann, H. Mellmann, A. Schmidt, H. Barfuss, P. A. Naylor, and W. Kellermann (2020-04) The LOCATA challenge: acoustic source localization and tracking. IEEE/ACM Transactions on Audio, Speech, and Language processing 28, pp. 1620–1643. Cited by: §1.
  • [7] J. Foytik, P. Sankaran, and V. Asari (2011) Tracking and recognizing multiple faces using Kalman filter and ModularPCA. Procedia Computer Science 6, pp. 256–261. Cited by: §1.
  • [8] J. H. Friedman, J. L. Bentley, and R. A. Finkel (1977-09) An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software 3 (3), pp. 209–226. Cited by: §4.
  • [9] I. D. Gebru, S. Ba, G. Evangelidis, and R. Horaud (2015-12) Tracking the active speaker based on a joint audio-visual observation model. In ICCVW, Santiago, Chile, pp. 702–708. Cited by: §1.
  • [10] T. Gehrig and J. McDonough (2006-04) Tracking multiple speakers with probabilistic data association filters. In CLEAR, Southampton, UK, pp. 137–150. Cited by: §1.
  • [11] Z. Ghahramani and G. E. Hinton (2000-04) Variational learning for switching state-space models. Neural Computation 12 (4), pp. 831–864. Cited by: §1, §2.
  • [12] N. J. Gordon, D. J. Salmond, and A. F. M. Smith (1993-04) Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings-F 140 (2), pp. 107–113. Cited by: §3.
  • [13] E. Gordon-Rodriguez, G. Loaiza-Ganem, and J. P. Cunningham (2020-07) The continuous categorical: a novel simplex-valued exponential family. In ICML, pp. 3637–3647. Cited by: §2, §2.
  • [14] A. G. Gray and A. W. Moore (2000-11) ‘N-body’ problems in statistical learning. In NIPS, Denver, USA, pp. 521–527. Cited by: §4.
  • [15] N. Ito, S. Araki, and T. Nakatani (2016-08)

    Complex angular central Gaussian mixture model for directional statistics in mask-based microphone array signal processing

    In EUSIPCO, Budapest, Hungary, pp. 1153–1157. Cited by: §2.
  • [16] H. Jin, F. Kubala, and R. Schwartz (1997-02) Automatic speaker clustering. In DARPA Speech Recognition Workshop, Chantilly, USA. Cited by: §1.
  • [17] M. Klaas, M. Briers, N. de Freitas, A. Doucet, S. Maskell, and D. Lang (2006-06) Fast particle smoothing: if I had a million particles. In ICML, Pittsburgh, USA, pp. 481–488. Cited by: §4.
  • [18] A. Kong, J. S. Liu, and W. H. Wong (1994-03)

    Sequential imputations and Bayesian missing data problems

    Journal of the American Statistical Association 89 (425), pp. 278–288. Cited by: §3.
  • [19] F. Landini, S. Wang, M. Diez, L. Burget, P. Matějka, K. Žmolíková, L. Mošner, A. Silnova, O. Plchot, O. Novotný, H. Zeinali, and J. Rohdin (2020-05) BUT system for the second DIHARD speech diarization challenge. In ICASSP, Barcelona, Spain, pp. 6529–6533. Cited by: §1.
  • [20] I. Marković and I. Petrović (2010-11) Speaker localization and tracking with a microphone array on a mobile robot using von Mises distribution and particle filtering. Robotics and Autonomous Systems 58 (11), pp. 1185–1196. Cited by: §1.
  • [21] J. McDonough, K. Kumatani, T. Arakawa, K. Yamamoto, and B. Raj (2013-05) Speaker tracking with spherical microphone arrays. In ICASSP, Vancouver, Canada, pp. 3981–3985. Cited by: §1.
  • [22] M. Murase, S. Yamamoto, J.-M. Valin, K. Nakadai, K. Yamada, K. Komatani, T. Ogata, and H. G. Okuno (2005-09) Multiple moving speaker tracking by microphone array on mobile robot. In Interspeech, Lisbon, Portugal, pp. 249–252. Cited by: §1.
  • [23] H. Ning, M. Liu, H. Tang, and T. Huang (2006-09) A spectral clustering approach to speaker diarization. In ICSLP, Pittsburgh, USA, pp. 2178–2181. Cited by: §1.
  • [24] J. M. Pardo, X. Anguera, and C. Wooters (2007-09) Speaker diarization for multiple-distant-microphone meetings using several sources of information. IEEE Transactions on Computers 56 (9), pp. 1212–1224. Cited by: §1.
  • [25] T. J. Park, K. J. Han, M. Kumar, and S. Narayanan (2019-12) Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Processing Letters 27, pp. 381–385. Cited by: §1.
  • [26] A. Plinge and G. A. Fink (2014-05) Multi-speaker tracking using multiple distributed microphone arrays. In ICASSP, Florence, Italy, pp. 614–618. Cited by: §1.
  • [27] D. B. Rubin (1987-06) A noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: the SIR algorithm. Journal of the American Statistical Association 82 (398), pp. 543–546. Cited by: §3, §3.
  • [28] D. Salvati, C. Drioli, and G. L. Foresti (2018-09) Localization and tracking of an acoustic source using a diagonal unloading beamforming and a Kalman filter. In LOCATA Challenge Workshop, Tokyo, Japan. Cited by: §1.
  • [29] C. Segura, A. Abad, J. Hernando, and C. Nadeu (2007-05) Multispeaker localization and tracking in intelligent environments. In CLEAR2007 and RT2007, Baltimore, USA, pp. 82–90. Cited by: §1.
  • [30] Z. Shaik and V. Asari (2007-10) A robust method for multiple face tracking using Kalman filter. In AIPR, Washington DC, USA, pp. 125–130. Cited by: §1.
  • [31] S. Shum, N. Dehak, E. Chuangsuwanich, D. Reynolds, and J. Glass (2011-08) Exploiting intra-conversation variability for speaker diarization. In Interspeech, Florence, Italy, pp. 945–948. Cited by: §1.
  • [32] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern (1997-02) Automatic segmentation, classification and clustering of broadcast news audio. In DARPA Speech Recognition Workshop, Chantilly, USA, pp. 97–99. Cited by: §1.
  • [33] R. Tibshirani, G. Walther, and T. Hastie (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B 63 (2), pp. 411–423. Cited by: §1.
  • [34] D. Vijayasenan and F. Valente (2012-03) Speaker diarization of meetings based on large TDOA feature vectors. In ICASSP, Kyoto, Japan, pp. 4173–4176. Cited by: §1.
  • [35] J. H. M. Wong, X. Xiao, and Y. Gong (2021-06)

    Hidden Markov model diarisation with speaker location information

    In ICASSP, Toronto, Canada, pp. 7158–7162. Cited by: §1, §2, §2, §2, §5, §6, §6.
  • [36] T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang, A. Hurvitz, L. Jiang, S. Koubi, E. Krupka, I. Leichter, C. Liu, P. Parthasarathy, A. Vinnikov, L. Wu, X. Xiao, W. Xiong, H. Wang, Z. Wang, J. Zhang, Y. Zhao, and T. Zhou (2019-12) Advances in online audio-visual meeting transcription. In ASRU, Singapore, pp. 276–283. Cited by: §2, §5, §6.
  • [37] T. Zhou, Y. Zhao, and J. Wu (2021-01) ResNeXt and Res2Net structures for speaker verification. In SLT, Shenzhen, China, pp. 301–307. Cited by: §5, §6.