1 Introduction
Speaker diarisation aims to cluster together segments of audio that are uttered by the same speaker. This is useful in a rich meeting transcription task, where both the identity of a speaker and the word being said need to be hypothesised. Spectral [17] and means [24]
clustering can be used for diarisation, after first estimating the number of clusters, by for example, finding the maximum gap in a chosen statistic
[28, 19]. Agglomerative Hierarchical Clustering (AHC) [25, 10]instead jointly estimates the cluster assignments and number of clusters. The Hidden Markov Model (HMM) can also be used, either to compute merging scores within AHC
[1], or on its own after having estimated the number of clusters [4, 13]. These methods often rely solely on features in the form of speaker embeddings, such as [3], [26], and [29]. The speaker embeddings are intended to express information that is useful in discriminating between different speakers.When multichannel audio is available, it is possible to estimate the instantaneous location from where the sound originated from. This information may be complementary to the speaker embeddings in the diarisation task. Previous works have investigated using timedelayofarrival [18, 30] and Sound Source Localisation (SSL) information [31], together with speaker embeddings in HMM clustering. There are also a diversity of methods to count and localise multiple speakers, without using speaker embeddings [16, 15, 20].
Speakers may move over the duration of a meeting. Explicitly modelling this movement may aid in diarisation. In multiface tracking, Kalman filters are often used to track face movements from visual information [23, 6]. When multichannel audio is available, acoustic location information has been shown to be complementary to visual information for face movement tracking [7]. The LOCATA challenge [5] has helped to spur the development of audioonly location tracking methods. Several of these approaches also rely on Kalman filters, to track the locations of a single [2, 21] or multiple [22] audio sources.
This paper proposes to perform diarisation, while modelling the movements of multiple speakers. It builds upon the works in [18, 30, 31], by tracking the movements of speakers, rather than assuming that speakers are stationary. It also extends upon the audioonly tracking methods, such as in [16, 15, 20, 22], by using both location information and speaker embeddings in diarisation. Diarisation is performed using AHC. Speaker movement is modelled as the likelihood of a sequence of instantaneous locations, computed using a Kalman filter. This is used together with a speaker embedding affinity score in the AHC cluster merging and stopping criteria.
2 von Mises Kalman filter tracking
The Kalman filter [11] can be used to model movement through location tracking. Using the Markov assumptions, the Kalman filter computes the likelihood of an observation sequence as
(1) 
where is the frame index, is the total number of frames, and is an observed instantaneous location feature, whose possible forms are discussed in section 2.1. In this paper, the hidden state, , represents the estimated location of the speaker. In the future, it may be beneficial to also investigate modelling the velocity and higher time derivatives in the hidden state, as is often done in face tracking [23, 6].
In this paper, the speaker location is expressed as the horizontal angle around a microphone array. This is a continuous variable that is bounded within in radians, with a periodic boundary condition. Previous works have satisfied these properties using von Mises and warped normal density functions [12]. The transition likelihood used in this paper is a von Mises density function,
(2) 
where is the modified Bessel function of the first kind of order and the concentration parameter, , expresses how tightly the density function is concentrated about the mean of . A higher concentration yields a lower likelihood for protean sequences. The initial state likelihood, , is set to a uniform density function, as there is complete uncertainty of where the speaker is before any observation is made.
2.1 Observation feature
Two forms of observed features are considered, namely the scalar DirectionOfArrival (DOA) and the full SSL vector. The SSL vector,
, is a categorical distribution, with each dimension representing the probability that the sound had originated from the respective angular bin around the microphone array,
(3) 
where is the angular bin index and is the angular bin from which the audio forming feature may have originated. The SSL is computed using a complex angular central Gaussian model [9], as is described in [32]. The DOA, , is computed as the mode of the SSL,
(4) 
and is the angle in radians of the th bin. Instead of the mode, it is also possible to estimate the DOA as the circular mean of the SSL, which can be computed using (10). However, initial tests did not suggest any significant difference between the performances of either form of DOA.
When using the DOA as the observed feature, the Kalman filter is used to compute , by substituting the placeholder, , with in (1). The emission likelihood is chosen to be a von Mises density function,
(5) 
where the concentration parameter, , expresses the random noise in the observation.
The DOA only represents an estimate of the instantaneous angle of the speaker. However, interactions with the environment, noise, and the limited spatial resolution of the microphone array geometry may result in uncertainty in this estimation. The full SSL vector may express information about this uncertainty. The likelihood of an SSL sequence, , can be computed by substituting with in (1), and using an emission likelihood in the form of a continuous categorical density function [8],
(6) 
where is the number of angular bins, is the normalisation constant defined in [8], and the continuous categorical bin probabilities are computed as a discretised von Mises distribution about the mean of ,
(7) 
The continuous categorical emission likelihood can be interpreted as being that if multiple samples are drawn when given the same , then (6) computes the likelihood of observing each angular bin, , at a fraction of , out of all of the samples. If a Dirichlet density function is used instead, by swapping the places of and in (6), then the interpretation will be different, and the simplification of (8) will no longer be applicable. By taking the logarithm of (6), the emission loglikelihood can be seen to be a Kullback Leibler (KL)divergence between two categorical distributions, one of the observation, , and the other being a prediction of the angular distribution, . That is to say, that when the model predicts the angular state to be , the model also predicts that the observed SSL should be similar to . The emission loglikelihood then measures a similarity score between the observed SSL, , and the predicted SSL, . The work in [31] also uses an emission loglikelihood in the form of a KLdivergence between SSL vectors, for HMM diarisation.
However, the exact form of (6) presents a challenge, as the normalisation term of is difficult to compute in a numerically stable manner [8]. The approximation is therefore made that is independent of , thereby allowing to be ignored when computing loglikelihood ratios for the AHC affinity scores, as will be described in Section 3. By ignoring and substituting in (7), (6) can be simplified to
(8) 
where
(9) 
and
(10) 
The form of (8) is reminiscent of a von Mises density function. Therefore, by choosing to use a combination of a continuous categorical density function and a discretised von Mises distribution for the SSL emission likelihood in (6) and (7), the effective emission likelihood will also look similar to a von Mises density function, and at each frame can be completely summarised by its equivalent concentration, , and circular mean, . The concentration, , may weigh the contribution of each frame to the total loglikelihood proportionally to the sharpness of the SSL distribution.
However, unlike a von Mises density function, the denominator in (7) and (8) depends on , which is inconvenient for the forward recursion integral described in Section 2.2. Figure 1 plots the denominator, , over a variety of and values. It can be seen that the denominator is approximately independent of , except when the concentration, , is large at the same time as the number of angular bins, , is small. The experiments in this paper do not operate in such a regime. Therefore, it may be reasonable to approximate the denominator as being independent of . This allows the SSL emission likelihood to be expressed as
(11) 
which has the form of a von Mises density function over .
2.2 Likelihood computation
The Kalman filter can be used to compute the loglikelihood of a DOA or SSL observation sequence, . This loglikelihood can be used as a score in the AHC merging and stopping criteria, as will be discussed in Section 3. The likelihood can be computed using the Kalman filter forward recursion,
(12) 
This can be broken down into the prediction step,
(13) 
and the update step,
(14) 
In this paper, the transition likelihood in (2), and emission likelihoods in (5) and (11
) all have the form of von Mises density functions in terms of the random variable
. The prediction step in (13) is a convolution operation. Unfortunately, the von Mises density function is not closed under convolution, but instead the result takes a form described in [14]. However, it has been shown that the result of the convolution can be closely approximated by a von Mises density function [27], thereby allowing (13) to be expressed as(15) 
The prediction concentration is
(16) 
and the prediction mean is
(17) 
where is the functional inverse of
(18) 
which can be solved for using the NewtonRaphson root finding algorithm, and both and are the parameters of the update step von Mises density function from the previous frame. In the prediction step, the concentration is broadened from the previous frame, through the inclusion of in (16). The mean in (17) does not change, because higher temporal derivatives of the angle are not modelled in the hidden state.
The von Mises density function is closed under multiplication. Therefore, by using the approximate prediction in (15), the update step of (14) becomes
(19) 
When using DOA observations with an emission likelihood of (5), the update concentration is
(20) 
and the update mean is
(21) 
When instead using SSL observations with an emission likelihood of (11), the update concentration is
(22) 
and the update mean is
(23) 
The mean updates of (21) and (23) are weighted circular averages between the prediction mean and the observation. This serves to bring closer to the current observation.
After having computed the prediction density function through the forward recursion, the loglikelihood of the observation sequence can then be computed as
(24) 
When computing the loglikelihood, there is no need to preserve the von Mises form, as only a point estimate is needed. Therefore, the exact convolution [14] can be used, which for DOA observations is
(25) 
When using SSL observations, the exact emission likelihood of (6) is difficult to use, because of the numerical instability of the normalisation term. Therefore, the experiments in this paper approximate the SSL observation sequence loglikelihood using the same form as (25), with and being substituted with and respectively. This approximation again ignores the denominator terms in both (6) and (7), and renormalises (11) over .
A speaker may have discontiguous regions of speech, since a speaker may not speak continuously throughout a whole meeting. As such, when modelling a speaker’s movement with a Kalman filter, there may be frames for which the speaker has no DOA or SSL observation. This is analogous to an occlusion in the visual face tracking task. For such frames, is undefined and the emission likelihood is simply set to . The update step for these frames simplifies from (14) to,
(26) 
When computing the conditional loglikelihood for such frames, using in (24) yields . Thus these frames, without observations, do not contribute to the total observation sequence loglikelihood.
In the setup used in this paper, diarisation is performed after speech separation, to handle overlapped speech. As such, it is possible to encounter situations where the observation sequence may have frames from multiple separated channels that overlap in time. Let and be observations at frame index from channels 1 and 2 respectively. During the update step in (14) and the likelihood computation in (24), the emission likelihood of is simply substituted with , where it is assumed that the parallel observations of and are conditionally independent of each other when given . As a reminder, the product of multiple von Mises density functions also has the form of a von Mises density function.
2.3 Parameter estimation
The Kalman filter has two parameters, and . These express the dynamic ranges of the speaker’s movement speed and random noise in the observation respectively. One possible method of estimating them is through maximising the loglikelihood of the observation sequence,
(27) 
using the ExpectationMaximisation (EM) algorithm.
The Estep requires the computation of the state posteriors,
(28) 
which is a product between the forward and backward density functions. The backward density function can be expressed as
(29) 
This can be computed using the product and approximate convolution of von Mises density functions, described in Section 2.2. An analogous expression can also be derived for the joint state posterior, , which is omitted here for brevity. For frames that do not have an observation of the speaker’s location, in (29).
The Mstep update for with DOA observations is
(30) 
and that for is
(31) 
where is the EM iteration index. Monte Carlo approximations can be used to compute the integrals in (30) and (31).
Estimating the parameters by maximising the likelihood only makes sense when using DOA observations. When using SSL observations, the normalising terms in (6) and (7) are dependent on . Therefore a naive maximum likelihood optimisation with the previously described SSL observation approximations, of omitting the normalising terms, may not converge. In the experiments presented in this paper, and were optimised for DOA observations, and the same parameter values were then also used for SSL observations.
The approximate convolution used in the forward and backward recursions yields Estep posteriors and Mstep updates that are also approximate. Therefore, there is no guarantee that the observation sequence loglikelihood will not worsen at each iteration. Furthermore, there is no guarantee that the parameters will converge to locally optimal values. However, the convolution approximation is needed to allow the density functions to remain closed within the von Mises family, and simplify the mathematics for the recursions. In the future, it may be interesting to investigate parameter estimation methods with fewer approximations.
3 Agglomerative hierarchical clustering
Speaker diarisation can be performed using AHC. The aim is to cluster together segments that belong to the same speaker. AHC begins by treating each segment as a separate cluster. At each iteration, the two clusters with the highest affinity score are merged in a greedy manner. The merging iterations continue until the maximum remaining affinity falls below a threshold. Work in [1]
uses the Bayes’ Information Criterion (BIC) as the affinity and computes the observation sequence likelihoods for the clusters through a HMM with Gaussian mixture model emission likelihoods. The BIC allows the model complexity to remain constant through all AHC iterations, thereby alleviating any favouritism toward having more speakers. However, the stopping threshold for the BIC can be difficult to tune robustly in practice. Furthermore, each AHC merger iteration requires an optimisation of the HMM to be run using the EM algorithm, which can be computationally expensive.
The setup in this paper instead uses a simpler AHC formulation that follows [31]
, where the affinity is computed as a cosine similarity between the speaker embedding centroid of each cluster. Given two clusters,
and , with unitlength speaker embeddings and respectively, the affinity is(32) 
This measures a similarity between the speaker embeddings of the two clusters. These embeddings are often extracted using a model that is trained using a speaker identification or speaker verification task. The embeddings are therefore expected to express characteristics of the audio that are useful in discriminating one speaker from another. It does not consider the locations or movements of the speakers.
In order to use location information, it is proposed in [31] to use a similarity score that is a KLdivergence between SSL vectors, within a HMM diarisation framework. This paper proposes as a baseline, that this score can also be used with AHC, by computing the SSL contribution to the affinity as
(33) 
where is the SSL centroid of cluster . Unlike in [31], the symmetric KLdivergence is used here. However, this affinity averages the SSL vectors over time when computing the centroid, and may therefore not explicitly model the movement of the speakers.
This paper proposes to model speaker movements within AHC by computing an affinity based on the loglikelihood ratio between merged clusters and separated clusters, where the loglikelihood is computed using the Kalman filter,
(34)  
where is the number of frames that have a DOA or SSL observation in cluster , and and represent the first and last frame indexes of cluster respectively. The normalisation by the number of frames with observations is necessary, as the loglikelihood from (24) scales linearly with the number of frames with observations. In the future, it may be interesting to explore BIC equivalents for this affinity, to take into account the model complexity.
The affinities of and express the cluster similarity based on the speakers’ angular locations, while
expresses the cluster similarity based on the characteristics of the audio that are useful in discriminating one speaker from another. The location affinities alone may not be sufficient for the clustering task, as it is possible that multiple speakers may overlap in their angular locations, or even in their movements. However, an affinity based on location may be complementary to one based on the speaker’s acoustic discriminative characteristics. Therefore, it may be useful to interpolate together
and either or .4 Meeting transcription setup
A rich meeting transcription task was used to evaluate the proposed approach. The setup followed that described in [32, 31]. Audio from a microphone array was beamformed and separated into multiple channels, where it was assumed that there were no concurrent speakers within each channel. On each channel, voice activity detection and speech recognition were run. The nonsilence segments were then further split into segments with speaker purity, using speaker change detection at word boundaries. Each of these segments might contain one or more words. Diarisation was then run to cluster and tag the resulting segments from all channels together. For each segment, a speaker embedding was extracted using the model described in [33]. DOA and SSL features were also extracted, as is described in Section 2.1. AHC was used to cluster the segments from the same speaker together, using a combination of one or more of the affinity measures described in Section 3. Finally, the Hungarian algorithm was used to tag the clusters, by finding an optimal mapping between the clusters and the enrolled speakers, based on the similarity of the speaker embeddings. The AHC affinity of in (33) was computed using a single SSL vector per cluster that represents the centroid, while in (34) used a sequence of DOA or SSL features with a duration and shift of 0.4s.
Feature type  dev speakerattributed WER (%) 

DOA  22.71 
SSL  22.46 
5 Experiments
Experiments were performed on audio collected from internal Microsoft meetings, lasting up to 1 hour each, with an average of 7 active participants each. The dev set comprised 51 meetings totalling 23 hours, while the eval set comprised 60 meetings totalling 35 hours. The speaker embeddings had 128 dimensions, while the SSL vectors had 360 dimensions. The AHC stopping criterion and affinity interpolation weights were tuned on the dev set using parameter sweeps. The Kalman filter parameters were optimised on the dev set using the EM algorithm, as is described in Section 2.3. The performance was measured using the speakerattributed Word Error Rate (WER) [32]. This first computed the WER separately for each speaker, by comparing the hypothesis to the reference for that speaker, then the WERs were averaged over all speakers. The speakerattributed WER expresses a combination of the speech recognition and diarisation performances, both of which are important for the rich meeting transcription task.
Table 1 compares computing the loglikelihood ratio using either DOA or SSL features in the Kalman filter, on the dev set. In both cases, the location tracking affinity of in (34) was interpolated with the speaker embedding affinity of in (32). The results suggest that using the full SSL vectors as the observed location features in the Kalman filter performs better than using DOA features. Each SSL vector is summarised by a mean angle in (10) and a concentration in (9). The concentration may contain information about the certainty of the instantaneous location estimation, which is not expressed in the DOA features. This weighs the contribution of each frame to the total sequence loglikelihood. The remaining experiment used the SSL features in the Kalman filter.
Speakerattributed WER (%)  

Test set  Affinity  stationary  moving  average 
dev  22.61  27.19  25.42  
22.05  27.55  25.43  
21.10  23.32  22.46  
eval  25.32  20.39  23.65  
24.73  19.83  23.06  
23.65  20.15  22.40 
The use of the Kalman filter loglikelihood ratio, in (34), within the affinity for AHC can be compared against the baselines of using speaker embeddings with in (32) and the KLdivergence based instantaneous location affinity of in (33). The meetings were categorised into those with and without moving speakers. A meeting was considered to contain moving speakers if for at least one speaker in the meeting, it was possible to find two disjoint angular arcs of at least radians each, where that speaker spent at least 30s of active speech in each of the two angular regions not covered by these two arcs, based on manually transcribed location information from video data. The performances of the various AHC affinities on the stationary and moving meetings are shown in Table 2. Interpolating with improves the performance for both stationary and moving meetings on the eval set, and the stationary meetings on the dev set, compared against using alone. This agrees with [31] in suggesting that location information may be complementary to the speaker embeddings for the clustering task. Initial tests suggested that it may be difficult to robustly tune the AHC stopping criterion and the affinity interpolation weights between and , as the cosine similarity in has a dynamic range between 1 and 1, while that for the KLdivergence in is between and 0. The Kalman filter loglikelihood ratio in outperforms on the moving meetings in the dev set, but not the eval set. This may again suggest that it can be difficult to robustly tune the hyperparameters for the interpolated affinity to generalise well to new data, since and again have different dynamic ranges. Despite this, consistently yields improvements for the moving meeting compared to using alone. In both datasets, is also able to yield consistent improvements over when averaged over all meetings, thereby suggesting that there may be a benefit to explicitly model speaker movements when performing diarisation.
6 Conclusion
This paper has presented an approach to explicitly model the spatial movements of speakers while performing diarisation. The movements are modelled through location tracking, using a Kalman filter with von Mises density functions as the transition and emission likelihoods. This Kalman filter is used to compute loglikelihood ratios between different cluster merging hypotheses in AHC. The results suggest that explicitly modelling the movements of speakers may provide information that is complementary to the speaker embeddings for the diarsation task.
References
 [1] (200311) A robust speaker clustering algorithm. In ASRU, St. Thomas, US Virgin Islands, pp. 411–416. Cited by: §1, §3.
 [2] (200305) Speaker tracking with a microphone array using Kalman filtering. Advances in Radio Science 1, pp. 113–117. Cited by: §1.
 [3] (201105) Frontend factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §1.
 [4] (201909) Bayesian HMM based xvector clustering for speaker diarization. In Interspeech, Graz, Austria, pp. 346–350. Cited by: §1.
 [5] (202004) The LOCATA challenge: acoustic scource localization and tracking. IEEE/ACM Transactions on Audio, Speech, and Language processing 28, pp. 1620–1643. Cited by: §1.
 [6] (2011) Tracking and recognizing multiple faces using Kalman filter and ModularPCA. Procedia Computer Science 6, pp. 256–261. Cited by: §1, §2.
 [7] (201512) Tracking the active speaker based on a joint audiovisual observation model. In ICCVW, Santiago, Chile, pp. 702–708. Cited by: §1.
 [8] (202007) The continuous categorical: a novel simplexvalued exponential family. In ICML, pp. 3637–3647. Cited by: §2.1, §2.1.
 [9] (201608) Complex angular central Gaussian mixture model for directional statistics in maskbased microphone array signal processing. In EUSIPCO, Budapest, Hungary, pp. 1153–1157. Cited by: §2.1.
 [10] (199702) Automatic speaker clustering. In DARPA Speech Recognition Workshop, Chantilly, USA. Cited by: §1.
 [11] (196003) A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82 (1), pp. 35–45. Cited by: §2.
 [12] (201603) Recursive Baysian filtering in circular state spaces. IEEE Aerospace and Electronic Systems Magazine 31 (3), pp. 70–87. Cited by: §2.
 [13] (202005) BUT system for the second DIHARD speech diarization challenge. In ICASSP, Barcelona, Spain, pp. 6529–6533. Cited by: §1.
 [14] (199901) Directional statistics. John Wiley and Sons. Cited by: §2.2, §2.2.
 [15] (201305) Speaker tracking with spherical microphone arrays. In ICASSP, Vancouver, Canada, pp. 3981–3985. Cited by: §1, §1.
 [16] (200509) Multiple moving speaker tracking by microphone array on mobile robot. In Interspeech, Lisbon, Portugal, pp. 249–252. Cited by: §1, §1.

[17]
(200609)
A spectral clustering approach to speaker diarization
. In ICSLP, Pittsburgh, USA, pp. 2178–2181. Cited by: §1.  [18] (200709) Speaker diarization for multipledistantmicrophone meetings using several sources of information. IEEE Transactions on Computers 56 (9), pp. 1212–1224. Cited by: §1, §1.
 [19] (201912) Autotuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Processing Letters 27, pp. 381–385. Cited by: §1.
 [20] (201405) Multispeaker tracking using multiple distributed microphone arrays. In ICASSP, Florence, Italy, pp. 614–618. Cited by: §1, §1.
 [21] (201809) Localization and tracking of an acoustic source using a diagonal unloading beamforming and a Kalman filter. In LOCATA Challenge Workshop, Tokyo, Japan. Cited by: §1.
 [22] (200705) Multispeaker localization and tracking in intelligent environments. In CLEAR2007 and RT2007, Baltimore, USA, pp. 82–90. Cited by: §1, §1.
 [23] (200710) A robust method for multiple face tracking using Kalman filter. In AIPR, Washington DC, USA, pp. 125–130. Cited by: §1, §2.
 [24] (201108) Exploiting intraconversation variability for speaker diarization. In Interspeech, Florence, Italy, pp. 945–948. Cited by: §1.
 [25] (199702) Automatic segmentation, classification and clustering of broadcast news audio. In DARPA Speech Recognition Workshop, Chantilly, USA, pp. 97–99. Cited by: §1.
 [26] (201804) Xvectors: robust DNN embeddings for speaker recognition. In ICASSP, Calgary, Canada, pp. 5329–5333. Cited by: §1.
 [27] (196312) Random walk on a circle. Biometrika 50 (34), pp. 385–390. Cited by: §2.2.
 [28] (200201) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society B 63 (2), pp. 411–423. Cited by: §1.

[29]
(201405)
Deep neural networks for small footprint textdependent speaker verification
. In ICASSP, Florence, Italy, pp. 4052–4056. Cited by: §1.  [30] (201203) Speaker diarization of meetings based on large TDOA feature vectors. In ICASSP, Kyoto, Japan, pp. 4173–4176. Cited by: §1, §1.

[31]
(202106)
Hidden Markov model diarisation with speaker location information
. In ICASSP, Toronto, Canada, pp. 7158–7162. Cited by: §1, §1, §2.1, §3, §3, §4, §5.  [32] (201912) Advances in online audiovisual meeting transcription. In ASRU, Singapore, pp. 276–283. Cited by: §2.1, §4, §5.
 [33] (202101) ResNeXt and Res2Net structures for speaker verification. In SLT, Shenzhen, China, pp. 301–307. Cited by: §4.
Comments
There are no comments yet.