Lightweight and Optimized Sound Source Localization and Tracking Methods for Open and Closed Microphone Array Configurations

12/01/2018 ∙ by Francois Grondin, et al. ∙ Université de Sherbrooke 0

Human-robot interaction in natural settings requires filtering out the different sources of sounds from the environment. Such ability usually involves the use of microphone arrays to localize, track and separate sound sources online. Multi-microphone signal processing techniques can improve robustness to noise but the processing cost increases with the number of microphones used, limiting response time and widespread use on different types of mobile robots. Since sound source localization methods are the most expensive in terms of computing resources as they involve scanning a large 3D space, minimizing the amount of computations required would facilitate their implementation and use on robots. The robot's shape also brings constraints on the microphone array geometry and configurations. In addition, sound source localization methods usually return noisy features that need to be smoothed and filtered by tracking the sound sources. This paper presents a novel sound source localization method, called SRP-PHAT-HSDA, that scans space with coarse and fine resolution grids to reduce the number of memory lookups. A microphone directivity model is used to reduce the number of directions to scan and ignore non significant pairs of microphones. A configuration method is also introduced to automatically set parameters that are normally empirically tuned according to the shape of the microphone array. For sound source tracking, this paper presents a modified 3D Kalman (M3K) method capable of simultaneously tracking in 3D the directions of sound sources. Using a 16-microphone array and low cost hardware, results show that SRP-PHAT-HSDA and M3K perform at least as well as other sound source localization and tracking methods while using up to 4 and 30 times less computing resources respectively.



There are no comments yet.


page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Distant Speech Recognition (DSR) occurs when speech is acquired with one or many microphone(s) moved away from the mouth of the speaker, making recognition difficult because of background noise, overlapping speech from other speakers, and reverberation Woelfel and McDonough (2009); Kumatari et al. (2012). DSR is necessary for enabling verbal interactions without the necessity of using intrusive body- or head-mounted devices. But still, recognizing distant speech robustly remains a challenge Vacher et al. (2015). Microphone arrays make it possible to capture sounds for DSR Kumatari et al. (2012) in human-robot interaction (HRI). This requires the installation of multiple microphones on the robot platform, and process distant speech perceived by filtering out noise from fans and actuators on the robot and non-stationary background sound sources in reverberant environments, fast enough to support live interactions. This process usually relies first on localizing and tracking the perceived sound sources, to then be able to separate them Grondin et al. (2013) for specific processing such as speech recognition Brodeur et al. (2016); Fréchette et al. (2012). Using sound source localization and tracking methods robust to noise and low computational cost is important Hoshiba et al. (2017), as it is usually the first step to engage speech based human-robot interaction (HRI). A natural speech HRI requires the robot to be able to detect speech commands in noisy environments, while avoiding false detections.

Figure 1: Block diagram of sound source localization and tracking

As illustrated by Fig. 1, sound source localization (SSL) and sound source tracking (SST) are done in sequence. For each frame , SSL uses the captured signals from the -microphone array and generates potential sources , where each potential source consists of a direction of arrival (DoA) in Cartesian coordinates, and the steered beamformer energy level . SSL methods provide noisy observations of the DoAs of sound sources, caused by the sporadic activities of sound sources (e.g., the sparsity of speech), combined with the presence of multiple competing sound sources. SST then uses the potential sources and returns tracked sources to filter out this noise and provide a smooth trajectory of the sound sources. Improved capabilities for SSL and SST can be directly associated with the number of microphones used, which influences processing requirements Valin et al. (2007).

In this process, SSL is the most expensive in terms of computation, and a variety of SSL algorithms exists. Rascon et al. Rascon et al. (2015) present a lightweight SSL method that uses little memory and CPU resources, but is limited to three microphones and scans the DoA of sound source only in 2D. Nesta and Omologo Nesta and Omologo (2012)

describe a generalized state coherence transform to perform SSL, which is particularly effective when multiple sound sources are present. However, this method relies on independent component analysis (ICA), which takes many seconds to converge. Drude et al.

Drude et al. (2015) use a kernel function that relies on both phase and level differences, at the cost of increasing the computational load. Loesch and Yang Loesch and Yang (2010)

also introduce a localization method based on time-frequency sparseness, which remains sensitive to high reverberation levels. Multiple Signal Classification based on Standard Eigenvalue Decomposition (SEVD-MUSIC) makes SSL robust to additive noise

Nakadai et al. (2010). SEVD-MUSIC, initially used for narrowband signal Schmidt (1986), has been adapted for broadband sound sources such as speech Ishi et al. (2009), and is robust to noise as long as the latter is less powerful than the signals to be localized. Multiple Signal Classification based on Generalized Eigenvalue Decomposition (GEVD-MUSIC) method Nakamura et al. (2011)

has been introduced to cope with this issue, but the latter method increases the computations. Multiple Signal Classification based on Generalized Singular Value Decomposition (GSVD-MUSIC) reduces computational load of GEVD-MUSIC and improves localization accuracy

Nakamura et al. (2012), but still relies on eigen decomposition of a matrix. Other methods take advantage of specific array geometries (linear, circular or spherical) to improve robustness and reduce computational load Danès and Bonnal (2010); Pavlidi et al. (2012); Rafaely et al. (2010). Even though interesting properties arise from these geometries, these configurations are less practical for a mobile robot due to physical constraints introduced by its specific shape. SSL can also be performed using a Steered Response Power with Phase Transform (SRP-PHAT). The SRP-PHAT is usually computed using weighted Generalized Cross-Correlation with Phase Transform (GCC-PHAT) at each pair of microphones Grondin et al. (2013); Valin et al. (2007). SRP-PHAT requires less computations than MUSIC-based methods, but still requires a significant amount of computations when scanning the 3D-space for a large number of microphones. Stochastic region contraction Do et al. (2007), hierarchical search Zotkin and Duraiswami (2004); Do and Silverman (2009); Nunes et al. (2014)

and vectorization

Lee and Kalker (2010) have also been studied to speed up scanning with SRP-PHAT, but usually limit the search to a 2D surface and a single source. Marti et al. Marti et al. (2013) also propose a recursive search over coarse and fine grids. This method divides space in rectangular volumes, and maps points from the fine grid to only one point on the coarse grid. It however neglects microphone directivity, and also uses an averaging window over the GCC values, which may reduce the contribution of a peak during the coarse scan when neighboring values are negative.

SST methods can be categorized into four types:

  • Viterbi search. Anguera et al. Anguera et al. (2007) propose a post-processing Viterbi method to track a sound source over time. This method introduces a significant latency when used online, making it appropriate only for offline processing. Tracking is also performed on discrete states, which restrains the direction of the tracked source to a fixed grid.

  • Sequential Monte Carlo (SMC) filtering. The SMC method, also called particle filtering, performs low latency tracking for a single sound source Williamson and Ward (2002); Ward et al. (2003); Vermaak and Blake (2001). Valin et al. Grondin et al. (2013); Valin et al. (2007) adapt the SMC method to track multiple sound sources. This method consists in sampling the space with finite particles to model the non-Gaussian state distribution. SMC allows tracking with continuous trajectories, but requires a significant amount of computations, and is undeterministic because it uses randomly generated particles.

  • Kalman filtering. Rascon et al. Rascon et al. (2015) propose a lightweight method that relies on Kalman filters, which allows tracking with continuous trajectories and reduce considerably the amount of computations. This method is however limited to DoAs in spherical coordinates, using elevation and azimuth, which generates distortion as the azimuth resolution changes with elevation. It also introduces azimuth wrapping. Marković et al. Marković et al. (2016a) present an extended Kalman filter on Lie groups (LG-EKF) to perform directional tracking with an 8-microphone array. LG-EKF solves the azimuth wrapping phenomenon, but limits the tracking to a 2D circle, and is therefore unsuitable for tracking sources on a 3D spherical surface.

  • Joint probabilistic data association filter (JPDA). Marković et al. Marković et al. (2016b)

    introduces this tracking method for 3D spherical surface that relies on Bayesian von Mises-Fisher estimator. This approach however requires prior knowledge of the number of active sources, and neglects the motion model for each tracked source, which leads to switched or merged trajectories when two sources cross each other.

To improve SSL and SST, this paper introduces a SRP-PHAT method referred to as SRP-PHAT-HSDA, for Hierarchical Search with Directivity model and Automatic calibration, and a tracking method based on a modified 3D Kalman filter (M3K) using Cartesian coordinates. SRP-PHAT-HSDA scans the 3D space over a coarse resolution grid, and then refines search over a specific area. It includes a Time Difference of Arrival (TDOA) uncertainty model to optimize the scan accuracy using various grid resolution levels using open and closed microphone array configurations. A microphone directivity model is also used to reduce the number of directions to scan and ignore non significant pairs of microphones. M3K replaces the SMC filters used in Valin et al. Valin et al. (2007) by Kalman filters, and introduces three new concepts: 1) normalization of the states to restrict the space to a unit sphere; 2) derivation of a closed-form expression for the likelihood of a coherent source to speed up computations; and 3) weighted update of the Kalman mean vector and covariance matrix for simultaneous tracking of sound sources. These modifications provide efficient tracking of multiple sound sources, makes the method convenient for low-cost embedded hardware as it requires less computations than the SMC method, and solves the distortion and wrapping introduced by Kalman filtering with spherical coordinates.

The paper is organized as follows. First, Section 2 characterizes the computing requirements of SRP-PHAT in comparison to SEVD-MUSIC, to justify and situate the improvements brought by SRP-PHAT-HSDA. Sections 3 and 4 then describe SRP-PHAT-HSDA and M3K, respectively. Section 5 presents the experimental setup involving 8 and 16-microphone circular and closed cubic arrays on a mobile robot, implementing SSL and SST methods on a Raspberry Pi 3. Section 6 presents the results obtained from experiments comparing SRP-PHAT with SRP-PHAT-HSDA, and M3K with SMC. Finally, Section 7 concludes this paper with final remarks and future work.

2 Computing Requirements of SRP-PHAT versus

SSL is usually divided in two tasks: 1) estimation of TDOA, and 2) DoA search over the 3D space around the microphone array. The main difference between SRP-PHAT and SEVD-MUSIC lies in Task 1: SRP-PHAT relies on the Generalized Cross-Correlation with Phase Transform method (GCC-PHAT), while SEVD-MUSIC uses Singular Eigenvalue Decomposition (SEVD). The intend here is to demonstrate which method is the most efficient for Task 1, and then, using this method, how can Task 2 be further improved to reduce computing needs.

Both methods first capture synchronously the acoustic signals from the microphones in the array. These signals are divided in frames of samples, spaced by samples and multiplied by a the sine window :


with , and

representing the frame, microphone and sample indexes, respectively. The methods then compute the Short-Time Fourier Transform (STFT) with a

-samples real Fast Fourier Transform (FFT), where the expression stands for the spectrum at each frequency bin , and the constant is the complex number :


The sine window allows reconstruction in the time-domain with a 50% frame overlap, and thus the same STFT results can be used for both the localization and separation steps, which reduces the total amount of computations. SRP-PHAT relies on the Generalized Cross-Correlation with Phase Transform (GCC-PHAT), which is computed for each pair of microphones and (where ). The Inverse Fast Fourier Transform (IFFT) provides an efficient computation of the GCC-PHAT, given that the time delay is an integer:


The IFFT complexity depends on the number of samples per frame , which is usually a power of 2. The order of complexity for a real IFFT is ). With pairs of microphones, SRP-PHAT computing complexity reaches .

SEVD-MUSIC relies on singular eigenvalue decomposition of the cross-correlation matrix. The correlation matrix is defined as follows, where and stand for the expectation and Hermitian operators, respectively:


The vector concatenates the spectra of all microphones for each frame and frequency bin (where the operator stands for the transpose):


In practice, the correlation matrix is usually computed at each frame with an estimator that sums vectors over time (a window of frames) for each frequency bin :


SEVD-MUSIC complexity depends on the size of the matrix , and is Holmes et al. (2007). This operation is performed at each frequency bin , for a total of bins, which leads to an overall complexity of .

To better express the computing requirements of both methods, Table 1 presents simulation results of the time (in sec) required to process one frame , with various values of and , on a Raspberry Pi 3. SRP-PHAT compute real -sample FFTs using the FFTW C library Frigo and Johnson (1998), and SVD-MUSIC evaluates SEVD of matrices using the Eigen C++ library Guennebaud and Jacob (2014). Some methods (e.g., Nakadai et al. (2010); Nakamura et al. (2011, 2012)) compute SEVD only in the lower frequency range (where speech is usually observed) to reduce the computational load. However, this discards some useful spectral information in the higher frequencies (in speech fricative sounds for instance), which are considered with the SRP-PHAT method. To ensure a fair comparison, both methods treat the whole spectral range. For , when the number of microphones increase from to , the processing time increases by a factor of for SRP-PHAT, and a factor of for SEVD-MUSIC. The latter factor is less than the expected complexity of

, which is probably explained by truncated computation of SVD for small singular values. Similarly, for a fixed number of microphones

, the complexity increases by a factor of for SRP-PHAT, and by a factor of for SEVD-MUSIC. SRP-PHAT requires from 206 ( and ) to 525 ( and ) less computing time that the SEVD-MUSIC method. This suggests that SRP-PHAT is more suitable for online processing, as it performs Task 1 effectively. It is therefore desirable to use SRP-PHAT for Task 1, and optimize Task 2 to get an efficient SSL method.

(228) (245) (225) (206)
(303) (321) (290) (260)
(389) (424) (400) (350)
(469) (525) (477) (524)
Table 1: Processing time in sec/frame for {SRP-PHAT, SEVD-MUSIC} and ratio between both (between parentheses)


To understand how SRP-PHAT-HSDA works, let us start by explaining SRP-PHAT, to then explain the added particularities of SRP-PHAT-HSDA. Figure 2 illustrates the SRP-PHAT-HSDA method, using microphone signals to localize potential sources. The Microphone Directivity module and the MSW Automatic Calibration module are used at initialization, and provide parameters to perform optimized GCC-PHAT, Maximum Sliding Window (MSW) filtering and Hierarchical Search online.

Figure 2: Block diagram of SRP-PHAT-HSDA

The underlying mechanism of SRP-PHAT is to search for potential sources for each frame over a discrete space Grondin et al. (2013); DiBiase et al. (2001). For each potential source, the computed GCC-PHAT frames are filtered using a Maximum Sliding Windows (MSW). The sum of the filtered GCC-PHAT frames for all pairs of microphones provide the acoustic energy for each direction on the discrete space, and the direction with the maximum energy corresponds to a potential source. Once a potential source is obtained, its contribution is removed from the GCC-PHAT frames by setting the amplitude of the corresponding TDOA to zero, and the space is scanned again. This process is repeated times until the DoAs () and energy levels () of all potential sources are generated.

A discrete unit sphere provides potential DoAs for sound sources. As in Grondin et al. (2013) and Valin et al. (2007), a regular convex icosahedron made of 12 points defines the initial discrete space, and is refined recursively times until the desired space resolution is obtained. Figure 3 shows the regular icosahedron (), and subsequent refining iterations levels ( and ).

Figure 3: Discrete unit spheres

Each point on the discrete sphere corresponds to a unit vector , where stands for the point index where , and is the set that contains all vectors, where the number of points depends on the resolution level . In the SRP-PHAT method proposed in Valin et al. (2007), the scan space is refined four times () to generate points and obtain a spatial resolution of degrees.

To further reduce SRP-PHAT computations, and maintain a high localization accuracy regardless of the microphone array shape, SRP-PHAT-HSDA adds the following elements:

  • Microphone Directivity (MD): When the number of microphone increases, the computational load also increases by a complexity of . The proposed method assumes that microphone have a directivity pattern, and this introduces constraints that reduces the space to be scanned and the number of pairs of microphones to use, which in turn decreases the amount of computations.

  • Maximum Sliding Window Automatic Calibration (MSWAC): TDOA estimation is influenced by the uncertainty in the speed of sound and the microphones positions (which may be difficult to measure precisely with microphone arrays of complex geometry), and scan grid discretization, which should be modelled somehow. The MSW size can be tuned manually by hand to maximize localization accuracy, but this remains a time consuming task which has to be repeated for each new microphone array geometry. The TDOA uncertainty model solves this challenge as it automatically tunes the MSW size to maximize localization accuracy.

  • Hierarchical Search (HS): Searching for potential sources involves scanning the 3D space according to a grid with a specific resolution. Finer resolution means better precision but higher computation. To reduce computations, a solution is to first do a scan with a grid at coarse resolution to identify a potential sound source, and then do another scan with a grid with a fine resolution using the location found during the first scan to pinpoint a more accurate direction.

3.1 Microphone Directivity

In a microphone array, microphones are usually assumed to be omnidirectional, i.e., acquiring signals with equal gain from all directions. In practice however, microphones on a robot platform are often mounted on a rigid body, which may block the direct propagation path between a sound source and a microphone. The attenuation is mostly due to diffraction, and changes as a function of frequency. Since the exact diffraction model is not available, the proposed model relies on simpler assumptions: 1) there is a unit gain for sound sources with a direct propagation path, and 2) the gain is null when the path is blocked by the robot body. As the signal to noise ratio is generally unknown for the blocked microphones, it is safer to assume a low SNR, and setting the gain to zero prevents noise to be injected in the observations. Moreover, the gain is set constant for all frequencies, and a smooth transition band connects the unit and null gain regions. This transition band prevents abrupt changes in gains when the sound source position varies. Figure 4 introduces , as defined by (7), the angle between a sound source located at , and the orientation of the microphone modeled by the unit vector .

Figure 4: Microphone directivity angle as a function of microphone orientation and source direction

Figure 5 illustrates the logistic function that models the gain as a function of the angle , given in (8). The expression is a set that contains the parameters , where stands for the angle where the gain is one while corresponds to the angle at which the gain is null. The region between both angles can be viewed as a transition band.

Figure 5: Microphone gain response

To make SSL more robust to reverberation, the scan space is restricted to a specific direction. For instance, the scan space is limited to the hemisphere that points to the ceiling to ignore reflections from the floor. The unit vector stands for the orientation of the scan space.

Since microphone directivity introduces some constraints on the scan space, the spatial gains and need to be large enough for a source located in the direction to excite both microphones and . The gain also needs to be large enough for this direction to be part of the scan space. The mask models this condition, where the constant stands for the minimal gain value:


When the mask is zero, the value of the corresponding sample in the GCC-PHAT frame is negligible and can be ignored. When all pairs of microphones are uncorrelated ( for all values of and ), the direction can simply be ignored ():


Similarly, the GCC-PHAT between microphones and needs to be computed only when , that is when these microphones are excited simultaneously at least once for a given direction :


3.2 MSW Automatic Calibration

The TDOA between two microphones and is given by the expression . Under the far field assumption, the TDOA is set according to (12), where represents the normalized direction of the sound source, stands for the sample rate (in samples/sec) and for the speed of sound (in m/s). Note that the TDOA is usually given in sec, but is provided in samples here since all processing is performed on discrete-time signals.


The speed of sound varies according to air temperature, humidity and pressure. These parameters usually lie within a known range in a room (and even outside), but it remains difficult to calculate the exact speed of sound. In SRP-PHAT-HSDA, the speed of sound is modeled using a random variable

, where is the mean and

the standard deviation of the normal distribution. The exact position of each microphone is also modeled by a trivariate normal distribution

, where stands for the mean vector and for the covariance matrix.

The first step consists in solving for the expression (in samples/m). To make calculations easier, a normally distributed random variable is introduced:


The previous equation can be linearized given that . Expanding this function as a Taylor series, the following approximation holds:


This results in being a normally distributed random variable, with mean and standard deviation .

The second step consists in solving the projection of the distance between both microphones represented by random variables and , on the deterministic unit vector , represented below as :


The intermediate expression is a random variable with a normal distribution , where and . The position uncertainty is usually significantly smaller than the distance between both microphones, such that , where the expression stands for the vector and matrix norms. The random variable has a normal distribution:


The random variable is the product of the normal random variables and , which gives the following expression:


where and are two independent random variables with standard normal distribution. Since , and (for all and ), the following approximation holds:


The random variable therefore exhibits a normal distribution where:


This models the TDOA estimation uncertainty, and is used to configure MSW size. In practice, GCC-PHAT based on FFT generates frames with discrete indexes, and therefore the estimated TDOA value (denoted by ) for each discrete direction

can be rounded to the closest integer if no interpolation is performed:


To cope with the disparity between and the observation of the random variable , a MSW filters each GCC-PHAT frame for all pairs of microphones, where stands for the filtered frame. The MSW has a size of samples (the frame index is omitted here for clarity):


Figure 6

illustrates how the partial area under the probability density function (PDF) of

stands for the probability the MSW captures the TDOA value.

Figure 6: MSW and PDF of the TDOA random variable

The area under the curve corresponds to the integral of the PDF that lies within the interval of the MSW, given by :


where and .

A discrete integration over space is used to estimate the probability that the MSW captures a source in a neighboring direction to :


An octagon made of points estimates the discretized surface, where stands for the number of recursive iterations. The radius of the octagon corresponds to the distance between and its closest neighbor. Figure 7 shows octagons for , and iterations:

Figure 7: Discrete octogons

For all directions in the set , the probability that the MSW captures sources in neighboring directions is estimated as follows:


For a given discrete direction , the probability that the discrete point is captured for all pairs of microphones is estimated with the following expression:


The objective is to maximize for all directions and discrete points , while keeping the MSW window size as small as possible to preserve the localization accuracy. To achieve this, Algorithm 1 increments the parameters progressively until the threshold is reached. This calibration is performed once at initialization.

1:for all pairs  do
3:     Compute with (27)
4:end for
5:Compute for all and with (28)
6:while  do
9:     Update
10:     Update for all and
11:end while
Algorithm 1 MSW Automatic Calibration – Offline

3.3 Hierarchical Search

Hierarchical search involves two discrete grids: one with a coarse resolution and the other with a fine resolution. A matching matrix provides a mean to connect at initialization the coarse and fine resolution grids, which are then used to perform the hierarchical search.

Algorithm 2 first performs a scan using the coarse resolution grid, and then a second scan over a region of the fine resolution grid to improve accuracy. The expressions and stand for the GCC-PHAT frames at pair filtered by the MSW for the coarse and fine resolutions grids, respectively. To consider the microphone directivity in the scanning process, the GCC-PHAT result for each pair and directions or is summed only when the binary masks or are set to . The energy levels (defined by the expressions and ) are normalized with the number of active pairs for each direction (expressed by ). The variable is set to a small value to avoid division by zero.

The coarse scan returns the maximum index on the coarse grid, and then the fine scan searches all points where . The point then corresponds to the index of the point on the fine grid with the maximum value. Scanning for the potential source returns the DoA and the corresponding energy level .

The proposed Hierarchical Search involves directions to scan in average, compared with directions for a fixed grid with the same resolution. For instance, with , and for SRP-PHAT-HSDA, and for SRP-PHAT, there are in average directions to scan, instead of .

1:for  to  do
2:     ,
3:     for all pairs  do
4:         if  then
7:         end if
8:     end for
10:end for
12:for  to  do
13:     ,
14:     if  then
15:         for all pairs  do
16:              if  then
19:              end if
20:         end for
22:     end if
23:end for
Algorithm 2 Hierarchical Search Scanning – Online

The matching matrix that connects the coarse and fine resolution spaces (denoted by and , respectively) needs to be generated offline prior to online hierarchical search. This matrix, denoted by the variable , connects each direction from the fine resolution grid (composed of directions) to many directions in the coarse resolution grid (made of directions).

The similitude between a direction in and a direction in is given by , which is the length of the intersection between subsets and :




The expressions and depend on the window size of the MSWs, computed for the coarse and fine resolutions grids, for each pair of microphones . Each direction in the fine resolution grid is mapped to the most similar directions in the coarse resolution grid, derived using Algorithm 3.

1:for  to  do
2:     for  to  do
5:         for all pairs  do
6:              if  and  then
8:              end if
9:         end for
10:     end for
11:     for  to  do
15:     end for
16:end for
Algorithm 3 Hierarchical Search Matching – Offline

4 Modified 3D Kalman Filters for SST

Figure 8 illustrates the nine steps of the M3K method, where Normalization (Step B), Likelihood (Step D) and Update (Step H) are introduced to include Kalman filtering and replace particle filtering in the multiple sources tracking method presented by Valin et al. Valin et al. (2007)

. First, the new states of each tracked source are predicted (Step A) and normalized (Step B) in relation to the search space. Second, the potential sources are then assigned (Step C) to either a source currently tracked, a new source of a false detection (Steps D, E, F). Third, the method then adds (Step G) new sources to be tracked if needed, and removes inactive sources previously tracked. Fourth, the states of each tracked source are updated (Step H) with the relevant observations, and the direction of each tracked source is finally estimated (Step I) from the Gaussian distributions.

Figure 8: Tracking simultaneous sound sources using M3K. Tracked sources are labeled and potential sources are labeled .

Before presenting in more details these nine steps in the following subsections, let us first define the Kalman filter model used in M3K. A Kalman filter estimates recursively the state of each source and provides the estimated source direction. Normally distributed random variables model the 3D-direction (, and ) and 3D-velocity (, and ), where stands the transpose operator:


The random vector concatenates these positions and velocities:


The Kalman model assumes the state evolves over time according to the following linear model:


where the matrix stands for the state transition model, represents the control-input matrix, is the control vector and models the process noise. In the matrix , the expression denotes the time interval (in second) between two successive frames, with being the hop size in samples between two frames, and the sample rate in samples per second:


With M3K, there is no control input, and therefore the expression () in (35) is ignored. The process noise exhibits a multivariate normal distribution, where

. In M3K, the process noise lies in the velocity state and is parametrized with the variance

. In this method, the parameter is set to a constant value, but it would be possible to have it depend on as uncertainty increases for larger values of .


The observations are represented by random variables in the -, - and -directions, obtained from the states:




The matrix stands for the observation model:


Expression models the observation noise, where the diagonal covariance matrix is defined as:


M3K therefore requires only two manually-tuned parameters, and , which influence the tracking sensitivity and inertia of each tracked source.

4.1 Prediction (Step A)

The vector is the posterior mean of the state, where the matrix stands for the state error posterior covariance matrix of each tracked source . The tracking method predicts new states (also referred to as prior states) for each sound source . Predicted mean vector and covariance matrix are obtained as follows:


Each prediction step increases the states uncertainty, which is then reduced in the Update step (Step H) when the tracked source is associated to a relevant observation.

4.2 Normalization (Step B)

The observations stands for sound direction of the potential source , which constitutes a unit vector. A normalization constraint is therefore introduced and generates a new state mean vector , for which the predicted direction lies on a unitary sphere and velocity is tangential to the sphere surface:






This manipulation violates the Kalman filter assumptions, which state that all processes are Gaussian and that the system is linear Julier and Uhlmann (1997). In practice however, Kalman filtering remains efficient as this normalization only involves a slight perturbation in direction and velocity, which makes the nonlinearity negligible.

During the normalization process, the covariance matrix remains unchanged. The Update step (Step H) ensures that the radial component of the matrix stays as small as possible such that the PDF (Probability Density Function) lies mostly on the unit sphere surface.

4.3 Assignment (Step C)

Assuming that sources are tracked, the function assigns the potential source at index to either a false detection (), a new source (), or a previously tracked source (from to ):


There are potential sources that can be assigned to values, which leads to possible assignments. The vector for the assignation concatenates the assignment functions for the potential sources:


4.4 Likelihood (Step D)

The energy level gives significant information regarding sound source activity. When a sound source is active (), which means the source emits a sound, a Gaussian distribution models the energy level and the PDF is given by:


where and stand for the mean and standard deviation of the normal distribution, respectively (for simplicity, we omit and from the left-hand side of the equation).

When the source is inactive, the energy level is modeled with the same distribution but different parameters:


where and also represent the mean and standard deviation of the normal distribution, respectively. Typically the standard deviation is similar to , but the mean is smaller than . Moreover, the probability that the potential source is generated by the tracked source is obtained with the following volume integral:


The symbol stands for a coherent source, which describes a sound source located at a specific direction in space. As modeled by the Kalman filter, the probability follows this normal distribution:






Given the normalized tracked source direction , the following expression represents the PDF when the potential source is observed :


Note that swapping the mean and the random variable leads to the same PDF (the expression stands for the matrix determinant):