Description of algorithms for Ben-Gurion University Submission to the LOCATA challenge

12/12/2018 ∙ by Lior Madmoni, et al. ∙ Ben-Gurion University of the Negev 0

This paper summarizes the methods used to localize the sources recorded for the LOCalization And TrAcking (LOCATA) challenge. The tasks of stationary sources and arrays were considered, i.e., tasks 1 and 2 of the challenge, which were recorded with the Nao robot array, and the Eigenmike array. For both arrays, direction of arrival (DOA) estimation has been performed with measurements in the short time Fourier transform domain, and with direct-path dominance (DPD) based tests, which aim to identify time-frequency (TF) bins dominated by the direct sound. For the recordings with Nao, a DPD test which is applied directly to the microphone signals was used. For the Eigenmike recordings, a DPD based test designed for plane-wave density measurements in the spherical harmonics domain was used. After acquiring DOA estimates with TF bins that passed the DPD tests, a stage of k-means clustering is performed, to assign a final DOA estimate for each speaker.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 DOA estimation with the Nao robot array

This section describes the method for direction of arrival (DOA) estimation in tasks 1 and 2, that was performed with the Nao robot array.

In this paper, the same spherical coordinate system is used, as described in [1], denoted by , where is the distance from the origin, and and are the elevation and azimuth angles, respectively. Consider an array of omni-directional microphones, representing the array mounted on Nao. In this case, let denote the microphones positions arranged according to the configuration used in the LOCATA challenge for Nao [1]. In addition, a sound field which is comprised of far field sources is also considered, arriving from directions . These sources can represent the direct sound from speakers in a room and the reflections due to objects and room boundaries. In this case, the sound pressure measured by the array can be described in the short-time Fourier transform (STFT) domain as [2]


where is a vector holding the recorded sound pressure, is an vector holding the source signal amplitudes, is a matrix holding the steering vectors between each source and microphone and with denoting the DOAs of the sources, is a vector holding the noise components, and are the time and frequency indices, respectively, and denotes the transpose operator.

The signals recorded by the Nao robot array were transformed to the STFT domain with a Hanning window of 512 samples (32 ms), and with an overlap of 50%. A focusing process was then applied to this measured pressures vector in order to remove the frequency dependence of the steering matrices across every adjacent frequency indexes. The purpose of the focusing process is to enable the implementation of frequency-smoothing while preserving the spatial information. The focusing was performed by multiplying the sound pressure vector at each frequency index, , with a focusing transformation that satisfies


where is the center frequency in the frequency-smoothing range. The focusing transformations were computed in advance according to [3] using spherical harmonics (SH) order of . With ideal focusing, the cross-spectrum matrix of the focused sound pressure can be written as [3]


where , , , and is the Hermitian operator. In practice, an averaging across time frames is used to approximate the expectation. A frequency-smoothing is then applied to by averaging across frequency bins. Denoting the smoothed variables by an overline, i.e. , the smoothed focused cross-spectrum matrix can be written as


The purpose of the frequency-smoothing operation is to restore the rank of the source cross-spectrum matrix, , which is singular when coherent sources, such as reflections, are present. After applying focusing and frequency-smoothing, the effective-rank [4] of reflects the number of sources and the noise subspace can be correctly estimated [3]. Time-frequency (TF) bins in which the direct-path is dominant are identified in a similar way to those proposed in the direct-path dominance (DPD) test [5]

where and

are the largest and the second largest eigenvalues of

, and is the test threshold, chosen independently for each recording, to ensure that 5% of all available bins pass the test. Then, MUSIC with a signal subspace of single dimension was applied to each of the bins in

. The noise subspace was estimated by the singular values decomposition of


Next, k-means clustering was performed with the DOA estimates from the bins that passed the test. For task 1, a single speaker was present, thus, k-means clustering has been performed with a single cluster. For task 2, the number of clusters was chosen to the number of sources, which has been estimated for each recording by examining the scatter of DOA estimates on an azimuth-elevation grid, and was therefore assumed to be known apriori. This was performed in order to focus on the performance of the DOA estimation process rather than on source number estimation. Finally, since the sources in tasks 1 and 2 are known to be stationary, the final DOA estimates have been associated with a unique source identifier for all timestamps, regardless of its activity.

2 DOA estimation with the Eigenmike array

This section describes the method for DOA estimation in tasks 1 and 2, that was performed with the Eigemike array.

The sound pressure system model described in (1), can be used with for all and with the same STFT parameters, such that it now describes a spherical array. This formulation can facilitate the processing of signals in the SH domain [6, 7, 8], which was performed up to SH order of . Following that, plane wave decomposition had been performed, leading to [9]:



is a vector holding the recorded plane wave density (PWD) coefficients in the SH domain, is the steering matrix in this domain, with its columns holding the SH functions of order and degree . These functions are assumed to be order limited to , which usually holds when both and [10, 8], where is the wavenumber. The noise components in this domain are described by the vector , where denotes the complex conjugate. In this challenge, this plane-wave decomposition was performed in a similar manner to the R-PWD method, described in [11] (equation (2.27)).

Next, the local TF correlation matrices are computed for every TF bin by [5]:


where and are the number of time and frequency bins for the averaging, respectively. The values that were chosen for this array are and . Notice in (6) that frequency smoothing is performed directly without focusing matrices, in this domain [12].

The direct-path dominance enhanced plane-wave decomposition (DPD-EDS) test is designed for PWD measurements in the SH domain, and it uses the local TF correlation matrix , as in (6). With the aim of identifying TF bins dominated by the direct sound, it was shown in [13]

, that under some conditions, the dominant eigenvector of

, denoted by , may approximately satisfy


where is the direction of the direct sound in the TF bin. Motivated by (7), identifying a bin dominated by the direct sound, can be achieved by examining , and measuring to what extent it represents a single plane wave. In this challenge, this has been performed by the following MUSIC-based measure


where is the projection into the subspace which is orthogonal to . Next, the following DPD-EDS test have been performed:


where is the test thresholds which should hold , and in this challenge was chosen for each recording separately, to ensure that of all available bins pass the test.

Similarly to the previous section, a DOA estimation from each TF bin is given by the argument that maximizes ,


already computed in (8). For further information on the DPD-EDS test, the reader is referred to [13, 14]. The process of producing the final DOA estimates is performed similarly to the process described for the Nao robot array in the previous section, using k-means clustering.

For most recordings, an analysis frequency range of Hz was employed, with the exception of several recordings where the frequency range was reduced to Hz which seemed to yield more tightly dense clusters of DOA estimates. When the development data of the Eigenimke recordings was analyzed, a relatively constant bias of in the azimuth angle, and in the elevation angle, relative to the ground truth data, was present. Hence, this bias was subtracted from the final DOA estimates that were calculated with the evaluation data, for all recordings.


  • [1] H. W. Löllmann, C. Evers, A. Schmidt, H. Mellmann, H. Barfuss, P. A. Naylor, and W. Kellermann, “The locata challenge data corpus for acoustic source localization and tracking,” in IEEE Sensor Array Multichannel Signal Process. Workshop (SAM), 2018.
  • [2] H. L. Van Trees, Optimum array processing: Part IV of detection, estimation and modulation theory.   Wiley Online Library, 2002, vol. 1.
  • [3] H. Beit-On and B. Rafaely, “Speaker localization using the direct-path dominance test for arbitrary arrays,” in Proceedings of the International Conference On The Science Of Electrical Engineering (ICSEE 2018), accepted for publication.
  • [4] O. Roy and M. Vetterli, “The effective rank: A measure of effective dimensionality,” in Signal Processing Conference, 2007 15th European.   IEEE, 2007, pp. 606–610.
  • [5] O. Nadiri and B. Rafaely, “Localization of multiple speakers under high reverberation using a spherical microphone array and the direct-path dominance test,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1494–1505, 2014.
  • [6] J. Meyer and G. Elko, “A highly scalable spherical microphone array based on an orthonormal decomposition of the soundfield,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002, vol. 2.   IEEE, 2002, pp. II–1781.
  • [7] T. D. Abhayapala and D. B. Ward, “Theory and design of high order sound field microphones using spherical microphone array,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002, vol. 2.   IEEE, 2002, pp. II–1949.
  • [8] B. Rafaely, Fundamentals of spherical array processing.   Springer, 2015, vol. 8.
  • [9] D. Khaykin and B. Rafaely, “Coherent signals direction-of-arrival estimation using a spherical microphone array: Frequency smoothing approach,” in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009. WASPAA’09.   IEEE, 2009, pp. 221–224.
  • [10] D. B. Ward and T. D. Abhayapala, “Reproduction of a plane-wave sound field using an array of loudspeakers,” IEEE Transactions on speech and audio processing, vol. 9, no. 6, pp. 697–707, 2001.
  • [11] D. L. Alon and B. Rafaely, “Spatial decomposition by spherical array processing,” in Parametric Time-frequency Domain Spatial Audio, V. Pulkki, S. Delikaris-Manias, and A. Politis, Eds.   John Wiley & Sons, 2017.
  • [12] D. Khaykin and B. Rafaely, “Coherent signals direction-of-arrival estimation using a spherical microphone array: Frequency smoothing approach,” in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009. WASPAA’09.   IEEE, 2009, pp. 221–224.
  • [13] L. Madmoni and B. Rafaely, “Direction of arrival estimation for reverberant speech based on enhanced decomposition of the direct sound,” IEEE Journal of Selected Topics in Signal Processing, pp. 1–1, 2018.
  • [14] ——, “Improved direct-path dominance test for speaker localization in reverberant environments,” in Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Sept 2018, pp. 2424–2428.