Robot audition aims to provide robots with hearing capabilities to interact efficiently with people in everyday environments . Sound source localization (SSL) is a typical task that consists of localizing the direction of arrival (DOA) of a target source using a microphone array. This task is challenging as the robot usually generates a significant amount of noise (fans, actuators, etc.)  and the target sound source is corrupted by reverberation. SSL often relies on Multiple Signal Classification (MUSIC) and Steered-Response Power Phase Transform (SRP-PHAT) methods.
MUSIC is a localization method based on Standard Eigenvalue Decomposition (SEVD-MUSIC) that was initially used for narrowband signals, and then adapted to broadband signals like speech . However, SEVD-MUSIC assumes the speech signal is more powerful than noise at each frequency bin in the spectrogram, which is usually not the case. To cope with this limitation, Nakamura et al. introduced the MUSIC based on Generalized Eigenvalue Decomposition (GEVD-MUSIC) method [5, 6, 7]
. This method solves the limitation of SEVD-MUSIC, but also introduces some localization errors because the transform provides a noise subspace with correlated bases. To deal with this issue, a variant of GEVD-MUSIC, named MUSIC based on Generalized Singular Value Decomposition (GSVD-MUSIC), enforces orthogonality between the noise subspace bases and thus improves the DOA estimation accuracy. However, all MUSIC-based methods rely on online eigenvalue or singular value decompositions that are computationally expensive, and make on-board real-time processing challenging .
SRP-PHAT is built on the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) between each pair of microphones 
. GCC-PHAT is often computed with the Inverse Fast Fourier Transform (IFFT) to speed up computation, at the cost of discretizing Time Difference of Arrival (TDOA) values, which reduces localization accuracy. SRP-PHAT usually scans a discretized 3-D space and returns the most likely DOA[11, 12, 13, 14, 15, 16]. This scanning process often involves a significant amount of lookups in memory, which creates a bottleneck and increases execution time. To reduce the number of lookups, a hierarchical search is proposed to speed up the space scan, but this method still relies on discrete TDOA . We therefore recently proposed the Singular Value Decomposition with Phase Transform (SVD-PHAT) method, which avoids TDOA discretization, and significantly reduces computing time . However, as for SRP-PHAT, SVD-PHAT remains sensitive to additive noise. To cope with this limitation, time-frequency (TF) masks can be generated to improve robustness to stationary noise [19, 20]. Stationary noise is often estimated with techniques like Minima Controlled Recursive Averaging (MCRA)  and Histogram-based Recursive Level Estimation (HRLE) 
, or recorded offline prior to test if the robot’s environment is static. Pertilä et al. also propose a method that generates TF masks using convolutional neural networks for non-stationary noise sources. However, these TF masks ignore noise spatial coherence, which carries useful insights for robust localization, and is in fact exploited by GSVD-MUSIC.
In this paper, we propose a variant of the SVD-PHAT method, called Difference SVD-PHAT (DSVD-PHAT), that performs correlation matrix subtraction, which considers noise spatial coherence, while preserving the low complexity of the original SVD-PHAT. Section II reviews the state of the art GSVD-MUSIC method, and section III introduces the proposed DSVD-PHAT method. Section IV describes the experimental setup on a Baxter robot, and then section V compares results from GSVD-MUSIC and the proposed DSVD-PHAT approach.
GSVD-MUSIC relies on the Time Difference of Arrival (TDOA) between each microphone and a reference in space. The TDOA (in sec) stands for the propagation delay for the signal emitted by the sound source DOA (where stands for the -norm) to reach microphone with respect to the origin. For discrete-time signals, the TDOA is usually expressed in terms of samples, as shown in (1), where stands for the speed of sound in air (in m/sec), and is the sample rate (in samples/sec). The operator stands for the dot product.
The expression stands for the Short Time Fourier Transform coefficient of microphone , at frequency bin and frame , where and
stand for the frame and hop sizes in samples, respectively. The STFT values are concatenated in the vector, as shown in (2).
GSVD-MUSIC uses a steering vector for each potential DOA :
The correlation matrix of the vector at each frequency bin can be estimated at each frame using the following recursive approximation, where the parameter is the adaptive rate:
where stands for the Hermitian operator.
The GSVD-MUSIC method performs a generalized singular value decomposition with respect to the noise correlation matrix (which can be estimated as in (4) during silence periods or precomputed offline if the test environment is known):
where the diagonal matrix holds the singular values in descending order (), and and are the left and right singular vectors and , respectively:
This method projects the steering vector in the noise subspace, spanned by the singular vectors (when there is only one target source). The inverse of the projections for each frequency bin is summed over the full spectrum (which may also be restricted to a more specific range of frequency bins ):
The sound source DOA then corresponds to , where:
GSVD-MUSIC involves singular value decompositions of matrices per frame, as shown in (5), which is challenging from a computing point of view for real-time applications. Moreover, it also involves computing (9) for potential sources, which also implies a significant amount of computations. The proposed DSVD-PHAT aims to reduce the amount of computations, while preserving a similar robustness to noise.
DSVD-PHAT relies on the TDOA between each pair of microphones and (as opposed to (1), where the TDOA is between a microphone and to the origin), which leads to the following expression, for a total of pairs:
Since noise and speech sources are independent, it is reasonable to assume that the clean speech correlation matrix can be estimated from the difference between the noisy speech and the noise correlation matrices at each frame , as proposed in :
The normalized cross-spectra in DSVD-PHAT at each frequency bin are thus obtained as follows, where refers to the element in the th row and th column:
Note how DSVD-PHAT differs from the original SVD-PHAT, as the latter uses directly the noisy correlation matrix (e.g. replaces in (12)).
We then define the vector to concatenate all normalized cross-spectra introduced in (13):
The matrix holds all the SRP-PHAT coefficients :
The vector stores the SRP-PHAT energy for all potential DOAs, where extracts the real part of the expression:
The sound source DOA corresponds to , where:
Computing for all values of is expensive, and therefore SVD-PHAT provides a more efficient way of finding . The Singular Value Decomposition is first performed on the matrix, where , and :
The parameter (where ) satisfies the condition in (19), which ensures accurate reconstruction of , where is a user-defined small value that stands for the tolerable reconstruction error. The operator represents the trace of the matrix.
The vector results from the projection of the observations in the K-dimensions subspace:
Similarly, the matrix holds a set of vectors :
The optimization in (17) can then be converted to a nearest neighbor problem:
where and . A k-d tree then solves efficiently this nearest neighbor search problem. The corresponding amplitude for the optimal DOA at index corresponds to:
where stands for the -th row of .
Both GSVD-MUSIC and DSVD-PHAT rely on SVD decompositions, but DSVD-PHAT computes them offline. The online processing only involves the projection in (20) and the k-d tree search, which is appealing for real-time processing.
Iv Experimental Setup
To compare both methods with a wide range of conditions, we perform simulations to evaluate numerous room configurations and signal-to-noise ratios (SNRs). Noise from Baxter’s fans is therefore recorded and then mixed with male and female speech utterances from the TIMIT dataset , convolved with simulated Room Impulse Responses (RIRs) and amplified with various gains. The room impulse response (RIR) corresponds to the impulse response obtained with the image method  between the microphone array and the target sound sources, both positioned randomly in a m x m x m room. For each pair of SNR and room reverberation time RT60, we generate RIRs and use the same number of speech sources picked randomly from the TIMIT dataset.
The parameters for the experiments are summarized in Table I. The sample rate captures all the frequency content of speech, and the speed of sound corresponds to typical indoor conditions. The frame size analyzes segments of 16 msecs, and the hop size provides a 50% overlap. The potential DOAs are represented by equidistant points on a unit halfsphere generated recursively from a tetrahedron, for a total of points, as in . The smoothing parameter provides a context of roughly msecs to estimate the correlation matrices, which captures multiple phonemes. The parameter is set to the value found in , which ensures a good accuracy. For this array configuration, the dimensionality of the subspace corresponds to with .
Table II lists the positions of the ReSpeaker array microphones (in cm) w.r.t. to the center of the array.
In all experiments, the noise correlation matrix comes from the offline recording of the robot’s fans. This ensures we compare both methods independently of the performance of the online background noise estimation method.
To get some intuition about the SSL with GSVD-MUSIC and DSVD-PHAT, we first analyze an example of a speech utterance with a SNR of dB and a reverberation level of RT60 = msecs, shown in Fig. 5. The spectrogram in Fig. (a)a displays the speech signal, corrupted by some stationary noise between Hz and Hz. Fig. (b)b shows the DOAs obtained from GSVD-MUSIC, with the true DOA represented by straigh lines. This example demonstrates that, in this specific case, GSVD-MUSIC estimates many DOAs that differ from the theoretical DOA. Similarly, Fig. (c)c displays the DOAs obtained from DSVD-PHAT for the same noisy signal. Here the estimated DOAs are closer to the theoretical DOA.
It is also convenient to define the expression to denote the angle difference between the estimated DOA at frame (obtained using GSVD-MUSIC or DSVD-PHAT), and the theoretical DOA extracted from the simulated room parameters:
Let us define the margin , that corresponds to the DOA error tolerance for a localized source to be considered as a valid DOA. In this section, we arbitrary define the tolerance to , which corresponds to . Expression takes a value of when the localized sound source is within the range, or otherwise:
Similarly, the expression corresponds to the observation amplitude ( for GSVD-MUSIC from (9), and from (23) for DSVD-PHAT). This metric is relevant as it is often assumed that the confidence in the DOA depends on the associated amplitude of [16, 17]. Therefore, a DOA is considered as a positive when the amplitude equals or exceeds the fixed threshold , and as a negative otherwise:
Fig. 10 illustrates the angle difference of the DOAs estimated previously with both methods, and also displays the associated amplitudes. Note that for DSVD-PHAT in particular, the amplitude goes down when the value of gets outside the acceptable range, which suggests that a well-tuned could discriminate between accurate and inaccurate estimated DOAs.
To measure the performance of both methods, we vary the value of and compute the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). A TP occurs when the amplitude is greater or equal to the threshold, and the measured DOA falls within the acceptable range of the theoretical DOA:
Similarly, a TN happens when a DOA out of the acceptable range is rejected as its associated amplitude is below the fixed threshold:
Finally, FP and FN occur when an erroneous DOA is picked and when a valid DOA is rejected, respectively:
Fig. 11 shows both ROC curves with GSVD-MUSIC and DSVD-PHAT for the previous example. In this case, the DSVD-PHAT surpasses the GSVD-MUSIC results as the Area Under the Curve (AUC) is clearly closer to .
Table III shows the AUC results for SNRs dB and RT60 msecs. In general, GSVD-MUSIC generates higher AUC values for cases when the SNR is below dB. However, the DSVD-PHAT still provides AUC values close to GSVD-MUSIC, which demonstrates that the proposed method also allows accurate DOA estimation under reverberant and noisy conditions. Moreover, the proposed DSVD-PHAT approach provides better results for all scenarios where the SNR is greater or equal to dB, at all reverberation levels.
|SNR (dB)||RT60 (msec)||GSVD-MUSIC||DSVD-PHAT|
Both methods are compared in terms of the execution times per frame. These methods run in the MATLAB environment, and their implementation relies mostly on vectorization to speed up processing. The hardware used consists of an Intel Xeon CPU E5-1620 clocked at 3.70GHz. Table IV shows the average execution time per frame. This demonstrates the significant efficiency gain with DSVD-PHAT that avoids the expensive online SVD computations, as it runs approximately times faster than GSVD-MUSIC. In this experiment, with msecs between each frame, GSVD-MUSIC requires roughly of the actual computing resources to achieve real-time, whereas DSVD-PHAT easily meets real-time requirements by using only of the computing power.
This paper introduces a variant of the SVD-PHAT method to improve noise robustness. Results demonstrate that the proposed method performs similarly to the state of the art GSVD-MUSIC technique, but runs approximately times faster. This makes DSVD-PHAT appealing for localization on robots with limited on-board computing power.
-  H. G. Okuno, T. Ogata, K. Komatani, and K. Nakadai, “Computational auditory scene analysis and its application to robot audition,” in Proceedings of the International Conference on Informatics Research for Development of Knowledge Society Infrastructure. IEEE, 2004, pp. 73–80.
-  G. Ince, K. Nakadai, T. Rodemann, H. Tsujino, and J.-I. Imura, “Robust ego noise suppression of a robot,” in Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer, 2010, pp. 62–71.
-  R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276–280, 1986.
-  C. Ishi, O. Chatot, H. Ishiguro, and N. Hagita, “Evaluation of a MUSIC-based real-time sound localization of multiple sound sources in real noisy environments,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009, pp. 2027–2032.
-  K. Nakamura, K. Nakadai, F. Asano, Y. Hasegawa, and H. Tsujino, “Intelligent sound source localization for dynamic environments,” in Proceedings of the IEEE/RSJ international conference on Intelligent Robots and Systems. IEEE, 2009, pp. 664–669.
-  K. Nakamura, K. Nakadai, F. Asano, and G. Ince, “Intelligent sound source localization and its application to multimodal human tracking,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2011, pp. 143–148.
-  K. Nakadai, G. Ince, K. Nakamura, and H. Nakajima, “Robot audition for dynamic environments,” in Proceedings of the IEEE International Conference on Signal Processing, Communication and Computing. IEEE, 2012, pp. 125–130.
K. Nakamura, K. Nakadai, and G. Ince, “Real-time super-resolution sound source localization for robots,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 694–699.
-  T. Ohata, K. Nakamura, T. Mizumoto, T. Taiki, and K. Nakadai, “Improvement in outdoor sound source detection using a quadrotor-embedded microphone array,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2014, pp. 1902–1907.
-  M. Brandstein and H. Silverman, “A robust method for speech signal time-delay estimation in reverberant rooms,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. 1. IEEE, 1997, pp. 375–378.
-  J.-M. Valin, F. Michaud, J. Rouat, and D. Létourneau, “Robust sound source localization using a microphone array on a mobile robot,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 2. IEEE, 2003, pp. 1228–1233.
-  J.-M. Valin, F. Michaud, B. Hadjou, and J. Rouat, “Localization of simultaneous moving sound sources for mobile robot using a frequency-domain steered beamformer approach,” in Proceedings of the IEEE International Conference on Robotics and Automation, vol. 1. IEEE, 2004, pp. 1033–1038.
-  J.-M. Valin, F. Michaud, and J. Rouat, “Robust 3D localization and tracking of sound sources using beamforming and particle filtering,” in Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 4. IEEE, 2006, pp. 841–844.
-  ——, “Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering,” Robotics and Autonomous Systems, vol. 55, no. 3, pp. 216–228, 2007.
-  A. Badali, J.-M. Valin, F. Michaud, and P. Aarabi, “Evaluating real-time audio localization algorithms for artificial audition in robotics,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2009, pp. 2033–2038.
-  F. Grondin, D. Létourneau, F. Ferland, V. Rousseau, and F. Michaud, “The ManyEars open framework,” Autonomous Robots, vol. 34, no. 3, pp. 217–232, 2013.
-  F. Grondin and F. Michaud, “Lightweight and optimized sound source localization and tracking methods for open and closed microphone array configurations,” Robotics and Autonomous Systems, vol. 113, pp. 63–80, 2019.
-  F. Grondin and J. Glass, “SVD-PHAT: A fast sound source localization method,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signals Processing, 2019.
-  F. Grondin and F. Michaud, “Time difference of arrival estimation based on binary frequency mask for sound source localization on mobile robots,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2015, pp. 6149–6154.
-  ——, “Noise mask for tdoa sound source localization of speech on mobile robots in noisy environments,” in Proceedings of the IEEE International Conference on Robotics and Automation. IEEE, 2016, pp. 4530–4535.
-  I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for robust speech enhancement,” IEEE signal processing letters, vol. 9, no. 1, pp. 12–15, 2002.
-  H. Nakajima, G. Ince, K. Nakadai, and Y. Hasegawa, “An easily-configurable robot audition system using histogram-based recursive level estimation,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010, pp. 958–963.
-  P. Pertilä and E. Cakir, “Robust direction estimation with convolutional neural networks based steered response power,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2017, pp. 6125–6129.
-  T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 5210–5214.
-  V. Zue, S. Seneff, and J. Glass, “Speech database development at MIT: TIMIT and beyond,” Speech communication, vol. 9, no. 4, pp. 351–356, 1990.
-  J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
-  K. Nakadai, T. Takahashi, H. G. Okuno, H. Nakajima, Y. Hasegawa, and H. Tsujino, “Design and implementation of robot audition system’hark’—open source software for listening to three simultaneous speakers,” Advanced Robotics, vol. 24, no. 5-6, pp. 739–761, 2010.