I Introduction
Robot audition aims to provide robots with hearing capabilities to interact efficiently with people in everyday environments [1]. Sound source localization (SSL) is a typical task that consists of localizing the direction of arrival (DOA) of a target source using a microphone array. This task is challenging as the robot usually generates a significant amount of noise (fans, actuators, etc.) [2] and the target sound source is corrupted by reverberation. SSL often relies on Multiple Signal Classification (MUSIC) and SteeredResponse Power Phase Transform (SRPPHAT) methods.
MUSIC is a localization method based on Standard Eigenvalue Decomposition (SEVDMUSIC) that was initially used for narrowband signals
[3], and then adapted to broadband signals like speech [4]. However, SEVDMUSIC assumes the speech signal is more powerful than noise at each frequency bin in the spectrogram, which is usually not the case. To cope with this limitation, Nakamura et al. introduced the MUSIC based on Generalized Eigenvalue Decomposition (GEVDMUSIC) method [5, 6, 7]. This method solves the limitation of SEVDMUSIC, but also introduces some localization errors because the transform provides a noise subspace with correlated bases. To deal with this issue, a variant of GEVDMUSIC, named MUSIC based on Generalized Singular Value Decomposition (GSVDMUSIC), enforces orthogonality between the noise subspace bases and thus improves the DOA estimation accuracy
[8]. However, all MUSICbased methods rely on online eigenvalue or singular value decompositions that are computationally expensive, and make onboard realtime processing challenging [9].SRPPHAT is built on the Generalized CrossCorrelation with Phase Transform (GCCPHAT) between each pair of microphones [10]
. GCCPHAT is often computed with the Inverse Fast Fourier Transform (IFFT) to speed up computation, at the cost of discretizing Time Difference of Arrival (TDOA) values, which reduces localization accuracy. SRPPHAT usually scans a discretized 3D space and returns the most likely DOA
[11, 12, 13, 14, 15, 16]. This scanning process often involves a significant amount of lookups in memory, which creates a bottleneck and increases execution time. To reduce the number of lookups, a hierarchical search is proposed to speed up the space scan, but this method still relies on discrete TDOA [17]. We therefore recently proposed the Singular Value Decomposition with Phase Transform (SVDPHAT) method, which avoids TDOA discretization, and significantly reduces computing time [18]. However, as for SRPPHAT, SVDPHAT remains sensitive to additive noise. To cope with this limitation, timefrequency (TF) masks can be generated to improve robustness to stationary noise [19, 20]. Stationary noise is often estimated with techniques like Minima Controlled Recursive Averaging (MCRA) [21] and Histogrambased Recursive Level Estimation (HRLE) [22], or recorded offline prior to test if the robot’s environment is static. Pertilä et al. also propose a method that generates TF masks using convolutional neural networks for nonstationary noise sources
[23]. However, these TF masks ignore noise spatial coherence, which carries useful insights for robust localization, and is in fact exploited by GSVDMUSIC.In this paper, we propose a variant of the SVDPHAT method, called Difference SVDPHAT (DSVDPHAT), that performs correlation matrix subtraction, which considers noise spatial coherence, while preserving the low complexity of the original SVDPHAT. Section II reviews the state of the art GSVDMUSIC method, and section III introduces the proposed DSVDPHAT method. Section IV describes the experimental setup on a Baxter robot, and then section V compares results from GSVDMUSIC and the proposed DSVDPHAT approach.
Ii GsvdMusic
GSVDMUSIC relies on the Time Difference of Arrival (TDOA) between each microphone and a reference in space. The TDOA (in sec) stands for the propagation delay for the signal emitted by the sound source DOA (where stands for the norm) to reach microphone with respect to the origin. For discretetime signals, the TDOA is usually expressed in terms of samples, as shown in (1), where stands for the speed of sound in air (in m/sec), and is the sample rate (in samples/sec). The operator stands for the dot product.
(1) 
The expression stands for the Short Time Fourier Transform coefficient of microphone , at frequency bin and frame , where and
stand for the frame and hop sizes in samples, respectively. The STFT values are concatenated in the vector
, as shown in (2).(2) 
GSVDMUSIC uses a steering vector for each potential DOA :
(3) 
where .
The correlation matrix of the vector at each frequency bin can be estimated at each frame using the following recursive approximation, where the parameter is the adaptive rate:
(4) 
where stands for the Hermitian operator.
The GSVDMUSIC method performs a generalized singular value decomposition with respect to the noise correlation matrix (which can be estimated as in (4) during silence periods or precomputed offline if the test environment is known):
(5) 
where the diagonal matrix holds the singular values in descending order (), and and are the left and right singular vectors and , respectively:
(6) 
(7) 
(8) 
This method projects the steering vector in the noise subspace, spanned by the singular vectors (when there is only one target source). The inverse of the projections for each frequency bin is summed over the full spectrum (which may also be restricted to a more specific range of frequency bins [8]):
(9) 
The sound source DOA then corresponds to , where:
(10) 
GSVDMUSIC involves singular value decompositions of matrices per frame, as shown in (5), which is challenging from a computing point of view for realtime applications. Moreover, it also involves computing (9) for potential sources, which also implies a significant amount of computations. The proposed DSVDPHAT aims to reduce the amount of computations, while preserving a similar robustness to noise.
Iii DsvdPhat
DSVDPHAT relies on the TDOA between each pair of microphones and (as opposed to (1), where the TDOA is between a microphone and to the origin), which leads to the following expression, for a total of pairs:
(11) 
Since noise and speech sources are independent, it is reasonable to assume that the clean speech correlation matrix can be estimated from the difference between the noisy speech and the noise correlation matrices at each frame , as proposed in [24]:
(12) 
The normalized crossspectra in DSVDPHAT at each frequency bin are thus obtained as follows, where refers to the element in the th row and th column:
(13) 
Note how DSVDPHAT differs from the original SVDPHAT, as the latter uses directly the noisy correlation matrix (e.g. replaces in (12)).
We then define the vector to concatenate all normalized crossspectra introduced in (13):
(14) 
The matrix holds all the SRPPHAT coefficients :
(15) 
The vector stores the SRPPHAT energy for all potential DOAs, where extracts the real part of the expression:
(16) 
The sound source DOA corresponds to , where:
(17) 
Computing for all values of is expensive, and therefore SVDPHAT provides a more efficient way of finding . The Singular Value Decomposition is first performed on the matrix, where , and :
(18) 
The parameter (where ) satisfies the condition in (19), which ensures accurate reconstruction of , where is a userdefined small value that stands for the tolerable reconstruction error. The operator represents the trace of the matrix.
(19) 
The vector results from the projection of the observations in the Kdimensions subspace:
(20) 
Similarly, the matrix holds a set of vectors :
(21) 
The optimization in (17) can then be converted to a nearest neighbor problem:
(22) 
where and . A kd tree then solves efficiently this nearest neighbor search problem. The corresponding amplitude for the optimal DOA at index corresponds to:
(23) 
where stands for the th row of .
Both GSVDMUSIC and DSVDPHAT rely on SVD decompositions, but DSVDPHAT computes them offline. The online processing only involves the projection in (20) and the kd tree search, which is appealing for realtime processing.
Iv Experimental Setup
The GSVDMUSIC and DSVDPHAT methods are evaluated for a Baxter robot setup, equipped with a 4microphone ReSpeaker^{1}^{1}1http://seeedstudio.io array mounted on its head, as shown in Fig. 1.
To compare both methods with a wide range of conditions, we perform simulations to evaluate numerous room configurations and signaltonoise ratios (SNRs). Noise from Baxter’s fans is therefore recorded and then mixed with male and female speech utterances from the TIMIT dataset [25], convolved with simulated Room Impulse Responses (RIRs) and amplified with various gains. The room impulse response (RIR) corresponds to the impulse response obtained with the image method [26] between the microphone array and the target sound sources, both positioned randomly in a m x m x m room. For each pair of SNR and room reverberation time RT60, we generate RIRs and use the same number of speech sources picked randomly from the TIMIT dataset.
The parameters for the experiments are summarized in Table I. The sample rate captures all the frequency content of speech, and the speed of sound corresponds to typical indoor conditions. The frame size analyzes segments of 16 msecs, and the hop size provides a 50% overlap. The potential DOAs are represented by equidistant points on a unit halfsphere generated recursively from a tetrahedron, for a total of points, as in [17]. The smoothing parameter provides a context of roughly msecs to estimate the correlation matrices, which captures multiple phonemes. The parameter is set to the value found in [18], which ensures a good accuracy. For this array configuration, the dimensionality of the subspace corresponds to with .
Table II lists the positions of the ReSpeaker array microphones (in cm) w.r.t. to the center of the array.
In all experiments, the noise correlation matrix comes from the offline recording of the robot’s fans. This ensures we compare both methods independently of the performance of the online background noise estimation method.
V Results
To get some intuition about the SSL with GSVDMUSIC and DSVDPHAT, we first analyze an example of a speech utterance with a SNR of dB and a reverberation level of RT60 = msecs, shown in Fig. 5. The spectrogram in Fig. (a)a displays the speech signal, corrupted by some stationary noise between Hz and Hz. Fig. (b)b shows the DOAs obtained from GSVDMUSIC, with the true DOA represented by straigh lines. This example demonstrates that, in this specific case, GSVDMUSIC estimates many DOAs that differ from the theoretical DOA. Similarly, Fig. (c)c displays the DOAs obtained from DSVDPHAT for the same noisy signal. Here the estimated DOAs are closer to the theoretical DOA.
It is also convenient to define the expression to denote the angle difference between the estimated DOA at frame (obtained using GSVDMUSIC or DSVDPHAT), and the theoretical DOA extracted from the simulated room parameters:
(24) 
Let us define the margin , that corresponds to the DOA error tolerance for a localized source to be considered as a valid DOA. In this section, we arbitrary define the tolerance to , which corresponds to . Expression takes a value of when the localized sound source is within the range, or otherwise:
(25) 
Similarly, the expression corresponds to the observation amplitude ( for GSVDMUSIC from (9), and from (23) for DSVDPHAT). This metric is relevant as it is often assumed that the confidence in the DOA depends on the associated amplitude of [16, 17]. Therefore, a DOA is considered as a positive when the amplitude equals or exceeds the fixed threshold , and as a negative otherwise:
(26) 
Fig. 10 illustrates the angle difference of the DOAs estimated previously with both methods, and also displays the associated amplitudes. Note that for DSVDPHAT in particular, the amplitude goes down when the value of gets outside the acceptable range, which suggests that a welltuned could discriminate between accurate and inaccurate estimated DOAs.
To measure the performance of both methods, we vary the value of and compute the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). A TP occurs when the amplitude is greater or equal to the threshold, and the measured DOA falls within the acceptable range of the theoretical DOA:
(27) 
Similarly, a TN happens when a DOA out of the acceptable range is rejected as its associated amplitude is below the fixed threshold:
(28) 
Finally, FP and FN occur when an erroneous DOA is picked and when a valid DOA is rejected, respectively:
(29) 
(30) 
The True Positive Rate (TPR) and False Positive Rate (FPR) then correspond to (31) and (32), respectively, and are used to build the ROC curve.
(31) 
(32) 
Fig. 11 shows both ROC curves with GSVDMUSIC and DSVDPHAT for the previous example. In this case, the DSVDPHAT surpasses the GSVDMUSIC results as the Area Under the Curve (AUC) is clearly closer to .
Table III shows the AUC results for SNRs dB and RT60 msecs. In general, GSVDMUSIC generates higher AUC values for cases when the SNR is below dB. However, the DSVDPHAT still provides AUC values close to GSVDMUSIC, which demonstrates that the proposed method also allows accurate DOA estimation under reverberant and noisy conditions. Moreover, the proposed DSVDPHAT approach provides better results for all scenarios where the SNR is greater or equal to dB, at all reverberation levels.
SNR (dB)  RT60 (msec)  GSVDMUSIC  DSVDPHAT 

Both methods are compared in terms of the execution times per frame. These methods run in the MATLAB environment, and their implementation relies mostly on vectorization to speed up processing. The hardware used consists of an Intel Xeon CPU E51620 clocked at 3.70GHz. Table IV shows the average execution time per frame. This demonstrates the significant efficiency gain with DSVDPHAT that avoids the expensive online SVD computations, as it runs approximately times faster than GSVDMUSIC. In this experiment, with msecs between each frame, GSVDMUSIC requires roughly of the actual computing resources to achieve realtime, whereas DSVDPHAT easily meets realtime requirements by using only of the computing power.
Method  GSVDMUSIC  DSVDPHAT 

Time (msecs)  23.3  0.093 
Vi Conclusion
This paper introduces a variant of the SVDPHAT method to improve noise robustness. Results demonstrate that the proposed method performs similarly to the state of the art GSVDMUSIC technique, but runs approximately times faster. This makes DSVDPHAT appealing for localization on robots with limited onboard computing power.
In future work, we will investigate multiple sound source localization with the proposed DSVDPHAT method. Moreover, DSVDPHAT could be incorporated to existing SSL frameworks such as HARK^{2}^{2}2http://hark.jp [27] and ODAS^{3}^{3}3http://odas.io [17].
References
 [1] H. G. Okuno, T. Ogata, K. Komatani, and K. Nakadai, “Computational auditory scene analysis and its application to robot audition,” in Proceedings of the International Conference on Informatics Research for Development of Knowledge Society Infrastructure. IEEE, 2004, pp. 73–80.
 [2] G. Ince, K. Nakadai, T. Rodemann, H. Tsujino, and J.I. Imura, “Robust ego noise suppression of a robot,” in Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer, 2010, pp. 62–71.
 [3] R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276–280, 1986.
 [4] C. Ishi, O. Chatot, H. Ishiguro, and N. Hagita, “Evaluation of a MUSICbased realtime sound localization of multiple sound sources in real noisy environments,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009, pp. 2027–2032.
 [5] K. Nakamura, K. Nakadai, F. Asano, Y. Hasegawa, and H. Tsujino, “Intelligent sound source localization for dynamic environments,” in Proceedings of the IEEE/RSJ international conference on Intelligent Robots and Systems. IEEE, 2009, pp. 664–669.
 [6] K. Nakamura, K. Nakadai, F. Asano, and G. Ince, “Intelligent sound source localization and its application to multimodal human tracking,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2011, pp. 143–148.
 [7] K. Nakadai, G. Ince, K. Nakamura, and H. Nakajima, “Robot audition for dynamic environments,” in Proceedings of the IEEE International Conference on Signal Processing, Communication and Computing. IEEE, 2012, pp. 125–130.

[8]
K. Nakamura, K. Nakadai, and G. Ince, “Realtime superresolution sound source localization for robots,” in
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 694–699.  [9] T. Ohata, K. Nakamura, T. Mizumoto, T. Taiki, and K. Nakadai, “Improvement in outdoor sound source detection using a quadrotorembedded microphone array,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2014, pp. 1902–1907.
 [10] M. Brandstein and H. Silverman, “A robust method for speech signal timedelay estimation in reverberant rooms,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. 1. IEEE, 1997, pp. 375–378.
 [11] J.M. Valin, F. Michaud, J. Rouat, and D. Létourneau, “Robust sound source localization using a microphone array on a mobile robot,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 2. IEEE, 2003, pp. 1228–1233.
 [12] J.M. Valin, F. Michaud, B. Hadjou, and J. Rouat, “Localization of simultaneous moving sound sources for mobile robot using a frequencydomain steered beamformer approach,” in Proceedings of the IEEE International Conference on Robotics and Automation, vol. 1. IEEE, 2004, pp. 1033–1038.
 [13] J.M. Valin, F. Michaud, and J. Rouat, “Robust 3D localization and tracking of sound sources using beamforming and particle filtering,” in Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 4. IEEE, 2006, pp. 841–844.
 [14] ——, “Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering,” Robotics and Autonomous Systems, vol. 55, no. 3, pp. 216–228, 2007.
 [15] A. Badali, J.M. Valin, F. Michaud, and P. Aarabi, “Evaluating realtime audio localization algorithms for artificial audition in robotics,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2009, pp. 2033–2038.
 [16] F. Grondin, D. Létourneau, F. Ferland, V. Rousseau, and F. Michaud, “The ManyEars open framework,” Autonomous Robots, vol. 34, no. 3, pp. 217–232, 2013.
 [17] F. Grondin and F. Michaud, “Lightweight and optimized sound source localization and tracking methods for open and closed microphone array configurations,” Robotics and Autonomous Systems, vol. 113, pp. 63–80, 2019.
 [18] F. Grondin and J. Glass, “SVDPHAT: A fast sound source localization method,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signals Processing, 2019.
 [19] F. Grondin and F. Michaud, “Time difference of arrival estimation based on binary frequency mask for sound source localization on mobile robots,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2015, pp. 6149–6154.
 [20] ——, “Noise mask for tdoa sound source localization of speech on mobile robots in noisy environments,” in Proceedings of the IEEE International Conference on Robotics and Automation. IEEE, 2016, pp. 4530–4535.
 [21] I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for robust speech enhancement,” IEEE signal processing letters, vol. 9, no. 1, pp. 12–15, 2002.
 [22] H. Nakajima, G. Ince, K. Nakadai, and Y. Hasegawa, “An easilyconfigurable robot audition system using histogrambased recursive level estimation,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010, pp. 958–963.
 [23] P. Pertilä and E. Cakir, “Robust direction estimation with convolutional neural networks based steered response power,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2017, pp. 6125–6129.
 [24] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using timefrequency masks for online/offline ASR in noise,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 5210–5214.
 [25] V. Zue, S. Seneff, and J. Glass, “Speech database development at MIT: TIMIT and beyond,” Speech communication, vol. 9, no. 4, pp. 351–356, 1990.
 [26] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating smallroom acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
 [27] K. Nakadai, T. Takahashi, H. G. Okuno, H. Nakajima, Y. Hasegawa, and H. Tsujino, “Design and implementation of robot audition system’hark’—open source software for listening to three simultaneous speakers,” Advanced Robotics, vol. 24, no. 56, pp. 739–761, 2010.
Comments
There are no comments yet.