Small autonomous unmanned vehicles (e.g., quadcopters and ground robots) have revolutionized civilian and military missions by creating a platform for observation and permitting access to locations that are too dangerous, too difficult or too costly to send humans. These small vehicles have shown themselves to be remarkably capable in a lot of applications, such as surveying and mapping, precision agriculture, search and rescue, traffic surveillance, and infrastructure monitoring, to name just a few.
The sensing capability of unmanned vehicles has been enabled by various sensors, such as RGB cameras, infrared cameras, LiDARs, RADARs, and ultrasound sensors. However, these mainstream sensors are subject to either lighting conditions or line-of-sight requirements. On the end of the spectrum, sound sensors have the superiority to conquer line-of-sight constraints and provide a more efficient approach for unmanned vehicles to acquire situational awareness thanks to their omnidirectional nature.
Among the sensing tasks for unmanned vehicles, localization is of utmost significance . While vision-based localization techniques have been developed based on cameras, sound source localization (SSL) has been achieved using microphone arrays with different numbers (e.g., 2, 4, 8, 16) of microphones. Although it has been reported that the accuracy of the localization is enhanced as the number of microphones increases [2, 3], this comes with the price of algorithm complexity and hardware cost, especially due to the expense of the Analog-to-Digital converters (ADC), which is proportional to the number of speaker channels. Moreover, arrays with a particular structure (e.g., linear, cubical, circular, etc.) will be difficult to control, mount and maneuver, which makes them unsuitable to be used on small vehicles.
Humans and many other animals can locate sound sources with decent accuracy and responsiveness by using their two ears associated with head rotations to avoid ambiguity (i.e., cone of confusion) . Recently, SSL techniques based on a self-rotating bi-microphone array have been reported in the literature [5, 6, 7, 8, 9]. Single-SSL (SSSL) techniques have been well studies using different numbers of microphones, while for multi-sound-source-localization (MSSL) many reported techniques require large microphone arrays with specific structures, limiting them to be mounted on small robots. Pioneer work for MSSL assumed the number of sources to be known beforehand [10, 11]. Some of these approaches [12, 13, 14] are based on sparse component analysis (SCA) that requires the sources to be W-disjoint orthogonal  (i.e., in some time-frequency components, at most one source is active), thereby making them unsuitable for reverberant environments. Pavlidi et al.  and Loesch et al.  presented an SCA-based method to count and localize multiple sound sources but requires one sound source to be dominant over others in a time-frequency zone. Clustering methods have also been used to conduct MSSL [13, 12, 14]. Catalbas et al.  presented an approach for MSSL by deploying four microphones at the corners of the room and the sound sources are required to be present within the boundary. The technique was limited to localize sound orientations in the two-dimensional plane using K-mediods clustering. The number of sound sources were calculated using the exhaustive elbow method, which is instinctive and computationally expensive. Traa et al. 
presented an approach that converts the time-delay between the microphones in the frequency domain so as to model the phase differences in each frequency bin of short-time Fourier transform. Due to the linear relationship between phase difference and frequency, the data were then clustered using random sample consensus (RANSAC). In our previous work[9, 8]
, we developed a SSSL technique based on an extended Kalman filter and a MSSL technique based on a cross-correlation approach, which was very computationally expensive.
The contributions of this paper includes two novel MSSL approaches for identification of the number of sound sources as well as localizing them in a three-dimensional (3D) environment. The rotation of the bi-microphone array generates an Interaural Time Difference (ITD) signal with data points forming multiple discontinuous sinusoidal waveforms. In the first approach, a novel mapping mechanism is developed to convert the acquired ITD signal to an orientation domain. An unsupervised classification is then conducted using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN). DBSCAN  is one of the most popular nonlinear clustering techniques. It can discover any arbitrary shaped clusters of densely grouped points in a data set and outperforms other clustering methods in the literature [21, 22].
The second presented novel approach for MSSL completes a sinusoidal ITD regression using a RANSAC-based method. Each of the sine waves in the ITD signal corresponds to a single sound source. The data points associated with each sine wave is separated by performing a repeated sinusoidal regression using RANSAC . After a model is fitted in an iteration, the associated data points will be removed from the ITD signal before the next iteration starts. The azimuth and elevation angles of the sound source are then determined for each fitted model. A threshold is then selected to determine the number of sound sources with the qualifying number of data points.
Both simulations and experiments were conducted to test the proposed two approaches. The results show that both approaches are capable of correctly generating the number of sound sources and their 3D orientations in terms of azimuth and elevation angles. However, the RANSAC-based approach outperforms the DBSCAN-based approach on the identification of the number of sound sources, while the DBSCAN-based approach outperforms the RANSAC-based approach in the localization accuracy.
The rest of the paper is organized as follows. In Section II-B, the mathematical calculation for the ITD signal generated by the self-rotating microphone array is presented. In Section III, the mapping mechanism for regression is presented. Section IV presents the localization algorithm using DBSCAN clustering and Section V presents the RANSAC-based localization algorithm. Simulation results are presented and discussed in Section VI and Section VII concludes the paper.
Ii-a Interaural Time Difference (ITD)
Consider a single stationary sound source and two spatially separated microphones placed in an environment. Let and be the sound signals captured by the microphones in the presence of noise, which are given by 
where is the sound signal, and are real and jointly stationary random processes, denotes the time difference of arriving at the two microphones, and is the signal attenuation factor due to different traveling distances. It is commonly assumed that changes slowly and is uncorrelated with noises and . Figure 1 shows the process of ITD estimation between signals and , where and could be the scaling functions or pre-filters , which eliminate or reduce the effect of background noise and reverberations using various techniques [26, 27, 28, 29].
The cross-correlation function of and is given by
where represents the expectation operator. The time difference of and , i.e., the ITD, is given by The distance difference of the sound signal traveling to the two microphones is given by where is the sound speed and is usually selected to be 345 m/s on the Earth surface.
As a matter of simplicity, the signal is referred as ITD in the context. ITD is the only cue used in this paper for the source counting and localization, generated without using any scaling functions nor pre-filters mentioned above.
Ii-B Mathematical Model for ITD signal
Before discussing the multi-source ITD signal collected by the self-rotating bi-microphone array, the single source ITD signal is first modeled. In this paper, the location of a single sound source is defined in a spherical coordinate frame, whose origin is assumed to coincide with the center of a ground robot.
As shown in Figures 2 and 3, the left and right microphones, and collects the acoustic signal generated by the sound source S. Let be the center of the robot as well as the bi-microphone array. The sound source location is represented by (), where is the distance between the source and the center of the robot, i.e., the length of segment , is the elevation angle defined as the angle between and the horizontal plane, and
] is the azimuth angle defined as the angle measured clockwise from the robot heading vector,, to . Letting unit vector be the orientation (heading) of the microphone array, be the angle between and , and be the angle between and , both following a right hand rotation rule, we have
To avoid cone of confusion  in SSL, the binaural microphone array needs to be rotated with a nonzero angular velocity . Without loss of generality, in this paper we assume a clockwise rotation of the microphone array on the horizontal plan while the robot itself does not rotate throughout the entire estimation process, which implies that is constant.
The initial heading of the microphone array is configured to coincide with the heading of the robot, i.e., , which implies that . As the microphone array rotates clockwise with a constant angular velocity, , we have and due to Equation (3) we have
The resulting time-varying due to Equation (4) is then given by
Because the microphone array rotates on the horizontal plane, does not change during the rotation for a stationary sound source. The resulting is a sinusoidal signal with the amplitude , which implies that
It can be seen from Equation (6), the phase angle of is the azimuth angle of the sound source. Therefore, the localization of a stationary sound source equates the identification of the characteristics (i.e., the amplitude and phase angle) of the sinusoidal signal, .
The collection of the ITD signal for multiple sound sources (as shown in Figure 11) illustrates a group of multiple discontinuous sinusoidal waveforms, each corresponding to a single sound source, satisfying the amplitude-elevation and phase-azimuth relationship as mentioned above.
Iii Model for Mapping and Sinusoidal Regression
The signal in Equation (6) is sinusoidal with its amplitude and phase angle that corresponds to the azimuth angle of the sound source. Since the frequency, , of is the known rotational speed of the microphone array, the localization task (i.e., identifying and ) is to estimate the amplitude and phase angle of , i.e., and . Consider a general form of expressed as
where and , and we have
Consider the two data points, and , collected at two distinct time instants and , respectively, and we have
where is an integer.
Iv DBSCAN-Based MSSL
In the DBSCAN algorithm , a random point from the data set is considered as a core cluster point when more than points (including itself) within a distance of (epsilon ball) exists in its neighborhood. This cluster is then extended by checking all of the other points satisfying the and criteria thereby letting the cluster grow. A new arbitrary point is then chosen and the process is repeated. The point which is not a part of any cluster and having fewer than points in its epsilon ball is considered as a "noise point". The DBSCAN technique is more suitable for applications with noise and performs better than the Kmeans method, which requires a prior knowledge of the number and the approximate centroid locations of clusters and can also fail in the presence of noisy data points.
The DBSCAN-Based MSSL technique consists of two stages. In the first stage, the data points of the ITD signal are mapped to the orientation domain. The data set consisting of all the data points in multi-source ITD signal contains not only inliers but also outliers, which produce undesired mapped locations. When the number of inliers is significantly greater than the outliers after a number of iterations, highly dense clusters will be formed. In the second stage, these clusters are detected using the DBSCAN technique by carefully selecting parametersand . The number of clusters corresponds to the number of sound sources and the centroids of these clusters represent the locations of the sound sources.
The complete DBSCAN-based MSSL algorithm is described in 1. Two points in the data set are selected randomly and mapped into the orientation domain by calculating the angles and using Equations (6), (13), (7) and (14). A set of these mapped points is then created. The process for detection of clusters is then started. A point in is randomly chosen and is decided to be a core cluster point or a noise point by checking the density-reachability criteria under the - condition . The time complexity of Algorithm 1 is , where is the number of iterations for mapping and clustering. The number of iterations, , needs to be selected large enough for the algorithm to work efficiently.
V RANSAC-Based MSSL
The RANSAC algorithm  is able to identify inliers (e.g., parameters of a mathematical model) in a data set that may contain a significantly large number of outliers. The input to the RANSAC algorithm includes a set of data, a parameterized model, and a confidence parameter (). In each iteration, a subset of the original data is randomly selected and used to fit the predefined parameterized model. All other data points in the original data set are then tested against the fitted model. A point is determined to be an inlier of the fitted model, if it satisfies the condition. The process is repeated by selecting another random subset of the data. After a fixed number of iterations, the parameters are then selected for the best fitting (with maximum inliers) estimated model.
The RANSAC-based MSSL method is described in Algorithm 2. It can be seen from Equation (6) that the signal generated by the self-rotating bi-microphone array is sinusoidal. Two points from the ITD signal are selected randomly and a sine wave with the given frequency (i.e., the angular speed of the rotation, is generated. The represents the number of points whose distance to the fitted sine wave is less than , which is the threshold for a point to be considered inlier. Then the points in that belong to according to the condition will be removed from , This procedure is repeated for iterations and the parameters and are updated every time the number of inliers is greater than that in the previous iterations. This process is repeated until either all the points in are examined or iterations are completed. The time complexity of Algorithm 2 is , where is the number of samples in the ITD. After the first few of iterations, most of the data points are removed. This results in to be very small number as compared to . The iterations
should be chosen large enough to ensure the probability that at least one of the sets of randomly selected points does not include an outlier.
The number of sound source is determined by carefully selecting a threshold, as shown in Figure 5. The confidence about the presence of a sound source is dependent on the value. The source with the maximum is considered to be qualified with 100 % confidence and the confidence values for other sources are calculated relatively. The source with confidence value less than the threshold is considered to be noise and do not qualify as a sound source. Very weak sound signals will have few or no data points at all in the ITD signal and will be discarded.
Vi Simulation and Experimental Results
|Dimension||20 m x 20 m x 20 m|
|(walls, floor and ceiling)|
|Sound speed||345 m/s|
|Static pressure||29.92 mmHg|
|Relative humidity||38 %|
Audio Array Toolbox  is used to establish an emulated rectangular room using the image method described in . The robot was placed in the origin of the room. The sound sources and the microphones are assumed omni-directional and the attenuation of the sound are calculated per the specifications in Table I. A number of recorded speech signals available at  were used as sound sources to test the technique. Different number of sound sources were placed at various azimuth and elevation angles at a fixed distance of 5 m and the ITD signal was recorded by the rotating bi-microphone array with mics separated by a distance of m (which is approximate distance between the ears of a human). The sound sources were separated by at least in azimuth and atleast in elevation. The ITD value was calculated and recorded every
of rotation. Noise with a variance (of was added to this ITD signal for the simulations in order to account for sensor noise. These simulations were run on a high performance cluster named Joker.
Numerous experiments were also conducted using a robotic platform in an indoor environment with ms, as shown in Figure 6. Figure 7 shows the impulse response of the room. The sound sources were kept at a distance of about to m from the center of the robot.
|Parameters||For simulations||For experiments|
|No. of||MAE (Sim)||MAE (Expt)||Avg|
|No. of||MAE (Sim)||MAE (Expt)||Avg|
Figure 8 shows four sound sources in the simulation placed at , , and and a distance of m from the robot placed at the origin. Figure 9 shows the estimation of the parameters and for every two points chosen from the multi-source signal at each iteration. The parameters used for the RANSAC-based and DBSCAN-based algorithms are listed in Table II. The value was chosen to be m, which implies that all sound sources are assumed to be separated by at least from each other. Tables III and IV show the simulation and experimental results of localization with the number of sound sources varying from one to four.
Monte Carlo simulation runs were performed using the two proposed approaches, respectively, with specifications given in Table I. The simulation were run with sources and the results of the source counting are listed in the Table V.
The clustering result using DBSCAN, as shown in Figure 10.
Figure 11 shows the simulation result of a sample run, where three sources were detected. The three sound sources were kept at and and the estimated locations by the RANSAC-based algorithm are and , respectively. Since the ITD signal is noisy any point very close () to any of was chosen to be on the ITD by the RANSAC algorithm. The value can be chosen depending on possibility of sound sources to be close to each other and the noise level. The signal to noise ratio (SNR) of the measured signal was dB. For a source to be considered as a qualified sound source, the threshold for the confidence that worked for us was % in simulation and % in experiments.
As shown in Figure 12, the average error of orientation localization with the DBSCAN-based algorithm is less as compared to the RANSAC-based algorithm. which, however, generates comparatively more accurate results for source counting, as shown in the Figure 13. In both simulations and experiments, the error of elevation angle estimation was found to be large for sources kept close to zero elevation, which coincide the conclusion in . The performance of the localization and source counting using both the aforementioned techniques improves significantly by increasing the number of rotation of the bi-microphone array. The sound sources are assumed to be active during the rotation of the bi-microphone array with possible pauses such as in case of speech signals.
Two novel techniques are presented for small autonomous unmanned vehicles (SAUVs) to perform multi-sound-source localization (MSSL) using a self-rotating bi-microphone array. The DBSCAN-based MSSL approach iteratively maps the randomly chosen points in the ITD signal to the orientation domain, leading to a data sets for clustering. These clusters are detected using the density based spatial clustering for application with noise (DBSCAN). The number of clusters gives the number of sound sources and the location of the centroid of these clusters determines the location of the sound sources. The second proposed technique uses random sample consensus (RANSAC) to iteratively estimate parameters of a model using two randomly randomly chosen data points from the ITD signal data. It then uses a threshold to decide between the qualifying sound sources. The simulation and experimental results show the effectiveness of both approaches in identifying the number and the orientations of the sound sources.
-  J. Borenstein, H. Everett, and L. Feng, Navigating mobile robots: systems and techniques. A K Peters Ltd., 1996.
-  D. V. Rabinkin, “Optimum sensor placement for microphone arrays,” Ph.D. dissertation, RUTGERS The State University of New Jersey - New Brunswick, 1998.
-  M. Brandstein and D. Ward, Microphone arrays: signal processing techniques and applications. Springer Science & Business Media, 2013.
-  H. Wallach, “On sound localization,” The Journal of the Acoustical Society of America, vol. 10, no. 4, pp. 270–274, 1939.
-  S. Lee, Y. Park, and Y.-s. Park, “Three-dimensional sound source localization using inter-channel time difference trajectory,” International Journal of Advanced Robotic Systems, vol. 12, no. 12, p. 171, 2015.
-  A. A. Handzel and P. Krishnaprasad, “Biomimetic sound-source localization,” IEEE Sensors Journal, vol. 2, no. 6, pp. 607–616, 2002.
-  G. H. Eriksen, “Visualization tools and graphical methods for source localization and signal separation,” Master’s thesis, Universityof OSLO, Department of Informatics, 2006.
-  X. Zhong, W. Yost, and L. Sun, “Dynamic binaural sound source localization with ITD cues: Human listeners,” The Journal of the Acoustical Society of America, vol. 137, no. 4, pp. 2376–2376, 2015.
-  D. Gala, N. Lindsay, and L. Sun, “Three-dimensional sound source localization for unmanned ground vehicles with a self-rotational two-microphone array,” in Proceedings of the 5th International Conference of Control, Dynamic Systems, and Robotics (CDSR’18). Accepted (2018), June.
-  L. Sun and Q. Cheng, “Indoor multiple sound source localization using a novel data selection scheme,” in 48th Annual Conference on Information Sciences and Systems (CISS). IEEE, 2014, pp. 1–6.
-  X. Zhong, L. Sun, and W. Yost, “Active binaural localization of multiple sound sources,” Robotics and Autonomous Systems, vol. 85, pp. 83–92, 2016.
-  C. Blandin, A. Ozerov, and E. Vincent, “Multi-source TDOA estimation in reverberant audio using angular spectra and clustering,” Signal Processing, vol. 92, no. 8, pp. 1950–1960, 2012.
-  M. Swartling, B. Sällberg, and N. Grbić, “Source localization for multiple speech sources using low complexity non-parametric source separation and clustering,” Signal Processing, vol. 91, no. 8, pp. 1781–1788, 2011.
-  T. Dong, Y. Lei, and J. Yang, “An algorithm for underdetermined mixing matrix estimation,” Neurocomputing, vol. 104, pp. 26–34, 2013.
-  O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on signal processing, vol. 52, no. 7, pp. 1830–1847, 2004.
-  D. Pavlidi, A. Griffin, M. Puigt, and A. Mouchtaris, “Real-time multiple sound source localization and counting using a circular microphone array,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2193–2206, 2013.
-  B. Loesch and B. Yang, “Source number estimation and clustering for underdetermined blind source separation,” in International Workshop on Acoustic Signal Enhancement (IWAENC), 2008.
-  M. C. Catalbas and S. Dobrisek, “3D moving sound source localization via conventional microphones,” Elektronika ir Elektrotechnika, vol. 23, no. 4, pp. 63–69, 2017.
-  J. Traa and P. Smaragdis, “Blind multi-channel source separation by circular-linear statistical modeling of phase differences,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013, pp. 4320–4324.
-  M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Kdd, vol. 96, no. 34, 1996, pp. 226–231.
C. D. Raj, “Comparison of K means K medoids DBSCAN algorithms using DNA microarray dataset,”International Journal of Computational and Applied Mathematics (IJCAM), 2017.
-  N. Farmani, L. Sun, and D. J. Pack, “A scalable multitarget tracking system for cooperative unmanned aerial vehicles,” IEEE Transactions on Aerospace and Electronic Systems, vol. 53, no. 4, pp. 1947–1961, Aug 2017.
-  M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
-  C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, Aug 1976.
-  M. Azaria and D. Hertz, “Time delay estimation by generalized cross correlation methods,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 280–285, 1984.
-  P. Naylor and N. D. Gaubitch, Speech dereverberation. Springer Science & Business Media, 2010.
-  A. Spriet, L. Van Deun, K. Eftaxiadis, J. Laneau, M. Moonen, B. Van Dijk, A. Van Wieringen, and J. Wouters, “Speech understanding in background noise with the two-microphone adaptive beamformer beam in the nucleus freedom cochlear implant system,” Ear and hearing, vol. 28, no. 1, pp. 62–72, 2007.
-  D. R. Gala, A. Vasoya, and V. M. Misra, “Speech enhancement combining spectral subtraction and beamforming techniques for microphone array,” in Proceedings of the International Conference and Workshop on Emerging Trends in Technology (ICWET), 2010, pp. 163–166.
-  D. R. Gala and V. M. Misra, “SNR improvement with speech enhancement techniques,” in Proceedings of the International Conference and Workshop on Emerging Trends in Technology (ICWET). ACM, 2011, pp. 163–166.
-  “International Organization for Standardization (ISO), British, European and International Standards (BSEN), Noise emitted by machinery and equipment – Rules for the drafting and presentation of a noise test code,” 12001: 1997 Acoustics.
-  K. D. Donohue, “Audio array toolbox,” [Online] Available: http://vis.uky.edu/distributed-audio-lab/about/ , 2017, Dec 22.
-  J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
-  K. D. Donohue, “Audio systems lab experimental data - single-track single-speaker speech,” [Online] Available: http://http://web.engr.uky.edu/ donohue/audio/Data/audioexpdata.htm , 2018, Feb 10.
-  New Mexico State University, “High performance cluster, joker,” [Online] Available: http://https://hpc.nmsu.edu/ , 2017, Feb 10.