Sound Event Detection (SED) is an important machine listening task, which aims to automatically recognize, label, and estimate the position in time of sound events in a continuous audio signal. This is a popular research topic, due to the number of real-world applications for SED such as home-care , surveillance , environmental monitoring  or urban traffic control , to name just a few. Successful Detection and Classification of Acoustic Scenes and Events (DCASE) challenges [19, 20] now provide the community with datasets and baselines for a number of tasks related to SED. However, most of the effort so far has concentrated on classification and detection of the sound events in time only, with little work done to perform robust localization of sound event in space.
Early approaches for SED are strongly inspired by speech recognition systems, using mel frequency cepstral coefficients (MFCCs) with Gaussian Mixture Models (GMMs) combined with Hidden Markov Models (HMM)[13, 10]. Methods based on dictionary learning, mainly Non-negative Matrix Factorization (NMF), are also considered as prominent solutions for the SED task [6, 17, 9]24, 33]
. The prevailing architectures used for SED are Convolutional Neural Networks (CNNs)
, which are particularly successful in computer vision tasks. Other common approaches try to model time relations in audio signal by using recurrent neural networks (RNNs). Both can be combined in a Convolutional Recurrent Neural Network (CRNN), which achieves state of the art results on several machine listening tasks [16, 2, 1].
On the other hand, sound source localization (SSL) refers to estimating the direction of arrival (DOA) of multiple sound sources. There are two popular categories of SSL methods: 1) high resolution and 2) steered-response techniques. High resolution methods include Multiple Signal Classification (MUSIC)  and Estimation of Signal Parameters via Rotational Invariance Technique (ESPRIT) . These approaches, although initially designed for narrowband signals, can be adapted to broadband signals such as speech [15, 23, 28, 4, 7]. Alternatively, the Steered-Response Power Phase Transform (SRP-PHAT) robustly estimates the direction of arrival of speech and other broadband sources . SRP-PHAT relies on the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) between each pair of microphones of a microphone array. It is therefore convenient to estimate the time difference of arrival (TDOA) values for each pair, and combine these results to estimate the direction of arrival for a source [11, 31, 29, 30, 12], which is the approach we choose for this challenge.
In this paper we propose a system for sound event detection and localization (SELD), which we submitted to Task3 of the DCASE2019 Challenge . Motivated by the results obtained by , we propose a CRNN architecture that uses both the spectrogram and GCC-PHAT features to perform SED and estimate TDOA. However, since TDOA and SED have different cost functions, we believe they are distinct tasks with different optimal solutions, and we propose to use two separate neural networks for each of these two tasks. The results are then combined together to generate a final SED decision and estimate the DOA.
2 Sound Event Localization and Detection
The goal of sound event localization and detection (SELD) is to output all instances of the sound events in the recording, its respective onset-offset times, and spatial locations in azimuth and elevation angles, given a multichannel audio input. An example of such a setup has been provided in Task 3 of the DCASE 2019 Challenge . Our system uses the TAU Spatial Sound Events - Microphone Array dataset, which provides four-channel directional microphone recordings from a tetrahedral array configuration. A detailed description of the dataset and the recording procedure may be found in . In our approach, we propose to predict events and TDOAs for each pair of microphones, which leads to a total six pairs.
3 Proposed method
We propose a method based on a combination of two convolutional recurrent neural networks (CRNNs), that share a similar front-end architecture. The first network, , is trained to detect, label and estimate onset and offsets of sound events from a pair of microphones. The second network, , estimates the TDOA for each pair of microphones and each class of sound events. The SED results of all pairs are then combined together and a threshold is applied to make a final decision regarding sound detection for each class. The TDOAs are also combined together for all pairs of microphones and a DOA is generated for each class. To obtain a DOA from the TDOA values, each potential DOA is assigned a set of target TDOAs, which are found during a initial calibration procedure. Figure 1 shows the overall architecture of the proposed system. The following subsections describe in details each building block of the system.
The search space around the microphone array is discretized into DOAs, which are indexed by . Each DOA is associated to an azimuth and an elevation, denoted by , where and , which corresponds to the discrete angles used when recording the DCASE dataset . The number of microphones corresponds to , and the number of pairs to , where . Each DOA
also corresponds to a vectorof TDOA values, where and the cardinality . The expressions and stand for the maximum TDOA and the number of discrete TDOA values, respectively. Assuming free field propagation of sound, the microphone array geometry and the speed of sound provide enough information to estimate the TDOA values of each DOA. However, the free field assumption becomes inaccurate when dealing with a closed microphone array (e.g. when microphones are installed around a filled support), and thus calibration based on the recorded signals is needed and is performed offline.
stands for the Short-Time Fourier Transform (STFT) coefficient at frame index, microphone index and bin index , where stands for the total number of frequency bins used. The frame size and hop size correspond to and , respectively, and the spectral content thus spans frequencies in the interval Hz, where stands for the sample rate in samples/sec. The complex cross-spectrum for each microphone pair corresponds to:
where is a set that contains all the frame indexes where a single source is active at DOA , and stands for the complex conjugate operator. The Generalized Cross-Correlation with Phase Transform (GCC-PHAT) is then computed as follows:
where , with .
The TDOA value for the pair and DOA is then estimated as:
Since there is a limited amount of sound events per DOA in the training dataset, the estimated TDOAs can be noisy. To cope with this limitation, we apply a polynomial fitting method with an order of (found empirically). For each discrete elevation angle , there are azimuths , and the TDOAs associated to these azimuths vary smoothly. Therefore, for each pair and elevation , we concatenate the estimated TDOAs three times to create a signal that spans over the azimuths and avoids the discontinuities observed at and
within the initial range. A first polynomial fitting is then performed, and the outliers are removed prior to performing a second fitting, which finally provides the estimated TDOAfor each DOA for the pair :
Figure 2 shows an example of the proposed method and how it deals effectively with outliers. Note that once the polynomial coefficients are obtained, the TDOAs are only estimated in the region of interest, which is in the range .
3.2 Neural network architecture
The main building block of our system are two CRNNs that share a similar front-end architecture, as shown in Fig. 3.
The network consists of two branches. This first is a series of convolutional layers (CNN), that process the log amplitude and phase of the instantaneous complex cross-spectrum input spectrograms (as in (1)) between microphones and . In parallel, GCC-PHAT features (as in (2), but for a single frame
) are fed into a branch of a network that consists of two feed-forward layers. The outputs of two branches are concatenated and passed to a Bidirectional Gated Recurrent Unit (Bi-GRU) layer. The resultant vector is considered as a task dependent embedding of the input data. The embedding is passed to two feed forward layers, followed by an activation function, which depends on the task of the network.
is trained in a supervised manner using SED labels, i.e. information about the onset, offset and label of a sound event. As SED task may be pinned down to a multi-label classification of time frames, we use binary cross entropy as a loss function of the network. A Sigmoid activation function outputs the probabilities betweenand of each class for each time frame.
is trained on TDOA labels for each pair of microphones. The problem of TDOA estimation is defined in a regression framework. Hence, Mean Squared Error (MSE) loss is used to train the network. Similarly to the , the network consists of CNNs and GRU, followed by an activation function, Hyperbolic Tangent (tanh) in this case, scaled by as the TDOA value lies in the range . Note that the TDOA is only estimated over segments (i.e. audio samples for a given time interval) where the corresponding sound event is active according to the reference labels, as proposed in .
Both networks are trained separately on all pairs of microphones, using segments of seconds selected randomly amongst the training dataset, and using the Adam optimizer with a learning rate of and a batch size of . We stopped training the network when no further improvement is observed on the validation set, that is after 120,000 segments for and 160,000 segments for .
3.3 Event detection
returns a value for each pair of microphones and class . These values are summed up for all pairs and each class, and normalized by the number of pairs, which leads to a new expression :
An event from class is then considered to be detected at frame if exceeds a threshold, which is class specific:
A post-filter method finally ensures that each sound event lasts a minimum amount of frames (denoted by ) to avoid false detection of sporadic events. For evaluation purpose, the event activity is usually defined for a given segment , where holds the frames that belong to segment . The estimated event activity is then said to be active if at least one frames within the interval indicates the event is active.
3.4 DOA estimation
Similarly to , returns an estimated TDOA for each class and pair of microphone at frame . For each DOA at index , the estimated TDOAs are compared to the theoretical values
obtained from polynomial fitting during the calibration step. A Gaussian kernel with a variance ofthen generates a value close to when both TDOAs are close to each other, whereas this value goes to zero when the difference increases. All DOAs are scanned for each class, and the one that returns the maximum sum corresponds to the estimated DOA index :
The estimated DOAs are then concatenated in :
The proposed system is evaluated on the DCASE 2019 development dataset. This set is divided into 4 cross-validation splits of 100 one-minute recordings each, as described in . Table 1 lists the parameters used in the experiments. The sample rate and the number of microphones match the DCASE dataset parameters. The frame size corresponds to 43 msecs, which allows a good trade-off between time and frequency resolutions. The hop size provides a spacing of msecs between frames, which corresponds to the hop length for evaluation in the actual challenge. The values of and are set to provide a frequency range that goes up to 12 kHz (and exclude the DC component), which is where most of the sound event energy lies. The parameter is chosen to ensure a minimum sound event duration of
msecs, and the standard deviationis found empirically to provide a good DOA resolution with
TDOA values. The maximum value for a TDOA is set such that this includes all possible TDOA values for the actual array geometry. Finally, the neural network hyperparameters, , , and are found empirically from observed performances with the validation set. Also note that the event thresholds are found empirically by scanning values between and and selecting thresholds that lead to the best event detection metrics on the validation set.
To evaluate the performance of the system, events are defined for segments of 1 sec (). We define the number of true positives () for segment as the number of correctly estimated events with respect to the reference events activity ():
Similarly, the number of false negatives () and false positives () are given by:
Finally the total number of active events corresponds to:
We then define substitutions (), deletions () and insertions () are defined as:
This leads to the event rate (ER) and F1-score (F) metrics :
The DOA metrics consist of the DOA error (DOAE) and frame recall (FR) . The DOAE is obtained as follows:
where denotes the number of estimated events, stands for Hungarian algorithm  and represents the reference DOA. The pair-wise costs between individual predicted and reference DOAs corresponds to:
where and stand for the azimuth of the estimated and reference DOA, respectively, and and stand for the elevation of the estimated and reference DOA, respectively.
Finally, the frame recall corresponds to the following expression, where denotes the number of reference events, and stands for the indicator function that generates an output one if the condition () is met, or zero otherwise:
Table 2 summarizes the results for the baseline and the proposed method. This shows that the proposed system outperforms the baseline for all metrics, and improves particularly the accuracy of the estimated DOA.
Performances in terms of Error Rate (ER – less is better), F score (F – more is better), Direction of Arrival Error (DOA – less is better) and Frame Recall (FR – more is better).
In this paper, we propose a system to detect sound events and estimate their TDOA for each pair of microphones, which then combines them to detect sound events and estimate their DOA for a four-microphone array. The proposed method outperforms the DCASE 2019 baseline system.
In future work, additional neural networks architecture should be investigated for SELD. Moreover, making the system work online (by using unidirectional GRU layers for instance) would make the method appealing for real-world applications.
-  (2018) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. J. Sel. Topics Signal Process. 13, pp. 34–48. Cited by: §1, §2, §4.
-  (2018) Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. In Proc. IEEE EUSIPCO, pp. 1462–1466. Cited by: §1, §4.
-  (2019) A multi-room reverberant dataset for sound event localization and detection. In Submitted to DCASE Workshop, Cited by: §1, §2, §3.1.
-  (2007) Broadband variations of the MUSIC high-resolution method for sound source localization in robotics. In Proc. IEEE/RSJ IROS, pp. 2009–2014. Cited by: §1.
-  (2019) Polyphonic sound event detection and localization using a two-stage strategy. arXiv preprint arXiv:1905.00268. Cited by: §1, §3.2.
-  (2011) Spectral vs. spectro-temporal features for acoustic event detection. In Proc. IEEE WASPAA, pp. 69–72. Cited by: §1.
-  (2010) Information-theoretic detection of broadband sources in a coherent beamspace MUSIC scheme. In Proc. IEEE/RSJ IROS, pp. 1976–1981. Cited by: §1.
-  (2001) Robust localization in reverberant rooms. In Microphone Arrays, pp. 157–180. Cited by: §1.
-  (2013) Sound event detection using non-negative dictionaries learned from annotated overlapping events. In Proc. IEEE WASPAA, Cited by: §1.
-  (2013) Sound event detection for office live and office synthetic AASP challenge. In Proc. IEEE AASP DCASE, pp. 1–3. Cited by: §1.
-  (2013) The manyears open framework. Autonomous Robots 34 (3), pp. 217–232. Cited by: §1.
-  (2019) Lightweight and optimized sound source localization and tracking methods for open and closed microphone array configurations. Rob. Auton. Syst. 113, pp. 63–80. Cited by: §1.
-  (2013) Context-dependent sound event detection. EURASIP J. Audio, Spee. 2013 (1), pp. 1–13. Cited by: §1.
-  (2018) Domestic activities classification based on CNN using shuffling and mixing data augmentation. Technical report DCASE2018 Challenge. Cited by: §1.
-  (2009) Evaluation of a MUSIC-based real-time sound localization of multiple sound sources in real noisy environments. In Proc. IEEE/RSJ IROS, pp. 2027–2032. Cited by: §1.
-  (2018) Mean teacher convolution system for DCASE 2018 task 4. Technical report DCASE2018 Challenge. Cited by: §1.
-  (2016) Acoustic event detection method using semi-supervised non-negative matrix factorization with a mixture of local dictionaries. In Proc. DCASE Workshop, pp. 45–49. Cited by: §1.
-  (2014) Detection and localization of selected acoustic events in acoustic field for smart surveillance applications. Multimed. Tools Appl. 68 (1), pp. 5–21. Cited by: §1.
-  (2018) Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Trans. Audio, Speech, Language Process. 26 (2), pp. 379–393. Cited by: §1.
-  (2017) DCASE 2017 challenge setup: tasks, datasets and baseline system. In Proc. DCASE Workshop, Cited by: §1.
-  (2016) Metrics for polyphonic sound event detection. Applied Sciences 6 (6), pp. 162. Cited by: §4.
-  (2008) A real-time siren detector to improve safety of guide in traffic environment. In Proc. EUSIPCO, Cited by: §1.
A real-time super resolution robot audition system that improves the robustness of simultaneous speech recognition. Adv. Robotics 27 (12), pp. 933–945. Cited by: §1.
-  (2016) Recurrent neural networks for polyphonic sound event detection in real life recordings. In Proc. IEEE ICASSP, pp. 6440–6444. Cited by: §1.
-  (1986) Estimation of signal parameters via rotational invariance techniques - ESPRIT. In Proc. IEEE MILCOM, Cited by: §1.
-  (1986) Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34 (3), pp. 276–280. Cited by: §1.
-  (2016) Bird detection in audio: A survey and a challenge. In Proc. IEEE MLSP, Cited by: §1.
-  (2005) EB-ESPRIT: 2D localization of mulitple wideband acoustic sources using eigen-beams. In Proc. IEEE ICASSP, pp. 89–92. Cited by: §1.
-  (2004) Localization of simultaneous moving sound source for mobile robot using a frequency-domain steered beamformer approach. In Proc. IEEE ICRA, pp. 1033–1038. Cited by: §1.
-  (2006) Robust 3D localization nad tracking of sound sources using beamforming and particle filtering. In Proc. IEEE ICASSP, pp. 841–844. Cited by: §1.
-  (2007) Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Rob. Auton. Syst. 55 (3), pp. 216–228. Cited by: §1.
-  (2009) Audio event detection for in-home care. In Proc. ICA, pp. 618–620. Cited by: §1.
-  (2018) Large-scale weakly supervised audio classification using gated convolutional neural network. In Proc. IEEE ICASSP, pp. 121–125. Cited by: §1.