SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection

Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAU-NIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6 log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16 respectively, compared to using multichannel log-mel spectrograms with generalized cross-correlation spectra. Our ensemble model trained on SALSA features ranked second in the team category of the SELD task in the 2021 DCASE Challenge.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 14

06/29/2021

DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic Sound Event Localization and Detection

Sound event localization and detection consists of two subtasks which ar...
11/26/2019

A two-step system for sound event localization and detection

Sound event detection and sound event localization requires different fe...
11/08/2021

The complex-valued correlation coefficient accounts for binaural detection

Binaural hearing is one of the principal mechanisms enabling the localiz...
07/22/2021

What Makes Sound Event Localization and Detection Difficult? Insights from Error Analysis

Sound event localization and detection (SELD) is an emerging research to...
02/14/2020

Sound Event Localization based on Sound Intensity Vector Refined By DNN-Based Denoising and Source Separation

We propose a direction-of-arrival (DOA) estimation method for Sound Even...
06/13/2021

SoundDet: Polyphonic Sound Event Detection and Localization from Raw Waveform

We present a new framework SoundDet, which is an end-to-end trainable an...
12/06/2018

Binaural Source Localization based on Modulation-Domain Features and Decision Pooling

In this work we apply Amplitude Modulation Spectrum (AMS) features to th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sound event localization and detection (SELD) has many applications in urban sound sensing [33], wildlife monitoring [37], surveillance [10], autonomous driving, and robotics [39]. SELD is an emerging research field that unifies the tasks of sound event detection (SED) and direction-of-arrival estimation (DOAE) by jointly recognizing the sound classes, and estimating the directions of arrival (DOA), the onsets, and the offsets of detected sound events [1]. Because of a need for source localization, SELD typically requires multichannel audio inputs from a microphone array, which has several formats in current use, such as first-order ambisonics (FOA) and far-field microphone array (MIC).

I-a Existing methods

Approach Format Input Features Network Architecture Output
Adavanne et al. [1] FOA/MIC Magnitude & phase spectrograms End-to-end CRNN class-wise
Cao et al. [7] FOA/MIC Log-mel spectrograms, GCC-PHAT Two-stage CRNNs class-wise
Nguyen et al. [20] FOA Log-mel spectrograms, directional SS histograms Sequence matching CRNN track-wise
Xue et al. [42] MIC Log-mel spectrograms, IV, pair-wise phase differences Modified two-stage CRNNs class-wise
Cao et al. [4] FOA Log-mel spectrograms, IV EINv2 track-wise
Shimada et al. [35] FOA Linear amplitude spectrograms, IPD CRNN with D3Net class-wise
Sato et al. [34] FOA Complex spectrograms Invariant CRNN class-wise
Phan et al. [28] FOA/MIC Log-mel spectrograms, IV, GCC-PHAT CRNN with self attention class-wise
Park et al. [25] FOA Log-mel spectrograms, IV, harmonic percussive separation CRNN with feature pyramid class-wise
Emmanuel et al. [9] FOA Constant-Q spectrograms, log-mel spectrograms, IV Multi-scale network with MHSA track-wise
Lee et al. [14] FOA Log-mel spectrograms, IV EINv2 with cross-model attention track-wise
(Top’19) Kapka et al. [12] FOA Log-mel spectrograms, IV Ensemble of CRNNs class-wise
(Top’20) Wang et al. [41] FOA+MIC Log-mel spectrograms, IV, GCC-PHAT Ensemble of CRNNs & CNN-TDNNs class-wise
(Top’21) Shimada et al. [36] FOA Linear amplitude spectrograms, IPD, cosIPD, sinIPD Ensemble of CRNNs & EINv2 class-wise
Proposed method FOA/MIC SALSA: Log-linear spectrograms & normalized eigenvectors End-to-end CRNN class-wise

IV and GCC-PHAT features follow the frequency scale (linear, mel, constant-Q) of the spectrograms. TDNN stands for time delay neural networks. IPD stands for interchannel phase differences. Top’YY denotes the top ranked systems for the respective DCASE SELD Challenges.

TABLE I: Comparison of the proposed method with some existing deep learning-based methods for polyphonic SELD.

Over the past few years, there have been many major developments for SELD in the areas of data augmentation, feature engineering, model architectures, and output formats. In 2015, an early monophonic SELD work by Hirvonen [11] formulated SELD as a classification task. In 2018, Adavanne et al. [1]

pioneered a seminal polyphonic SELD work that used an end-to-end convolutional recurrent neural network (CRNN),

SELDnet, to jointly detect sound events and estimate the corresponding DOAs. In 2019, SELD task was introduced in the Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). Cao et al. [7] proposed a two-stage strategy by training separate SED and DOAE models, then using the SED outputs as masks to select the DOA outputs. Mazzon et al. [15] proposed a spatial augmentation method by swapping channels of FOA format. Xue et al. [42] applied eight fixed beamformers to extract signals from different directions as inputs to a modified two-stage CRNN network. Nguyen et al. [20, 21] explored a hybrid approach called a Sequence Matching Network

(SMN) that first solved SED and DOAE separately, then matched the SED and DOAE output sequences using a bidirectional gated recurrent unit (BiGRU).

In 2020, moving sound sources were introduced in the DCASE SELD Challenge. Cao et al. [6] proposed Event Independent Network (EIN) that used soft parameter sharing between the SED and DOAE encoder branches and output track-wise predictions. An improved version of this network, EINv2, replaced the biGRUs with multi-head self-attention (MHSA) [4]. Sato et al. [34] designed a CRNN that is invariant to rotation, scale, and time translation for FOA signals. Phan et al. [28] formulated SELD as regression problems for both SED and DOAE to improve training convergence. Park et al. [25]

experimented with different loss functions to overcome several data imbalance issues. Wang et al. 

[40] focused on several data augmentation methods to overcome the data sparsity problem in SELD. Shimada et al. [35] unified SED and DOAE losses into one regression loss using a representation technique called Activity-Coupled Cartesian Direction of Arrival

(ACCDOA), and incorporated a new convolutional neural network (CNN) architecture, D3Net 

[38] into the CRNN for SELD. In another research direction, Lopez et al. [27]

tracked the trajectories of sound sources based on the estimated DOAs, performed spatial filtering based on the source trajectories on the array signals, and finally classified the filtered signals. In 2021, unknown interferences were introduced in the DCASE SELD challenge. Lee et al. 

[14] enhanced EINv2 by adding cross-modal attention between the SED and DOAE branches. Emmanuel et al. [9] proposed a multi-scaled network with MHSA for SELD. Table I summarizes some notable and state-of-the-art deep learning methods for SELD.

I-B Input features for SELD

In this paper, we focus on input features for SELD. When SELDnet was first introduced, it was trained on multichannel magnitude and phase spectrograms [1]. Subsequently, different features, such as multichannel log-spectrograms and intensity vector (IV) for the FOA format, and generalized cross-correlation with phase transform (GCC-PHAT) for the MIC format in the mel scale were shown to be more effective for SELD [30, 29, 7, 4, 35, 40, 41, 36].

Due to the smaller dimension size and stronger emphasis on the lower frequency bands, where signal contents are mostly populated, the mel frequency scale has been used more frequently than the linear frequency scale for SELD. However, combining the IV or GCC-PHAT features with the mel spectrograms is not trivial and the implicit DOA information stored in the former features are often compromised. In practice, the IVs are also passed through the mel filters which merge DOA cues in different narrow bands into one mel band, making it more difficult to resolve different DOAs in multi-source scenarios. Likewise, in order to stack the GCC-PHAT with the mel spectrograms, longer time-lags on the GCC-PHAT have to be truncated. Since the linear scale has the advantage of preserving the directional information at each frequency band, several works have attempted to used spectrogram, inter-channel phase differences (IPD), and IVs in linear scale [35] or the constant-Q scale [9]. However, there is lack of experimental results that directly compare these features over different scales.

Referring to Table I, more SELD algorithms have been developed for the FOA format compared to the MIC format, even though the MIC format is more common in practice. The baselines for three DCASE SELD challenges so far have indicated that using FOA inputs performs slightly better than that with MIC inputs [2, 30, 29]. In addition, it is more straightforward to stack IVs with the spectrograms in the FOA format compared to stacking GCC-PHAT with spectrograms. When IVs are stacked with spectrograms, there is a direct frequency correspondence between the IVs and the spectrograms. This frequency correspondence is crucial for networks to associate the sound classes and the DOAs of multiple sound events, where signals of different sound sources are often distributed differently along the frequency dimension. On the other hand, the time-lag dimension of the GCC-PHAT features does not have a local linear one-to-one mapping with the frequency dimension of the spectrograms. As a result, all of the DOA information is aggregated at the frame level, making it difficult to assign correct DOAs to different sound events. Furthermore, when there are multiple sound sources, GCC-PHAT features are known to be noisy, and the directional cues at overlapping TF bins of IVs are merged. In order to solve SELD more effectively in noisy, reverberant, and multi-source scenarios, a better feature is needed for both audio formats, but especially for the MIC format where feature engineering has largely been lacking compared to the FOA format.

I-C Our Contributions

We propose a novel feature for SELD called Spatial Cue-Augmented Log-Spectrogram (SALSA) with exact spectrotemporal mapping between the signal power and the source DOA for both FOA and MIC formats. The feature consists of multichannel log-magnitude linear-frequency spectrograms stacked with a normalized version of the principal eigenvector of the spatial covariance matrix (CM) at each TF bin on the spectrograms. The principal eigenvector is normalized such that it represents the inter-channel intensity difference (IID) for the FOA format, or inter-channel phase difference (IPD) for the MIC format.

To further improve the performance, only eigenvectors from approximately single-source (SS) TF bins are included in the features since these directional cues are less noisy. A TF bin is considered a SS bin when it contains energy mostly from only one source [23, 18]. We evaluated the effectiveness of the proposed feature on both the FOA and the MIC formats using the TAU-NIGENS Spatial Sound Events (TNSSE) 2021 dataset used in DCASE 2021 SELD Challenge. Experimental results showed that the SALSA feature outperformed, for the FOA format, both mel- and linear-frequency log-magnitude spectrograms with IV, and for the MIC format, the log-magnitude spectrogram with GCC-PHAT.

In addition, SALSA features bridged the performance gap between the FOA and the MIC formats, and achieved the state-of-the-art performance for a single (non-ensemble) model on the TNSSE 2021 development dataset for both formats. Similarly, when evaluated on the TNSSE 2020 dataset, SALSA also achieved the top performance for a single model for both formats on both the development and the evaluation datasets. Our ensemble model trained on an early version of SALSA features ranked second in the team category of the DCASE 2021 SELD challenge [22].

Our paper offers several contributions, as follows:

  1. a novel and effective feature for SELD that works for both FOA and MIC formats,

  2. an improvement to the proposed feature by utilizing signal processing-based methods to select SS TF bins,

  3. a comprehensive analysis of feature importance of each components in SALSA for SELD, and,

  4. an extensive ablation study of different data augmentation methods for the newly proposed SALSA feature, as well as for the log-magnitude spectrograms with IV and GCC-PHAT in both linear- and mel-frequency scales.

The rest of the paper is organized as follows. Section II presents the proposed SALSA features for both the FOA and the MIC formats. Section III briefly describes common SELD features used as benchmarks. Section IV presents the network architecture employed in all of the experiments. Section V elaborates the experimental settings. Section VI presents the experimental results and discussion with extensive ablation study. Finally, we conclude the paper in Section VII.

Ii Spatial Cue-Augmented Log-Spectrogram Features for SELD

The proposed SALSA features consist of two major components: multichannel log-linear spectrograms and normalized principal eigenvectors. For the rest of this paper, spectrograms refer to multichannel spectrograms unless otherwise stated.

Ii-a Signal Model

Let be the number of microphones and

be the number of sound sources. The short-time Fourier transform (STFT) signal observed by an

-channel microphone array of arbitrary geometry in the TF domain is given by

(1)

where and are time and frequency indices, respectively; is the th source signal;

is the frequency-domain steering vector corresponding to the DOA (

) of the th source, where and are the azimuth and elevation angles, respectively; and is the noise vector. For moving sources, and are functions of time. For brevity, the time variable is omitted in and for some equations.

Ii-B Multichannel log-linear spectrograms

The log-linear spectrograms are computed from the complex spectrograms by

(2)

where is the number of time frames and is the number of frequency bins.

Ii-C Normalized principal eigenvectors

Assuming the signal and noise are zero-mean and uncorrelated, the true CM, , is a linear combination of rank-one outer products of the steering vectors weighted by signal powers of the th source at the TF bin , that is,

(3)
(4)

where is the noise CM, and denotes the Hermitian transpose. Eq. 4 shows that at SS TF bins, where only one sound source is dominant, the theoretical steering vector can be approximated by the principal eigenvector of the CM [23, 18]. Therefore, we can reliably extract directional cues from these principal eigenvectors at these bins. For TF bins which are not single-source, the values of the directional cues can be set to a predefined default value such as zero. In the next sections, we elaborate on how to normalize the principal eigenvectors to extract directional cues, which are encoded in the IID and IPD for FOA arrays and far-field microphone arrays, respectively.

Ii-C1 Eigenvector-based intensity vector for FOA arrays

FOA arrays have four channels and the directional cues are encoded in the IID. A typical steering vector for an FOA array can be defined by

(5)

where and are the time-dependent azimuth and elevation angles of a sound source with respect to the array, respectively.

We can compute an eigenvector-based intensity vector (EIV) to approximate from the principle eigenvector as follows. First, we normalize by its first element, which corresponds to the omni-directional channel, then discard the first element to obtain . Afterwards, we take the real part of and normalize it to obtain unit-norm EIV . SALSA features for the FOA format are formed by stacking the four-channel spectrograms with the three-channel EIV . Fig. 1 illustrates SALSA features of a -second audio segment in multi-source cases for an FOA array with an EIV cutoff frequency of . The three EIV channels are visually discriminant for different sources originating from different directions. The green areas in the EIV channels correspond to zeroed-out TF bins.

Ii-C2 Eigenvector-based phase vector for microphone arrays

Fig. 1: SALSA features of a 16-second audio segment of FOA format in a multi-source scenario. The vertical axis represents frequency in kHz.
Fig. 2: SALSA features of a 16-second audio segment of a four-channel microphone array (MIC format) in a multi-source scenario. The vertical axis represents frequency in kHz.

For a far-field microphone array, the directional cues are encoded in the IPD. The steering vector of an -channel far-field array of an arbitrary geometry can be modelled by , whose elements are given by

(6)

where is the imaginary unit, is the speed of sound; is the distance of arrival, in metres, travelled by a sound source, between the th microphone and the reference () microphone. The distance of arrival is computed by

(7)

where and are the Cartesian coordinates of the reference and the th microphones, respectively. is the time difference of arrival (TDOA), travelled by the sound source, between the th and the reference microphones.

The directional cues of a far-field microphone array (MIC format) can be presented in several forms such as the relative distance of arrival (RDOA) and TDOA. In this study, we choose to extract the directional cues in the form of RDOA. One advantage of RDOA is that we do not need to know the exact coordinates of the individual microphones. We can compute an eigenvector-based phase vector (EPV) to approximate from the principle eigenvector as follows. First we normalize by its first element, which is chosen arbitrarily as the reference microphone, then discard the first element to obtain . After that, we take the phase of and normalize it by to obtain the EPV . The SALSA features for the MIC format are formed by stacking the -channel spectrograms with the -channel EPV. To avoid spatial aliasing, the values of are set to zero for all TF bins above aliasing frequency. Fig. 2 illustrates the SALSA feature of a -second audio segment in multi-source cases for a four-channel microphone array with an EPV cutoff frequency of . Similar to the FOA counterpart, the three EPV channels are visually discriminant for different sources originating from different directions. The green areas in the EPV channels corresponds to zeroed-out TF bins that are not SS or above aliasing frequency.

The proposed method to extract spatial cues can also be extended to near-field and baffled microphone arrays, where directional cues are encoded in both IID and IPD. For those arrays, we can approximate their array response model using the far-field model, or we can compute both EIV and EPV as shown in Section II-C1 and Section II-C2, respectively.

Ii-D Single-source time-frequency bin selection

Fig. 3: Distribution of TF bins that fail magnitude test, pass magnitude test and fail coherence test, pass both tests for FOA and MIC format

The selection of SS TF bins have been shown to be effective for DOAE in noisy, reverberant and multi-source cases [23, 20, 18]. There are several methods to select SS TF bins [17, 26, 23]. In this paper, we apply two tests to select SS TF bins, namely, the magnitude and coherence tests. The magnitude test aims to select only TF bins that contain signal from foreground sound sources [23]

. A TF bin passes the magnitude test if its signal-to-noise ratio (SNR) with respect to an adaptive noise floor is above a noise threshold

 [23]. The coherence test aims to find TF bins that contain signal from mostly one source [17]

. A TF bin passes the coherence test if its direct-to-reverberant ratio (DRR), which is the ratio between the two largest eigenvalues of the spatial CM, is above a coherence threshold

 [32]. Fig. 3 shows the distribution of TF bins that fail magnitude test, pass magnitude test but fail coherence test, and pass both tests for the FOA and MIC formats from the TNSSE 2021 development dataset [29]. The lower cutoff frequency for both formats is while the upper cutoff frequency for the FOA and MIC formats are and , respectively. For both formats, around of TF bins in the passband pass both tests. The two tests significantly reduce the number of EIVs or EPVs to be computed.

Iii Common input features for SELD

Name Format Components # channels
MelSpecIV FOA MelSpec + IV
LinSpecIV FOA LinSpec + IV
MelSpecGCC MIC MelSpec + GCC-PHAT
LinSpecGCC MIC LinSpec + GCC-PHAT
SALSA FOA LinSpec + EIV
SALSA MIC LinSpec + EPV

The number of channels are calculated based on four-channel inputs.

TABLE II: Feature names and descriptions

We compare the proposed SALSA features with log-spectrograms and IV for the FOA format, and log-spectrograms and GCC-PHAT for the MIC format in both mel- and linear-frequency scales, of which the mel-scale features are the more popular for SELD. The log-mel spectrograms are computed from the complex spectrograms by

(8)

where is the mel index, and is the mel filter.

Iii-a Log-spectrograms and IV for FOA format

The four channels of the FOA format consist of the omni-, X-, Y-, and Z-directional components. The IV expresses intensity differences of the X, Y, and Z components with respect to the omni-directional component, and thus carries the DOA cues [43, 5]. The IV is computed in the TF domain by

(9)

The IV features are then normalized [5] to have unit norm via . In order to combine IVs and the multichannel log-mel spectrograms, the IVs are passed through the same set of mel filters used to compute the log-mel spectrograms; we refer to this feature as MelSpecIV. Linear-scale IV can also be stacked with log-linear spectrograms, referred to as LinSpecIV. The dimensions of MelSpecIV and LinSpecIV are and , respectively, where is the number of mel filters.

Iii-B Log-spectrograms and GCC-PHAT for MIC format

GCC-PHAT is computed for each audio frame for each of the microphone pairs by [7]

(10)

where is the time lag, is the inverse Fourier transform. The maximum time lag of the GCC-PHAT spectrum is , where is the sampling rate, and is the largest distance between two microphones. When the GCC-PHAT features are stacked with mel- or linear-scale spectrograms, the ranges of time lags to be included in the GCC-PHAT spectrum are or , respectively. We refer to these two features as MelSpecGCC and LinSpecGCC, respectively. The dimensions of the MelSpecGCC and LinSpecGCC feature are and , respectively. Table II summarizes all features of interest in this work.

Iv Network Architecture and Pipeline

Figure 4 shows the SELD network architecture that is used for all the experiments in this paper. The network consists of a CNN that is based on ResNet22 for audio tagging [13], a two-layer BiGRU, and fully connected (FC) layers. The network can be adapted for different input features in Table II

by setting the number of input channels in the first convolutional layer to that of the input features. During inference, sound classes whose probabilities are above the SED threshold are considered active classes. The DOAs corresponding to these classes are selected accordingly.

Iv-a Loss function

We use the class-wise output format for SELD, in which the SED is formulated as a multilabel multiclass classification and the DOAE as a three-dimensional Cartesian regression. The loss function used is given by

(11)

where is the number of output frames; and is the number of target sound classes;

are the SELD prediction and target tensors, respectively;

are the SED prediction and target tensors, respectively; are the DOA prediction and target tensors, respectively. The DOA loss is only computed for the active classes in each frame.

Iv-B Feature normalization

The four features MelSpecIV, LinSpecIV, MelSpecGCC, and LinSpecGCC

are globally normalized for zero mean and unit standard deviation vectors per channel 

[3]. For the SALSA features, only the spectrogram channels are similarly normalized.

Iv-C Data augmentation

To tackle the problem of small datasets in SELD, we investigate the effectiveness of three data augmentation techniques for all features listed in Table II: channel swapping (CS) [15, 40], random cutout (RC) [44, 24], and frequency shifting (FS). All the three augmentation techniques can be performed in the STFT domain on the fly during training. Only CS changes the ground truth, while RC and FS do not alter the ground truth. Each training sample has an independent chance to be augmented by each of the three techniques.

In CS, there are and ways to swap channels for the FOA [15] and MIC [40] formats, respectively. The IV, GCC-PHAT, EIV, EPV, and target labels are altered accordingly when channels are swapped. CS augmentation technique greatly increases the variation of DOAs in the dataset.

In RC, we either apply random cutout [44] or TF masking via SpecAugment [24] on all the channels of the input features. Random cutout produces a rectangular mask on the spectrograms while SpecAugment produces a cross-shaped mask. For the LinSpec and MelSpec channels, the value of the mask is set to a random value within these channels’ value range. For the IV, GCC-PHAT, EIV and EPV channels, the value of the mask is set to zero. All the channels share the same mask. The RC technique aims to improve network redundancy.

We also introduce FS as a new data augmentation for SELD. FS in the frequency domain is similar to pitch shift in the time domain [33]. We randomly shift all the channels input features up or down along the frequency dimension by up to bands. For MelSpecGCC and LinSpecGCC features, the GCC-PHAT channels are not shifted. The FS augmentation technique increases the variation of frequency patterns of sound events.

Fig. 4: Block diagram of the SELD network, which is a CRNN. This network can be adapted for different input features such as SALSA, MelSpecIV, MelSpecGCC, etc. by changing the number of input channels in the first convolutional layer of the network.

V Experimental Settings

V-a Dataset

The main dataset used in the majority of our experiments is the TNSSE 2021 dataset [29]. Since this dataset is relatively new, we also use the TNSSE 2020 dataset [30] to compare our models with state-of-the-art methods. The development subset of each TNSSE dataset consists of , , and one-minute audio recordings for the train, validation, and test split, respectively. The evaluation subset of each dataset consists of one-minute audio recording. Unless otherwise stated, the validation set was used for model selection while the test set was used for evaluation. Table III summarizes some key characteristics of the two datasets. The azimuth and elevation ranges of both datasets are and , respectively.

Characteristics 2020 2021
Channel format FOA, MIC FOA, MIC
Moving sources
Ambiance noise
Reverberation
Unknown interferences
Maximum degree of polyphony 2 3
Number of target sound classes 14 12
TABLE III: Characteristics of TNSSE 2020 and 2021 datasets

Both TNSSE datasets were recorded using a -microphone Eigenmike spherical array with a radius of . The -channel signals were converted into FOA format, whose array response is approximately frequency-independent up to around . Therefore, we compute EIV for SALSA features between and . Out of the microphones, four microphones that form a tetrahedron are used for the MIC format. Since the radius of the spherical array corresponds to an aliasing frequency of , we computed EPV for MIC format between and . Even though the microphones are mounted on an acoustically-hard spherical baffle, we found that the far-field array model in Section II-C2 is sufficient to extract the spatial cues for the MIC format.

V-B Evaluation

To evaluate SELD performance, we used the official evaluation metrics 

[31] that were introduced in the 2021 DCASE Challenge as our default metrics. A sound event is considered a correct detection only if it has correct class prediction and its estimated DOA is less than away from the DOA ground truth, where is the most commonly used value. The DOAE metrics are class-dependent, that is, the detected sound class will have to be correct in order for the corresponding localization predictions to count. Since some state-of-the-art SELD systems only reported the 2020 version of the DCASE evaluation metrics [16], we also used these metrics in some experiments to fairly compare the results.

Both the 2020 and 2021 SELD evaluation metrics consist of four metrics: location-dependent error rate () and F1 score () for SED; and class-dependent localization error (), and localization recall () for DOAE. We also computed an aggregated SELD error metric that was used as the ranking metric for the 2019 and 2020 DCASE Challenges as follows,

(12)

was used for model and hyperparameter selection. A good SELD system should have low

, high , low , high , and low aggregated error metric .

V-C Hyperparameters

We used a sampling rate of , window length of , hop length of , Hann window, FFT points, and mel bands. As a result, the input frame rate of all the features was . Since the model temporally downsampled the input by a factor of , we temporally upsampled the final outputs by a factor of to match the label frame rate of . To reduce the feature dimensions to speed up the training time, we linearly compressed frequency bands above , which correspond to frequency bin index and above, by a factor of , i.e., consecutive bands will be averaged into a single band. As the results, the frequency dimension is for all linear-scale features. Unless stated otherwise, -second audio chunks were used for model training. The loss weights for SED and DOAE were set to and , respectively. Adam optimizer was used for all training. The learning rate was initially set to and linearly decreased to over last epochs of the total training epochs. A threshold of

was used to binarize active class predictions in the SED outputs.

Vi Results and discussion

We performed a series of experiments to compare the performances of each input feature without and without data augmentation. Afterwards, the effect of data augmentation on each feature was examined in details. We analyzed the effects of the magnitude and coherence tests on the performance of SELD systems running on SALSA features. Next, we studied the feature importance of LinSpec, EIV and EPV that constitute SALSA features. In addition, effect of different segment lengths on SALSA performance was investigated. For the MIC format, we examined effect of spatial aliasing on the SELD performance with SALSA features. Finally, we compared the performance of models trained on the proposed SALSA features with several state-of-the-art SELD systems on both the 2020 and 2021 TNSSE datasets.

Vi-a Comparison between SALSA and other SELD features

Feature Data Aug. FOA format MIC format
MelSpecIV None 0.555 0.584 15.9 0.625 0.358 - - - - -
LinSpecIV None 0.527 0.609 15.6 0.642 0.341 - - - - -
MelSpecGCC None - - - - - 0.660 0.455 21.1 0.521 0.450
LinSpecGCC None - - - - - 0.622 0.506 19.6 0.583 0.410
FOA SALSA None 0.543 0.592 15.4 0.627 0.352 - - - - -
MIC SALSA None - - - - - 0.528 0.601 15.9 0.644 0.343
TABLE IV: Baseline SELD performances of different features without data augmentation.
Feature Data Aug. FOA format MIC format
MelSpecIV SW + FS 0.444 0.686 11.8 0.686 0.284 - - - - -
LinSpecIV SW + FS + RC 0.410 0.710 10.5 0.702 0.264 - - - - -
MelSpecGCC SW + FS + RC - - - - - 0.507 0.614 17.9 0.679 0.328
LinSpecGCC SW + FS + RC - - - - - 0.514 0.606 17.8 0.676 0.333
FOA SALSA SW + FS 0.404 0.724 12.5 0.727 0.255 - - - - -
MIC SALSA SW + FS + RC - - - - - 0.408 0.715 12.6 0.728 0.259
TABLE V: SELD performances of different features with best combination of data augmentation techniques.

Table IV shows benchmark performances of all considered features without data augmentation. Linear-scale features (LinSpec-based) appear to perform better than their mel-scale counterparts (MelSpec-based) for both audio formats. For the ‘traditional’ features, the performance gap between the FOA and MIC formats is large, with both IV-based features outperforming GCC-based features. Without data augmentation, FOA SALSA performed better than MelSpecIV but slightly worse than LinSpecIV, while MIC SALSA performed much better than both GCC-based features.

Table V shows the performance of all features with their respective best combination of the three data augmentation techniques investigated. For the FOA format, the experimental results, again, showed that linear-scale features achieved better performance than mel-scale features. For the MIC format, the mel-scale features performed slightly better than linear-scale features. The large performance gap between the FOA and MIC formats still remained with data augmentation applied. IV-based features significantly outperform GCC-based features across all the evaluation metrics. The proposed SALSA features achieved the best performances for both the FOA and MIC formats. SALSA scored the highest in and ; and the lowest in and among all features investigated. It is expected that a high often leads to a high . With a higher , SALSA also has a higher than LinSpecIV by . SALSA outperformed both GCC-based features by a large margin. Compared to MelSpecGCC, SALSA feature substantially reduced by , increased by , reduced by , and increased by . The overall was impressively reduced by .

The performance gap between the IV- and GCC-based features, and the similar performances of SALSA for both array formats indicated that the exact TF mapping between the signal power and the directional cues, as per SALSA, MelSpecIV, and LinSpecIV, are much better for SELD than simply stacking spectrograms and GCC-PHAT spectra as per MelSpecGCC and LinSpecGCC. This exact TF mapping also facilitates the learning of CNNs, as the filters can more conveniently learn the multichannel local patterns on the image-like input features. Most importantly, the results showed that the extracted spatial cues for SALSA features are effective for both FOA and MIC formats. Therefore, SALSA can be considered as a unified SELD feature regardless of the array format. The outstanding performance gains in models trained with SALSA features shown in both Table IV and V indicate that SALSA as a very effective feature for deep learning-based SELD.

Vi-B Effect of data augmentation

We report the effect of different data augmentation techniques on each feature in Table VI. The experimental results clearly demonstrated that CS significantly improved the performance for all features across all metrics. On average, decreased by , increased by , decreased by , and increased by . SW reduced the aggregated error metric by between and , where the larger reductions are observed for MIC features such as MelSpecGCC, LinSpecGCC, and MIC SALSA.

When FS was used together with CS, the performance was improved further for all features. Compared to CS alone, the combination of CS and FS on average reduced by a further , increased by , reduced by , increased by , and reduced by . These results showed that varying the SED and DOA patterns by FS and CS helped the models to learn more effectively.

When RC was used together with CS and FS, the performance was further improved for LinSpecIV and all MIC features; but not for MelSpecIV and FOA SALSA. For subsequent experiments, the best combinations of data augmentation techniques for each feature, as shown in boldface in Table VI, are used.

Data Aug.
MelSpecIV
None 0.555 0.584 15.9 0.625 0.358
CS 0.472 0.655 12.0 0.653 0.308
CS+FS 0.444 0.686 11.8 0.686 0.284
CS+FS+RC 0.440 0.683 10.2 0.668 0.286
LinSpecIV
None 0.527 0.609 15.6 0.642 0.341
CS 0.459 0.669 12.3 0.678 0.295
CS+FS 0.423 0.700 10.8 0.692 0.273
CS+FS+RC 0.410 0.710 10.5 0.702 0.264
FOA SALSA
None 0.543 0.592 15.4 0.627 0.352
SC 0.462 0.655 14.9 0.666 0.306
CS+FS 0.404 0.724 12.5 0.727 0.255
CS+FS+RC 0.413 0.713 11.5 0.713 0.263
MelSpecGCC
None 0.660 0.455 21.1 0.521 0.450
CS 0.552 0.556 18.1 0.583 0.378
CS+FS 0.507 0.609 17.0 0.646 0.337
CS+FS+RC 0.507 0.614 17.9 0.679 0.328
LinSpecGCC
None 0.622 0.506 19.6 0.583 0.410
CS 0.532 0.589 18.6 0.658 0.347
CS+FS 0.514 0.604 17.7 0.666 0.336
CS+FS+RC 0.514 0.606 17.8 0.676 0.333
MIC SALSA
None 0.528 0.601 15.9 0.644 0.343
CS 0.447 0.675 13.7 0.683 0.291
CS+FS 0.431 0.696 12.3 0.709 0.274
CS+FS+RC 0.408 0.715 12.6 0.728 0.259
TABLE VI: Performance of MelSpecIV, LinSpecIV, MelSpecGCC, LinSpecGCC, and SALSA with different data augmentation.

Vi-C Effect of magnitude and coherence tests

Test
FOA SALSA
None 0.418 0.706 12.0 0.710 0.267
Magnitude 0.434 0.698 11.9 0.701 0.275
Magnitude + Coherence 0.404 0.724 12.5 0.727 0.255
MIC SALSA
None 0.414 0.701 12.1 0.700 0.270
Magnitude 0.407 0.716 12.3 0.721 0.260
Magnitude + Coherence 0.408 0.715 12.6 0.728 0.259
TABLE VII: Effect of magnitude and coherence tests on SALSA features.

Table VII shows the effect of magnitude and coherence tests on the performance of models trained on SALSA features. Fig. 3 indicates that around of all TF bins are removed after the magnitude test, and an additional of bins are removed after the coherence test. These tests aim to only include approximately SS TF bins with reliable directional cues. The magnitude test improved the performance of the MIC format but not the FOA format. On the other hand, using both the magnitude and coherence tests significantly improved the performance of the FOA format. Overall, when both tests are applied to compute SALSA features, the performances were improved compared to when no test was applied. For subsequent experiments, both tests were applied to compute SALSA features.

Vi-D Feature importance

Components
FOA SALSA
LinSpec 0.835 0.123 87.2 0.608 0.647
EIV 0.577 0.557 14.1 0.571 0.382
Mono-SALSA 0.421 0.705 12.8 0.723 0.266
SALSA 0.404 0.724 12.5 0.727 0.255
MIC SALSA
LinSpec 0.506 0.616 18.1 0.698 0.323
EPV 0.629 0.502 17.4 0.547 0.419
Mono-SALSA 0.443 0.680 14.7 0.710 0.284
SALSA 0.408 0.715 12.6 0.728 0.259
TABLE VIII: Feature importance of FOA and MIC SALSA.

Table VIII reports the feature importance of each component in SALSA feature: multichannel log-linear spectrogram LinSpec, as well as spatial features EIV and EPV for FOA and MIC formats, respectively. Mono-SALSA is an ablation feature formed by stacking the log-linear spectrogram of only the first microphone with the corresponding spatial features. For both formats, SALSA achieved the best performance, followed by Mono-SALSA.

For the FOA format, LinSpec alone could not meaningfully estimate DOAs. One possible reason is that the spatial cues of FOA format are encoded in the signed amplitude differences between microphones, but LinSpec retains only the unsigned magnitude differences. The sign ambiguity caused the confusion between the input features and the target labels. Therefore, the model trained on LinSpec feature failed to detect the correct DOAs. On the other hand, the model trained on only the EIV feature performed reasonably well. The EIV feature preserved some coarse spatiotemporal patterns of each sound class (see Fig. 1), thus the model was able to distinguish different sound classes. SALSA feature significantly outperformed its constituent features, LinSpec and EIV. In the absence of the X, Y, and Z channels of the linear spectrograms, Mono-SALSA performed slightly worse than SALSA on the SED metrics but similarly on the DOAE metrics. These results suggest that the main contribution of the X, Y, and Z channels in the linear spectrograms is to distinguish different sound classes.

For the MIC format, LinSpec alone performed reasonably well for SELD. Referring to Section V-A, the MIC format of the DCASE SELD dataset is not a true far-field array, but rather a baffled microphone array, where some spatial cues are also encoded in the magnitude differences between microphones. Therefore, not only is the model trained on LinSpec feature able to classify sound sources, but it is also able to estimate DOAs. The EPV feature alone returned a lower SELD performance compared to the EIV feature of the FOA format, likely because the EPV feature is computed with an upper cutoff frequency of , which is much lower than that of EIV at . The model trained on only EPV also has the highest and lowest among all ablation models of the MIC format. SALSA feature significantly outperformed its individual feature component across all the metrics. The performance gap between SALSA and Mono-SALSA is larger for the MIC format than the FOA format. The reason is likely that the spatial cues are also encoded in the magnitude of different input channels, and the EPV is all zeroed out above the upper cutoff frequency. Therefore, the multichannel nature of the spectrograms play an important role in both sound class recognition and DOA estimation.

Vi-E Effect of spatial aliasing on SELD for microphone array

Cutoff frequency
0.403 0.714 12.5 0.707 0.261
0.408 0.715 12.6 0.728 0.259
0.425 0.698 12.8 0.720 0.270
TABLE IX: Effect of spatial aliasing on SALSA feature of MIC format.

For narrow band signals, spatial aliasing occurs at high frequency bins, where half of the signal wavelength is less than the distance between two microphones. To investigate the effect of spatial aliasing when SALSA features for MIC format are used, we report the performances of SALSA with different upper cutoff frequencies in Table IX. The upper cutoff frequencies were computed using the spatial aliasing formula for narrow band signals, , where is the maximum distance between any two microphones in the array. The investigated values of are the arc length between any two microphones () and the radius of the Eigenmike array (), which correspond to aliasing frequencies of , and , respectively. In addition, we also tested a cutoff frequency of to investigate the case where spatial aliasing is ignored. Table IX shows that cutoff frequencies at and result in similar performances. One possible reason is that the spatial aliasing might not significantly occur in all of the microphone pairs beyond for some DOAs. On the other hand, with the cutoff frequency, spatial aliasing has occurred in too many high-frequency bins, resulting in a slightly lower performance than a loose cutoff frequency at . However, the impact of spatial aliasing appears to be mild, with the model trained on a loose aliasing frequency at achieving the best . This result is agreeable with the finding in [8], where broadband signals were shown to not experience with spatial aliasing unless they contain strong harmonic components.

Vi-F Effect of segment length for training

Length
FOA SALSA
0.468 0.658 13.4 0.646 0.310
0.404 0.724 12.5 0.727 0.255
0.414 0.717 11.8 0.720 0.261
MIC SALSA
0.449 0.664 14.1 0.678 0.297
0.408 0.715 12.6 0.728 0.259
0.413 0.714 12.7 0.730 0.260
TABLE X: Effect of segment length during training on SELD performance using SALSA.

Different sound events often have different duration. Thus the segment length that is used during training may affect the model performance. The sound event lengths from the TNSSE 2021 dataset are between and , with a median of , and a mean of . We present the SELD performances on models trained with different input segment lengths, as per Table X. Models trained with a segment length of significantly outperformed models trained with a segment length of for both the FOA and MIC formats. However, increasing the segment length to did not further improve the overall performance. Thus, it appears that the model requires a certain minimum sequence length to sufficiently learn the temporal dependency, although this temporal dependency does not need to be very long, since the model would likely rely more on recent frames than older frames.

Vi-G Comparisons with state-of-the-art methods for SELD

We compared models trained with the proposed SALSA features with state-of-the-art (SOTA) methods on three datasets: the test and evaluation splits of the TNSSE 2020 dataset [30] and the test split of the TNSSE 2021 dataset [29]. Some of the SOTA methods used single models while others used ensemble models. We used the same single-model SELD network shown in Section IV to train all of the models reported, i.e., no ensembling was used. To further improve the performance of our models, we applied test-time augmentation (TTA) during inference [35]. TTA swaps the channels of the SALSA features in a manner similar to the CS augmentation technique that was employed during training. The estimated DOA outputs were rotated back to the original axes, then averaged to produce the final results. During inference, the whole -second features were passed into the models without being split into smaller chunks. Since the SOTA results on the TNSSE 2020 dataset were evaluated using the 2020 SELD evaluation metrics, we evaluated our models using both the 2020 and 2021 metrics, the former for fair comparison with past works, and the latter for ease of comparison with future works.

Vi-G1 Performance on the test split of the TNSSE 2020 dataset

System Format
2020 Metrics
DCASE baseline [30] FOA 0.72 0.374 22.8 0.607
Shimada et al. [35] w/o TTA FOA 0.36 0.730 10.2 0.791
Shimada et al. [35] w/ TTA FOA 0.32 0.768 7.9 0.805
Wang et al. [41] FOA+MIC 0.29 0.764 9.4 0.828
(’20 #1) Wang et al. [41] FOA+MIC 0.260 0.800 7.4 0.847
FOA SALSA w/o TTA FOA 0.338 0.748 7.9 0.784
MIC SALSA w/o TTA MIC 0.379 0.717 8.2 0.762
FOA SALSA w/ TTA FOA 0.318 0.761 7.4 0.797
MIC SALSA w/ TTA MIC 0.341 0.741 7.8 0.783
2021 Metrics
FOA SALSA w/o TTA FOA 0.344 0.755 8.1 0.755
MIC SALSA w/o TTA MIC 0.383 0.727 8.3 0.738
FOA SALSA w/ TTA FOA 0.323 0.768 7.5 0.763
MIC SALSA w/ TTA MIC 0.342 0.749 7.9 0.744
denotes an ensemble model.
TABLE XI: SELD performances of SOTA systems and SALSA-based models on test split of the TNSSE 2020 dataset.

Table XI shows the performances on the test split of the TNSSE 2020 dataset of SOTA systems, and our SALSA models for both the FOA and MIC formats. FOA SALSA models performed slightly better than the MIC counterparts. The TTA significantly improved location dependent SED metrics and . The model by Wang et al. [41] used both the FOA and MIC data as input features and achieved the best performance for , , and for single models. However, it is unrealistic to have both FOA and MIC data available in real-life applications. Our FOA SALSA model outperformed the DCASE baseline [30] by a large margin, and performed better than [35] in term of , , and . Our FOA SALSA model with TTA also performed on-par with the TTA version of [35]. On average, the 2021 evaluation metrics return similar , and compared to the 2020 metrics, but stricter than the 2020 metrics.

Vi-G2 Performance on the evaluation split of the TNSSE 2020 dataset

System Format
2020 Metrics
DCASE’21 baseline [30] MIC 0.69 0.413 23.1 0.624
Cao et al. [4] FOA 0.233 0.832 6.8 0.861
(’20 #2) Nguyen et al. [19] FOA 0.23 0.820 9.3 0.900
(’20 #1) Wang et al. [41] FOA+MIC 0.200 0.849 6.0 0.885
FOA SALSA w/o TTA FOA 0.237 0.823 6.9 0.858
MIC SALSA w/o TTA MIC 0.227 0.836 6.7 0.869
FOA SALSA w/ TTA FOA 0.219 0.840 6.5 0.869
MIC SALSA w/ TTA MIC 0.202 0.854 6.0 0.884
2021 Metrics
FOA SALSA w/o TTA FOA 0.244 0.830 7.0 0.831
MIC SALSA w/o TTA MIC 0.234 0.842 6.7 0.849
FOA SALSA w/ TTA FOA 0.225 0.844 6.6 0.838
MIC SALSA w/ TTA MIC 0.208 0.858 6.0 0.856
denotes an ensemble model.
TABLE XII: SELD performances of SOTA systems and SALSA-based models on evaluation split of TNSSE 2020 dataset.

Table XII shows the performances on the evaluation split of the TNSSE 2020 dataset of SOTA systems, and our SALSA models for both the FOA and MIC formats. Our models were trained using all audio clips from the development split of the TNSSE 2020 dataset. Interestingly, when more data are available for training, models trained on MIC SALSA features performed better than models trained on FOA SALSA features across all metrics. The FOA SALSA model has competitive performance compared to [4] while the MIC SALSA model performed slightly better. The MIC SALSA model with TTA achieved comparable performance as the top ensemble model [41] from the 2020 DCASE Challenge, with similar , , and higher . The 2021 metrics again returned similar , , results and stricter than the 2020 metrics.

Vi-G3 Performance on the test split of the TNSSE 2021 dataset

System Format
2021 Metrics
DCASE baseline [29] FOA 0.73 0.307 24.5 0.448
(’21 #1) Shimada et al. [36] FOA 0.43 0.699 11.1 0.732
(’21 #4) Lee et al. [14] FOA 0.46 0.609 14.4 0.733
FOA SALSA w/o TTA FOA 0.404 0.724 12.5 0.727
MIC SALSA w/o TTA MIC 0.408 0.715 12.6 0.728
FOA SALSA w/ TTA FOA 0.376 0.744 11.1 0.722
MIC SALSA w/ TTA MIC 0.376 0.735 11.2 0.722
(’21 #2) Nguyen et al. [22] FOA 0.370 0.737 11.2 0.741
denotes an ensemble model.
TABLE XIII: SELD performances of SOTA systems and SALSA-based models on test split of TNSSE 2021 dataset.

Table XIII shows the performances on the test split of the TNSSE 2021 dataset of SOTA systems, and our SALSA models for both the FOA and MIC formats. The FOA SALSA models performed similarly in , , as and higher compared to the MIC SALSA models. The TTA significantly improved their , , and but not . The models trained on SALSA features of both formats outperformed the DCASE baseline by the large margin, and performed better than the highest-ranked system from the 2021 DCASE Challenge [36] in terms of and . With TTA, the models trained on SALSA features achieved much better and , similar , and slightly lower compared to [36]. An ensemble model trained on a variant of our proposed SALSA features [22] officially ranked second in the team category of the SELD taks in the 2021 DCASE Challenge. The SALSA variant in [22] included an additional channel for the estimated DRR at each TF bin.

Compared to the TNSSE 2020 dataset, the TNSSE 2021 dataset is more challenging since it has more overlapping sound events and unknown directional interferences. Overall, the performances of models listed in Table XIII are lower than those of the models listed in Table XI across all metrics.

The results in Tables XI to XIII consistently show that the proposed SALSA features for both the FOA and MIC formats are very effective for SELD. Simple CRNN models trained on SALSA features surpassed or performed comparably to many SOTA systems, both single models and ensembles, on different datasets across all evaluation metrics.

Vi-H Qualitative evaluation

Fig. 5: Visualization of ground truth and predicted azimuth for test clip fold6_room2_mix041 of the TNSSE 2021 dataset. Legend lists the ground truth events in chronological order. Sound classes are color-coded.

Fig. 5 shows the plots of ground truth and predicted azimuth angles for a audio clip from the test set of the TNSSE 2021 dataset. The angles were predicted by a CRNN model trained on FOA SALSA features. Overall, the trajectories of predicted events were smooth and followed the ground truths closely. The model was able to correctly detect the sound classes and estimate DOAs across different numbers of overlapping sound sources (up to three overlapping sources). An unknown interference was misclassified as a piano event (purple) and an alarm event (pink) between the th and the th seconds. Since we used the class-wise output format to train the model, when there were two overlapping crash events between the nd and the th seconds, the model only predicted one crash event.

Vii Conclusion

In conclusion, we proposed a novel and effective feature for polyphonic SELD named Spatial cue-Augmented Log-SpectrogrAm (SALSA), which consists of multichannel log-spectrograms and normalized principal eigenvector of the spatial CM at each TF bin of the spectrograms. There are two key characteristics that contribute to the effectiveness of the proposed feature. Firstly, SALSA spectrotemporally aligns the signal power and the source directional cues, which aids in resolving overlapping sound sources. This locally linear alignment works well with CNNs, where the filters learn the multichannel local pattern of the image-like input features. Secondly, SALSA includes helpful directional cues extracted from the principal eigenvectors of the spatial CMs. Depending on the array type, where the directional cues might be encoded as interchannel amplitude and/or phase differences, the principal eigenvectors can be easily normalized to extract these cues. Therefore, SALSA features are versatile to use with different microphone array formats, such as FOA and MIC.

The proposed SALSA features can be further enhanced by incorporating signal processing-based methods such as magnitude and coherence tests to select more reliable directional cues and improve SELD performance. In addition, for multichannel arrays, spatial aliasing has little effect on the performance of models trained on SALSA. More importantly, the training segment length must be sufficient long for the model to capture the temporal dependency in the data.

In addition, data augmentation techniques such as channel swapping, frequency shifting, and random cutout can be readily applied to SALSA on the fly during training. These data augmentation techniques mitigated the problem of small datasets and significantly improved the performance of models trained on SALSA features. Simple CRNN models trained on the SALSA features achieved similar or even better SELD performance than many complicated state-of-the-art systems on the TNSSE 2020 and 2021 datasets.

References

  • [1] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen (2019) Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks. IEEE Journal of Selected Topics in Signal Processing 13 (1), pp. 34–48. External Links: Document, ISSN 1932-4553 Cited by: §I-A, §I-B, TABLE I, §I.
  • [2] S. Adavanne, A. Politis, and T. Virtanen (2019) A Multi-room Reverberant Dataset for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York University, NY, USA, pp. 10–14. Cited by: §I-B.
  • [3] S. Adavanne and A. Politis (2021) DCASE 2021: Sound Event Localization and Detection with Directional Interference. GitHub. External Links: Link Cited by: §IV-B.
  • [4] Y. Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley (2021) An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 885–889. Cited by: §I-A, §I-B, TABLE I, §VI-G2, TABLE XII.
  • [5] Y. Cao, T. Iqbal, Q. Kong, M. B. Galindo, W. Wang, and M. D. Plumbley (2019) Two-Stage Sound Event Localization and Detection using Intensity Vector and Generalized Cross-Correlation. Technical report DCASE2019 Challenge. Cited by: §III-A.
  • [6] Y. Cao, T. Iqbal, Q. Kong, Y. Zhong, W. Wang, and M. D. Plumbley (2020-11) Event-independent Network for Polyphonic Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan, pp. 11–15. Cited by: §I-A.
  • [7] Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley (2019) Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York University, NY, USA, pp. 30–34. Cited by: §I-A, §I-B, TABLE I, §III-B.
  • [8] J. Dmochowski, J. Benesty, and S. Affes (2009) On Spatial Aliasing in Microphone Arrays. IEEE Transactions on Signal Processing 57 (4), pp. 1383–1395. External Links: Document Cited by: §VI-E.
  • [9] P. Emmanuel, N. Parrish, and M. Horton (2021) Multi-scale Network for Sound Event Localization and Detection. Technical report DCASE2021 Challenge. Cited by: §I-A, §I-B, TABLE I.
  • [10] P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and M. Vento (2016) Audio Surveillance of Roads: A System for Detecting Anomalous Sounds. IEEE Transactions on Intelligent Transportation Systems 17 (1), pp. 279–288. External Links: Document Cited by: §I.
  • [11] T. Hirvonen (2015) Classification of spatial audio location and content using Convolutional neural networks. Journal of the audio engineering society 2, pp. 622–631. External Links: ISBN 9781510806597 Cited by: §I-A.
  • [12] S. Kapka and M. Lewandowski (2019) Sound Source Detection, Localization and Classification using Consecutive Ensemble of CRNN Models. Technical report DCASE2019 Challenge. Cited by: TABLE I.
  • [13] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley (2020)

    PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

    .
    IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2880–2894. External Links: Link, Document, ISSN 23299304 Cited by: §IV.
  • [14] S. Lee, J. Hwang, S. Seo, and H. Park (2021) Sound Event Localization and Detection Using Cross-modal Attention and Parameter Sharing for DCASE2021 Challenge. Technical report DCASE2021 Challenge. Cited by: §I-A, TABLE I, TABLE XIII.
  • [15] L. Mazzon, Y. Koizumi, M. Yasuda, and N. Harada (2019-10) First Order Ambisonics Domain Spatial Augmentation for DNN-based Direction of Arrival Estimation. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), Vol. 4, New York University, NY, USA, pp. 154–158. External Links: Document Cited by: §I-A, §IV-C, §IV-C.
  • [16] A. Mesaros, S. Adavanne, A. Politis, T. Heittola, and T. Virtanen (2019) Joint measurement of localization and detection of sound events. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, pp. 333–337. External Links: ISBN 9781728111230, Document, ISSN 19471629 Cited by: §V-B.
  • [17] S. Mohan, M. E. Lockwood, M. L. Kramer, and D. L. Jones (2008) Localization of multiple acoustic sources with small arrays using a coherence test. The Journal of the Acoustical Society of America 123 (4), pp. 2136–2147. External Links: Document, ISSN 0001-4966 Cited by: §II-D.
  • [18] T. N. T. Nguyen, W. S. Gan, R. Ranjan, and D. L. Jones (2020) Robust Source Counting and DOA Estimation Using Spatial Pseudo-Spectrum and Convolutional Neural Network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2626–2637. External Links: Document Cited by: §I-C, §II-C, §II-D.
  • [19] T. N. T. Nguyen, D. L. Jones, and W. S. Gan (2020-11) Ensemble of Sequence Matching Networks for Dynamic Sound Event Localization, Detection, and Tracking. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan, pp. 120–124. Cited by: TABLE XII.
  • [20] T. N. T. Nguyen, D. L. Jones, and W. Gan (2020) A Sequence Matching Network for Polyphonic Sound Event Localization and Detection. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 71–75. External Links: Document Cited by: §I-A, TABLE I, §II-D.
  • [21] T. N. T. Nguyen, N. K. Nguyen, H. Phan, L. Pham, K. Ooi, D. L. Jones, and W. Gan (2021)

    A General Network Architecture for Sound Event Localization and Detection Using Transfer Learning and Recurrent Neural Network

    .
    In ICASSP, pp. 1–5. External Links: ISBN 9781728176055, Document Cited by: §I-A.
  • [22] T. N. T. Nguyen, K. Watcharasupat, N. K. Nguyen, D. L. Jones, and W. Gan (2021-07) DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic Sound Event Localization and Detection. Technical report DCASE2021 Challenge. Cited by: §I-C, §VI-G3, TABLE XIII.
  • [23] T. N. T. Nguyen, S. Zhao, and D. L. Jones (2014) Robust DOA estimation of multiple speech sources. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2287–2291. External Links: Document Cited by: §I-C, §II-C, §II-D.
  • [24] D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019)

    SpecAugment: A simple data augmentation method for automatic speech recognition

    .
    In Proceedings of the Annual Conference of the International Speech Communication Association, pp. 2613–2617. External Links: Document, ISSN 19909772 Cited by: §IV-C, §IV-C.
  • [25] S. Park, S. Suh, and Y. Jeong (2020) Sound Event Localization and Detection with Various Loss Functions. Technical report DCASE2020 Challenge. Cited by: §I-A, TABLE I.
  • [26] D. Pavlidi, A. Griffin, M. Puigt, and A. Mouchtaris (2013) Real-Time Multiple Sound Source Localization and Counting Using a Circular Microphone Array. IEEE Transactions on Audio, Speech, and Language Processing 21 (10), pp. 2193–2206. External Links: Document Cited by: §II-D.
  • [27] A. Pérez-López and R. Ibáñez-Usach (2020-11)

    Papafil: A Low Complexity Sound Event Localization and Detection Method with Parametric Particle Filtering and Gradient Boosting

    .
    In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan, pp. 155–159. Cited by: §I-A.
  • [28] H. Phan, L. Pham, P. Koch, N. Q. K. Duong, I. Mcloughlin, and A. Mertins (2020-11) On Multitask Loss Function for Audio Event Detection and Localization. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan, pp. 160–164. Cited by: §I-A, TABLE I.
  • [29] A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen (2021) A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. arXiv. External Links: Link Cited by: §I-B, §I-B, §II-D, §V-A, §VI-G, TABLE XIII.
  • [30] A. Politis, S. Adavanne, and T. Virtanen (2020) A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan, pp. 165–169. Cited by: §I-B, §I-B, §V-A, §VI-G1, §VI-G, TABLE XI, TABLE XII.
  • [31] A. Politis, A. Mesaros, S. Adavanne, T. Heittola, and T. Virtanen (2020) Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 684–698. Cited by: §V-B.
  • [32] B. Rafaely and D. Kolossa (2017)

    Speaker localization in reverberant rooms based on direct path dominance test statistics

    .
    In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6120–6124. External Links: Document Cited by: §II-D.
  • [33] J. Salamon and J. P. Bello (2017) Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Processing Letters 24 (3), pp. 279–283. External Links: Document, ISSN 1070-9908 Cited by: §I, §IV-C.
  • [34] R. Sato, K. Niwa, and K. Kobayashi (2021) Ambisonic Signal Processing DNNs Guaranteeing Rotation, Scale and Time Translation Equivariance. IEEE/ACM Transactions on Audio Speech and Language Processing 29, pp. 1449–1462. External Links: Document, ISSN 23299304 Cited by: §I-A, TABLE I.
  • [35] K. Shimada, Y. Koyama, N. Takahashi, S. Takahashi, and Y. Mitsufuji (2021) ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization And Detection. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 915–919. External Links: Document Cited by: §I-A, §I-B, §I-B, TABLE I, §VI-G1, §VI-G, TABLE XI.
  • [36] K. Shimada, N. Takahashi, Y. Koyama, S. Takahashi, E. Tsunoo, M. Takahashi, and Y. Mitsufuji (2021) Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection. Technical report DCASE2021 Challenge. Cited by: §I-B, TABLE I, §VI-G3, TABLE XIII.
  • [37] D. Stowell, M. Wood, Y. Stylianou, and H. Glotin (2016) Bird Detection in Audio: A Survey and a Challenge. In

    International Workshop on Machine Learning for Signal Processing (MLSP)

    ,
    pp. 1–6. External Links: ISBN 9781509007462, Document, ISSN 21610371 Cited by: §I.
  • [38] N. Takahashi and Y. Mitsufuji (2021) Densely connected multidilated convolutional networks for dense prediction tasks. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 993–1002. External Links: Link Cited by: §I-A.
  • [39] J. M. Valin, F. Michaud, B. Hadjou, and J. Rouat (2004) Localization of simultaneous moving sound sources for mobile robot using a frequency-domain steered beamformer approach. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004, Vol. 2004, pp. 1033–1038. External Links: Document, ISSN 10504729 Cited by: §I.
  • [40] Q. Wang, J. Du, H. Wu, J. Pan, F. Ma, and C. Lee (2021) A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection. arXiv. External Links: Link Cited by: §I-A, §I-B, §IV-C, §IV-C.
  • [41] Q. Wang, H. Wu, Z. Jing, F. Ma, Y. Fang, Y. Wang, T. Chen, J. Pan, J. Du, and C. Lee (2020) The USTC-iFlytek System for Sound Event Localization and Detection of DCASE2020 Challenge. Technical report DCASE2020 Challenge. Cited by: §I-B, TABLE I, §VI-G1, §VI-G2, TABLE XI, TABLE XII.
  • [42] W. Xue, Y. Tong, C. Zhang, G. Ding, X. He, and B. Zhou (2020) Sound event localization and detection based on multiple DOA beamforming and multi-task learning. In INTERSPEECH, Vol. 2020-Octob, pp. 5091–5095. External Links: Document, ISSN 19909772 Cited by: §I-A, TABLE I.
  • [43] S. Zhao, T. Saluev, and D. L. Jones (2014) Underdetermined direction of arrival estimation using acoustic vector sensor. Signal Processing 100, pp. 160–168. External Links: Link, Document, ISSN 0165-1684 Cited by: §III-A.
  • [44] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2020-04) Random erasing data augmentation. In

    Proceedings of the 34th AAAI Conference on Artificial Intelligence

    ,
    pp. 13001–13008. External Links: ISBN 9781577358350, Document, ISSN 2159-5399 Cited by: §IV-C, §IV-C.