Sound Event Localization and Detection Using Activity-Coupled Cartesian DOA Vector and RD3net

by   Kazuki Shimada, et al.

Our systems submitted to the DCASE2020 task 3: Sound Event Localization and Detection (SELD) are described in this report. We consider two systems: a single-stage system that solve sound event localization (SEL) and sound event detection (SED) simultaneously, and a two-stage system that first handles the SED and SEL tasks individually and later combines those results. As the single-stage system, we propose a unified training framework that uses an activity-coupled Cartesian DOA vector (ACCDOA) representation as a single target for both the SED and SEL tasks. To efficiently estimate sound event locations and activities, we further propose RD3Net, which incorporates recurrent and convolution layers with dense skip connections and dilation. To generalize the models, we apply three data augmentation techniques: equalized mixture data augmentation (EMDA), rotation of first-order Ambisonic (FOA) singals, and multichannel extension of SpecAugment. Our systems demonstrate a significant improvement over the baseline system.



There are no comments yet.


page 1

page 2

page 3

page 4


Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection

This report describes our systems submitted to the DCASE2021 challenge t...

ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection

Neural-network (NN)-based methods show high performance in sound event l...

Sound event localization and detection based on crnn using rectangular filters and channel rotation data augmentation

Sound Event Localization and Detection refers to the problem of identify...

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

Sound event localization and detection is a novel area of research that ...

A Track-Wise Ensemble Event Independent Network for Polyphonic Sound Event Localization and Detection

Polyphonic sound event localization and detection (SELD) aims at detecti...

Sound Event Localization and Detection using Squeeze-Excitation Residual CNNs

Sound Event Localization and Detection (SELD) is a problem related to th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sound event localization and detection (SELD) is the task of identifying both the direction of arrival (DOA) and the type of sound. A number of methods have been tackling this challenging problem by decomposing tasks into several subtasks: the estimation of the number of sources, DOA estimation, and sound event detection (SED). Although this simplifies the SELD problem and therefore could improve the performance of each task, it also increases system complexity and computational cost. Here, we consider both the single- and two-stage systems. The single-stage system solve both SED and SEL task simultaneously using an activity-coupled Cartesian DOA vector (ACCDOA) representation. ACCDOA assigns an audio event activity to the length of a corresponding Cartesian DOA vector. On the other hand, the two-stage system first handles the SED as a frame wise classification problem and then combines with the DOA estimation.

2 System

In this section, we first give an overview of our two systems, namely, the ACCDOA (single-stage) system and the two-stage system. Then we explain the parts of our pipelines: the features, data augmentation, network architecture, and loss function.

2.1 System overview

A schematic flow of the ACCDOA system is shown in Fig. 1

. Two data augmentation techniques are applied to input signals prior to the feature extraction while one data augmentation technique exploiting multichannel information in the feature domain is performed after the feature extraction. Finally, the network outputs frame-wise ACCDOA vectors for 14 sound events. The magnitude of the vectors corresponds to the probability of each sound event activity while the direction of the vectors points toward the location of each source. The model using the ACCDOA representation is trained to minimize the distance between the estimated and the target coordinates.

Figure 1: Illustration of our ACCDOA system.

The two-stage system is inspired by the work in [1]. This is characterized by three ideas: training only the SED branch, transferring a part of the network parameters from the SED branch to the DOA estimation branch, and training the DOA estimation branch. The data augmentation techniques, the features, and the network used in the system are exactly the same as the ones in the ACCDOA system.

2.2 Feature

Multichannel amplitude spectrograms and inter-channel phase differences (IPDs) are used as frame-wise features. Here,

is computed from the short-time Fourier transform (STFT) coefficients

and , where and denote the time frame, the frequency bin, the microphone channel , and the channel , respectively. We fix to compute relative IPDs between all the other channels, . Since the input consists of four channel signals, we can extract four amplitude spectrograms and three IPDs.

2.3 Data augmentation

To promote the generalization of the model, we exploit the following data augmentation techniques during the training.

  • EMDA: As described in [10, 8], we apply the equalized mixture data augmentation (EMDA) method where up to two audio events are mixed with random amplitudes, delays, and the modulation of frequency characteristics, i.e. equalization.

  • Rotation: We also adopt the spatial augmentation method in [4]. It rotates the training data represented in the first order Ambisonic (FOA) format and allows us to increase the numbers of DOA labels without losing the physical relationships between steering vectors and observations. We consider eight rotation patterns for azimuth and elevation : .

  • Multichannel SpecAugment: We propose a multichannel version of SpecAugment in [6, 14]. In addition to the time-frequency hard masking schemes applied on amplitude spectrograms, we also extend it to the channel dimension. The target channel for the channel masking, , is chosen from where

    denotes the number of microphone channels. For the IPD features, instead of multiplying a mask value by the original value, the original values are replaced with random values, where the values are sampled from a uniform distribution ranging from 0 to


2.4 Network architecture

As the network architecture, we adopt the MMDenseLSTM architecture, which has shown the state-of-the-art performance in music source separation [9]

. The adaptation to the SELD problem includes four modifications. First, we omit dense blocks in the up-sampling path because high frame-rate prediction is not necessary for the SELD problem. Second, we replace the LSTM cells with the GRU cells existing only in the bottleneck part. Third, the batch normalization is replaced with the network deconvolution 

[13]. Finally, we employ dilated convolutions in the dense blocks as covering a large input field is shown to be effective [8]. In each dense block, the dilation factor of the initial convolution is set to one, and it doubles every time the next convolution is applied, as applied in WaveNet [11]. We call this architecture RD3Net; the architecture is illustrated in Figure 2.

Figure 2: Illustration of RD3Net architecture.

2.5 Loss function

In the ACCDOA system, we solve the multi-output regression with a mean square error (MSE) loss. In the other system using the two-stage training scheme, we use a binary cross entropy (BCE) for the SED classification task and a masked MSE for the DOA regression task [1]. The latter is based on an MSE masked with the ground truth activations of each class, hence not contributing to the training when the corresponding sound event is not active.

2.6 Post-processing

During the inference, we split the 60 seconds inputs into shorter segments with overlap, process each segment, and average the results of overlapped frames. To further improve the performance, we conduct a post-processing with the following procedure: rotating the FOA data, estimating the ACCDOA vectors, rotating the vectors back, and averaging the vectors of different rotation patterns. Similar to section 2.3, we consider eight rotations.

2.7 Model ensemble

Name # of models Base system
Ensemble 1 4 ACCDOA w/ RD3Net 3
Two-stage w/ RD3Net
Ensemble 2 5 ACCDOA w/ RD3Net 3
Two-stage w/ RD3Net, CRNN
Ensemble 3 5 ACCDOA w/ RD3Net 3
Two-stage w/ RD3Net 2
Table 1: Ensemble configuration.
Validation split Testing split
Submission label System
- Baseline FOA 62.0 0.72 37.7 60.7 0.72 37.4
- ACCDOA w/ CRNN 79.9 0.34 77.2 73.8 0.40 70.5
- Two-stage w/ RD3Net 86.2 0.29 80.4 78.2 0.38 73.0
Shimada_SONY_task3_1 ACCDOA w/ RD3Net 87.0 0.24 84.4 80.5 0.32 76.8
Shimada_SONY_task3_2 Ensemble 1 90.0 0.20 87.6 82.9 0.29 79.4
Shimada_SONY_task3_3 Ensemble 2 90.6 0.18 88.0 83.7 0.28 79.9
Shimada_SONY_task3_4 Ensemble 3 90.6 0.18 88.0 83.5 0.29 80.0
Table 2: SELD performance of our systems evaluated using joint localization/detection metrics for the development set.

It is well known that averaging outputs of several models trained with different conditions such as initial parameters, stopping iteration, input features and model architectures often provides an improvement over the individual models. Here, we perform the model ensemble by averaging the outputs with weights. The weights are assigned to each class and model, thus the dimension of weights is , where is the number of class and is the number of models. The weights are estimated by the stochastic gradient decent on the validation set using MSE criteria as the ACCDOA system.

The systems used for the ensemble is shown in Table 1

. Here, CRNN means the convolutional recurrent neural network architecture used in 

[1]. Some of the models use PCEN [3] with and without mel filter, cosIPDs, and sinIPDs [12] as input features, instead of the amplitude spectrograms and IPDs.

3 Experimental evaluation

In this section, we show the experimental results of our systems on the development dataset.

Testing split
Without polyphony 83.1 0.25 81.3
With polyphony 79.0 0.36 74.3
Table 3: SELD performance without and with polyphony for the development set.

3.1 Experimental settings

We evaluated our approach on the development set of TAU Spatial Sound Events 2020 - Ambisonic using the suggested setup [7]. In the setup, four metrics were used for the evaluation [5]. The first was the localization error , which expresses the average angular distance between predictions and references of the same class. The second was a simple localization recall metric that expresses the true positive rate of how many of these localization estimates were detected in a class out of the total number of class instances. The next two metrics were the location-dependent error rate (

) and F-score (

), where predictions were considered as true positives only when the distance from the reference is less than .

The sampling frequency was set to 24 kHz. The STFT was applied with a configuration of 20 ms frame length and 10 ms frame hop. The frame length of input to the networks was 1,024 frames. During the inference time, the frame shift length was set to 20 frames. We used a batch size of 32. Each training sample was generated on-the-fly [2]. The learning rate was set to 0.001 and decayed 0.9 times every 20,000 iterations. We used Adam optimizer with a weight decay of .

All final submitted systems were trained on the fold 3, 4, 5 and 6 of the dataset except the ”Shimada_SONY_task3_4” where one of the two-stage model in the ensemble was trained on the fold 1, 3, 4, 5 and 6. The the fold 2 was used for the validation set all the time.

Figure 3: Visualization of SELD output for ACCDOA w/ RD3Net.

3.2 Experimental results

Table 2 shows the performance with the development set for our systems. As shown in the table, our systems outperformed the baseline for each metric by a large margin. We compared the performances of RD3Net and CRNN used in [1] with the ACCDOA system. The results show significant improvements over CRNN in all metrics, demonstrating the advantage of RD3Net. We also compared the performances of the ACCDOA system and two-stage system. The ACCDOA system showed 2.3 points higher than the two-stage in the testing split, while the ACCDOA system improved by 3.8 points. This suggests that the the ACCDOA system is more effective in the location-aware detection. Model ensemble improved by 3.2 points from the single model in the testing split. Table 3 shows the performances of the ACCDOA system ”Shimada_SONY_task3_1” on recordings without and with polyphony. We observed that the performance on recordings without polyphony is better than with polyphony.

An example of the proposed ACCDOA system output from the test split is visuarized in Fig. 3. Each event class is represented by a unique color. We can observe that our system performs joint detection, localization, and tracking of dynamic sources successfully in the recording.

4 Conclusion

We presented our approach to DCASE2020 task 3, Sound Event Localization and Detection. Our systems use the ACCDOA representation to solve both SED and SEL tasks in a unified manner. Moreover, we proposed an efficient network architecture called RD3Net. Our systems showed superior performance over the baselines with a single model. Furthermore, we observed further improvement with an ensemble of the ACCDOA and two-stage systems.

5 Acknowledgement

We would like to thank Yuichiro Koyama for the useful discussions on ACCDOA.


  • [1] Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley (2019) Polyphonic sound event detection and localization using a two-stage strategy. arXiv preprint arXiv:1905.00268. Cited by: §2.1, §2.5, §2.7, §3.2.
  • [2] H. Erdogan and T. Yoshioka (2018)

    Investigations on data augmentation and loss functions for deep learning based speech-background separation

    In Proc. of Interspeech, pp. 3499–3503. Cited by: §3.1.
  • [3] V. Lostanlen, J. Salamon, M. Cartwright, B. McFee, A. Farnsworth, S. Kelling, and J. P. Bello (2018) Per-channel energy normalization: why and how. IEEE Signal Processing Letters 26 (1), pp. 39–43. Cited by: §2.7.
  • [4] L. Mazzon, M. Yasuda, Y. Koizumi, and N. Harada (2019) Sound event localization and detection using FOA domain spatial augmentation. In Proc. of DCASE workshop, Cited by: 2nd item.
  • [5] A. Mesaros, S. Adavanne, A. Politis, T. Heittola, and T. Virtanen (2019) Joint measurement of localization and detection of sound events. In Proc. of IEEE WASPAA, Cited by: §3.1.
  • [6] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019)

    SpecAugment: a simple data augmentation method for automatic speech recognition

    Proc. of Interspeech, pp. 2613–2617. Cited by: 3rd item.
  • [7] A. Politis, S. Adavanne, and T. Virtanen (2020) A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. arXiv preprint arXiv:2006.01919. Cited by: §3.1.
  • [8] N. Takahashi, M. Gygli, and L. Van Gool (2017) AENet: learning deep audio features for video analysis. IEEE Trans. on Multimedia 20, pp. 513–524. Cited by: 1st item, §2.4.
  • [9] N. Takahashi, N. Goswami, and Y. Mitsufuji (2018) MMDenseLSTM: an efficient combination of convolutional and recurrent neural networks for audio source separation. In Proc. IWAENC, Cited by: §2.4.
  • [10] N. Takahashi, M. Gygli, B. Pfister, and L. V. Gool (2016)

    Deep convolutional neural networks and data augmentation for acoustic event detection

    In Proc. Interspeech, Cited by: 1st item.
  • [11] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §2.4.
  • [12] Z. Wang, J. Le Roux, and J. R. Hershey (2018) Multi-channel deep clustering: discriminative spectral and spatial embeddings for speaker-independent speech separation. In Proc. of IEEE ICASSP, pp. 1–5. Cited by: §2.7.
  • [13] C. Ye, M. Evanusa, H. He, A. Mitrokhin, T. Goldstein, J. A. Yorke, C. Fermuller, and Y. Aloimonos (2020) Network deconvolution. In Proc. ICLR, Cited by: §2.4.
  • [14] J. Zhang, W. Ding, and L. He (2019) Data augmentation and prior knowledge-based regularization for sound event localization and detection. In Proc. of DCASE workshop, Cited by: 3rd item.