1 Introduction
Sound event localization and detection (SELD) is the task of identifying both the direction of arrival (DOA) and the type of sound. A number of methods have been tackling this challenging problem by decomposing tasks into several subtasks: the estimation of the number of sources, DOA estimation, and sound event detection (SED). Although this simplifies the SELD problem and therefore could improve the performance of each task, it also increases system complexity and computational cost. Here, we consider both the single- and two-stage systems. The single-stage system solve both SED and SEL task simultaneously using an activity-coupled Cartesian DOA vector (ACCDOA) representation. ACCDOA assigns an audio event activity to the length of a corresponding Cartesian DOA vector. On the other hand, the two-stage system first handles the SED as a frame wise classification problem and then combines with the DOA estimation.
2 System
In this section, we first give an overview of our two systems, namely, the ACCDOA (single-stage) system and the two-stage system. Then we explain the parts of our pipelines: the features, data augmentation, network architecture, and loss function.
2.1 System overview
A schematic flow of the ACCDOA system is shown in Fig. 1
. Two data augmentation techniques are applied to input signals prior to the feature extraction while one data augmentation technique exploiting multichannel information in the feature domain is performed after the feature extraction. Finally, the network outputs frame-wise ACCDOA vectors for 14 sound events. The magnitude of the vectors corresponds to the probability of each sound event activity while the direction of the vectors points toward the location of each source. The model using the ACCDOA representation is trained to minimize the distance between the estimated and the target coordinates.
The two-stage system is inspired by the work in [1]. This is characterized by three ideas: training only the SED branch, transferring a part of the network parameters from the SED branch to the DOA estimation branch, and training the DOA estimation branch. The data augmentation techniques, the features, and the network used in the system are exactly the same as the ones in the ACCDOA system.
2.2 Feature
Multichannel amplitude spectrograms and inter-channel phase differences (IPDs) are used as frame-wise features. Here,
is computed from the short-time Fourier transform (STFT) coefficients
and , where and denote the time frame, the frequency bin, the microphone channel , and the channel , respectively. We fix to compute relative IPDs between all the other channels, . Since the input consists of four channel signals, we can extract four amplitude spectrograms and three IPDs.2.3 Data augmentation
To promote the generalization of the model, we exploit the following data augmentation techniques during the training.
-
Rotation: We also adopt the spatial augmentation method in [4]. It rotates the training data represented in the first order Ambisonic (FOA) format and allows us to increase the numbers of DOA labels without losing the physical relationships between steering vectors and observations. We consider eight rotation patterns for azimuth and elevation : .
-
Multichannel SpecAugment: We propose a multichannel version of SpecAugment in [6, 14]. In addition to the time-frequency hard masking schemes applied on amplitude spectrograms, we also extend it to the channel dimension. The target channel for the channel masking, , is chosen from where
denotes the number of microphone channels. For the IPD features, instead of multiplying a mask value by the original value, the original values are replaced with random values, where the values are sampled from a uniform distribution ranging from 0 to
.
2.4 Network architecture
As the network architecture, we adopt the MMDenseLSTM architecture, which has shown the state-of-the-art performance in music source separation [9]
. The adaptation to the SELD problem includes four modifications. First, we omit dense blocks in the up-sampling path because high frame-rate prediction is not necessary for the SELD problem. Second, we replace the LSTM cells with the GRU cells existing only in the bottleneck part. Third, the batch normalization is replaced with the network deconvolution
[13]. Finally, we employ dilated convolutions in the dense blocks as covering a large input field is shown to be effective [8]. In each dense block, the dilation factor of the initial convolution is set to one, and it doubles every time the next convolution is applied, as applied in WaveNet [11]. We call this architecture RD3Net; the architecture is illustrated in Figure 2.2.5 Loss function
In the ACCDOA system, we solve the multi-output regression with a mean square error (MSE) loss. In the other system using the two-stage training scheme, we use a binary cross entropy (BCE) for the SED classification task and a masked MSE for the DOA regression task [1]. The latter is based on an MSE masked with the ground truth activations of each class, hence not contributing to the training when the corresponding sound event is not active.
2.6 Post-processing
During the inference, we split the 60 seconds inputs into shorter segments with overlap, process each segment, and average the results of overlapped frames. To further improve the performance, we conduct a post-processing with the following procedure: rotating the FOA data, estimating the ACCDOA vectors, rotating the vectors back, and averaging the vectors of different rotation patterns. Similar to section 2.3, we consider eight rotations.
2.7 Model ensemble
Name | # of models | Base system |
---|---|---|
Ensemble 1 | 4 | ACCDOA w/ RD3Net 3 |
Two-stage w/ RD3Net | ||
Ensemble 2 | 5 | ACCDOA w/ RD3Net 3 |
Two-stage w/ RD3Net, CRNN | ||
Ensemble 3 | 5 | ACCDOA w/ RD3Net 3 |
Two-stage w/ RD3Net 2 |
Validation split | Testing split | ||||||||
---|---|---|---|---|---|---|---|---|---|
Submission label | System | ||||||||
- | Baseline FOA | 62.0 | 0.72 | 37.7 | 60.7 | 0.72 | 37.4 | ||
- | ACCDOA w/ CRNN | 79.9 | 0.34 | 77.2 | 73.8 | 0.40 | 70.5 | ||
- | Two-stage w/ RD3Net | 86.2 | 0.29 | 80.4 | 78.2 | 0.38 | 73.0 | ||
Shimada_SONY_task3_1 | ACCDOA w/ RD3Net | 87.0 | 0.24 | 84.4 | 80.5 | 0.32 | 76.8 | ||
Shimada_SONY_task3_2 | Ensemble 1 | 90.0 | 0.20 | 87.6 | 82.9 | 0.29 | 79.4 | ||
Shimada_SONY_task3_3 | Ensemble 2 | 90.6 | 0.18 | 88.0 | 83.7 | 0.28 | 79.9 | ||
Shimada_SONY_task3_4 | Ensemble 3 | 90.6 | 0.18 | 88.0 | 83.5 | 0.29 | 80.0 |
It is well known that averaging outputs of several models trained with different conditions such as initial parameters, stopping iteration, input features and model architectures often provides an improvement over the individual models. Here, we perform the model ensemble by averaging the outputs with weights. The weights are assigned to each class and model, thus the dimension of weights is , where is the number of class and is the number of models. The weights are estimated by the stochastic gradient decent on the validation set using MSE criteria as the ACCDOA system.
The systems used for the ensemble is shown in Table 1
. Here, CRNN means the convolutional recurrent neural network architecture used in
[1]. Some of the models use PCEN [3] with and without mel filter, cosIPDs, and sinIPDs [12] as input features, instead of the amplitude spectrograms and IPDs.3 Experimental evaluation
In this section, we show the experimental results of our systems on the development dataset.
Testing split | ||||
---|---|---|---|---|
ACCDOA w/ RD3Net | ||||
Without polyphony | 83.1 | 0.25 | 81.3 | |
With polyphony | 79.0 | 0.36 | 74.3 |
3.1 Experimental settings
We evaluated our approach on the development set of TAU Spatial Sound Events 2020 - Ambisonic using the suggested setup [7]. In the setup, four metrics were used for the evaluation [5]. The first was the localization error , which expresses the average angular distance between predictions and references of the same class. The second was a simple localization recall metric that expresses the true positive rate of how many of these localization estimates were detected in a class out of the total number of class instances. The next two metrics were the location-dependent error rate (
) and F-score (
), where predictions were considered as true positives only when the distance from the reference is less than .The sampling frequency was set to 24 kHz. The STFT was applied with a configuration of 20 ms frame length and 10 ms frame hop. The frame length of input to the networks was 1,024 frames. During the inference time, the frame shift length was set to 20 frames. We used a batch size of 32. Each training sample was generated on-the-fly [2]. The learning rate was set to 0.001 and decayed 0.9 times every 20,000 iterations. We used Adam optimizer with a weight decay of .
All final submitted systems were trained on the fold 3, 4, 5 and 6 of the dataset except the ”Shimada_SONY_task3_4” where one of the two-stage model in the ensemble was trained on the fold 1, 3, 4, 5 and 6. The the fold 2 was used for the validation set all the time.
3.2 Experimental results
Table 2 shows the performance with the development set for our systems. As shown in the table, our systems outperformed the baseline for each metric by a large margin. We compared the performances of RD3Net and CRNN used in [1] with the ACCDOA system. The results show significant improvements over CRNN in all metrics, demonstrating the advantage of RD3Net. We also compared the performances of the ACCDOA system and two-stage system. The ACCDOA system showed 2.3 points higher than the two-stage in the testing split, while the ACCDOA system improved by 3.8 points. This suggests that the the ACCDOA system is more effective in the location-aware detection. Model ensemble improved by 3.2 points from the single model in the testing split. Table 3 shows the performances of the ACCDOA system ”Shimada_SONY_task3_1” on recordings without and with polyphony. We observed that the performance on recordings without polyphony is better than with polyphony.
An example of the proposed ACCDOA system output from the test split is visuarized in Fig. 3. Each event class is represented by a unique color. We can observe that our system performs joint detection, localization, and tracking of dynamic sources successfully in the recording.
4 Conclusion
We presented our approach to DCASE2020 task 3, Sound Event Localization and Detection. Our systems use the ACCDOA representation to solve both SED and SEL tasks in a unified manner. Moreover, we proposed an efficient network architecture called RD3Net. Our systems showed superior performance over the baselines with a single model. Furthermore, we observed further improvement with an ensemble of the ACCDOA and two-stage systems.
5 Acknowledgement
We would like to thank Yuichiro Koyama for the useful discussions on ACCDOA.
References
- [1] (2019) Polyphonic sound event detection and localization using a two-stage strategy. arXiv preprint arXiv:1905.00268. Cited by: §2.1, §2.5, §2.7, §3.2.
-
[2]
(2018)
Investigations on data augmentation and loss functions for deep learning based speech-background separation
. In Proc. of Interspeech, pp. 3499–3503. Cited by: §3.1. - [3] (2018) Per-channel energy normalization: why and how. IEEE Signal Processing Letters 26 (1), pp. 39–43. Cited by: §2.7.
- [4] (2019) Sound event localization and detection using FOA domain spatial augmentation. In Proc. of DCASE workshop, Cited by: 2nd item.
- [5] (2019) Joint measurement of localization and detection of sound events. In Proc. of IEEE WASPAA, Cited by: §3.1.
-
[6]
(2019)
SpecAugment: a simple data augmentation method for automatic speech recognition
. Proc. of Interspeech, pp. 2613–2617. Cited by: 3rd item. - [7] (2020) A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. arXiv preprint arXiv:2006.01919. Cited by: §3.1.
- [8] (2017) AENet: learning deep audio features for video analysis. IEEE Trans. on Multimedia 20, pp. 513–524. Cited by: 1st item, §2.4.
- [9] (2018) MMDenseLSTM: an efficient combination of convolutional and recurrent neural networks for audio source separation. In Proc. IWAENC, Cited by: §2.4.
-
[10]
(2016)
Deep convolutional neural networks and data augmentation for acoustic event detection
. In Proc. Interspeech, Cited by: 1st item. - [11] (2016) WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §2.4.
- [12] (2018) Multi-channel deep clustering: discriminative spectral and spatial embeddings for speaker-independent speech separation. In Proc. of IEEE ICASSP, pp. 1–5. Cited by: §2.7.
- [13] (2020) Network deconvolution. In Proc. ICLR, Cited by: §2.4.
- [14] (2019) Data augmentation and prior knowledge-based regularization for sound event localization and detection. In Proc. of DCASE workshop, Cited by: 3rd item.
Comments
There are no comments yet.