Sound event detection (SED) is the task of describing, from an audio recording, what happens and when each single sound event is occurring . This is something that we, as humans, do rather naturally to obtain information about what is happening around us. However, trying to reproduce this with a machine is not trivial, as the SED algorithm needs to cope with several problems, including audio signal degradation due to additive noise or overlapping events . Indeed, in real-world scenarios, the recordings provided to the SED systems contain not only target sound events, but also sound events that can be considered as “noise” or “interference.” Also, several target sound events can occur simultaneously.
In the past, the overlapping sound events problem has been tackled from the classifier point of view. This can be done by training the SED as a multilabel system in which case the most energetic sound events are usually detected more accurately than the rest [3, 4]. Some other approaches tried to deal more explicitly with this problem using either a set of binary classifiers , using factorization techniques on the input of the classifier [6, 7], or exploiting spatial information when available . The additive noise problem is usually solved by training SED systems on noisy signals. This may be effective to some degree when the noise level is low, but much less so when the noise level increases .
Sound separation (SS) seems like a natural candidate to solve these two issues. SS systems are trained to predict the constituent sources directly from mixtures. Thus, sound separation can both decrease the level of interfering noise and enable a SED system to detect quieter events in overlapping acoustic mixtures. Until recently, SS has been mainly applied to specific classes of signals, such as speech or music. However, recent works has shown that sound separation can also be applied to separating sounds of arbitrary classes, a task known as “universal sound separation” [9, 10, 11].
In this paper, we propose to combine a universal SS algorithm [9, 10] used as a pre-processing to the DCASE 2020 SED baseline . We investigate the impact of the data used to train the SS on the SED performance. We also explore different ways to combine the separated sound sources at different stages of SED.
2 Problem and baselines description
We aim to solve a problem similar to that of DCASE 2019 Task 4 . Systems are expected to produce strongly-labeled outputs (i.e. detect sound events with a start time, end time, and sound class label), but are provided with weakly labeled data (i.e. sound recordings with only the presence/absence of a sound event included in the labels without any timing information) for training. Multiple events can be present in each audio recording, including overlapping target sound events and potentially non-target sound events. Previous studies have shown that the presence of additional sound events can drastically decrease the SED performance .
2.1 Sound event detection baseline
The SED baseline system uses a mean-teacher model which is a combination of two models: a student model and a teacher model (both have the same architecture). The student model is the final model used at inference time, while the teacher model is aimed at helping the student model during training and its weights are an exponential moving average of the student model’s weights. A more detailed description can be found in Turpault and Serizel .
2.2 Sound separation baseline
, which employs a convolutional masking network using STFT and analysis and synthesis. The training loss is negative stabilized signal-to-noise ratio (SNR) with a soft-threshold
. Going beyond previous work, the model in this paper is able to handle variable number sources by using different loss functions for active and inactive reference sources that encourage the model to only output as many nonzero sources as exist in the mixture. Additional source slots are encouraged to be all-zero.
3 Sound event detection and separation
3.1 Sound separation for sound event detection
Overlapping sound events are typically more difficult to detect as compared to isolated ones. SS can be used for SED by first separating the component sounds in a mixed signal and then applying SED on each of the separated tracks. The decisions obtained on separated signals may be more accurate than the ones on the mixed signal. On the other hand, separation of sounds is not a trivial problem and may introduce artifacts which in turn may make sound SED harder. So, it is necessary to jointly investigate SS and SED.
3.2 Sound event detection on separated sources
In the approaches described here, SS provides several audio clips that contain information related to the sound sources composing the original (mixture) clip. Each of these new audio clips (separated sound sources) are used together with the mixture clip within the SED. We compare three different approaches to integrate the information from these audio clips at different levels of the model.
3.2.1 Early integration
This approach is similar to the SED baseline except that all the audio clips (mixture and separated sound sources) are concatenated as input channels to form a new tensor (Figure0(a)). The first channel always contains the mixture clip while the separated sound source clips are provided with no particular order. The model is trained like the SED baseline using the annotations of the mixture clip.
3.2.2 Middle integration
We re-use the CNN block from the SED baseline to extract embeddings from the mixture clip and the separated sound sources clips (Figure 0(b)). The embeddings are concatenated along the feature axis and fed into a fully connected layer before training a new RNN classifier within a mean-teacher student framework.
3.2.3 Late integration
For this approach, we apply the SED baseline on the mixture clip and the separated sound source clips (Figure 0(c)). The SED output for each of these clips are obtained from the , the raw outputs of the classifier corresponding to each sound class among the sound classes. The combined raw output (before thresholding and post-processing) for each class is obtained as follows:
where and are the raw classifier outputs for the sound class obtained on the mixture clips and the separated sound source clips, respectively. The sound source/mixture combination weight is . The classifier output for the sound class is obtained from the raw classifier outputs on each individual separated sound sources as follows:
where is the number of separated sound source clips obtained from the SS, is the raw classifier output for the sound class obtained for the separated sound source clip and is the sound sources combination weight.
4 Baselines setup and dataset
4.1 DESED dataset
The dataset used for the SED experiments is DESED111https://project.inria.fr/desed/, a flexible dataset for SED in domestic environments composed of 10-sec audio clips that are recorded or synthesized [13, 4]. The recorded soundscapes are taken from AudioSet . The synthetic soundscapes are generated using Scaper . The foreground events are obtained from FSD50k [17, 18]. The background textures are obtained from the SINS dataset (activity class “other”)  and TUT scenes 2016 development dataset .
The dataset includes a synthetic validation set simulated from different isolated those in the training set (SYN_VAL) , a validation set and a public evaluation set composed of recorded clips (REC_VAL and REC_EVAL) that are used to adjust the hyper-parameters and evaluate the SED, respectively.
4.2 FUSS dataset
The Free Universal Sound Separation (FUSS)222https://github.com/google-research/sound-separation/tree/master/datasets/fuss dataset  is intended for experimenting with universal sound separation , and is used as training data for the SS system. Audio data is sourced from freesound.org. Using labels from FSD50k , gathered through the Freesound Annotator , these source files have been screened such that they likely only contain a single type of sound. Labels are not provided for these source files, and thus the goal is to separate sources without using class information. To create reverberant mixtures, 10 second clips of sources are convolved with simulated room impulse responses. Each 10 second mixture contains between 1 and 4 sources. Source files longer than 10 seconds are considered ”background” sources. Every mixture contains one background source, which is active for the entire duration.
4.3 Sound event detection baseline
The SED baseline333https://github.com/turpaultn/dcase20_task4/ architecture and parameters are described extensively in Turpault et al. . The performance obtained with this baseline on DESED is presented in Table 1.
4.4 Sound separation baseline
The SS system is trained on 16-kHz audio444 https://github.com/google-research/sound-separation/tree/master/models/dcase2020_fuss_baseline. The input to the SS network is the magnitude of the STFT using window size 32ms and hop of 8ms. These magnitudes are processed by an improved time-domain convolutional network (TDCN++) [9, 10], which is similar to Conv-TasNet . Like Conv-TasNet, the TDCN++ consists of four repeats of 8 residual dilated convolution blocks, where within each repeat the dilation of block is for . The main differences between the TDCN++ and Conv-Tasnet are (1) bin-wise normalization instead of global layer normalization, which averages only over basis frames instead of frames and frequency bins, (2) trainable scalar scale parameters multiplied after each dense layer, which are initialized with
, and (3) additional residual connections between blocks, with connection pattern, , , , , .
This TDCN++ network predicts four masks that are the same shape as the input STFT. Each mask is multiplied with the complex input STFT, and a source waveform is computed by applying the inverse STFT. A weighted mixture consistency projection layer  is applied to the separated waveforms to be consistent with the input mixture waveform where the per-source weights are predicted by an additional dense layer using the penultimate output of TDCN++.
To separate mixtures with variable numbers of sources, different loss functions are used for active and inactive reference sources. For active reference sources (i.e. non-zero reference source signals), the soft-threshold for SNR is 30 dB, equivalent to the error power being below the reference power by 30 dB. For non-active reference sources (i.e. all-zero reference source signals), the soft-threshold is 20 dB measured relative to the mixture power, thus gradients are clipped when the error power is 20 dB below the mixture power. Thus, for a -source mixture, a -output model with should output non-zero sources, and all-zero sources.
4.5 Evaluation metrics
SS systems are evaluated in terms of scale-invariant SNR (SI-SNR) . Since FUSS mixtures can contain one to four sources, we report two scores to summarize performance: multi-source SI-SNR improvement (MSi), which measures the separation quality of mixtures with two or more sources, and single-source SI-SNR (1S), which measures the separation model’s ability to reconstruct single-source inputs.
|FUSS test set||REC_VAL|
SS and SED performance for FUSS-trained SS models: MSi (multi-source SI-SNR improvement) and 1S (single-source SI-SNR). Confidence intervals:1.2 (F1-score) and 0.015 (PSDS).
|DmFm||DESED mix, dry FUSS mix|
|BgFgFm||DESED bg, DESED fg mix, dry FUSS mix|
|PIT||DESED bg, dry FUSS mix, 5 DESED fg sources|
|Classwise||DESED bg, 10 DESED classes, dry FUSS mix|
|GroupPIT||DESED bg, 5 DESED fg sources, 4 dry FUSS srcs|
|SS training||SI-SNRi (dB)||Early integration||Middle integration||Late integration||Late integration|
SED systems are evaluated according to an event-based F1-score with a 200 ms collar on the onsets and a collar on the offsets that is the greater of 200 ms and 20% of the sound event’s length. The overall F1-score is the unweighted average of the class-wise F1-scores (macro-average). F-scores are computed on a single operating point (decision thresholds=0.5) using the sed_eval library.
SED systems are also evaluated with poly-phonic sound event detection scores(PSDS) . PSDS are computed using 50 operating points (linearly distributed from 0.01 to 0.99) with the following parameters: detection tolerance parameter (), ground truth intersection parameter (), cross-trigger tolerance parameter (), maximum False Positive rate (). The weight on the cost trigger cost is set to and the weight on the class instability cost is set to .
Table 2 displays SS and SED performance on the FUSS test set and REC_VAL. For SS we do a full cross-evaluation between the dry and reverberant versions. From this we can see that the reverberant FUSS-trained model achieves the best separation scores across both dry and reverberant conditions. However, in terms of SED performance, the dry FUSS-trained separation model yields the best performance in terms of both F1 and PSDS. This may be due to the synthetic room impulse responses used to create reverberant FUSS being mismatched to the real data in REC_VAL. Thus, we opt to use the dry version of FUSS in the proceeding experiments.
Besides training SS systems on FUSS, we also constructed a number of tasks consisting of data from both DESED and FUSS, described in Table 3. Some tasks are trained with permutation-invariant training (PIT)  or groupwise PIT.
reports the results of evaluating these models on the BgFgFm task. For models with more than three outputs, the sources for corresponding classes are summed together. For example, sources 1 through 5 are summed together for the PIT and GroupPIT models, and sources 0 through 9 for the Classwise model, to produce the separated estimate of the DESED foreground mixture. The BgFgFm-trained SS model achieves the best SS scores, since it is matched to the task. This model also achieves the highest F1 score on the SYN_VAL set, although this is not statistically significant. However, on REC_VAL, the Classwise model achieves the best F1 score. However, notice that the dry FUSS SS model achieves the overall best F1 and PSDS scores of 39.2 and 0.574 in Table2. This suggests that the DESED+FUSS-trained SS models do not generalize as well, since they are trained on more specific synthetic data compared to FUSS-trained models.
Figure 2 displays the impact of the late integration parameters and on the SED performance. Intuitively when the SS models aims at separating sources that corresponds to target sound events, the parameter
should be high so the source aggregation is close to a max pooling across sources. This is what can be observed on Fig.1(a) for the PIT model. For the FUSS-trained SS separated sources do not correspond to target sources and the integration is better for low values of . This however is not confirmed on REC_VAL (Fig. 1(b)). This could be due to the mismatch between training and test for the SS leading to sound sources that are not properly separated.
The SED performance depending on the parameter is presented on Figure 1(c). A high value for the parameter means focusing only on the mixture or on the separated sounds and leads to degraded performance for all the SS models. The best performance is then obtained with the FUSS-trained SS and and (40.7% F1-score and 0.570 PSDS on REC_EVAL).
In this paper we proposed to use a SS algorithm as pre-processing to a SED system applied to complex mixtures including non-target events and background noise. We proposed to retrain the generic SS on task specific datasets. The combination has shown to have potential to improve the SED performance in particular when using a late integration to combine the prediction obtained from the separated sources. However, the benefits still remain limited most probably because of the mismatch between the SS training conditions and the SED test conditions.
We would like to thank the other organizers of DCASE 2020 task 4: Daniel P. W. Ellis and Ankit Parag Shah.
-  T. Virtanen, M. D. Plumbley, and D. Ellis, Computational analysis of sound scenes and events. Springer, 2018.
-  E. Benetos, G. Lafay, M. Lagrange, and M. D. Plumbley, “Detection of overlapping acoustic events using a temporally-constrained probabilistic model,” in ICASSP. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01255074
-  J. Salamon and J. P. Bello, “Feature learning with deep scattering for urban sound analysis,” in 2015 23rd European Signal Processing Conference (EUSIPCO). IEEE, 2015, pp. 724–728.
-  R. Serizel, N. Turpault, A. Shah, and J. Salamon, “Sound event detection in synthetic domestic environments,” in Proc. ICASSP, 2020.
A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic scene classification and sound event detection,” in2016 24th European Signal Processing Conference (EUSIPCO). IEEE, 2016, pp. 1128–1132.
-  E. Benetos, G. Lafay, M. Lagrange, and M. D. Plumbley, “Detection of overlapping acoustic events using a temporally-constrained probabilistic model,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 6450–6454.
-  V. Bisot, S. Essid, and G. Richard, “Overlapping sound event detection with supervised nonnegative matrix factorization,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 31–35.
S. Adavanne, A. Politis, and T. Virtanen, “Multichannel sound event detection using 3d convolutional neural networks for learning inter-channel features,” in2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 2018, pp. 1–7.
-  I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux, and J. R. Hershey, “Universal sound separation,” in Proc. WASPAA, 2019.
-  E. Tzinis, S. Wisdom, J. R. Hershey, A. Jansen, and D. P. Ellis, “Improving universal sound separation using sound classification,” in Proc. ICASSP, 2020.
-  M. Olvera, E. Vincent, R. Serizel, and G. Gasso, “Foreground-Background Ambient Sound Scene Separation,” May 2020, working paper or preprint. [Online]. Available: https://hal.archives-ouvertes.fr/hal-02567542
-  N. Turpault and R. Serizel, “Training sound event detection on a heterogeneous dataset,” 2020, working paper or preprint.
-  N. Turpault, R. Serizel, A. Parag Shah, and J. Salamon, “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” in Proc. DCASE Workshop, 2019.
-  S. Wisdom, E. Tzinis, H. Erdogan, R. J. Weiss, K. Wilson, and J. R. Hershey, “Unsupervised sound separation using mixtures of mixtures,” arXiv preprint arXiv:2006.12701, 2020.
-  J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. ICASSP, 2017.
-  J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Scaper: A library for soundscape synthesis and augmentation,” in Proc. WASPAA, 2017, pp. 344–348.
-  F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in Proc. ACM, 2013, pp. 411–412.
-  E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50k: an open dataset of human-labeled sound events,” in arXiv, 2020.
-  G. Dekkers, S. Lauwereins, B. Thoen, M. W. Adhana, H. Brouckxon, T. van Waterschoot, B. Vanrumste, M. Verhelst, and P. Karsmakers, “The SINS database for detection of daily activities in a home environment using an acoustic sensor network,” in Proc. DCASE Workshop, November 2017, pp. 32–36.
-  A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in 2016 24th European Signal Processing Conference (EUSIPCO). IEEE, pp. 1128–1132. [Online]. Available: http://ieeexplore.ieee.org/document/7760424/
-  S. Wisdom, H. Erdogan, D. P. W. Ellis, R. Serizel, N. Turpault, E. Fonseca, J. Salamon, P. Seetharaman, and J. R. Hershey, “What’s all the FUSS about free universal sound separation data?” In preparation, 2020.
-  E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets: a platform for the creation of open audio datasets,” in Proc. ISMIR), 2017, pp. 486–493.
-  Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” vol. 27, no. 8, pp. 1256–1266, 2019.
-  S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable consistency constraints for improved deep speech enhancement,” in Proc. ICASSP, 2019.
-  J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 626–630.
-  A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, May 2016.
-  C. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulovic, “A framework for the robust evaluation of sound event detection,” in Proc. ICASSP, 2020.
-  D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017, pp. 241–245.