Sound event detection (SED) is the task of detecting sound event labels and their onset/offset in an audio recording, where a sound event indicates a type of sound such as “people talking” and “bird singing” 
. SED plays an important role in realizing various applications using artificial intelligence in sounds, such as automatic life-logging, machine monitoring, automatic surveillance, media retrieval, and biomonitoring systems[2, 3, 4, 5, 6, 7, 8].
|Sound event||Duration||Sound event||Duration|
|(object) banging||0.78 s||drawer||0.80 s|
|(object) impact||0.35 s||fan||29.99 s|
|(object) rustling||2.24 s||glass jingling||0.80 s|
|(object) snapping||0.46 s||keyboard typing||0.21 s|
|(object) squeaking||0.74 s||large vehicle||14.68 s|
|bird singing||7.63 s||mouse clicking||0.14 s|
|brakes squeaking||1.65 s||mouse wheeling||0.16 s|
|breathing||0.43 s||people talking||4.09 s|
|car||6.88 s||people walking||6.63 s|
|children||6.87 s||washing dishes||4.15 s|
|cupboard||0.65 s||water tap running||5.92 s|
|cutlery||0.74 s||wind blowing||6.09 s|
, recurrent neural network (RNN), and convolutional recurrent neural network (CRNN) , have been proposed. In these methods, an audio clip is segmented into short time frames (e.g., 40 ms frame length with 20 ms shift), and each segmented frame is regarded as one data sample when training and evaluating a sound event model. As shown in Fig. 1, each sound event instance has a different frame length, and the frame length of each instance varies depending on the sound event class. Table 1 and Fig. 2 respectively show the average duration of one sound event instance and the total frame number of sound events in datasets used for evaluation experiments discussed in Section 4 (TUT Sound Events 2016, 2017, and TUT Acoustic Scenes 2016 development [12, 13]). In this dataset, the number of frames in sound event “mouse clicking,” which has an average length of 0.15 s, is 1,163, while that in sound event “fan,” which has an average length of 29.99 s, is 116,837. Thus, the difference in the time duration between sound events causes a serious data imbalance problem in SED. There are some conventional methods of SED for imbalanced data [14, 15]. For instance, Chen and Jin have proposed a method for detecting rare sound events using data augmentation . Wang et al. have proposed a method for few-shot sound event detection based on metric learning . However, the data imbalance problem caused by the difference in time duration between sound event classes has not been investigated in these works.
In this paper, we address the imbalanced-data SED caused by the difference in time duration using a duration robust loss function. In a preliminary experiment, it is proven that a simple classwise reweighting of the loss function on the basis of the inverse frequency of sound event occurrences (the details are described in Sec. 2) is not effective for SED using extremely imbalanced data. In the proposed method, we instead apply another classwise reweighting approach of the loss function in accordance with on the ease/difficulty of model training. As shown in Fig. 3, this is because many sound events of long duration (e.g., “fan” or “car”) are stationary sounds, which have less variation in their acoustic features and their model training is easy. We demonstrate that the proposed reweighting approach of the loss function can prevent the sound events of long duration from dominating the model training and improve the performance of SED using a seriously imbalanced dataset.
The rest of this paper is organized as follows. In Sec. 2, we introduce the conventional SED method, and in Sec. 3, we propose the SED method based on the duration robust loss function. In Sec. 4, we discuss the SED performance evaluation for an imbalanced training dataset. In Sec. 5, we conclude this paper.
2 Conventional Method
Let us consider the training dataset . is an acoustic feature of the sound clip and is the sound event label, where
indicates a multi–hot vector of time framein the sound clip over the sound event class. The goal of SED is to predict sound event labels in an unknown sound using
where , , and are the model, the model parameter trained using , and the detection threshold, respectively. In the conventional SED, the mel-band energy and mel-frequency cepstral coefficients (MFCCs) are often used as the acoustic features . As the model , CNN, RNN, or CRNN-based neural network is applied. The model parameter
is estimated using the following binary cross-entropy (BCE) loss function
and the backpropagation technique:
are the sigmoid function and the output of the network in time frame, respectively. is the target label in time frame and is 1 if acoustic event is active in time frame and 0 otherwise. Note that we omit sound clip index to simplify the equation. is actually calculated by summing the binary cross entropy over time frames of all sound clips. Since the frame length of sound events varies considerably depending on the event class, the model parameter estimation using Eq. (2) leads to the data imbalance problem. As a result, sound events of long durations overwhelm the model training and those of short durations are likely to be downweighted.
|Inverse freq. loss||Fscore||0.00%||0.02%||0.00%||0.00%||0.00%||0.00%||2.39%||0.00%||26.89%||0.00%||0.00%||0.00%|
|Duration robust loss||Fscore||0.00%||2.86%||1.20%||0.00%||0.00%||25.65%||4.01%||0.00%||55.22%||0.00%||0.00%||0.00%|
|Duration robust loss||Fscore||0.00%||7.21%||2.37%||0.00%||0.00%||28.41%||13.35%||0.00%||54.01%||0.00%||0.00%||0.00%|
|Inverse freq. loss||Fscore||0.00%||0.00%||0.00%||0.00%||0.00%||10.93%||0.00%||0.00%||0.00%||0.00%||0.00%||2.73%||0.00%|
|Duration robust loss||Fscore||0.01%||0.00%||25.44%||0.00%||0.23%||47.67%||0.00%||0.00%||4.40%||32.09%||13.97%||34.16%||0.00%|
|Duration robust loss||Fscore||1.66%||0.02%||28.91%||0.00%||0.49%||47.44%||0.08%||0.00%||5.26%||30.34%||20.65%||38.76%||0.02%|
One simple idea to address the imbalanced-data problem is classwise reweighting of the loss function in accordance with the inverse frequency of sound event occurrences as follows.
Here, and are the number of frames of sound event in a sound clip and a constant number, respectively. However, in a preliminary experiment, we confirmed that the simple classwise reweighting using is not effective for SED.
3 Proposed Method
In this paper, we propose a duration robust loss function for SED, whereby training can be focused on sound events of short duration. In the proposed method, we focus on the relationship between the sound event duration and the ease/difficulty of model training. In particular, many sound events of long duration (e.g., “fan” and “car”) are stationary sounds, which have less variation in their acoustic features and their model training is easy. Meanwhile, some sound events of short duration (e.g., “object impact” and “keyboard typing”) have more than one audio pattern, such as attack, decay, and release parts. Therefore, the ease/difficulty of model training is important information for controlling the training weight, as well as a more direct way of controlling the loss contribution.
To control the training weight of sound events in accordance with the ease/difficulty of model training, we add factors and to the BCE loss as follows:
where is the weighting parameter that controls the focusing weight. When sound event is active in time frame but the network output is a small value, the reweighting factor does not greatly affect the loss, whereas if the network output is a large value, the reweighting factor approaches zero and the loss is down-weighted. Thus, the loss function focuses the model training on the sound event classes that are difficult to train.
It is considered that the concept of the duration robust loss function is similar to that of the focal loss  in objective detection. In objective detection, there is also a serious imbalance problem, where background samples tend to have a large number of pixels with similar patterns, whereas the foreground samples are likely to have a relatively small number of pixels.
|Acoustic feature||Log mel-band energy (64 dim.)|
|Frame length / shift||40 ms / 20 ms|
|Length of sound clip||10 s|
|Network structure||3 CNN 1 BiGRU 1 fully conn.|
|# channels of CNN layers||128, 128, 128|
|Filter size||33, 33, 33|
18, 14, 1
2 (max pooling)
|# units in GRU layer||32|
|# units in fully conn. layer||32|
|Inverse frequency loss||1.72%||7.44%|
|Duration robust loss ()||9.88%||31.40%|
|Duration robust loss ()||11.16%||32.20%|
4.1 Experimental Conditions
We evaluated the performance of SED using the proposed duration robust loss function. As an evaluation dataset, we constructed the dataset composed of parts of TUT Sound Events 2016 development, 2017 development, and TUT Acoustic Scenes 2016 development [12, 13]. We selected a total of 192 min of sound clips including the 25 types of sound event listed in Fig. 2. As shown in Fig. 2, the number of time frames between sound events is seriously unbalanced, e.g., sound event “fan” makes up a total of more than 100,000 frames in the dataset, while sound event “mouse clicking” accounts for less than 3,500 frames. Note that the datasets were recorded not for the detection task of rare sound but for the analysis of real-life sounds; thus, the analysis of seriously imbalanced data is a general problem in SED.
As an acoustic feature, we used the 64-dimensional log mel-band energy, which was calculated every 40 ms with a 20 ms frame shift. The acoustic feature was fed to the neural network with 3 CNN layers, 1 bidirectional gated recurrent unit (GRU) layer, and 1 fully connected layer, which was used for the baseline system of the DCASE2018 challenge task 4. The performance of sound event detection was evaluated using the segment-based macro- and micro-Fscores . Other experimental conditions are listed in Table 3.
4.2 Experimental Results
Table 4 shows the average performances of SED using the BCE loss, the BCE loss with the inverse frequency class reweighting (referred to as inverse frequency loss), and the proposed duration robust loss. For each loss, we conducted the evaluation experiment 10 times with random initial values for model parameters. The results show that the proposed SED method improves both the macro- and micro-Fscores by 3.15 and 4.37 percentage points, respectively, compared with SED using the BCE loss function. Because the macro-Fscore tends to be weighted towards a sound event class that has a small number of frames, the results indicate that the proposed duration loss function is effective for the sound events of short length. On the other hand, the micro-Fscore is likely to be weighted towards a sound event class with a large number of frames; thus, the experimental results also indicate that the proposed method can balance the detection of sound events of both short and long durations.
To investigate the details of SED performance, we show the detection results for each sound event in Table 2. The results show that the proposed method improves both the Fscore and error rate for many sound events. For instance, the sound events “(object) impact,” “(object) rustling,” “dishes,” and “keyboard typing,” which have short durations (0.2–1.5 s), can be detected more precisely by the proposed method. Similarly, the detection performance of sound events “bird singing,” “car,” “can,” and “large vehicle,” which have long durations, is also improved. On the other hand, several sound events with very short durations (e.g., mouse wheeling) cannot be detected even by the proposed method; thus, this must be addressed in the future.
Figs. 4 and 5 shows the average macro- and micro-Fscores, respectively, for various weighting factors . The results show that even when the weighting factor is changed from to , the proposed method achieves better results than conventional methods in terms of both macro- and micro-Fscores. Thus, the experimental results show that incorporating duration robust loss leads to stable performance with various weighting factors .
In this paper, we proposed sound event detection using the duration robust loss function, which can focus training on sound events of short duration. In the proposed method, we assumed that many sound events of long duration are stationary sounds, which have less variation in their acoustic features and their model training is easy. On the basis of the ease/difficulty of model training, we apply the reweighting factor to the BCE loss, which focuses the model training on the sound event classes for which model training is difficult. Experimental results obtained using the imbalanced dataset in sound event classes indicate that the proposed method can detect sound events of short durations more precisely.
This work was supported by JSPS KAKENHI Grant Number JP19K20304 and NVIDIA GPU Grant Program.
-  K. Imoto, “Introduction to acoustic event and scene analysis,” Acoustical Science and Technology, vol. 39, no. 3, pp. 182–188, 2018.
-  K. Imoto, S. Shimauchi, H. Uematsu, and H. Ohmuro, “User activity estimation method based on probabilistic generative model of acoustic event sequence with user activity and its subordinate categories,” Proc. INTERSPEECH, 2013.
Y. Koizumi, S. Saito, H. Uematsu, Y. Kawachi, and N. Harada, “Unsupervised detection of anomalous sound based on deep learning and the Neyman–Pearson lemma,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 1, pp. 212–224, 2019.
-  Y. Koizumi, Y. Kawaguchi, K. Imoto, T. Nakamura, Y. Nikaido, R. Tanabe, H. Purohit, K. Suefusa, T. Endo, M. Yasuda, and N. Harada, “DCASE2020 challenge task2: Unsupervised anomalous sound detection for machine condition monitoring,” arXiv, arXiv:2006.05822, pp. 1–4, 2020.
-  S. Ntalampiras, I. Potamitis, and N. Fakotakis, “On acoustic surveillance of hazardous situations,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 165–168, 2009.
-  Q. Jin, P. F. Schulam, S. Rawat, S. Burger, D. Ding, and F. Metze, “Event-based video retrieval using audio,” Proc. INTERSPEECH, pp. 2085–2088, 2012.
-  J. Salamon, J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck, and S. Kelling, “Towards the automatic classification of avian flight calls for bioacoustic monitoring,” PLoS One, vol. 11, no. 11, 2016.
-  Y. Okamoto, K. Imoto, N. Tsukahara, K. Sueda, R. Yamanishi, and Y. Yamashita, “Crow call detection using gated convolutional recurrent neural network,” Proc. RISP International Workshop on Nonlinear Circuits, Communications and Signal Processing (NCSP), pp. 171–174, 2020.
-  S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “CNN architectures for large-scale audio classification,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135, 2017.
-  T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. L. Roux, and K. Takeda, “Duration-controlled LSTM for polyphonic sound event detection,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 25, no. 11, pp. 2059–2070, 2017.
-  E. Çakir, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 25, no. 6, pp. 1291–1303, 2017.
A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,”Proc. European Signal Processing Conference (EUSIPCO), pp. 1128–1132, 2016.
-  A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 85–92, 2017.
-  Y. Chen and H. Jin, “Rare sound event detection using deep learning and data augmentation,” Proc. INTERSPEECH, pp. 619–623, 2019.
-  Y. Wang, J. Salamon, N. J. Bryan, and J. P. Bello, “Few-shot sound event detection,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 81–85, 2020.
T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
dense object detection,”
Proc. IEEE International Conference on Computer Vision(ICCV), pp. 2980–2988, 2017.
-  R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P. Shah., “Large-scale weakly labeled semi-supervised sound event detection in domestic environments,” Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 19–23, 2018.
-  A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, 2016.