Sound Event Detection Using Duration Robust Loss Function

06/27/2020
by   Daichi Akiyama, et al.
0

Many methods of sound event detection (SED) based on machine learning regard a segmented time frame as one data sample to model training. However, the sound durations of sound events vary greatly depending on the sound event class, e.g., the sound event “fan” has a long time duration, while the sound event “mouse clicking” is instantaneous. The difference in the time duration between sound event classes thus causes a serious data imbalance problem in SED. In this paper, we propose a method for SED using a duration robust loss function, which can focus model training on sound events of short duration. In the proposed method, we focus on a relationship between the duration of the sound event and the ease/difficulty of model training. In particular, many sound events of long duration (e.g., sound event “fan”) are stationary sounds, which have less variation in their acoustic features and their model training is easy. Meanwhile, some sound events of short duration (e.g., sound event “object impact”) have more than one audio pattern, such as attack, decay, and release parts. We thus apply a class-wise reweighting to the binary-cross entropy loss function depending on the ease/difficulty of model training. Evaluation experiments conducted using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets show that the proposed method respectively improves the detection performance of sound events by 3.15 and 4.37 percentage points in macro- and micro-Fscores compared with a conventional method using the binary-cross entropy loss function.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

02/03/2021

Impact of Sound Duration and Inactive Frames on Sound Event Detection Performance

In many methods of sound event detection (SED), a segmented time frame i...
02/10/2021

Sound Event Detection Based on Curriculum Learning Considering Learning Difficulty of Events

In conventional sound event detection (SED) models, two types of events,...
11/04/2020

Influence of Event Duration on Automatic Wheeze Classification

Patients with respiratory conditions typically exhibit adventitious resp...
12/08/2020

Split: Inferring Unobserved Event Probabilities for Disentangling Brand-Customer Interactions

Often, data contains only composite events composed of multiple events, ...
11/30/2017

Direct Segmented Sonification of Characteristic Features of the Data Domain

Sonification and audification create auditory displays of datasets. Audi...
02/23/2021

Improving Deep Learning Sound Events Classifiers using Gram Matrix Feature-wise Correlations

In this paper, we propose a new Sound Event Classification (SEC) method ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sound event detection (SED) is the task of detecting sound event labels and their onset/offset in an audio recording, where a sound event indicates a type of sound such as “people talking” and “bird singing” [1]

. SED plays an important role in realizing various applications using artificial intelligence in sounds, such as automatic life-logging, machine monitoring, automatic surveillance, media retrieval, and biomonitoring systems

[2, 3, 4, 5, 6, 7, 8].

Figure 1: Examples of durations of sound event instances and number of data samples
Sound event Duration Sound event Duration
(object) banging 0.78 s drawer 0.80 s
(object) impact 0.35 s fan 29.99 s
(object) rustling 2.24 s glass jingling 0.80 s
(object) snapping 0.46 s keyboard typing 0.21 s
(object) squeaking 0.74 s large vehicle 14.68 s
bird singing 7.63 s mouse clicking 0.14 s
brakes squeaking 1.65 s mouse wheeling 0.16 s
breathing 0.43 s people talking 4.09 s
car 6.88 s people walking 6.63 s
children 6.87 s washing dishes 4.15 s
cupboard 0.65 s water tap running 5.92 s
cutlery 0.74 s wind blowing 6.09 s
dishes 1.24 s
Table 1: Average duration of one sound event instance in datasets used for evaluation experiments (TUT Sound Events 2016, 2017, and TUT Acoustic Scenes 2016 development [12, 13])
Figure 2: Numbers of frames of sound events in dataset used for evaluation experiments

For machine-learning-based SED, many methods using neural networks, such as a convolutional neural network (CNN)

[9]

, recurrent neural network (RNN)

[10], and convolutional recurrent neural network (CRNN) [11], have been proposed. In these methods, an audio clip is segmented into short time frames (e.g., 40 ms frame length with 20 ms shift), and each segmented frame is regarded as one data sample when training and evaluating a sound event model. As shown in Fig. 1, each sound event instance has a different frame length, and the frame length of each instance varies depending on the sound event class. Table 1 and Fig. 2 respectively show the average duration of one sound event instance and the total frame number of sound events in datasets used for evaluation experiments discussed in Section 4 (TUT Sound Events 2016, 2017, and TUT Acoustic Scenes 2016 development [12, 13]). In this dataset, the number of frames in sound event “mouse clicking,” which has an average length of 0.15 s, is 1,163, while that in sound event “fan,” which has an average length of 29.99 s, is 116,837. Thus, the difference in the time duration between sound events causes a serious data imbalance problem in SED. There are some conventional methods of SED for imbalanced data [14, 15]. For instance, Chen and Jin have proposed a method for detecting rare sound events using data augmentation [14]. Wang et al. have proposed a method for few-shot sound event detection based on metric learning [15]. However, the data imbalance problem caused by the difference in time duration between sound event classes has not been investigated in these works.

In this paper, we address the imbalanced-data SED caused by the difference in time duration using a duration robust loss function. In a preliminary experiment, it is proven that a simple classwise reweighting of the loss function on the basis of the inverse frequency of sound event occurrences (the details are described in Sec. 2) is not effective for SED using extremely imbalanced data. In the proposed method, we instead apply another classwise reweighting approach of the loss function in accordance with on the ease/difficulty of model training. As shown in Fig. 3, this is because many sound events of long duration (e.g., “fan” or “car”) are stationary sounds, which have less variation in their acoustic features and their model training is easy. We demonstrate that the proposed reweighting approach of the loss function can prevent the sound events of long duration from dominating the model training and improve the performance of SED using a seriously imbalanced dataset.

The rest of this paper is organized as follows. In Sec. 2, we introduce the conventional SED method, and in Sec. 3, we propose the SED method based on the duration robust loss function. In Sec. 4, we discuss the SED performance evaluation for an imbalanced training dataset. In Sec. 5, we conclude this paper.

2 Conventional Method

Let us consider the training dataset . is an acoustic feature of the sound clip and is the sound event label, where

indicates a multi–hot vector of time frame

in the sound clip over the sound event class. The goal of SED is to predict sound event labels in an unknown sound using

(1)

where , , and are the model, the model parameter trained using , and the detection threshold, respectively. In the conventional SED, the mel-band energy and mel-frequency cepstral coefficients (MFCCs) are often used as the acoustic features . As the model , CNN, RNN, or CRNN-based neural network is applied. The model parameter

is estimated using the following binary cross-entropy (BCE) loss function

and the backpropagation technique:

(2)

where and

are the sigmoid function and the output of the network in time frame

, respectively. is the target label in time frame and is 1 if acoustic event is active in time frame and 0 otherwise. Note that we omit sound clip index to simplify the equation. is actually calculated by summing the binary cross entropy over time frames of all sound clips. Since the frame length of sound events varies considerably depending on the event class, the model parameter estimation using Eq. (2) leads to the data imbalance problem. As a result, sound events of long durations overwhelm the model training and those of short durations are likely to be downweighted.

Figure 3: Spectrogram and sound event label of long/short duration sounds
Event (object) (object) (object) (object) (object) bird brakes breathing car children cupboard cutlery
banging impact rustling snapping squeaking singing squeaking
BCE loss Fscore 0.00% 1.03% 0.17% 0.00% 0.00% 25.57% 0.67% 0.00% 52.09% 0.00% 0.00% 0.00%
Error rate 1.000 1.007 1.023 1.000 1.000 1.240 1.000 1.000 0.799 1.020 1.000 1.000
Inverse freq. loss Fscore 0.00% 0.02% 0.00% 0.00% 0.00% 0.00% 2.39% 0.00% 26.89% 0.00% 0.00% 0.00%
Error rate 1.000 1.000 1.000 1.000 1.000 1.000 0.989 1.000 1.010 1.000 1.000 1.000
Duration robust loss Fscore 0.00% 2.86% 1.20% 0.00% 0.00% 25.65% 4.01% 0.00% 55.22% 0.00% 0.00% 0.00%
()
Error rate 1.000 1.021 1.045 1.000 1.000 1.256 0.988 1.000 0.768 1.031 1.000 1.000
Duration robust loss Fscore 0.00% 7.21% 2.37% 0.00% 0.00% 28.41% 13.35% 0.00% 54.01% 0.00% 0.00% 0.00%
()
Error rate 1.000 1.084 1.092 1.003 1.000 1.224 0.960 1.000 0.811 1.032 1.000 1.000
Event dishes drawer fan glass keyboard large mouse mouse people people washing water tap wind
jingling typing vehicle clicking wheeling talking walking dishes running blowing
BCE loss Fscore 0.00% 0.00% 17.86% 0.00% 0.01% 45.25% 0.00% 0.00% 2.68% 26.32% 15.63% 13.00% 0.00%
Error rate 1.000 1.000 0.908 1.000 1.000 1.202 1.000 1.000 1.053 0.907 0.959 0.942 1.000
Inverse freq. loss Fscore 0.00% 0.00% 0.00% 0.00% 0.00% 10.93% 0.00% 0.00% 0.00% 0.00% 0.00% 2.73% 0.00%
Error rate 1.000 1.000 1.000 1.000 1.000 0.968 1.000 1.000 1.000 1.000 1.000 0.986 1.000
Duration robust loss Fscore 0.01% 0.00% 25.44% 0.00% 0.23% 47.67% 0.00% 0.00% 4.40% 32.09% 13.97% 34.16% 0.00%
()
Error rate 1.000 1.000 0.861 1.000 1.000 1.190 1.000 1.000 1.091 0.898 1.003 0.801 1.000
Duration robust loss Fscore 1.66% 0.02% 28.91% 0.00% 0.49% 47.44% 0.08% 0.00% 5.26% 30.34% 20.65% 38.76% 0.02%
()
Error rate 0.998 1.000 0.839 1.000 1.000 1.141 1.000 1.000 1.119 0.907 1.031 0.773 1.024
Table 2: Sound event detection performance for each event

One simple idea to address the imbalanced-data problem is classwise reweighting of the loss function in accordance with the inverse frequency of sound event occurrences as follows.

(3)

Here, and are the number of frames of sound event in a sound clip and a constant number, respectively. However, in a preliminary experiment, we confirmed that the simple classwise reweighting using is not effective for SED.

3 Proposed Method

In this paper, we propose a duration robust loss function for SED, whereby training can be focused on sound events of short duration. In the proposed method, we focus on the relationship between the sound event duration and the ease/difficulty of model training. In particular, many sound events of long duration (e.g., “fan” and “car”) are stationary sounds, which have less variation in their acoustic features and their model training is easy. Meanwhile, some sound events of short duration (e.g., “object impact” and “keyboard typing”) have more than one audio pattern, such as attack, decay, and release parts. Therefore, the ease/difficulty of model training is important information for controlling the training weight, as well as a more direct way of controlling the loss contribution.

To control the training weight of sound events in accordance with the ease/difficulty of model training, we add factors and to the BCE loss as follows:

(4)

where is the weighting parameter that controls the focusing weight. When sound event is active in time frame but the network output is a small value, the reweighting factor does not greatly affect the loss, whereas if the network output is a large value, the reweighting factor approaches zero and the loss is down-weighted. Thus, the loss function focuses the model training on the sound event classes that are difficult to train.

It is considered that the concept of the duration robust loss function is similar to that of the focal loss [16] in objective detection. In objective detection, there is also a serious imbalance problem, where background samples tend to have a large number of pixels with similar patterns, whereas the foreground samples are likely to have a relatively small number of pixels.

Acoustic feature Log mel-band energy (64 dim.)
Frame length / shift 40 ms / 20 ms
Length of sound clip 10 s
Network structure 3 CNN 1 BiGRU 1 fully conn.
# channels of CNN layers 128, 128, 128
Filter size 33, 33, 33
Pooling size 18, 14, 1

2 (max pooling)

# units in GRU layer 32
# units in fully conn. layer 32
Detection threshold 0.5
Constant number 500
Table 3: Experimental conditions
Method Macro-Fscore Micro-Fscore
BCE loss 8.01% 27.83%
Inverse frequency loss 1.72% 7.44%
Duration robust loss () 9.88% 31.40%
Duration robust loss () 11.16% 32.20%
Table 4: Average performance of SED

4 Experiments

4.1 Experimental Conditions

We evaluated the performance of SED using the proposed duration robust loss function. As an evaluation dataset, we constructed the dataset composed of parts of TUT Sound Events 2016 development, 2017 development, and TUT Acoustic Scenes 2016 development [12, 13]. We selected a total of 192 min of sound clips including the 25 types of sound event listed in Fig. 2. As shown in Fig. 2, the number of time frames between sound events is seriously unbalanced, e.g., sound event “fan” makes up a total of more than 100,000 frames in the dataset, while sound event “mouse clicking” accounts for less than 3,500 frames. Note that the datasets were recorded not for the detection task of rare sound but for the analysis of real-life sounds; thus, the analysis of seriously imbalanced data is a general problem in SED.

As an acoustic feature, we used the 64-dimensional log mel-band energy, which was calculated every 40 ms with a 20 ms frame shift. The acoustic feature was fed to the neural network with 3 CNN layers, 1 bidirectional gated recurrent unit (GRU) layer, and 1 fully connected layer, which was used for the baseline system of the DCASE2018 challenge task 4

[17]. The performance of sound event detection was evaluated using the segment-based macro- and micro-Fscores [18]. Other experimental conditions are listed in Table 3.

4.2 Experimental Results

Table 4 shows the average performances of SED using the BCE loss, the BCE loss with the inverse frequency class reweighting (referred to as inverse frequency loss), and the proposed duration robust loss. For each loss, we conducted the evaluation experiment 10 times with random initial values for model parameters. The results show that the proposed SED method improves both the macro- and micro-Fscores by 3.15 and 4.37 percentage points, respectively, compared with SED using the BCE loss function. Because the macro-Fscore tends to be weighted towards a sound event class that has a small number of frames, the results indicate that the proposed duration loss function is effective for the sound events of short length. On the other hand, the micro-Fscore is likely to be weighted towards a sound event class with a large number of frames; thus, the experimental results also indicate that the proposed method can balance the detection of sound events of both short and long durations.

Figure 4: Average macro-Fscores of SED with various weighting factors

To investigate the details of SED performance, we show the detection results for each sound event in Table 2. The results show that the proposed method improves both the Fscore and error rate for many sound events. For instance, the sound events “(object) impact,” “(object) rustling,” “dishes,” and “keyboard typing,” which have short durations (0.2–1.5 s), can be detected more precisely by the proposed method. Similarly, the detection performance of sound events “bird singing,” “car,” “can,” and “large vehicle,” which have long durations, is also improved. On the other hand, several sound events with very short durations (e.g., mouse wheeling) cannot be detected even by the proposed method; thus, this must be addressed in the future.

Figs. 4 and 5 shows the average macro- and micro-Fscores, respectively, for various weighting factors . The results show that even when the weighting factor is changed from to , the proposed method achieves better results than conventional methods in terms of both macro- and micro-Fscores. Thus, the experimental results show that incorporating duration robust loss leads to stable performance with various weighting factors .

5 Conclusion

In this paper, we proposed sound event detection using the duration robust loss function, which can focus training on sound events of short duration. In the proposed method, we assumed that many sound events of long duration are stationary sounds, which have less variation in their acoustic features and their model training is easy. On the basis of the ease/difficulty of model training, we apply the reweighting factor to the BCE loss, which focuses the model training on the sound event classes for which model training is difficult. Experimental results obtained using the imbalanced dataset in sound event classes indicate that the proposed method can detect sound events of short durations more precisely.

6 Acknowledgement

This work was supported by JSPS KAKENHI Grant Number JP19K20304 and NVIDIA GPU Grant Program.

Figure 5: Average micro-Fscores of SED with various weighting factors

References

  • [1] K. Imoto, “Introduction to acoustic event and scene analysis,” Acoustical Science and Technology, vol. 39, no. 3, pp. 182–188, 2018.
  • [2] K. Imoto, S. Shimauchi, H. Uematsu, and H. Ohmuro, “User activity estimation method based on probabilistic generative model of acoustic event sequence with user activity and its subordinate categories,” Proc. INTERSPEECH, 2013.
  • [3]

    Y. Koizumi, S. Saito, H. Uematsu, Y. Kawachi, and N. Harada, “Unsupervised detection of anomalous sound based on deep learning and the Neyman–Pearson lemma,”

    IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 1, pp. 212–224, 2019.
  • [4] Y. Koizumi, Y. Kawaguchi, K. Imoto, T. Nakamura, Y. Nikaido, R. Tanabe, H. Purohit, K. Suefusa, T. Endo, M. Yasuda, and N. Harada, “DCASE2020 challenge task2: Unsupervised anomalous sound detection for machine condition monitoring,” arXiv, arXiv:2006.05822, pp. 1–4, 2020.
  • [5] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “On acoustic surveillance of hazardous situations,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 165–168, 2009.
  • [6] Q. Jin, P. F. Schulam, S. Rawat, S. Burger, D. Ding, and F. Metze, “Event-based video retrieval using audio,” Proc. INTERSPEECH, pp. 2085–2088, 2012.
  • [7] J. Salamon, J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck, and S. Kelling, “Towards the automatic classification of avian flight calls for bioacoustic monitoring,” PLoS One, vol. 11, no. 11, 2016.
  • [8] Y. Okamoto, K. Imoto, N. Tsukahara, K. Sueda, R. Yamanishi, and Y. Yamashita, “Crow call detection using gated convolutional recurrent neural network,” Proc. RISP International Workshop on Nonlinear Circuits, Communications and Signal Processing (NCSP), pp. 171–174, 2020.
  • [9] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “CNN architectures for large-scale audio classification,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135, 2017.
  • [10] T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. L. Roux, and K. Takeda, “Duration-controlled LSTM for polyphonic sound event detection,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 25, no. 11, pp. 2059–2070, 2017.
  • [11] E. Çakir, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 25, no. 6, pp. 1291–1303, 2017.
  • [12]

    A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,”

    Proc. European Signal Processing Conference (EUSIPCO), pp. 1128–1132, 2016.
  • [13] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 85–92, 2017.
  • [14] Y. Chen and H. Jin, “Rare sound event detection using deep learning and data augmentation,” Proc. INTERSPEECH, pp. 619–623, 2019.
  • [15] Y. Wang, J. Salamon, N. J. Bryan, and J. P. Bello, “Few-shot sound event detection,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 81–85, 2020.
  • [16] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,”

    Proc. IEEE International Conference on Computer Vision

    (ICCV)
    , pp. 2980–2988, 2017.
  • [17] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P. Shah., “Large-scale weakly labeled semi-supervised sound event detection in domestic environments,” Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 19–23, 2018.
  • [18] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, 2016.