Sound Event Detection Using Graph Laplacian Regularization Based on Event Co-occurrence

02/02/2019
by   Keisuke Imoto, et al.
0

The types of sound events that occur in a situation are limited, and some sound events are likely to co-occur; for instance, "dishes" and "glass jingling." In this paper, we introduce a technique of sound event detection utilizing graph Laplacian regularization taking the sound event co-occurrence into account. To consider the co-occurrence of sound events in a sound event detection system, we first represent sound event occurrences as a graph whose nodes indicate the frequency of event occurrence and whose edges indicate the co-occurrence of sound events. We then utilize this graph structure for sound event modeling, which is optimized under an objective function with a regularization term considering the graph structure. Experimental results obtained using TUT Acoustic Scenes 2016 development and 2017 development datasets indicate that the proposed method improves the detection performance of sound events by 7.9 percentage points compared to that of the conventional CNN-BiGRU-based method in terms of the segment-based F1-score. Moreover, the results show that the proposed method can detect co-occurring sound events more accurately than the conventional method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

04/25/2020

Sound Event Detection Utilizing Graph Laplacian Regularization with Event Co-occurrence

A limited number of types of sound event occur in an acoustic scene and ...
10/16/2020

Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning

Sound event detection (SED) and acoustic scene classification (ASC) are ...
02/10/2021

Sound Event Detection Based on Curriculum Learning Considering Learning Difficulty of Events

In conventional sound event detection (SED) models, two types of events,...
04/30/2021

GTN-ED: Event Detection Using Graph Transformer Networks

Recent works show that the graph structure of sentences, generated from ...
10/27/2019

Sound Event Recognition in a Smart City Surveillance Context

Due to the growing demand for improving surveillance capabilities in sma...
04/21/2021

Room adaptive conditioning method for sound event classification in reverberant environments

Ensuring performance robustness for a variety of situations that can occ...
02/23/2021

Improving Deep Learning Sound Events Classifiers using Gram Matrix Feature-wise Correlations

In this paper, we propose a new Sound Event Classification (SEC) method ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sound event detection (SED), in which the onsets and offsets of sound events are detected and the types of sounds are identified [1], has significant potential for use in many applications such as monitoring elderly people or infants [2, 3], automatic surveillance [4, 5, 6], and media retrieval [7].

SED typically falls into two categories: monophonic and polyphonic SED. In monophonic SED, it is assumed that multiple sound events do not occur simultaneously; thus, a monophonic SED system only detects at most one sound event in each time section. However, in a real-life situation, since multiple sound events tend to overlap in time, a monophonic SED system has limited performance in a real-life environment. To overcome this limitation, many polyphonic SED systems, which can detect multiple overlapping sound events, have been developed.

One approach to polyphonic SED is to use non-negative matrix factorization (NMF) [8, 9]

. In the SED approach based on NMF, a polyphonic sound is decomposed into a product of a basis and activation matrices, where each basis vector and activation vector respectively indicate a single sound event and the active duration of the corresponding sound event. SED systems based on neural networks have also been developed

[10, 11, 12, 13]

. For instance, a convolutional neural network (CNN) is widely used for SED

[10]. More recently, many methods using a recurrent neural network (RNN) or convolutional recurrent neural network (CRNN), which can capture temporal information of sound events, have been developed [11, 12, 13]. These methods enable successful analysis of overlapping sound events with reasonable performance. However, when the number of types of sound events to be analyzed increases, these approaches require a large training dataset.

On the other hand, as shown in Fig. 2, the number of types of sound events occurring in a single situation (acoustic scene) is limited and some sound events co-occur. For instance, the sound events “dishes” and “glass jingling” tend to co-occur, and “car” and “brakes squeaking” are also likely to co-occur. By considering this in the model training of sound events, we expect to be able to model sound events efficiently and effectively with limited sound data [14, 15]. However, conventional methods cannot be integrated into the state-of-the-art neural network-based method. Thus, in this paper, we propose an SED approach based on a neural network that considers the co-occurrence of sound events in each sound clip. To consider the co-occurrence of sound events, we introduce graph Laplacian regularization into the objective function of a neural network.

The rest of this paper is organized as follows. In section 2, a conventional SED approach based on a CRNN is introduced. In section 3, the proposed approach to SED, in which the co-occurrence of sound events can be considered, is discussed. In section 4, we report experiments conducted to evaluate the performance of SED by the proposed and conventional methods, and in section 5, we summarized and conclude this paper.

  
Figure 1: Histogram of sound event instances for each acoustic scene
Figure 2: Concept of graph representation of sound event occurrences

2 Conventional Sound Event Detection Based on Recurrent Neural Networks

In this section, we review conventional SED approaches based on neural networks. For polyphonic SED, CNN architectures are often used [10]. In CNN-based SED, the time-frequency representation of the acoustic feature is fed to a convolutional layer, where and

are the dimension of the acoustic feature and the number of time frames of the input feature, respectively. This layer convolutes the input feature map with two-dimensional filters; then, max pooling is conducted to reduce the dimension of the feature map. The CNN architecture allows robust feature extraction against time and frequency shifts, which often occur in SED.

To model time correlations explicitly, an RNN has been applied to SED in some works [11, 12, 13]

. In particular, it has been reported that neural networks combining a CNN and bidirectional gated recurrent unit (BiGRU) successfully detected sound events. In the CNN-BiGRU-based approaches, the acoustic feature

is also fed to the convolutional layer. The output of the convolutional layer is then concatenated as , and then is fed to the BiGRU layer, where is the number of filters of the convolution layer. In the BiGRU layer, the output vector is calculated using the following equations:

(1)
(2)
(3)
(4)
(5)
(6)
(7)

where superscripts and indicate forward and backward networks, respectively. Subscripts , , and indicate the time index, update gate, and reset gate, respectively. , , , and

indicate the update gate vector, reset gate vector, Hadamard product, and sigmoid function, respectively.

, , and

are parameter matrices and a bias vector. The BiGRU layer is followed by a fully connected layer, which is the output layer of the network. The final output of the network is calculated as

(8)

The CNN-BiGRU network is optimized under the following sigmoid cross-entropy objective function

using the backpropagation through time (BPTT):

(9)

where is a target vector of the output that indicates whether sound events are active or nonactive in time frame .

3 Sound Event Detection with Event-
Co-occurrence-Based Regularization

3.1 Motivation

Conventional CRNN-based approaches achieve reasonable performances in SED when a sufficient amount of training sound data is prepared. However, since recording and annotating environmental sounds are very time-consuming tasks [1], in many situations, the conventional method tends to exhibit degradation in the event detection performance. To overcome this problem, we propose a new method using graph Laplacian regularization for SED.

As shown in Fig. 2

, the number of types of sound events occurring in a single situation (acoustic scene) is limited and some sound events co-occur. For instance, the sound events “dishes” and “glass jingling” tend to co-occur, and “car” and “brakes squeaking” are also likely to co-occur. By considering the sound event co-occurrence in the model parameter estimation of a neural network, it is expected that sound events can be efficiently and effectively modeled with limited sound data.

3.2 Sound event detection using graph Laplacian regularization

To consider the co-occurrence of sound events, we introduce the graph representation of sound event occurrences and a graph-based regularization technique for the modeling of sound events.

Suppose that a graph representation of a sound event occurrence has nodes and an adjacency matrix , as shown in Fig. 2. Here, is the number of types of sound events. The weights of the nodes on the graph are the frequencies of sound event occurrences, and the weights of the edges indicate how often two sound events co-occur. Then, the graph Laplacian matrix [16] is defined as

(10)

where is a diagonal, so-called degree matrix, whose diagonal elements are defined as .

If two sound events tend to co-occur, when the two nodes corresponding to the sound events are connected with a large weight, the frequency of the sound event occurrence should have a small difference. Thus, adding the following penalty term to the cost function of the optimization problem enables us to learn a sound event model in which we can consider the sound event co-occurrence [17, 18].

(11)

By integrating Eq. (11) into Eq. (9), we obtain the following objective function:

(12)

By approximating the frequencies of sound event occurrences by , the objective function is finally given as

(13)

where is the regularization weight. Thus, we can detect appropriate sound events while considering the co-occurrence of sound events, even when limited training data can be used for model training.

Acoustic feature Log mel-band energy
# dims. of acoustic feature 64
Frame length / shift 40 ms / 20 ms
Length of sequence 500 (10 s)
Regularization weight
Network structure of CNN-BiGRU 3 conv. & 1 BiGRU layers
Filter size in CNN layers 3 3
Pooling in CNN layers 3 1 max pooling
Activation function ReLU
# channels of CNN layers 128, 128, 128
# GRU units 32

# epochs for training

150
Optimizer Adam
Thresholding Adaptive thresholding [19]
Table 1: Experimental conditions
Method Fold 1 Fold 2 Fold 3 Fold 4 Average
F1 score Error rate F1-score Error rate F1-score Error rate F1-score Error rate F1-score Error rate
CNN 48.67% 0.708 31.36% 0.829 33.11% 0.813 23.55% 0.899 34.17% 0.812
CNN-GRU 51.00% 0.672 36.64% 0.795 35.95% 0.797 34.70% 0.864 39.57% 0.782
CNN-BiGRU 53.10% 0.652 35.10% 0.807 38.34% 0.769 38.42% 0.814 41.24% 0.761
CNN-BiGRU w/ GLR 55.59% 0.631 48.28% 0.742 50.39% 0.678 42.39% 0.820 49.16% 0.718
Table 2: Detection performance of sound events in segment-based metrics

4 Experiments

4.1 Experimental conditions

To evaluate the performance of the proposed method, we conducted experiments with conventional neural-network-based methods and the proposed method. For the experiments, we constructed a sound event dataset composed of part of the TUT Sound Events 2016 development, 2017 development, and TUT Acoustic Scenes 2016 development [20, 21]. From the three datasets, we extracted sound clips including four acoustic scenes, home, residential area (TUT Sound Events 2016), city center (TUT Sound Events 2017), and office (TUT Acoustic Scenes 2016), with a total duration of 192 min. of audio. The experimental data include the 25 types of sound events listed in Fig. 2. In this regard, because the original TUT Acoustic Scenes 2016 development datasets do not have sound event annotations for the sound clips recorded in the office environment, we annotated them using the same protocol as in [20] and [21]. The experiments were conducted using the four-fold cross-validation setup introduced in the TUT Acoustic Scenes 2016 development and 2017 development datasets.

As the input of each system, the 64-dimensional log mel-band energy, which was calculated for each 40 ms time frame with 50% overlap, was used. The adjacency matrix was calculated by counting the number of co-occurring sound events in each sound clip over the training dataset and normalizing the result in the range from 0 to 1. After obtaining the output , active sound events were predicted by thresholding using an adaptive thresholding technique [19]. The detection performance was evaluated by the F1-score and error rate in the segment-based metrics [22], in which the segment length is set to 40 ms. The other recording conditions and experimental conditions are listed in Table 1.

Figure 3: Annotations and event detection results of sounds recorded in city center. Only sound events occurring in the annotations are described.
Figure 4: Annotations and event detection results of sounds recorded in residential area. Only sound events occurring in the annotations are described.

4.2 Experimental results

Table 2 shows the F1-scores and error rates for CNN, CNN-BiGRU, and CNN-BiGRU with graph Laplacian regularization (GLR). The results show that the proposed method moderately improves the SED performance in terms of both the F1-score and error rate. In this experiment using the TUT Sound Events and Acoustic Scenes datasets, the proposed method improves the average SED performance by 7.9 percentage points from that of conventional CNN-BiGRU in terms of the F1-score.

To examine the detection results in more detail, we illustrate examples of annotations and the predicted results of sound events in Figs. 3 and 4. The results also show that the proposed method detects sound events more accurately than the conventional methods. Moreover, the results show that the proposed method can detect co-occurring sound events more accurately than the conventional methods. For instance, the sound events “car” and “brakes squeaking” can be detected by adopting graph Laplacian regularization, whereas the conventional methods do not detect “brakes squeaking” events. Thus, we conclude that graph Laplacian regularization based on the co-occurrence of sound events is a promising technique for SED.

5 Conclusion

In this paper, we proposed the neural-network-based SED with graph Laplacian regularization based on the co-occurrence of sound events. Unlike conventional CNN or CNN-BiGRU-based SED methods, the proposed method can detect sound events with prior information on the co-occurrence of sound events. This enables sound events to be modeled effectively and efficiently even if there are many types of sound events to model and limited training data. The experimental results obtained using the TUT Sound Events 2016, 2017, and TUT Acoustic Scenes 2016 datasets show that the proposed method improves the SED performance by 7.9 percentage points in terms of the segment-based F1-score. The experimental results also show that the proposed method can detect sound events that tend to co-occur, such as sound events “car” and “brakes squeaking”, more accurately than the conventional methods.

References

  • [1] K. Imoto, “Introduction to acoustic event and scene analysis,” Acoustical Science and Technology, vol. 39, no. 3, pp. 182–188, 2018.
  • [2] Y. Peng, C. Lin, M. Sun, and K. Tsai,

    “Healthcare audio event classification using hidden Markov models and hierarchical hidden Markov models,”

    Proc. IEEE International Conference on Multimedia and Expo (ICME), pp. 1218–1221, 2009.
  • [3] P. Guyot, J. Pinquier, and R. André-Obrecht, “Water sound recognition based on physical models,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 793–797, 2013.
  • [4] R. Radhakrishnan, A. Divakaran, and P. Smaragdis, “Audio analysis for surveillance applications,” Proc. 2005 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 158–161, 2005.
  • [5] A. Harma, M. F. McKinney, and J. Skowronek, “Automatic surveillance of the acoustic activity in our living environment,” Proc. IEEE International Conference on Multimedia and Expo (ICME), 2005.
  • [6] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “On acoustic surveillance of hazardous situations,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 165–168, 2009.
  • [7] Q. Jin, P. F. Schulam, S. Rawat, S. Burger, D. Ding, and F. Metze, “Event-based video retrieval using audio,” Proc. INTERSPEECH, 2012.
  • [8] A. Dessein, A. Cont, and G. Lemaitre, “Real-time detection of overlapping sound events with non-negative matrix factorization,” Matrix Information Geometry, pp. 341–371, 2013, Springer.
  • [9] T. Komatsu, T. Toizumi, R. Kondo, and Y. Senda, “Acoustic event detection method using semi-supervised non-negative matrix factorization with mixtures of local dictionaries,” Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 45–49, 2016.
  • [10] S. Hershey et al., “CNN architectures for large-scale audio classification,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135, 2017.
  • [11] E. Çakir, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 25, no. 6, pp. 1291–1303, 2017.
  • [12] T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. Le Roux, and K. Takeda, “Duration-controlled LSTM for polyphonic sound event detection,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 25, no. 11, pp. 2059–2070, 2017.
  • [13] S. Kothinti, K. Imoto, D. Chakrabarty, G. Sell, S. Watanabe, and M. Elhilali, “Joint acoustic and class inference for weakly supervised sound event detection,” Technical report of task 4 of DCASE Challenge 2018, pp. 1–4, 2018.
  • [14] A. Mesaros, T. Heittola, and A. Klapuri, “Latent semantic analysis in sound event detection,” Proc. European Signal Processing Conference (EUSIPCO), pp. 1307–1311, 2011.
  • [15] K. Imoto and N. Ono, “Acoustic topic model for scene analysis with intermittently missing observations,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 2, pp. 367–382, 2019.
  • [16] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst,

    “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”

    IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, 2013.
  • [17] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative matrix factorization for data representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1548–1560, 2011.
  • [18] T. Ichita, S. Kyochi, and K. Imoto, “Audio source separation based on nonnegative matrix factorization with graph harmonic structure,” Proc. Asia-Pasific Signal Processing Association Annual Summit and Conference (APSIPA ASC), 2018, to appear.
  • [19] Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley, “Surrey-CVSSP system for DCASE2017 challenge task4,” Technical report of task 4 of DCASE Challenge 2018, pp. 1–3, 2017.
  • [20] A. Mesaros, T. Heittola, and T. Virtanen,

    “TUT database for acoustic scene classification and sound event detection,”

    Proc. European Signal Processing Conference (EUSIPCO), pp. 1128–1132, 2016.
  • [21] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1–8, 2017.
  • [22] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, pp. 1–17, 2016.