Learning How to Listen: A Temporal-Frequential Attention Model for Sound Event Detection

by   Yu-Han Shen, et al.
Tsinghua University
NetEase, Inc

In this paper, we propose a temporal-frequential attention model for sound event detection (SED). Our network learns how to listen with two attention models: a temporal attention model and a frequential attention model. Proposed system learns when to listen using the temporal attention model while it learns where to listen on the frequency axis using the frequential attention model. With these two models, we attempt to make our system pay more attention to important frames or segments and important frequency components for sound event detection. Our proposed method is demonstrated on the task 2 of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge and achieves competitive performance.



There are no comments yet.


page 4


Multi-Scale Time-Frequency Attention for Rare Sound Event Detection

Attention mechanism has been widely applied to various sound-related tas...

Sound Event Detection with Adaptive Frequency Selection

In this work, we present HIDACT, a novel network architecture for adapti...

Acoustic scene analysis with multi-head attention networks

Acoustic Scene Classification (ASC) is a challenging task, as a single s...

Duration robust sound event detection

Task 4 of the Dcase2018 challenge demonstrated that substantially more r...

A simple model for detection of rare sound events

We propose a simple recurrent model for detecting rare sound events, whe...

Modelling of Sound Events with Hidden Imbalances Based on Clustering and Separate Sub-Dictionary Learning

This paper proposes an effective modelling of sound event spectra with a...

Direct Segmented Sonification of Characteristic Features of the Data Domain

Sonification and audification create auditory displays of datasets. Audi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, sound event detection (SED), also named as acoustic event detection(AED), is considered as a popular topic in the field of acoustic signal processing. The aim of SED is to temporally locate the onset and offset times of target sound events present in an audio recording.

The Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge is an international challenge concerning SED, and has been held for several years. In DCASE 2017 Challenge, the theme of task 2 is “detection of rare sound events” [1]. It provides dataset [2] and baseline for rare sound event detection in synthesized recordings. Here, “rare” means that target sound events (babycry, glassbreak, gunshot) would occur at most once within a 30-second recording. And the mean duration of target sound event is very short: 2.25 s for babycry, 1.16 s for glassbreak, 1.32 s for gunshot, leading to a serious problem of data imbalance. All audio recordings are notated with ground-truth labels of event class, onset and offset time. According to the task description, a separate system should be developed for each of the three target event classes to detect the temporal occurrences of these events [1].

Among the submissions in DCASE 2017, most models are based on deep neural networks. Both of the top 2 teams [3, 4] utilized Convolutional Recurrent Neural Networks (CRNN) as their main architecture. They combined Convolutional Neural Networks (CNN) with Recurrent Neural Networks (RNN) to make frame-level predictions for target events and then adopted post-processing to get the onset and offset time of sound events. Kao et al. [5]

proposed a Region-based Convolutional Recurrent Neural Network (R-CRNN) to improve previous work in 2018. In our work, we followed the main architecture of those three models and used CRNN as main classifier.

Inspired by the excellent performance of attention model in machine translation [6], image caption [7], speaker verification [8], audio tagging [9], we proposed an attention model for SED. Currently, most attention models in speech and audio processing only concentrate on time domain. We proposed a temporal-frequential attention model to focus on important frequency components as well as important frames or segments. Our attention model can learn how to listen by extracting not only temporal information but also spectral information. Besides, we visualized the weights of attention models to show what our models have actually learnt.

The rest of this paper is organized as follows: in Section 2, we introduce our methods in detail, mainly including feature extraction, baseline and temporal-frequential attention model. The dataset, experiment setup and evaluation metric are illustrated in Section 3. The results and analysis are presented in Section 4. Finally, we conclude our work in Section 5.

2 Methods

2.1 System overview

As shown in Figure 1, our proposed system is a CRNN architecture with temporal-frequential attention model. The input of our system is a 2-dim acoustic feature. It is fed into a frequential attention model to produce frequential attention weights. Our system learns to focus on specific frequency components of audios using those attention weights. The input acoustic feature will multiply with those attention weights and then pass through CRNN architecture. Compared with traditional CRNN [3, 4]

, we add a temporal attention model to let our system pay different attention to different frames. The temporal attention weights will multiply with the outputs of CRNN by element-wise. A sigmoid activation is used to get normalized probabilities. Then we utilize post processing to get final detection outputs.

Figure 1: Illustration of overall system.

2.2 Feature extraction

The acoustic feature used in our work is log filter bank energy (Fbank). The sampling rate of input audios is 44.1kHz. To extract Fbank feature, each audio is divided into frames of 40 ms duration with shifts of 20 ms. Then we apply 128 mel-scale filters covering the frequency range 300 to 22050 Hz on the magnitude spectrum of each frame. Finally, we take logarithm on the amplitude and get Fbank feature. The extracted Fbank feature is normalized to zero mean and unit standard deviation before being fed into neural networks.

2.3 Baseline

We adopt state-of-the-art CRNN as baseline. The input is Fbank feature of 30-second audios. And the output of our system gives binary predictions for each segment with time resolution of 80 ms (4 times of the input frame shift 20 ms).

The CRNN architecture consists of three parts: convolutional neural network (CNN), recurrent neural network (RNN) and fully-connected layer. The architecture of our CRNN is similar to that in [5], and it is shown in Figure 2.

The CNN part contains four convolutional layers, and each layer is followed by batch normalization


, ReLU activation unit and dropout layer


. We add two residual connections


to improve the performance of CNN. Max-pooling layers (on both time axis and frequency axis) are used to maintain the most important information on each feature map. At the end of CNN, the extracted features over different convolutional channels are stacked along the frequency axis.

The RNN part is a bi-directional gated recurrent unit (bi-GRU) layer. Compared with uni-directional GRU, bi-GRU can extract temporal structures of sound events better. We add the outputs of forward GRU and backward GRU to get final outputs of bi-GRU. The size of the output of bi-GRU is (375,

U), where U is the number of GRU units.

After the bi-GRU, a single fully-connected layer with sigmoid activation is used to give classification result for each segment (4 frames). The output denotes the presence probabilities of the target event in each segment.

In order to determine the presence of an event, a binary prediction is given for each segment with a constant threshold of 0.5. These predictions are post-processed with a median filter of length 240 ms. Since at most one event would occur in a 30-s audio, we select the longest continuous sequence of positive predictions to get the onset and offset of target events.

Figure 2:

The architecture of CRNN. The first and second dimensions of convolutional kernels and strides represent the time axis and frequency axis respectively.

2.4 Learning when to listen

As shown in Figure 1, we add a temporal attention model at the end of CNN to enable our system to learn when to listen. This attention model was proposed to ignore irrelevant sounds and focus more on important segments. Unlike the attention model in audio classification [9] that only focuses on positive segments (including events), our temporal attention pays more attention to both positive segments and hard negative segments (only backgrounds, but easily misclassified as events) because they should be further differentiated.

The output of CNN will pass through a fully-connected layer with

hidden units, followed by an activation unit (sigmoid, ReLU, or softmax). Then a global max-pooling on the frequency axis is used to get one weight for each segment. Those attention weights will be normalized along time axis. In our experiments, this operation of normalization has shown great effectiveness because it takes into account the variation of weight factors along time axis instead of considering only current segment. Then we multiply the temporal attention weights with the output of the fully-connected layer after bi-GRU. A sigmoid function is used to normalize the probabilities to

. The final output can be computed as follows:



is an activation function,

denotes the output of CNN, and represent the weights and bias for the -th hidden unit respectively, and is the number of hidden units in time attention model. is the candidate temporal attention weight, is the total number of segments in an audio, is the normalized temporal attention weight, and is the final output probabilities.

Model Development Dataset Evaluation Dataset
babycry glassbreak gunshot average babycry glassbreak gunshot average
Baseline 0.1492.6 0.0498.0 0.1989.6 0.1293.4 0.3183.4 0.0895.9 0.2685.5 0.2288.3
CRNN+TA 0.1492.8 0.0398.4 0.1790.9 0.1194.0 0.2587.4 0.0597.4 0.1890.6 0.1691.8
Proposed 0.1095.1 0.0199.4 0.1691.5 0.0995.3 0.1891.3 0.0498.2 0.1790.8 0.1393.4
R-CRNN [5] 0.09 *** 0.04 *** 0.14 *** 0.0995.5 ****** ****** ****** 0.2387.9
1d-CRNN [3] 0.0597.6 0.0199.6 0.1691.6 0.0796.3 0.1592.2 0.0597.6 0.1989.6 0.1393.1
CRNN [4] ****** ****** ****** 0.1492.9 0.1890.8 0.1094.7 0.2387.4 0.1791.0
Table 1:

Performance of proposed models and other methods, in terms of ER and F-score (%). *** indicates that class-wise results are not given in related paper. We compare the following models: (1) Baseline: our bi-GRU-based CRNN; (2) CRNN+TA: our bi-GRU-based CRNN with temporal attention model; (3) Proposed: our bi-GRU-based CRNN with temporal-frequential attention model; (4) R-CRNN: Region-based CRNN; (5) 1d-CRNN: DCASE 1st place model; (6) CRNN: DCASE 2nd place model.

2.5 Learning where to listen

Apart from temporal attention model, we proposed a frequential attention model. As we all know, various sound events may have different spectral characteristics. So we assume that we should treat those frequency components differently based on the characteristic of each frame.

The structure of frequential attention model is similar to temporal attention model. The input Fbank feature will go through a fully-connected layer with hidden units, followed by an activation function (sigmoid, ReLU, or softmax). Here, is set to 128 to correspond with the number of mel-filters. Then it is normalized along the frequency axis to get frequential attention weights. Finally, an element-wise multiplication is adopted between the frequential attention weights and input Fbank feature before the feature is fed into CRNN architecture. The weighted feature is computed as follows:


where is an activation function, is the input acoustic feature, and represent the weights and bias for the -th hidden unit respectively. is the candidate frequential attention weight, is the normalized frequential attention weight, represents element-wise multiplication and is the weighted feature.

3 Experiments

3.1 Dataset

We demonstrate proposed model on DCASE 2017 Challenge task 2 [1]. The task dataset consists of isolated sound events for each target class and recordings of everyday acoustic scenes to serve as background [2]. There are three target event classes: babycry, glassbreak and gunshot. A synthesizer for creating mixtures at different event-to-background ratios is also provided. The dataset is comprised of development dataset and evaluation dataset. The development dataset also consists of two parts: train subset and test subset. Participants are allowed to use any combination of the provided data for training, and evaluate their models on the test subset of development dataset. Ranking of submitted systems is based on their performance on evaluation dataset. Detailed information about this task and dataset is available in [1][2].

We use the synthesizer to generate 3000 mixtures for each class. The event-to-background ratios are -6, 0, 6dB, and the event presence probability is set to 0.9 (default value: 0.5) in order to gain more positive samples and mitigate the problem of data imbalance. We use the development test subset to optimize our model and finally evaluate it on the evaluation dataset.

(a) Visualization of temporal attention weights
(b) Visualization of frequential attention weights
Figure 3: Visualization of attention models.

3.2 Experiment setup

Our model is trained using Adam [13]

with learning rate 0.001. Due to data imbalance, we use weighted cross-entropy loss function to reduce deletion error. The loss function is computed as follows:


where is the output score of each segment, is ground-truth label, and is the loss weight for positive samples. In our experiments, the value of equals to 10.

In order to accelerate training, we adopt pre-training strategy. We firstly train the baseline CRNN for 10 epoches and then use the pre-trained CRNN to initialize the weights during the training of proposed model. The training is stopped after 200 epoches. The batch size is 64. The number of hidden layer unit in temporal attention model

is 32. The number of GRU units is 32.

Because our work is a 0/1 classification system, we use sigmoid and ReLU activation in attention models. According to experimental results, our system can achieve the best performance with ReLU activation in temporal attention model and sigmoid activation in frequential attention model.

3.3 Metrics

We evaluate our method based on two kinds of event-based metrics: event-based error rate (ER) and event-based F-score. Both metrics are computed as defined in [14]

, using a collar of 500 ms and considering only the event onset. If the output accurately predicts the presence of target event and its onset, we denote it as correct detection. The onset detection is considered accurate only when it is predicted within the range of 500 ms of the actual onset time. ER is the sum of deletion error and insertion error, and F-score is the harmonic average of precision and recall. We compute these metrics using sed_eval toolbox

[14] provided by DCASE organizer.

4 Results

4.1 Experimental results

The performances of proposed models and other methods, in terms of ER and F-score, are shown in Table 1. Results show that temporal attention model can improve the performance of bi-GRU based CRNN baseline, and frequential attention model can make further improvement. Compared with baseline, proposed method can improve the performance of all classes on both development dataset and evaluation dataset.

Compared with other state-of-the-art methods, the performance of our model is also competitive. Note that both of the top 2 teams adopt ensemble method. Lim et al. [3] combined the output probabilities of more than four models with different time steps and different data mixtures to make final decision. Cakir et al. [4] utilized the ensemble of seven architectures. We can achieve comparable results on development dataset without any model ensemble. Moreover, the average ER only increases slightly from 0.09 to 0.13 on evaluation dataset. We believe that our proposed model has a better capability of generalization. Proposed model achieves the lowest average ER (0.13) and the highest average F-score (93.4%) on evaluation dataset, outperforming all other methods.

4.2 Visualization of attention models

In order to know more about our attention models, we visualize the weights of both temporal attention model and frequential attention model. Presented in Figure 3 is a good example of what our proposed temporal-frequential attention model has actually learnt. Figure 3 (a) and (b) are visualization of temporal attention weights and frequential attention weights respectively.

In Figure 3, (i) is the mel-spectrogram of an audio in the evaluation dataset. In this audio, babycry occurs from 23.13s to 26.16s with “bus” background. There is a “beep” sound at around 9-th second. In (ii), the blue line denotes the output probability and the orange line denotes the temporal attention weights. We can notice that the weight value is bigger when “beep” and “babycry” occur, which conforms with our previous assumption that temporal attention model gives more attention to positive segments and hard negative segments. (iii) is the visualization of frequential attention weights and (iv) is the spectrogram of weighted feature. We can find that the value of frequential attention weight is bigger in low-frequency area, which means that our frequential attention pays less attention to high frequency components. This can be considered as a low-band filter and frequential attention model can ignore some high-frequency noise.

5 Conclusion

In this paper, we proposed a temporal-frequential attention model for sound event detection. Proposed model is tested on DCASE 2017 task 2. Our system can achieve the best performance on DCASE evaluation dataset even without model ensemble. In addition to sound event detection, our temporal-frequential attention model can be applied in speaker verification, speech recognition, audio tagging in the future for further research.