Sound event detection (SED) is the task of identifying onsets and offsets of target class activities in general audio signals . A typical SED scenario involves a method which takes as an input an audio signal, and outputs temporal activity for target classes like “car passing by”, “footsteps”, “people talking”, “gunshot”, etc [5, 8]. The time resolution of the activity of classes can vary among different methods and datasets, but typically is used 0.02 sec [5, 8, 18, 10]. Also, activities of classes can overlap (polyphonic SED) or not (monophonic SED). SED can be employed in a wide range of applications, like wildlife monitoring and bird activity detection [4, 1], home monitoring [29, 3], autonomous vehicles [31, 21], and surveillance [24, 9].
Current deep learning-based SED methods can be viewed as a composition of three functions. The first function is a feature extractor, usually implemented by convolutional neural network (CNN) blocks (i.e. a CNN followed by a non-linearity, and normalization and sub-sampling processes), which provides frequency shift invariant features of the input audio signal. The second function, implemented by a recurrent neural network (RNN), models long temporal context and inter- and intra-class patterns in the output of the feature extractor (i.e. the first function) . Finally, the third function, which is an affine transform followed by a sigmoid non-linearity (in the case of polyphonic detection), performs the classification. In 
is described a widely adopted method that conforms to the above mentioned scheme, consisting of three CNN blocks followed by an RNN and a classifier. This method is termed as convolutional recurrent neural networks (CRNN) and has been used in a variety of audio processing tasks, like music emotion recognition, sound event detection and localization , bird activity detection [4, 1], and SED .
The typical amount of parameters of the CRNN is around 3.5 M, and the sequence length of the input audio and the output predictions is 1024 frames. Because an RNN is used, the CRNN method cannot be parallelized (i.e. split between different processing units, e.g. GPUs). The 1024 time-frame length of the output sequence can be considered long enough to create computational problems at the calculation of the gradient, due to the RNN (e.g. gated recurrent units, GRU, or long short-term memory, LSTM). Reduction of the number of parameters of an SED model would allow the method to be fit for systems with restricted resources (e.g. embedded systems) and the training time would decrease (resulting in faster experimentation and optimization). Also, removing the RNN would allow the method to be split between different processing units, would have more efficient training, and the amount of parameters could be further reduced.
In this paper we propose the replacement of the CNNs and the RNN. In particular, we propose the employment of depthwise separable convolutions [26, 11, 14, 7] instead of typical CNNs, resulting in a considerable decrease of the parameters for the learned feature extractor. Then, we also propose the replacement of the RNN with dilated convolutions [13, 25, 32]. This allows modeling long temporal context, but reduces the amount of parameters, eliminates the gradient problems due to the usually long employed sequences (e.g. 1024-frame long), and allows for parallelization of the model [20, 12].
Similar approaches have been proposed in  and in the code of the YAMNET system, available online111https://github.com/tensorflow/models/tree/master/research/audioset/yamnet. Specifically, in  is proposed a method using a series of dilated convolutions as a feature extractor, instead of CNNs. The output of the last dilated convolution is given as an input to an RNN, which does not lift any of the shortcomings of using RNNs in SED. YAMNET is based on the VGG architecture, using depthwise separable convolutions. The amount of parameters of the YAMNET amounts to 3.7M and there was not a specific module for taking into account the modeling of the longer temporal context in the input audio (e.g. like an RNN or a dilated convolution).
To evaluate the impact of our proposed changes, we employ a typical method for SED that is based on stacked CNNs and RNNs , and a freely available SED dataset, the TUTSED Synthetic 2016 . Our results show that with our proposed changes we reduce the amount of parameters by 85% and the average time per epoch need for training by 78% (measured on the same GPU), while we increase the frame-wise F1 score by 4.6% and decrease the error rate by 3.8%. The rest of the paper is as follows. In Section 2 we briefly present the baseline approach and in Section 3 is our proposed method. Section 4 explains the evaluation set-up of our method and the obtained results are presented in Section 5. Section 6 concludes the paper and proposes future research directions.
2 Baseline approach
The baseline approach accepts as an input a series of
audio feature vectors, with each vector having
features, and associated target output corresponding to the activities ofclasses . is given as an input to a learnable feature extractor , consisting of cascaded 2D CNN blocks. Each block has a 2D CNN followed by a non-linearity, a normalization process, and a feature sub-sampling process. The output of is given as an input to a temporal pattern identification module , which consists of a GRU RNN. is followed by a classifier , which is an affine transform followed by a sigmoid non-linearity. The output of for each of the feature vectors is the predicted activities for each of the classes . During inference process, the activities
are further binarized using a threshold of 0.5.
2.1 Learnable feature extractor based on CNNs
The learnable feature extractor of the baseline approach consists of three CNN blocks, each block having a typical 2D CNN followed by a rectified linear unit (ReLU), a batch normalization process, and max-pooling operation across the dimension of features. A typical 2D CNN consist ofkernels
and bias vectors, where and are the number of input and output channels of the CNN, and and are the height and width of the kernel of each channel. Each kernel is applied to the input of the 2D CNN to obtain the output of the 2D CNN, as
where is the convolution operator with unit stride
and zero padding. The above application ofonto leads to learning and extracting spatial and cross-channel information from the input features , and has a computational complexity of [11, 14, 7]. Additionally, the amount of learnable parameters of the 2D CNN (omitting bias) is . Figure 1 illustrates the above operation.
In each CNN block of the feature extractor, the output of the 2D CNN is followed by ReLU, batch normalization, and max-pooling operations. The output of the max-pooling operation is given as an input to the next CNN block. The output of the third CNN block , with to be the output channels of the third CNN, is reshaped to , where and . is given as an input to the GRU of the .
2.2 Gated recurrent unit for long temporal context identification
The output features of are likely to include multi-scale contextual information, encoding long temporal patterns and inter- and intra-class activity . To exploit this information, the baseline approach utilizes , which is a GRU that gets as an input the . The input and output dimensionality of the same and equal to .
In particular, the GRU of takes as an input the output of the last CNN block of the baseline approach and processes each row according to the equations mentioned in the original paper . The output of , is given as an input to the classifier .
2.3 Classifier, loss, and optimization
The classifier gets as an input the output of , . consists of a learnable affine transform with shared weights through time, followed by a sigmoid non-linearity. The output of is the output of the CRNN method, which is
, , and are jointly optimized by minimizing the cross-entropy loss between and .
3 Proposed approach
In our method we replace the and with different types of convolutions. We replace the with depthwise separable convolutions, which result in smaller amount of parameters and increased performance [7, 28, 27, 17, 15]. Additionally, we replace the with dilated convolutions, which have smaller amount of parameters, are based on CNNs, and can model long temporal context [13, 25, 32].
Specifically, our method also accepts as an input and the associated annotations for the activities of classes . is given as an input to a learnable feature extractor , consisting of cascaded 2D depthwise separable CNN (DWS-CNN) blocks. Each block has a 2D CNN based on depthwise separable convolution followed by a non-linearity, a normalization process, and a feature sub-sampling process. The output of is given as an input to a temporal pattern identification module , which consists of a 2D CNN based on dilated convolution (DIL-CNN). is followed by a classifier , which is the same classifier as in the baseline approach. The output of for each of the feature vectors is the predicted activities for each of the classes . Similarly to the baseline, during the inference process, the activities are further binarized using a threshold of 0.5.
3.1 Learnable feature extractor based on depthwise separable convolutions
Based on  and for our , we employ the factorization of the spatial and cross-channel learning process described by Eq (2.1). We replace the 2D CNNs of the CRNN method with 2D DWS-CNNs, closely following the DWS-CNNs presented for the MobileNets model  and the hyper-parameters used in the CRNN architecture . Instead of using in a convolution with a single kernel in order to learn spatial and cross-channel information, we apply, in series, two convolutions (i.e. the output of the first is the input to the second) using two different kernels. This factorization technique is termed as depthwise separable convolution, has been adopted to a variety of architectures for image processing (like the XCeption, GoogleLeNet, Inception, and MobileNets models), and has been proven to increase the performance while reducing the amount of parameters [7, 28, 27, 17, 15].
Firstly, we apply kernels to each in order learn the spatial relationships of features in as
where . Then, we utilize kernels , with , and we apply them , in order learn the cross-channel relationships, as
The resulting computational complexity and amount of parameters (omitting bias), for both processes in Eq. (3.1) and (4), are and , respectively. Thus, the computational complexity  and amount of parameters are both reduced by times. The process of depthwise convolution is illustrated in Figure 2, with the first step in Figure (a)a and the second in Figure (b)b.
According to the baseline approach, we use three blocks of DWS-CNNs, where each block consists of a DWS-CNN, followed by a rectified linear unit (ReLU), a batch normalization process, and a max pooling operation across the dimension of features . is the output of the third DWS-CNN block, which is given as an input to .
3.2 Dilated convolutions
Contrary to the baseline approach, we employ in order to exploit the long temporal patterns in . is based on 2D dilated convolutions, which are capable to aggregate and learn multi-scale information and have been used previously in image processing tasks [13, 25, 32].
A dilated 2D CNN (DIL-CNN) consists of kernels and bias vectors . Similarly to the typical CNN described in Section 3.1, and are the input and output channels of the DIL-CNN, and and are the height and width of the kernel for each channel. Each is applied to the input of DIL-CNN to obtain the output of the DIL-CNN as
The dilation rates, and , multiply the index that is used for accessing elements from . This allows a scaled aggregation of contextual information at the output of the operation . Practically, this means that the resulting features computed by using DIL-CNN (i.e. ) are calculated from a bigger area, resulting into modelling longer temporal context. The growth of the area that is calculated from, is equal . The process described by Eq. (3.2) is illustrated in Figure 3.
We use DIL-CNN to replace the recurrent neural networks that efficiently model long temporal context and inter- and intra-class activities for SED. Specifically, our has , takes as an input the output of , , and outputs , as
Finally, is reshaped to and given as an input to the classifier of our method, which is the of the baseline approach.
4 Evaluation setup
To assess the performance of each of the proposed replacements and their combination, we employ a freely available SED dataset and we compare the performance of the CRNN and each of our proposed replacements. The code for all the models and the evaluation process described in this paper, is freely available online222https://github.com/dr-costas/dnd-sed.
4.1 Baseline system and models
We employ four different models, , , , and . is our main baseline and consists of three CNN blocks, followed by a GRU, and a linear layer acting as a classifier. Each CNN block consists of a CNN with 256 channels, square kernel shape of , stride of , and padding of
, followed by a ReLU, a batch normalization, a max pooling, and a dropout of 0.25 probability. The max pooling operations have kernels and stride of, , and . The GRU has 256 input and output features, and the classifier has 256 input and 16 output features.
For our second model, Modeldws, we replace the CNN blocks at CRNN with , so we can assess the benefit of using DWS-CNNs instead of typical 2D CNNs. To minimize the factors that will have an impact to possible differences between our proposed method and the employed baseline, for our we adopted the same kernel shapes, strides, and padding for the kernels, as in the Modelbase. That is, all , , and of have the same values as the corresponding ones in . The same stands true for stride and padding, and all hyper-parameters of max-pooling operations.
At the third model, , we replace the GRU in with the , so we can assess the benefit of using DIL-CNN instead of an RNN. Since there are no previous studies using DIL-CNNs as a replacement for RNNs and for SED, we opt to keep the same amount of channels at the DWS-CNNs but perform a grid search on , , and . Specifically, we employ four different kernel shapes . We denote the different shapes of kernels with an exponent, e.g. for the model having an with a kernel of shape of , or for the model having and an of kernel with shape . Because we want to assess the effect of using a different time-resolution for capturing inter- and intra-event patterns with the DIL-CNN, we use and . That is, we apply dilation only on the time dimension and not on the dimension of features. Though, to keep the time dimension intact (i.e. to have ), we use zero padding at the time dimension. Specifically, we use a padding equal to for kernel shape of , a padding equal to for kernel, for the kernel, and for the kernel. We use no padding at the feature dimension for the . Must be noted that when then is a typical 2D CNN and, thus, we also assess the effect of replacing the RNN with a typical 2D CNN. We also denote the employed dilation in the exponent, e.g. or .
Finally, the is our complete proposed method, where we replace both the CNN blocks and the GRU from the , with the and , respectively. For complete assessment of our proposed method, we follow the same grid search on on , , and , as we perform for .
|SED Performance||Computational Performance (meanSTD)|
and the average (and standard deviation, STD) time, in seconds, required for an epoch by(STD). N/A denotes a non applicable parameterization. Bold faced elements denote the best reported performance for classification and computational performance.
4.2 Dataset and metrics
We use the TUT-SED Synthetic 2016 dataset, which is freely available online333http://www.cs.tut.fi/sgn/arg/taslp2017-crnn-sed/tut-sed-synthetic-2016 and has been employed in multiple previous work on SED [5, 8, 16]. TUT-SED Synthetic consists of 100 mixtures of around eight minutes length with isolated sound events from 16 classes, namely alarms & sirens, baby crying, bird singing, bus, cat meowing, crowd applause, crowd cheering, dog barking, footsteps, glass smash, gun shot, horse walk, mixer, motorcycle, rain, and thunder. The mixtures are split to training, validation, and testing split by 60%, 20%, and 20%, respectively. The maximum polyphony of the dataset is 5. From each mixture we extract multiple sequences of vectors, having log-mel band energies and using a hamming window of
sec, with 50% overlap. As the evaluation metrics we use F1 score and error rate (ER), similarly to the original paper of CRNN and previous work on SED [5, 8, 16]. Both of the metrics are calculate on a per-frame basis (i.e. for every ). Additionally, we keep a record of the training time per epoch for each model and for all repetitions of the optimization process, by measuring the elapsed time between the start and the end of each epoch.
4.3 Training and testing procedures
We optimize the parameters of all models (under all sets of hyper-parameters) using the training split of the employed dataset, the Adam optimizer with values for hyper-parameters as proposed in the original paper , a batch size of 16, and cross-entropy loss. After one full iteration over the training split (i.e. one epoch), we employ the same loss and measure its value on the validation split. We stop the optimization process if the loss on the validation split does not improve for 30 consecutive epochs and we keep the values of the parameters of the model from the epoch yielding the lowest validation loss. Finally, we assess the performance of each model using the testing split and the employed metrics (i.e. F1 and ER).
In order to have an objective assessment of the impact of our proposed method, we repeat 10 times the optimization for every model, following the above described process. Then, we calculate the average and standard deviation of the above mentioned metrics, i.e., F1 score and error rate (ER). In addition to this, we report the number of parameters () and the necessary mean training time per epoch (), i.e., a full iteration throughout the whole training split. All presented experiments performed on an NVIDIA Pascal V100 GPU.
5 Results and discussion
In Table 1 are the results from all conducted experiments, organized in two groups. The first one is termed as SED performance and regards the performance of each model and set of hyper-parameters for the SED task (i.e. F1 and ER). The second group, termed as computational performance, considers the number of parameters and average time necessary for training ( and ), for each model and each set of hyper-parameters. The STD of F1 and ER is in the range of to and omitted for clarity.
The baseline CRNN system, i.e. Modelbase, seems to perform better in classification only from . In every other case, Modelbase yields worse classification performance. This indicates that our proposed changes can, in general, result to better classification performance when compared to the baseline system. Regarding the computational performance, can be seen that there are specific sets of hyper-parameters that result to models with more parameters from Modelbase. Specifically, and with all , have more parameters than Modelbase. This increase in , though, is not attributed on the difference of the amount of parameters between of Modeldil and the GRU of Modelbase, but on the amount of parameters that the classifier has. In the case of Modelbase, the output of the GRU had dimensions of . The classifier has shared weights through time, thus the amount of its input features is 256. But, in the case of and , the dimensionality of the input to the classifier, i.e. , is , where is inverse proportional to the size of the kernel of . After reshaping to , the amount of input features to the classifier is , which is considerably bigger than the case, i.e. . Finally, Modelbase needs more time (on average) per epoch compared to any other model and set of hyper-parameters in Table 1. This clearly indicates that all of the proposed changes have a positive impact on the needed time per epoch, even in the case where is bigger.
Comparing the impact of each of the changes (i.e. Modeldws versus Modeldil), we can see that adopting DWS-CNN can significantly increase the SED performance, yielding better F1 and ER compared to Modelbase and Modeldil (except ). Additionally, Modeldws yields the lowest ER in total, but not the highest F1. Furthermore, Modeldws has M, significantly less than any Modeldil and the Modelbase. The decrease in the amount of parameters and the increase in the performance when using the is in accordance with previous studies that adopted DWS-CNN [7, 28, 27, 17, 15]. Focusing on the Modeldil, can be observed that the usage of dilation increases the classification performance. Specifically, in all kernel shapes, the (i.e. no dilation) yields the lowest F1 and highest ER. Also, it is apparent that for the classification performance decreases.
Finally, when both and are combined (i.e. Modeldnd) it seems that there is a drop in the performance (compared to Modeldws) for the (3, 3) and (5, 5) kernel shapes and for all . But, for the case of , there is the highest F1 score and by 0.02 second ER. Additionally, the specific model needs the less average time per epoch and belongs to the group of models with the less parameters.
6 Conclusion and future work
In this paper, we proposed the adoption of depthwise separable and dilated convolutions based 2D CNNs, as a replacement of usual 2D CNNs and RNN layers in typical SED methods. To evaluate our proposed changes, we conducted a series of experiments, assessing each replacement in separate and also their combination. We used a widely adopted method and a freely available SED dataset. Our results showed that when both DWS-CNN and DIL-CNN are used, instead of usual CNNs and RNNs, respectively, the resulting method has considerably better classification performance, the amount of parameters decreases by 80%, and the average needed time (for training) per epoch decreases by 72
Although we conducted a grid search of the hyper-parameters, the proposed method is likely not fine tuned for the task of SED. Further study is needed in order to fine tune the hyper-parameters and yield the maximum classification performance for the task of SED.
Part of the computations leading to these results was performed on a TITAN-X GPU donated by NVIDIA. Part of the research leading to these results has received funding from the European Research Council under the European Union’s H2020 Framework Programme through ERC Grant Agreement 637422 EVERYSOUND. Stylianos I. Mimilakis is supported in part by the German Research Foundation (AB 675/2-1, MU 2686/11-1). The authors wish to acknowledge CSC-IT Center for Science, Finland, for computational resources.
-  (2017) Stacked convolutional and recurrent neural networks for bird audio detection. In 2017 25th European Signal Processing Conference (EUSIPCO), pp. 1729–1733. External Links: Cited by: §1, §1.
-  (2019-03) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing 13 (1), pp. 34–48. External Links: Cited by: §1.
-  (2017) homeSound: real-time audio event detection based on high performance computing for behaviour and surveillance remote monitoring. Sensors (Basel) 17 (4). Cited by: §1.
-  (2017-Aug.) Convolutional recurrent neural networks for bird audio detection. In 2017 25th European Signal Processing Conference (EUSIPCO), Vol. , pp. 1744–1748. External Links: Cited by: §1, §1.
-  (2017-Jun.) Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech and Language Processing 25 (6), pp. 1291–1303. External Links: Cited by: §1, §1, §1, §3.1, §4.2.
Learning phrase representations using rnn encoder-decoder for statistical machine translation.
Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), (English (US)). Cited by: §2.2.
-  (2017-Jul.) Xception: deep learning with depthwise separable convolutions. In , Vol. , pp. 1800–1807. External Links: Cited by: §1, §2.1, §3.1, §3, §5.
-  (2019-Oct.) Language modelling for sound event detection with teacher forcing and scheduled sampling. In Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Cited by: §1, §1, §2.2, §4.2.
A mobile application for sound event detection.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Cited by: §1.
-  (2019-Oct.) Sound event localization and detection using CRNN on pairs of microphones. In Workshop of Detection and Classification of Acoustic Scenes and Events (DCASE), Cited by: §1.
-  (2018-Sep.) Network decoupling: from regular to depthwise separable convolutions. In British Machine Vision Conference (BMVC), Cited by: §1, §2.1.
Temporal convolutional networks for anomaly detection in time series. Journal of Physics: Conference Series 1213, pp. 042050. External Links: Cited by: §1.
-  (1990) A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, J. Combes, A. Grossmann, and P. Tchamitchian (Eds.), Berlin, Heidelberg, pp. 286–297. External Links: Cited by: §1, §3.2, §3.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861. Cited by: §1, §2.1.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861. Cited by: §3.1, §3.1, §3, §5.
-  (2018-09) Using sequential information in polyphonic sound event detection. In 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Cited by: §4.2.
Batch normalization: accelerating deep network training by reducing internal covariate shift.
Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. Cited by: §3.1, §3, §5.
-  (2019-Oct.) Sound source detection, localization and classification using consecutive ensemble of CRNN models. In Workshop of Detection and Classification of Acoustic Scenes and Events (DCASE), Cited by: §1.
-  (2015-05) Adam: a method for stochastic optimization. In 3rd International Conference for Learning Representations, pp. . Cited by: §4.3.
-  (2016) Temporal convolutional networks: a unified approach to action segmentation. In Computer Vision – ECCV 2016 Workshops, G. Hua and H. Jégou (Eds.), Cham, pp. 47–54. External Links: Cited by: §1.
-  (2017-Sep.) Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input. Technical report DCASE2017 Challenge. Cited by: §1.
-  (2020-05) Sound event detection via dilated convolutional recurrent neural networks. In (accepted, to be presented) IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.
-  (2017-Jul.) Stacked convolutional and recurrent neural networks for music emotion recognition. In 14th Sound & Music Computing Conference (SMC-17), pp. 208–213. Cited by: §1.
-  (2018) Listening for sirens: locating and classifying acoustic alarms in city scenes. ArXiv abs/1810.04989. Cited by: §1.
-  (1992-10) The discrete wavelet transform: wedding the a trous and mallat algorithms. IEEE Transactions on Signal Processing 40 (10), pp. 2464–2482. External Links: Cited by: §1, §3.2, §3.
-  (2014) Rigid-motion scattering for image classification. Ph.D. Thesis, Ecole Polytechnique, CMAP. Cited by: §1, §3.1.
-  (2016-Jun.) Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2818–2826. External Links: Cited by: §3.1, §3, §5.
-  (2015-Jun.) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1–9. External Links: Cited by: §3.1, §3, §5.
-  (2019-Oct.) Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Cited by: §1.
-  TUT-SED Synthetic 2016. Note: http://www.cs.tut.fi/sgn/arg/taslp2017-crnn-sed/tut-sed-synthetic-2016Accessed: 2019-12-10 Cited by: §1.
-  (2017-Sep.) Surrey-CVSSP system for DCASE2017 challenge task4. Technical report DCASE2017 Challenge. Cited by: §1.
-  (2016-05) Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR), Cited by: §1, §3.2, §3.2, §3.