1 Introduction
Sound event detection (SED) is the task of identifying onsets and offsets of target class activities in general audio signals [5]. A typical SED scenario involves a method which takes as an input an audio signal, and outputs temporal activity for target classes like “car passing by”, “footsteps”, “people talking”, “gunshot”, etc [5, 8]. The time resolution of the activity of classes can vary among different methods and datasets, but typically is used 0.02 sec [5, 8, 18, 10]. Also, activities of classes can overlap (polyphonic SED) or not (monophonic SED). SED can be employed in a wide range of applications, like wildlife monitoring and bird activity detection [4, 1], home monitoring [29, 3], autonomous vehicles [31, 21], and surveillance [24, 9].
Current deep learningbased SED methods can be viewed as a composition of three functions. The first function is a feature extractor, usually implemented by convolutional neural network (CNN) blocks (i.e. a CNN followed by a nonlinearity, and normalization and subsampling processes), which provides frequency shift invariant features of the input audio signal
[5]. The second function, implemented by a recurrent neural network (RNN), models long temporal context and inter and intraclass patterns in the output of the feature extractor (i.e. the first function) [8]. Finally, the third function, which is an affine transform followed by a sigmoid nonlinearity (in the case of polyphonic detection), performs the classification. In [5]is described a widely adopted method that conforms to the above mentioned scheme, consisting of three CNN blocks followed by an RNN and a classifier. This method is termed as convolutional recurrent neural networks (CRNN) and has been used in a variety of audio processing tasks, like music emotion recognition
[23], sound event detection and localization [2], bird activity detection [4, 1], and SED [5].The typical amount of parameters of the CRNN is around 3.5 M, and the sequence length of the input audio and the output predictions is 1024 frames. Because an RNN is used, the CRNN method cannot be parallelized (i.e. split between different processing units, e.g. GPUs). The 1024 timeframe length of the output sequence can be considered long enough to create computational problems at the calculation of the gradient, due to the RNN (e.g. gated recurrent units, GRU, or long shortterm memory, LSTM). Reduction of the number of parameters of an SED model would allow the method to be fit for systems with restricted resources (e.g. embedded systems) and the training time would decrease (resulting in faster experimentation and optimization). Also, removing the RNN would allow the method to be split between different processing units, would have more efficient training, and the amount of parameters could be further reduced.
In this paper we propose the replacement of the CNNs and the RNN. In particular, we propose the employment of depthwise separable convolutions [26, 11, 14, 7] instead of typical CNNs, resulting in a considerable decrease of the parameters for the learned feature extractor. Then, we also propose the replacement of the RNN with dilated convolutions [13, 25, 32]. This allows modeling long temporal context, but reduces the amount of parameters, eliminates the gradient problems due to the usually long employed sequences (e.g. 1024frame long), and allows for parallelization of the model [20, 12].
Similar approaches have been proposed in [22] and in the code of the YAMNET system, available online^{1}^{1}1https://github.com/tensorflow/models/tree/master/research/audioset/yamnet. Specifically, in [22] is proposed a method using a series of dilated convolutions as a feature extractor, instead of CNNs. The output of the last dilated convolution is given as an input to an RNN, which does not lift any of the shortcomings of using RNNs in SED. YAMNET is based on the VGG architecture, using depthwise separable convolutions. The amount of parameters of the YAMNET amounts to 3.7M and there was not a specific module for taking into account the modeling of the longer temporal context in the input audio (e.g. like an RNN or a dilated convolution).
To evaluate the impact of our proposed changes, we employ a typical method for SED that is based on stacked CNNs and RNNs [5], and a freely available SED dataset, the TUTSED Synthetic 2016 [30]. Our results show that with our proposed changes we reduce the amount of parameters by 85% and the average time per epoch need for training by 78% (measured on the same GPU), while we increase the framewise F_{1} score by 4.6% and decrease the error rate by 3.8%. The rest of the paper is as follows. In Section 2 we briefly present the baseline approach and in Section 3 is our proposed method. Section 4 explains the evaluation setup of our method and the obtained results are presented in Section 5. Section 6 concludes the paper and proposes future research directions.
2 Baseline approach
The baseline approach accepts as an input a series of
audio feature vectors
, with each vector havingfeatures, and associated target output corresponding to the activities of
classes . is given as an input to a learnable feature extractor , consisting of cascaded 2D CNN blocks. Each block has a 2D CNN followed by a nonlinearity, a normalization process, and a feature subsampling process. The output of is given as an input to a temporal pattern identification module , which consists of a GRU RNN. is followed by a classifier , which is an affine transform followed by a sigmoid nonlinearity. The output of for each of the feature vectors is the predicted activities for each of the classes . During inference process, the activitiesare further binarized using a threshold of 0.5.
2.1 Learnable feature extractor based on CNNs
The learnable feature extractor of the baseline approach consists of three CNN blocks, each block having a typical 2D CNN followed by a rectified linear unit (ReLU), a batch normalization process, and maxpooling operation across the dimension of features. A typical 2D CNN consist of
kernelsand bias vectors
, where and are the number of input and output channels of the CNN, and and are the height and width of the kernel of each channel. Each kernel is applied to the input of the 2D CNN to obtain the output of the 2D CNN, as(1) 
where is the convolution operator with
unit stride
and zero padding. The above application of
onto leads to learning and extracting spatial and crosschannel information from the input features [11], and has a computational complexity of [11, 14, 7]. Additionally, the amount of learnable parameters of the 2D CNN (omitting bias) is . Figure 1 illustrates the above operation.In each CNN block of the feature extractor, the output of the 2D CNN is followed by ReLU, batch normalization, and maxpooling operations. The output of the maxpooling operation is given as an input to the next CNN block. The output of the third CNN block , with to be the output channels of the third CNN, is reshaped to , where and . is given as an input to the GRU of the .
2.2 Gated recurrent unit for long temporal context identification
The output features of are likely to include multiscale contextual information, encoding long temporal patterns and inter and intraclass activity [8]. To exploit this information, the baseline approach utilizes , which is a GRU that gets as an input the . The input and output dimensionality of the same and equal to .
In particular, the GRU of takes as an input the output of the last CNN block of the baseline approach and processes each row according to the equations mentioned in the original paper [6]. The output of , is given as an input to the classifier .
2.3 Classifier, loss, and optimization
The classifier gets as an input the output of , . consists of a learnable affine transform with shared weights through time, followed by a sigmoid nonlinearity. The output of is the output of the CRNN method, which is
(2) 
, , and are jointly optimized by minimizing the crossentropy loss between and .
3 Proposed approach
In our method we replace the and with different types of convolutions. We replace the with depthwise separable convolutions, which result in smaller amount of parameters and increased performance [7, 28, 27, 17, 15]. Additionally, we replace the with dilated convolutions, which have smaller amount of parameters, are based on CNNs, and can model long temporal context [13, 25, 32].
Specifically, our method also accepts as an input and the associated annotations for the activities of classes . is given as an input to a learnable feature extractor , consisting of cascaded 2D depthwise separable CNN (DWSCNN) blocks. Each block has a 2D CNN based on depthwise separable convolution followed by a nonlinearity, a normalization process, and a feature subsampling process. The output of is given as an input to a temporal pattern identification module , which consists of a 2D CNN based on dilated convolution (DILCNN). is followed by a classifier , which is the same classifier as in the baseline approach. The output of for each of the feature vectors is the predicted activities for each of the classes . Similarly to the baseline, during the inference process, the activities are further binarized using a threshold of 0.5.
3.1 Learnable feature extractor based on depthwise separable convolutions
Based on [26] and for our , we employ the factorization of the spatial and crosschannel learning process described by Eq (2.1). We replace the 2D CNNs of the CRNN method with 2D DWSCNNs, closely following the DWSCNNs presented for the MobileNets model [15] and the hyperparameters used in the CRNN architecture [5]. Instead of using in a convolution with a single kernel in order to learn spatial and crosschannel information, we apply, in series, two convolutions (i.e. the output of the first is the input to the second) using two different kernels. This factorization technique is termed as depthwise separable convolution, has been adopted to a variety of architectures for image processing (like the XCeption, GoogleLeNet, Inception, and MobileNets models), and has been proven to increase the performance while reducing the amount of parameters [7, 28, 27, 17, 15].
Firstly, we apply kernels to each in order learn the spatial relationships of features in as
(3) 
where . Then, we utilize kernels , with , and we apply them , in order learn the crosschannel relationships, as
(4) 
The resulting computational complexity and amount of parameters (omitting bias), for both processes in Eq. (3.1) and (4), are and , respectively. Thus, the computational complexity [15] and amount of parameters are both reduced by times. The process of depthwise convolution is illustrated in Figure 2, with the first step in Figure (a)a and the second in Figure (b)b.
According to the baseline approach, we use three blocks of DWSCNNs, where each block consists of a DWSCNN, followed by a rectified linear unit (ReLU), a batch normalization process, and a max pooling operation across the dimension of features . is the output of the third DWSCNN block, which is given as an input to .
3.2 Dilated convolutions
Contrary to the baseline approach, we employ in order to exploit the long temporal patterns in . is based on 2D dilated convolutions, which are capable to aggregate and learn multiscale information and have been used previously in image processing tasks [13, 25, 32].
A dilated 2D CNN (DILCNN) consists of kernels and bias vectors . Similarly to the typical CNN described in Section 3.1, and are the input and output channels of the DILCNN, and and are the height and width of the kernel for each channel. Each is applied to the input of DILCNN to obtain the output of the DILCNN as
(5) 
where are the dilation rates for the and dimensions of . It should be denoted that for , Eq. (3.2) boils down to Eq. (2.1), i.e. a typical convolution with no dilation.
The dilation rates, and , multiply the index that is used for accessing elements from . This allows a scaled aggregation of contextual information at the output of the operation [32]. Practically, this means that the resulting features computed by using DILCNN (i.e. ) are calculated from a bigger area, resulting into modelling longer temporal context. The growth of the area that is calculated from, is equal . The process described by Eq. (3.2) is illustrated in Figure 3.
We use DILCNN to replace the recurrent neural networks that efficiently model long temporal context and inter and intraclass activities for SED. Specifically, our has , takes as an input the output of , , and outputs , as
(6)  
(7) 
Finally, is reshaped to and given as an input to the classifier of our method, which is the of the baseline approach.
4 Evaluation setup
To assess the performance of each of the proposed replacements and their combination, we employ a freely available SED dataset and we compare the performance of the CRNN and each of our proposed replacements. The code for all the models and the evaluation process described in this paper, is freely available online^{2}^{2}2https://github.com/drcostas/dndsed.
4.1 Baseline system and models
We employ four different models, , , , and . is our main baseline and consists of three CNN blocks, followed by a GRU, and a linear layer acting as a classifier. Each CNN block consists of a CNN with 256 channels, square kernel shape of , stride of , and padding of
, followed by a ReLU, a batch normalization, a max pooling, and a dropout of 0.25 probability. The max pooling operations have kernels and stride of
, , and . The GRU has 256 input and output features, and the classifier has 256 input and 16 output features.For our second model, Model_{dws}, we replace the CNN blocks at CRNN with , so we can assess the benefit of using DWSCNNs instead of typical 2D CNNs. To minimize the factors that will have an impact to possible differences between our proposed method and the employed baseline, for our we adopted the same kernel shapes, strides, and padding for the kernels, as in the Model_{base}. That is, all , , and of have the same values as the corresponding ones in . The same stands true for stride and padding, and all hyperparameters of maxpooling operations.
At the third model, , we replace the GRU in with the , so we can assess the benefit of using DILCNN instead of an RNN. Since there are no previous studies using DILCNNs as a replacement for RNNs and for SED, we opt to keep the same amount of channels at the DWSCNNs but perform a grid search on , , and . Specifically, we employ four different kernel shapes . We denote the different shapes of kernels with an exponent, e.g. for the model having an with a kernel of shape of , or for the model having and an of kernel with shape . Because we want to assess the effect of using a different timeresolution for capturing inter and intraevent patterns with the DILCNN, we use and . That is, we apply dilation only on the time dimension and not on the dimension of features. Though, to keep the time dimension intact (i.e. to have ), we use zero padding at the time dimension. Specifically, we use a padding equal to for kernel shape of , a padding equal to for kernel, for the kernel, and for the kernel. We use no padding at the feature dimension for the . Must be noted that when then is a typical 2D CNN and, thus, we also assess the effect of replacing the RNN with a typical 2D CNN. We also denote the employed dilation in the exponent, e.g. or .
Finally, the is our complete proposed method, where we replace both the CNN blocks and the GRU from the , with the and , respectively. For complete assessment of our proposed method, we follow the same grid search on on , , and , as we perform for .
SED Performance  Computational Performance (meanSTD)  


DWS  ()  
base 
N/A  N/A  0.59  0.54  3.68M  49.4 ()  
dil  1  ()  0.60  0.54  3.81M  14.1 ()  

10  ()  0.61  0.53  3.81M  14.1 ()  

50  ()  0.62  0.51  3.81M  14.1 ()  

100  )  0.61  0.53  3.81M  14.5 ()  

1  ()  0.60  0.54  3.81M  20.7 ()  

10  ()  0.63  0.51  3.81M  18.2 ()  

50  ()  0.60  0.52  3.81M  18.5 ()  

100  ()  0.58  0.56  3.81M  18.5 ()  

1  ()  0.60  0.54  3.64M  12.2 ()  

10  ()  0.62  0.52  3.64M  12.2 ()  

50  ()  0.61  0.52  3.64M  12.4 ()  

100  ()  0.58  0.57  3.64M  12.4 ()  
dws 
N/A  ()  0.62  0.50  0.62M  46.9 ()  
dnd 
1  ()  0.59  0.54  0.74M  13.0 ()  

10  ()  0.62  0.51  0.74M  13.0 ()  

50  ()  0.61  0.53  0.74M  13.0 ()  

100  ()  0.60  0.53  0.74M  13.4 ()  

1  ()  0.59  0.55  0.74M  20.1 ()  

10  ()  0.62  0.52  0.74M  17.0 ()  

50  ()  0.62  0.52  0.74M  17.4 ()  

100  ()  0.58  0.56  0.74M  17.4 ()  

1  ()  0.60  0.54  0.58M  11.4 ()  

10  ()  0.63  0.50  0.58M  11.1 ()  

50  ()  0.61  0.53  0.58M  11.2 ()  

100  ()  0.58  0.57  0.58M  11.3 () 
and the average (and standard deviation, STD) time, in seconds, required for an epoch by
(STD). N/A denotes a non applicable parameterization. Bold faced elements denote the best reported performance for classification and computational performance.4.2 Dataset and metrics
We use the TUTSED Synthetic 2016 dataset, which is freely available online^{3}^{3}3http://www.cs.tut.fi/sgn/arg/taslp2017crnnsed/tutsedsynthetic2016 and has been employed in multiple previous work on SED [5, 8, 16]. TUTSED Synthetic consists of 100 mixtures of around eight minutes length with isolated sound events from 16 classes, namely alarms & sirens, baby crying, bird singing, bus, cat meowing, crowd applause, crowd cheering, dog barking, footsteps, glass smash, gun shot, horse walk, mixer, motorcycle, rain, and thunder. The mixtures are split to training, validation, and testing split by 60%, 20%, and 20%, respectively. The maximum polyphony of the dataset is 5. From each mixture we extract multiple sequences of vectors, having logmel band energies and using a hamming window of
sec, with 50% overlap. As the evaluation metrics we use F
_{1} score and error rate (ER), similarly to the original paper of CRNN and previous work on SED [5, 8, 16]. Both of the metrics are calculate on a perframe basis (i.e. for every ). Additionally, we keep a record of the training time per epoch for each model and for all repetitions of the optimization process, by measuring the elapsed time between the start and the end of each epoch.4.3 Training and testing procedures
We optimize the parameters of all models (under all sets of hyperparameters) using the training split of the employed dataset, the Adam optimizer with values for hyperparameters as proposed in the original paper [19], a batch size of 16, and crossentropy loss. After one full iteration over the training split (i.e. one epoch), we employ the same loss and measure its value on the validation split. We stop the optimization process if the loss on the validation split does not improve for 30 consecutive epochs and we keep the values of the parameters of the model from the epoch yielding the lowest validation loss. Finally, we assess the performance of each model using the testing split and the employed metrics (i.e. F_{1} and ER).
In order to have an objective assessment of the impact of our proposed method, we repeat 10 times the optimization for every model, following the above described process. Then, we calculate the average and standard deviation of the above mentioned metrics, i.e., F_{1} score and error rate (ER). In addition to this, we report the number of parameters () and the necessary mean training time per epoch (), i.e., a full iteration throughout the whole training split. All presented experiments performed on an NVIDIA Pascal V100 GPU.
5 Results and discussion
In Table 1 are the results from all conducted experiments, organized in two groups. The first one is termed as SED performance and regards the performance of each model and set of hyperparameters for the SED task (i.e. F_{1} and ER). The second group, termed as computational performance, considers the number of parameters and average time necessary for training ( and ), for each model and each set of hyperparameters. The STD of F_{1} and ER is in the range of to and omitted for clarity.
The baseline CRNN system, i.e. Model_{base}, seems to perform better in classification only from . In every other case, Model_{base} yields worse classification performance. This indicates that our proposed changes can, in general, result to better classification performance when compared to the baseline system. Regarding the computational performance, can be seen that there are specific sets of hyperparameters that result to models with more parameters from Model_{base}. Specifically, and with all , have more parameters than Model_{base}. This increase in , though, is not attributed on the difference of the amount of parameters between of Model_{dil} and the GRU of Model_{base}, but on the amount of parameters that the classifier has. In the case of Model_{base}, the output of the GRU had dimensions of . The classifier has shared weights through time, thus the amount of its input features is 256. But, in the case of and , the dimensionality of the input to the classifier, i.e. , is , where is inverse proportional to the size of the kernel of . After reshaping to , the amount of input features to the classifier is , which is considerably bigger than the case, i.e. . Finally, Model_{base} needs more time (on average) per epoch compared to any other model and set of hyperparameters in Table 1. This clearly indicates that all of the proposed changes have a positive impact on the needed time per epoch, even in the case where is bigger.
Comparing the impact of each of the changes (i.e. Model_{dws} versus Model_{dil}), we can see that adopting DWSCNN can significantly increase the SED performance, yielding better F_{1} and ER compared to Model_{base} and Model_{dil} (except ). Additionally, Model_{dws} yields the lowest ER in total, but not the highest F_{1}. Furthermore, Model_{dws} has M, significantly less than any Model_{dil} and the Model_{base}. The decrease in the amount of parameters and the increase in the performance when using the is in accordance with previous studies that adopted DWSCNN [7, 28, 27, 17, 15]. Focusing on the Model_{dil}, can be observed that the usage of dilation increases the classification performance. Specifically, in all kernel shapes, the (i.e. no dilation) yields the lowest F_{1} and highest ER. Also, it is apparent that for the classification performance decreases.
Finally, when both and are combined (i.e. Model_{dnd}) it seems that there is a drop in the performance (compared to Model_{dws}) for the (3, 3) and (5, 5) kernel shapes and for all . But, for the case of , there is the highest F_{1} score and by 0.02 second ER. Additionally, the specific model needs the less average time per epoch and belongs to the group of models with the less parameters.
6 Conclusion and future work
In this paper, we proposed the adoption of depthwise separable and dilated convolutions based 2D CNNs, as a replacement of usual 2D CNNs and RNN layers in typical SED methods. To evaluate our proposed changes, we conducted a series of experiments, assessing each replacement in separate and also their combination. We used a widely adopted method and a freely available SED dataset. Our results showed that when both DWSCNN and DILCNN are used, instead of usual CNNs and RNNs, respectively, the resulting method has considerably better classification performance, the amount of parameters decreases by 80%, and the average needed time (for training) per epoch decreases by 72
Although we conducted a grid search of the hyperparameters, the proposed method is likely not fine tuned for the task of SED. Further study is needed in order to fine tune the hyperparameters and yield the maximum classification performance for the task of SED.
Acknowledgement
Part of the computations leading to these results was performed on a TITANX GPU donated by NVIDIA. Part of the research leading to these results has received funding from the European Research Council under the European Union’s H2020 Framework Programme through ERC Grant Agreement 637422 EVERYSOUND. Stylianos I. Mimilakis is supported in part by the German Research Foundation (AB 675/21, MU 2686/111). The authors wish to acknowledge CSCIT Center for Science, Finland, for computational resources.
References
 [1] (2017) Stacked convolutional and recurrent neural networks for bird audio detection. In 2017 25th European Signal Processing Conference (EUSIPCO), pp. 1729–1733. External Links: Document, ISBN 9780992862671 Cited by: §1, §1.
 [2] (201903) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing 13 (1), pp. 34–48. External Links: ISSN 19410484, Link, Document Cited by: §1.
 [3] (2017) homeSound: realtime audio event detection based on high performance computing for behaviour and surveillance remote monitoring. Sensors (Basel) 17 (4). Cited by: §1.
 [4] (2017Aug.) Convolutional recurrent neural networks for bird audio detection. In 2017 25th European Signal Processing Conference (EUSIPCO), Vol. , pp. 1744–1748. External Links: Document, ISSN 20761465 Cited by: §1, §1.
 [5] (2017Jun.) Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech and Language Processing 25 (6), pp. 1291–1303. External Links: ISSN 23299290 Cited by: §1, §1, §1, §3.1, §4.2.

[6]
(2014)
Learning phrase representations using rnn encoderdecoder for statistical machine translation.
In
Conference on Empirical Methods in Natural Language Processing (EMNLP 2014)
, (English (US)). Cited by: §2.2. 
[7]
(2017Jul.)
Xception: deep learning with depthwise separable convolutions.
In
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 1800–1807. External Links: Document, ISSN 10636919 Cited by: §1, §2.1, §3.1, §3, §5.  [8] (2019Oct.) Language modelling for sound event detection with teacher forcing and scheduled sampling. In Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Cited by: §1, §1, §2.2, §4.2.

[9]
(2019Oct.)
A mobile application for sound event detection.
In
Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence (IJCAI19)
, Cited by: §1.  [10] (2019Oct.) Sound event localization and detection using CRNN on pairs of microphones. In Workshop of Detection and Classification of Acoustic Scenes and Events (DCASE), Cited by: §1.
 [11] (2018Sep.) Network decoupling: from regular to depthwise separable convolutions. In British Machine Vision Conference (BMVC), Cited by: §1, §2.1.

[12]
(201906)
Temporal convolutional networks for anomaly detection in time series
. Journal of Physics: Conference Series 1213, pp. 042050. External Links: Document, Link Cited by: §1.  [13] (1990) A realtime algorithm for signal analysis with the help of the wavelet transform. In Wavelets, J. Combes, A. Grossmann, and P. Tchamitchian (Eds.), Berlin, Heidelberg, pp. 286–297. External Links: ISBN 9783642759888 Cited by: §1, §3.2, §3.
 [14] (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861. Cited by: §1, §2.1.
 [15] (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861. Cited by: §3.1, §3.1, §3, §5.
 [16] (201809) Using sequential information in polyphonic sound event detection. In 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Cited by: §4.2.

[17]
(2015)
Batch normalization: accelerating deep network training by reducing internal covariate shift.
In
Proceedings of the 32nd International Conference on International Conference on Machine Learning  Volume 37
, ICML’15, pp. 448–456. Cited by: §3.1, §3, §5.  [18] (2019Oct.) Sound source detection, localization and classification using consecutive ensemble of CRNN models. In Workshop of Detection and Classification of Acoustic Scenes and Events (DCASE), Cited by: §1.
 [19] (201505) Adam: a method for stochastic optimization. In 3rd International Conference for Learning Representations, pp. . Cited by: §4.3.
 [20] (2016) Temporal convolutional networks: a unified approach to action segmentation. In Computer Vision – ECCV 2016 Workshops, G. Hua and H. Jégou (Eds.), Cham, pp. 47–54. External Links: ISBN 9783319494098 Cited by: §1.
 [21] (2017Sep.) Ensemble of convolutional neural networks for weaklysupervised sound event detection using multiple scale input. Technical report DCASE2017 Challenge. Cited by: §1.
 [22] (202005) Sound event detection via dilated convolutional recurrent neural networks. In (accepted, to be presented) IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.
 [23] (2017Jul.) Stacked convolutional and recurrent neural networks for music emotion recognition. In 14th Sound & Music Computing Conference (SMC17), pp. 208–213. Cited by: §1.
 [24] (2018) Listening for sirens: locating and classifying acoustic alarms in city scenes. ArXiv abs/1810.04989. Cited by: §1.
 [25] (199210) The discrete wavelet transform: wedding the a trous and mallat algorithms. IEEE Transactions on Signal Processing 40 (10), pp. 2464–2482. External Links: Document, ISSN 19410476 Cited by: §1, §3.2, §3.
 [26] (2014) Rigidmotion scattering for image classification. Ph.D. Thesis, Ecole Polytechnique, CMAP. Cited by: §1, §3.1.
 [27] (2016Jun.) Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2818–2826. External Links: ISSN 10636919 Cited by: §3.1, §3, §5.
 [28] (2015Jun.) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1–9. External Links: Document, ISSN 10636919 Cited by: §3.1, §3, §5.
 [29] (2019Oct.) Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Cited by: §1.
 [30] TUTSED Synthetic 2016. Note: http://www.cs.tut.fi/sgn/arg/taslp2017crnnsed/tutsedsynthetic2016Accessed: 20191210 Cited by: §1.
 [31] (2017Sep.) SurreyCVSSP system for DCASE2017 challenge task4. Technical report DCASE2017 Challenge. Cited by: §1.
 [32] (201605) Multiscale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR), Cited by: §1, §3.2, §3.2, §3.
Comments
There are no comments yet.