Learning sound representations using trainable COPE feature extractors

01/21/2019 ∙ by Nicola Strisciuglio, et al. ∙ 0

Sound analysis research has mainly been focused on speech and music processing. The deployed methodologies are not suitable for analysis of sounds with varying background noise, in many cases with very low signal-to-noise ratio (SNR). In this paper, we present a method for the detection of patterns of interest in audio signals. We propose novel trainable feature extractors, which we call COPE (Combination of Peaks of Energy). The structure of a COPE feature extractor is determined using a single prototype sound pattern in an automatic configuration process, which is a type of representation learning. We construct a set of COPE feature extractors, configured on a number of training patterns. Then we take their responses to build feature vectors that we use in combination with a classifier to detect and classify patterns of interest in audio signals. We carried out experiments on four public data sets: MIVIA audio events, MIVIA road events, ESC-10 and TU Dortmund data sets. The results that we achieved (recognition rate equal to 91.71 the MIVIA road events, 81.25 demonstrate the effectiveness of the proposed method and are higher than the ones obtained by other existing approaches. The COPE feature extractors have high robustness to variations of SNR. Real-time performance is achieved even when the value of a large number of features is computed.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Methods and systems for the automatic analysis of people and vehicle behavior, scene understanding, familiar place recognition and human-machine interaction are traditionally based on computer vision techniques. In robotics or public security, for instance, there has been a great effort to equip machines with capabilities for autonomous visual understanding. However, video analysis has some weak points, such as sensitivity to light changes and occlusions, or limitation to the field of view of the camera. Sound is complementary to visual information and can be used to improve the capabilities of machines to deal with the surrounding environment. Furthermore, there are cases in which video analysis cannot be used due to privacy issues (e.g. in public toilets).

In this paper we focus on automatic learning of representations of sounds that are suitable for pattern recognition, in the context of environmental sound analysis for detection and classification of audio events. Recently, the interest in automatic analysis of environmental sounds increased because of various applications in intelligent surveillance and security CroccoSurvey , assistance of eldery people Vacher2010 , monitoring of smart rooms Wang2008 , home and social robotics Maxime2014 , etc.

A large part of sound analysis research in the past years focused on speech recognition Besacier201485 , speaker identification Roy2012 and music classification Fu2011survey

. Features and classifiers for voice analysis are established and widely used in practical systems: spectral or cepstral features in combination with classifiers based on Hidden Markov Models or Gaussian Mixture Models. However, state of the art methods for speech and music analysis do not give good results when applied to environmental sounds, which have highly non-stationary characteristics 

Cowling2003 . Most speech recognition methods assume that speech is based on a phonetic structure, which allows to analyze complex words or phrases by splitting them in a series of simple phonemes. In the case of environmental sound there is no underlying phoneme-like structure. Moreover, human voice has very specific frequency characteristics that are not present in other kinds of sound. For example, interesting events for surveillance applications, such as gun shots or glass breaking usually have high-frequency components that are not present in speech. For speech recognition and speaker identification the sound source is typically very close to the microphone. It implies that background noise has lower energy than foreground sounds and does not impair considerably the performance of the recognition system. Environmental sound sources can be, instead, at any distance from the microphone. Hence, the background noise can have relatively high energy, so determining very low or even negative signal-to-noise ratio (SNR).

Existing methods for detection of audio events, for which we provide an extensive overview in Section 2

, are based on the extraction of hand-crafted features from the audio signal. The features extracted from (a part of) the audio signal are submitted to a classification system. The employed features describe stationary and non-stationary properties of the signals 

Chachada2013 . This approach to pattern recognition requires a feature engineering step that aims at choosing or designing a set of features that describe important characteristics of the sound for the problem at hand. Widely used features are mainly borrowed from the field of speech recognition: responses of log-frequency filters, Mel-frequency cepstral coefficients, wavelet transform coefficients among others. The choice of effective features or combination of them is a critical step to build an effective system and requires considerable domain knowledge.

More recent approaches do not rely on hand-crafted features but rather involve automatic learning of data representations from training samples by using deep learning and convolutional neural networks (CNN) 

lecun2015deep . CNNs were originally proposed for visual data analysis, but have also been successfully applied to speech Huang2015 , music processing Oord2013 and sound scene classification Phan2017 . While they achieve very good performance, they require very large amount of labeled training data which is not always available.

In this work, we propose trainable feature extractors for sound analysis which we call COPE (Combination of Peaks of Energy). They are trainable as their structure is not fixed in advance but it is rather learned in an automatic configuration procedure using a single prototype pattern. This automatic configuration of feature extractors is a type of representation learning. It allows to automatically construct a suitable data representation to be used together with a classifier and does not require considerable domain expertise. We configure a number of COPE feature extractors on training sounds and use their responses to build a feature vector, which we then employ as input to a classifier. With respect to CopePreliminary2015 , in which we reported preliminary results obtained using COPE feature extractors on sound events with the same SNR, in this work we provide: a) a detailed formulation of the configuration and application steps of COPE features, b) a thourough validation of the performance of a classification system based on COPE features when tested with sounds with different values of SNR, c) an extension of the MIVIA audio events data set the includes null or negative SNR sound events and d) a wide comparison of the proposed method with other existing approaches on four benchmark data sets. Furthermore, we discuss the importance of robustness to variations of the background noise and SNR of the events of interest, for applications of sound event detection in Section 5.4. We provide a detailed analysis of the contribution of the COPE features to the improvement of sound event detection and classification performance with respect to existing approaches.

The design of COPE feature extractors was inspired by certain properties of the inner auditory system, which converts the sound pressure waves that reach our ears into neural stimuli on the auditory nerve. In the A we provide some details about the biological mechanisms that inspired the design of the COPE feature extractors.

We validate the effectiveness of the proposed COPE feature extractors by carrying out experiments on the following public benchmark data sets: MIVIA audio events BoawPRL2015 , MIVIA road events ITS2015 , ESC-10 Piczak15 , TU-Dortmund Plinge2014 .

The main contributions of this work are: a) novel COPE trainable feature extractors for representation learning of sounds that are automatically configured on training examples, b) a method for audio event detection that uses the proposed features, c) the release of an extended version of the MIVIA audio events data set with sounds at null and negative SNR.

The rest of the paper is organized as follows. In Section 2 we review related works, while in Section 3 we present the COPE feature extractors and the architecture of the proposed method. We describe the data sets used for the experiments in Section 4. We report the results that we achieved, a comparison with existing methods and an analysis of the sensitivity of the performance of the proposed method with respect to the parameters of the COPE feature extractors in Section 5. We provide a discussion in Section 6 and, finally, draw conclusions in Section 7.

2 Related works

Representation learning has recently received great attention by researchers in pattern recognition with the aim of constructing reliable features by direct learning from training data. Methods based on deep learning and CNNs were proposed to learn features for several applications: age and gender estimation from facial images 

LIU201782 , action recognition Deep2014 , person re-identification DING20152993 , hand-written signature verification HAFEMANN2017163 , and also sound analysis Aytar2016 . Other approaches for feature learning focused on sparse dictionary learning Rubinstein09 ; CHEN201751 , learning vector quantization BUNTE20111892 , and on extensions of the bag of features approach based on neural networks PASSALIS2017277 or higher-order pooling Koniusz17 .

In the context of audio analysis research, it is common to organize existing works on sound event detection according to the feature sets and classification architectures that they employ. Early methods approached the problems of sound event detection and classification by dividing the audio signal into small, partially overlapped frames and computing a feature vector for each frame. The used features ranged from relatively simple (e.g. frame energy, zero-crossing rate, sub-band energy rate) to more complicated ones (e-g. Mel-frequency Cepstral Coefficients Guo2003 , log-frequency filter banks Nadeu2001 , perceptual linear prediction coefficients Portelo2009 , etc.). The frame-level feature vectors were then used together with a classifier to perform a decision. Gaussian Mixture Model (GMM) based classifiers were largely employed to classify the frames as part of sounds of interest or background clavel05 ; Atrey2006

. To limit the influence of background sounds on the classification performance, One-Class Support Vector Machines were proposed 

Rabaoui2008 .

Spectro-temporal features based on spectrogram or other time-frequency representations were also developed Chu2009 ; Dennis13 . Inspired by the way the inner auditory system of humans responds to the frequency of the sounds, an auditory image model (AIM) was proposed Patterson1992 . The AIM was used as basis for improved models which are called stabilized auditory images (SAI) Lyon2011 . In gammatoneAVSS , the event detection was formulated as an object detection problem in a spectrogram-like representation of the sound, and approached by using a cascade of AdaBoost classifiers. The design of hand-crafted features poses some limitations to the construction of systems that are robust to varying conditions of the events of interest and requires considerable domain knowledge.

In order to construct more reliable systems, efforts towards automatic learning of features from training data by means of machine learning techniques were made. Various approaches based on

bag of features were proposed for sound event representation and classification Aucouturier07 ; Pancoast12 . A code-book of basic audio features (also called audio words) is directly learned from training samples as result of a quantization of the feature space by means of various clustering algorithms (e.g. k-Means or fuzzy k-Means). A comparison of hard and soft quantization of audio words was performed in BoawPRL2015 . Other approaches for the construction of a code-book of basic audio words were also based on non-negative matrix factorization Giannoulis13 or sparse coding Lu2014 . In the bag of features representation, the information about the temporal arrangement of the audio words is lost. This was taken into account in Grzeszick15 and chin2012 , where a feature augmentation and a classifier based on Genetic Motif Discovery were proposed, respectively. The sequence of audio words were also employed in Kumar12 and PhanM15 . The temporal information was described by a pyramidal approach to bag of features in Plinge2014 ; Grzeszick2016 . A method for sound representation learning based on Convolutional Neural Networks (CNN) was proposed in Piczak15a . Learning features from training samples does not require an engineering effort and allows for the adaptation of the recognition systems to various problems. However, the effectiveness and generalization capabilities of learned features depend on the amount of available training data.

Evaluation of algorithms for audio event detection on public benchmark data sets is a valuable tool for objective comparison of performance. The great attention that was dedicated to music and speech analysis determined the publication of several data sets used in scientific challenges for benchmarking of algorithms. The MIREX challenge series evaluated systems for music information retrieval (MIR) Downie2010 . The CHiME challenge focused on speech analysis in noisy environments CHiME13 . The “Acoustic event detection and classification” task of the CLEAR challenges (2006 and 2007) focused on the detection of sound events related to seminars, such as speech, chair moving, door opening and applause Stiefelhagen2007 . Recently, the DCASE challenge Stowell2015 stimulated the interest of researchers on audio processing for the analysis of environmental sounds. The attention was driven towards audio event detection and classification and scene classification.

3 Method

In Figure 1, we show an overview of the architecture of the proposed method. The algorithm is divided in two phases: configuration and application.

In the configuration phase (dashed line), the Gammatonegrams (see details in Section 3.1) of prototype training sounds are used to configure a set of COPE feature extractors (see Section 3.2.2). Successively, the response of the set of COPE feature extractors, computed on the sounds in the training set, are employed to construct COPE feature vectors (Figure 1b-d). A multi-class SVM classifier is finally trained using the COPE feature vectors (Figure 1e) to distinguish between the classes of interest for the application at end.

In the application phase, the previously configured set of COPE feature extractors is applied to extract feature vectors from input unknown sounds and the multi-class SVM classifier is used to detect and classify sound events of interest. The implementation of the COPE feature extractors and the proposed classification architecture is publicly available111The code is available at http://gitlab.com/nicstrisc/COPE.

3.1 Gammatonegram

The traditional and most used time-frequency representation of sounds is the spectrogram, in which the energy distribution over frequencies is computed by dividing the frequency axis into sub-bands with equal bandwidth. In the human auditory system, the resolution in the perception of differences in frequency changes according to the base frequency of the sound. At low frequency the band-pass filters have a narrower bandwidth than the ones at high frequency. This implies higher time resolution of filters at high frequency that are able to better catch high variations of the signal. In this work we employ a bank of Gammatone band-pass filters, whose bandwidth increases with increasing central frequency. The functional form of Gammatone is biologically-inspired and models the response of the cochlea membrane in the inner ear of the human auditory system patterson1986auditory .

COPE featuresCOPE featurevector computationMulti-classSVMCconfiguration process(a)(b)(c)(d)(e)
Figure 1: Architecture of the proposed method. The (a) Gammatonegram of the training audio samples is computed in the training phase (dashed arrow), and used to configure a (b) set of COPE feature extractors. The learned features are used in the application phase to (c) process the input sound and (d) construct feature vectors with their responses. A (e) multi-class SVM classifier is, finally, employed to detect events of interest.

The impulse response of a Gammatone filter is the product of a statistical distribution called Gamma and a sinusoidal carrier tone. It is formally defined as:


where is the central frequency of the filter, and is its phase. The constant controls the gain and is the order of the filter. The parameter is a decay factor and determines the bandwidth of the band-pass filter. The relation between the central frequency of a Gammatone filter and its bandwidth is given by the Equivalent Rectangular Bandwidth (ERB):


where is the asymptotic filter quality at high frequencies and is the minimum bandwidth at low frequencies, while is usually equal to 1 or 2. In Glasberg1990 , the parameters , and where determined by measurements from notched-noise data. In Figure (a)a, we show the impulse response of two Gammatone filters with low ( Hz) and higher ( KHz) central frequencies. The filter with higher central frequency has larger bandwidth, as it can be seen from their frequency response in Figure (b)b.

0612182430-0.0100.01Imp. resp. ( Hz, Hz)
0612182430-0.0500.05t (ms)Imp. resp. ( KHz, KHz)
(KHz)Frequency response-100-90-80-70-60-50-40-30-20-100HzKHz
Figure 2: (a) Impulse responses of two Gammatone filters (with central frequencies Hz and KHz). The dashed lines represent the envelope of the sinusoidal tone. The (b) frequency responses of the filters in (a): the filter with higher central frequency (dashed line) has larger bandwidth ( Hz while Hz).

We filter the input signal with a bank of Gammatone filters . The response of the -th filter to an input signal is the convolution of the input signal with the impulse response :


We divide the input audio signal in frames of samples and process every frame by a bank of Gammatone filters in order to capture the short-time properties of the energy distribution of the sound. Two consecutive frames have samples in common, which means that they overlap for of their length. This ensures continuity of analysis and that border effects are avoided. Given an input signal with samples, the number of concerned frames is . We finally construct the Gammatonegram of a sound as a matrix , whose -th column corresponds to with . The energy value of the -th frequency channel in the Gammatonegram at the -th time instant is:


In Figure (a)a, we show the Gammatonegram representation of a sample scream sound. It is similar to the spectrogram, with the substantial difference that the frequency axis has a logarithmic scale and the bandwidth of the band-pass filters increases linearly with the value of the central frequency.

3.2 COPE features

The configuration and application of COPE feature extractors involve a number of steps that we explain in the following of this section. In the application phase, given the Gammatonegram representation of a sound, a COPE feature extractor responds strongly to patterns similar to the one used in the configuration step. It also accounts for some tolerance in the detection of the pattern of interest, so being robust to distortions due to noise or to varying SNR.

3.2.1 Local energy peaks

The energy peaks (local maxima) in a Gammatonegram are highly robust to additive noise wang03 . This property provides underlying robustness of the designed COPE features to variation of the SNR of the sounds of interest. We consider that a point is a peak if it has higher energy than the points in its neighborhood. We suppress non-maxima points in the Gammatonegram and obtain an energy peak response map, as follows:


where and . The values and determine the size, in terms of time and frequency, of the neighborhood around a time-frequency point in which the local energy is evaluated (in this work we consider -connected pixels). We consider the arrangement (hereinafter constellation) of a set of such time-frequency points as a description of the distribution of the energy of a particular sound.

time (seconds)Prototype sound (scream)00.320.640.961.281.61.920.20.71.53612
time (seconds)Energy peaks and feature extractor support00.320.640.961.281.61.920.20.71.53612
time (seconds)Configured COPE feature extractor00.320.640.961.281.61.920.20.71.53612
Figure 3: Example of configuration of a COPE feature extractor performed on the (a) Gammatonegram representation of a scream. The (b) energy peaks are extracted and a support (dashed lines) is chosen around a reference point (small circle). The (c) configured feature extractor is composed of only those points whose energy is higher than a fraction of the energy of the reference point.

3.2.2 Configuration of a COPE feature extractor

Given the constellation of energy peaks of a sound and a reference point (in our case the point that correspond to the highest peak of energy), we determine the structure of a COPE feature extractor in an automatic configuration process. For the configuration one has to set the support size of the COPE feature extractor, i.e. the size of the time interval around the reference point in which to consider energy peaks.

In Figure 3, we show an example of the configuration process on the scream sound in Figure (a)a. First, we find the position of the local energy peaks and select a reference point (small circle in Figure (b)b) around which we define the support size of the feature extractor. The support is contained between the two dashed (red) lines in Figure (b)b. We consider the positions of only those peaks that fall within the support of the feature extractor and whose energy is higher than a fraction of the highest peak of energy (Figure (c)c). Every peak point is represented by a -tuple : is the temporal offset with respect to the reference point, is its corresponding frequency channel in the Gammatone filterbank and is the value of the energy contained in it.

The configuration process results in a set of tuples that describe the constellation of energy peaks in the Gammatonegram image of a sound. We denote by the set of -tuples, where is the number of considered peaks within the support of the filter.

3.2.3 Feature computation

Given a Gammatonegram, we compute the response of a COPE feature extractor as a combination of its weighted and shifted energy peaks. We define the weighting and shifting of the -th energy peak as :


where .

The function can be seen as a response map of the similarity between the detected energy peak in the input Gammatonegram and the corresponding one in the model. In this work we consider , so as to account only for the position and energy content of the peak points in the constellation. We weigth the response with a Gaussian weighting function

that allows for some tolerance in the expected position of the peak points. This choice is supported by evidence in the auditory system that vibrations of the cochlea membrane due to a sound wave of a certain frequency excite neurons specifically tuned for that frequency and also neighbor neurons 


. The size of the tolerance region is determined by the standard deviation

of the function , which is a parameter and we set as .

The value of a COPE feature is computed with a sliding window that shifts on the Gammatonegram of the input sound. Formally, we define it as the geometric mean of the weighted and shifted energy peak responses in Eq. 



where is a threshold value. Here, we set , so to not suppress any response. The value of a COPE feature for a sound in an interval delimited by

is given by max-pooling of the response

with :


3.3 COPE feature vector

We configure a set of COPE feature extractors on training audio samples from different classes. For a given interval of sounds , we then construct a feature vector as follows:


3.4 Classifier

We use the COPE feature vectors to train a classifier, which is able to assign the input sound to one of the classes of interest. The COPE feature vectors are not dependent on a specific classier and thus one can employ them together with any multi-class classifier.

In this work, we employ a multi-class SVM classifier, designed according to a one-vs-all scheme, in which binary SVM classifiers (where

corresponds to the number of classes) are trained to recognize samples from the classes of interest. We use linear SVMs with soft-margin as they provide already satisfactory results and are easy to train. We set the hyperparameter

for the training of each SVM, which indicates the trade-off between training error and size of the classification margin while training the SVM classifier (see CortesSVM1995 for reference). We train the -th SVM () by using as positive examples those of the class and as negative samples those of all the remaining classes. In this scheme, the training of each SVM classifier is an unbalanced problem, as the cardinality of the samples from the negative class outnumbers that of the samples from the positive class . We thus employ an implementation of the SVM algorithm that includes a cost-factor by which training errors on positive examples outweight errors on negative examples222Available in the SVMlight library - http://svmlight.joachims.org/ Morik99 . In this way, the training errors for the positive and negative examples have the same influence in the overall optimization.

During the test phase, each SVM classifier assigns a score to the given sample under test (i.e. a COPE feature vector that represents the sound to classify). We analyze the SVM scores and assign to the test vector the class that corresponds to the SVM that gives the highest classification score. We assign the sample under test to the reject class (background sound) in case all the scores are negative. We formally define the classification rule as:


4 Data sets

We carried out experiments on four public data sets, namely the MIVIA audio events BoawPRL2015 , MIVIA road events ITS2015 , ESC-10 Piczak15 and TU-Dortmund Plinge2014 data sets.

4.1 MIVIA audio events

Typical sounds of interest for intelligent surveillance applications are glass breakings, gun shots and screams. In the MIVIA audio events data set, such sounds are superimposed to various background sounds and have different SNRs (). This simulates the occurrence of sounds in different environments and at various distances from the microphone. We extended the data set by including cases in which the energy of the sounds of interest is equal or lower than the one of the background sound, so having null or negative SNR. Thus, adopting the same procedure described in BoawPRL2015 , we created two versions of the audio events at and SNR. The final data set333The data set is publicly available at the url http://www.gitlab.com/nicstrisc/COPE contains a total of events for each class, divided into events for training and events for testing equally distributed over the considered values of SNR. The audio clips are PCM sampled at KHz with a resolution of bits per sample. Hereinafter we refer at glass breaking with GB, at gun shots with GS and at screams with S. We indicate the background sound with BN. In Table 1, we report the details of the composition of the extended data set.

MIVIA audio events data set
Training set Test set
#Events Duration (s) #Events Duration (s)
BN - 77828.8 - 33382.4
GB 5600 8033.1 2400 3415.6
GS 5600 2511.5 2400 991.3
S 5600 7318.4 2400 3260.5


Table 1: Details of the composition of the MIVIA audio events data set. The total duration of the sounds is expressed in seconds.

4.2 MIVIA road events

The MIVIA road events data set contains car crash and tire skidding events mixed with typical road background sounds such as traffic jam, passing vehicles, crowds, etc. A total of sound events ( car crashes and tire skiddings) are superimposed to various road sounds ranging from very quiet (e.g. in country roads) to highly noisy traffic jams (e.g. in the center of a big city) and highways. The sounds of interest are distributed over audio clips of about one minute each, which are organized into four independent folds (in each fold events per class are present) for cross-validation experiments. The audio signals are sampled at KHz with a resolution of bits per PCM sample. In the rest of the paper, we refer at car crash with CC and at tire skidding with TS.

4.3 Esc-10

The ESC-10 data set is composed of sounds divided in ten classes (dog bark, rain, sea waves, baby cry, clock tick, sneeze, helicopter, chainsaw, rooster, fire crackling), each of them containing samples. The sounds are sampled at KHz with a bit rate of kbit/s and their total duration is about minutes. The data set is organized in five independent folds. The average classification accuracy achieved by human listeners is .

4.4 TU Dortmund

The TU Dortmund data set was recorded in a smart room with a microphone embedded on a table. The data set is composed of sounds from eleven classes (chair, cup, door, keyboard, laptop keys, paper sheets, pouring, rolling, silence, speech, steps), divided in a training and a test sets. The sounds of interest are sampled at Khz and are mixed with the background sound of the smart room. A ground truth with the start and end points of the sounds is provided. We constructed a second observer ground truth, which contains a finer grain manual segmentation of the events.

5 Experiments

5.1 Performance evaluation

For the MIVIA audio events and the MIVIA road events data sets we adopted the experimental protocol defined in BoawPRL2015 . The performance evaluation is based on the use of a time window of seconds that forward shifts on the audio signal by

seconds. An event is considered correctly detected if it is detected in at least one of the time windows that overlap with it. Besides the recognition rate and confusion matrix, we consider two types of error that are important for performance evaluation: the detection of events of interest when only background sound is present (false positive) and the case when an event of interest occurs but it is not detected (missed detection). In case a false positive is detected in two consecutive time windows, only one error is counted. We measured the performance of the proposed method by computing the recognition rate (RR), false positive rate (FPR), error rate (ER) and miss detection rate (MDR). Moreover, in addition to the receiver operating characteristic (ROC) curve and in order to assess the overall performance of the proposed method we compute the Detection Error Trade-off (DET) curve. It is a plot of the trade-off between the false positive rate and the miss detection rate and gives an insight of the performance of a classifier in terms of its errors. In contrast to the ROC curve, in the DET curve the axis are logarithmic in order to highlight differences between classifiers in the critical operating region. The closer the curve to the point

, the better the performance of the system.

For the ESC-10 and TU Dortmund data sets, we evaluate the performance for the classification of isolated audio events. This type of evaluation is done according to the structure of these data sets and to make possible a comparison with the results achieved by other approaches. We compute the average recognition rate (RR) and the F-Measure , where is the precision and is the recall. , and are the number of true positive, false positive and false negative classifications, respectively. In the case of the MIVIA road events and the ESC-10 data sets, we perform cross-validation experiments.

5.2 Results

In Table 2, we report the classification matrix that we obtained on the extended version of the MIVIA audio events data set. The average recognition rate for the three classes is , while the miss detection rate and the error rate are and , respectively. We obtained an FPR equal to , of which are glass breakings, are gun shots and are screams.

Results - MIVIA audio events data set


Detected class
True class GB


Table 2: Classification matrix obtained by the proposed method on the extended MIVIA audio events data set. GB, GS and S indicate the classes in the data set (see Section 4.1), while MDR is the miss detection rate.

In Table 3 we report the classification matrix achieved by the proposed approach on the MIVIA road events data set. The average RR is with a standard deviation of , while the average FPR is with a standard deviation of . The results are in line with the ones achieved on the MIVIA audio events data set. The low standard deviation of the recognition rate is indicative of good generalization capabilities.

The proposed method shows high performance on the ESC-10 and TU Dortmund data sets, which both contain a larger number of classes than in the MIVIA data sets, but with a lower number of samples per class. We achieved on the ESC-10 data set (the standard deviation of each measure is in brackets). On the TU Dortmund data set we achieved and . In the following, we compare the achieved results with the ones reported in other works.

Results - MIVIA road events


Guessed class
True class CC


Table 3: Average results obtained by the proposed method on the MIVIA road events data set. CC and TS are acronyms for the classes in the data set (see Section 4.2).

5.3 Results comparison

Result comparison - MIVIA audio events data set


Test with
Gammatone Saggese16
UDWT Saggese16
SoundNet SoundNet
HRNN Colangelo17
Test with and
SoundNet SoundNet


Table 4: Comparison of the results with the ones of existing approaches on the MIVIA audio events data set. RR, MDR, ER and FPR refer to the metrics described in Section 5.1.

In Table 4, we report the results that we achieved on the MIVIA audio event data set, compared with the ones of existing methods. In the upper part of the table we compare the results achieved by considering the classification of sound events with positive SNR only. In the lower part of the Table, we report the results achieved by including also sound events with negative and null SNR in the evaluation.

It is important to clarify that the methods described in BoawPRL2015 ; Saggese16 employ the same multi-class one-vs-all linear SVM classifiers of this work. The results that we report using SoundNet features SoundNet were obtained by using the features computed at the last convolutional layer of the SoundNet network in combination with the same classifier of this work.

SoundNet features obtained comparable recognition rate to the one achieved by the proposed approach, but a considerably higher FPR. The recognition rate achieved by the Hierarchical Recurrent Neural Network classifier (HRNN) proposed in 

Colangelo17 is slightly higher than the ones we obtained, though the HRNN-based approach has more complex design and training procedure, and a different classifier than SVM. The values of MDR, ER and FPR are not reported in Colangelo17 .

The performance of the proposed method demonstrated high robustness of the COPE feature extractors w.r.t. variations of the SNR. Conversely, for the methods proposed in BoawPRL2015 , the performance of the classification systems strongly depend on the SNR of the training sound events. When sounds with only positive SNR are used for training, the recognition rate achieved by the proposed method is almost higher than the one obtained by the approaches proposed in BoawPRL2015 ; Saggese16 . The performance results of the latter methods decrease strongly (recognition rate more than lower than the one of the proposed method) when sounds with negative SNR are included in the model. We provide an extensive analysis of robustness to variations of SNR in Section 5.4.

Comparison of results on MIVIA road events data set


Carletti13 ; ITS2015


Table 5: Comparison of the results achieved on the MIVIA roads events data set with respect to the methods proposed in ITS2015 ; Carletti13

. RR, MDR, ER and FPR refer to the evaluation metrics described in Section 


In Table 5, we compare the results we obtained on the MIVIA road events data set with the ones reported in ITS2015 , where different sets of audio features (BARK, MFCC and a combination of temporal and spectral features) have been employed as short-time descriptors of sounds. We obtained an average recognition rate () that is more than higher than the ones achieved by existing methods, with a lower standard deviation.

0.1 0.2 0.5 1 2 5 10 20 40 0.1 0.2 0.5 1 2 5 10 20 40

False Alarm probability (

)MIVIA audio events COPE
0.1 0.2 0.5 1 2 5 10 20 40 0.1 0.2 0.5 1 2 5 10 20 40 False Alarm probability ()MIVIA road events COPEBARKMFCCCarletti13
Figure 4: Detection Error Trade-off curves achieved by the proposed method (solid line) compared to the curves achieved by existing methods (dashed lines) on the (a) MIVIA audio events and (b) MIVIA road events data sets. (Notice the logarithmic scales.)

In Figure (a)a and (b)b, we plot the DET curves obtained by our method (solid lines) on the MIVIA audio events and MIVIA road events data sets, respectively, and those of the methods proposed in BoawPRL2015 and ITS2015 (dashed lines). The curve of our method is closer to the point than the ones of other approaches, so confirming higher performance with respect to existing methods on the concerned data sets.

We compare the results that we achieved on the ESC-10 data set with the ones reported by existing approaches in Table 6. The sign ‘’ indicates that the concerned value is not reported in published papers. The highest recognition rate is achieved by SoundNet SoundNet , which is a deep neural network trained on a very large data set of audio-visual correspondences. Approaches based on CNNs (Piczak15a ; SoundNet ; Hertel16 ; Medhat2017 ) are trained with data augmentation techniques and generally perform better than the proposed approach on the ESC-10 data set, which is instead trained only on the original sounds in the ESC-10 data set.

Result comparison on ESC-10 data set


Method RR F
Baseline Piczak15
Random Forest Piczak15
Piczak CNN Piczak15a
Conv. Autoenc. SoundNet
Hertel CNN Hertel16
SoundNet SoundNet
MCLNN. Medhat2017


Table 6: Comparison of the results on the ESC-10 data set. In brackets we report the standard deviation of the average performance metrics. RR, MDR, ER and FPR refer to the evaluation metrics described in Section 5.1.

In Table 7, we report the results that we achieved on the TU Dortmund data set together with those reported in Plinge2014 , where a classifier based on bag of features was proposed. Besides the traditional bag of features (BoF) scheme, the authors proposed a pyramidal approach (P-BoF) and the use of super-frames (BoSF) for embedding temporal information about the sequence of features. The results in Table 7 are computed according to the ground truth that we constructed based on a fine segmentation of sounds of interest and that we made publicly available. It is worth noting that the performance results of our method refer to the classification of sound events. For the methods proposed in Plinge2014 the evaluation is performed by considering the classification of sound frames.

Result comparison on TU Dortmund data set


RR Pr Re F
BoF Plinge2014
P-BoF Plinge2014
BoSF Plinge2014


Table 7: Comparison of the results on the TU Dortmund data set. The results were computed with respect to the second observer ground truth that we constructed. RR, MDR, ER and FPR refer to the evaluation metrics described in Section 5.1.

5.4 Robustness to background noise and SNR variations

Comparison of results on the MIVIA audio events data set


Training T1 Training T2


Table 8: Analysis and comparison of stability of results w.r.t. varying value of SNR of the events of interest. Details on the training schemes T1 and T2 are provided in Section 5.4. RR, MDR, ER and FPR refer to the evaluation metrics described in Section 5.1.

We carried out a detailed analysis of the performance of the COPE feature extractors on sounds with different levels of SNR. We trained the proposed classifier following two different schemes. For the first training scheme (that we refer at as T1) we included in the training process only sound events with positive SNR. For the second training scheme (that we refer at as T2) we trained the classifier with all the sounds in the data set, including those with null and negative SNR. In Table 8, we report the results that we achieved for the classification of sounds in the MIVIA audio event data set by training the system according to T1 and T2. We tested both trained models on the whole test set of the MIVIA audio event data set (including negative and null SNRs). The proposed method showed stronger robustness to changing SNR w.r.t. previously published approaches in BoawPRL2015 , especially when samples with null and negative SNR are not included in the training process. This demonstrate high generalization capabilities of the proposed COPE feature extractors to sound events corrupted by high-energy noise.

In Figure 5, we plot the ROC curves relative to the performance achieved at the different levels of SNR of the sounds of interest contained in the MIVIA audio event data set. We observed substantial stability of performance when the sounds of interest have positive (also very low) SNR. The high robustness of the COPE feature extractors with respect to variations of the SNR is attributable to the use of the local energy peaks extracted from the Gammatonegram, which are robust to additive noise. The slightly lower results at negative SNR are mainly due to the changes of the expected energy peak locations caused by high energy of the background sounds. In such cases, most of the wrong classifications are due to errors rather than to miss detection of sounds of interest.
Figure 5: ROC curves obtained by the proposed method on the MIVIA audio events data set at different SNR values (). The arrow indicates increasing values of SNR.

5.5 Sensitivity analysis

We analyzed the sensitivity of the COPE feature extractors with respect to the parameter which regulates the degree of tolerance to changes of the sounds of interest due to background noise or distortion. We used a version of the MIVIA audio events data set specifically built for cross-validation experiments. The data set was released in BoawPRL2015 , employing the same procedure used for the MIVIA audio events data set, ensuring statistical independence and high variability among folds. The sound events were divided into independent folds, each of them containing events of interest per class (times

versions of the SNR, as in the original MIVIA events data set). In our analysis, we estimated the variance of the generalization error using the Nadeau-Bengio variance estimator 

Nadeau2003 , which takes into account the variability of the training and test sets used in cross-validation experiments.

For the configuration of a COPE feature extractor, the user has to choose the size of its support, i.e. the length of the time interval around the reference point in which energy peaks are considered for the configuration (see Section 3.2.2). We experimentally observed that different sizes of the support, namely ms, do not significantly influence the performance of the proposed system on the MIVIA data sets. We report results achieved with a support of ms, which involves a limited number of energy peaks in the configuration of the feature extractors. One could however choose ms, achieving similar performance to the case in which ms. The drawback is the need of computing and combining the responses of a higher number of energy peaks, which increase the processing time of each feature extractor.

In Table 9, we report the generalization error (ER) and the false positive rate (FPR) as the parameter varies. The performance of the proposed system is slightly sensitive to varying values of the parameter , mostly when they are kept very low. For higher values (), the performance shows more stability. Higher tolerance for the detection of the energy peak positions determines stronger robustness to background noise. It is worth pointing out that too large values of tolerance might cause a loss in the selectivity and descriptive power of the COPE feature extractors and consequently a decrease of the classification performance.

. Sensitivity of COPE feature extractors MIVIA audio events MIVIA road events  

Table 9: Sensitivity of the COPE feature extractors to various values the parameter . Higher the value of , larger the tolerance of the feature extractor to variations of the pattern of interest. For the generalization error (ER) of cross-validation experiments, we report the value of the Nadeau-Bengio estimator of variance that takes into account the variability in the training and test sets Nadeau2003

6 Discussion

The high recognition capabilities of the proposed method are attributable to the trainable character and the versatility of the COPE feature extractors. The concept of trainable filters has been previously introduced for visual pattern recognition. COSFIRE filters were proposed for contour detection AzzopardiPetkov2012 , keypoint and object detection COSFIRE , retinal vessel segmentation AzzopardiStrisciuglio ; StrisciuglioVIP15 , curvilinear structure delineation StrisciuglioIWOBI17 ; StrisciuglioCAIP2017 ; StrisciuglioECCV18 , and action recognition StrisciuglioAction2018 . In this work, we extended the concept of trainable feature extractors to sound recognition. It is noteworthy that the proposed COPE feature extractors do not relate with template matching techniques, which are sensitive to variations with respect to the reference pattern. The tolerance introduced in the application phase allows also for the detection of modified versions of the prototype pattern, mainly due to noise or distortion.

An important advantage of using COPE feature extractors is the possibility of avoiding the process of feature engineering, which is a time-consuming task and requires substantial domain knowledge. In traditional sound recognition approaches, hand-crafted features (e.g. MFCC, spectral and temporal features, Wavelets and so on) are usually chosen and combined together to form a feature vector that describes particular characteristics of the audio signals. On the contrary, the automatic configuration of COPE feature extractors consists in learning data representations directly from the sounds of interest. Manual engineering of features is indeed not required.

Representation learning is typical of recent machine learning methods based on deep and convolutional neural networks, which require large amount of training data. When large data sets are not available, new synthetic data is generated by transformations of the original training data. To this concern, the COPE algorithm differs from deep and convolutional neural networks approaches, as it requires only one prototype pattern to configure a new feature. Moreover, the tolerance introduced in the application phase guarantees, to a certain extent, good generalization properties. Because of their flexibility, COPE feature extractors can be thus employed in various sound processing applications such as music analysis Newton11 ; Neocleous15 or audio fingerprinting Cano05 , among others.

time (seconds) breaking at 30dB SNR00.320.640.961.281.61.92
00.320.640.961.281.61.92time (seconds) of the COPE feature extractor
0.40.4450.490.5350.58time (seconds) at various SNRs0dB-5dB5dB10dB15dB20dB25dB30dB
Figure 6: The (a) Gammatonegram of a prototype glass breaking sound used for the configuration of a COPE feature extractor. The (b) response of the feature extractor computed on the sound in (a). The (c) time-zoomed (between s and s) response at different SNRs (dB). The response is stable for positive values of SNR and decreases for null or negative SNR values.

The COPE feature extractors are robust to variations of the background noise and of the SNR of the sounds of interest. In Figure (a)a we show the Gammatonegram of a glass breaking sound with SNR equal to dB. As an example, we configure a COPE feature extractor on this sound and compute its response on the sound of Figure (a)a, which we show in Figure (b)b. One can observe that the response is maximum in the same point used as reference point in the configuration phase. The response is null when at least one of the expected energy peaks is not present. In Figure (c)c, we show a time-zoomed detail of the response of the feature extractor computed on the same glass breaking event at different values of SNR (from dB to dB in steps of dB). The response keeps stable for positive, also very low values of SNR and it slightly decreases for null or negative SNR values. As demonstrated by the results that we reported in Section 5.4, the stability of the response of COPE feature extractors and the high performance on sounds with different values of SNR. The decrease of performance at null and negative SNR is due to the effect of background sounds with energy higher than that of the sounds of interest. It determines strong changes of the position of the energy peaks with respect to those determined in the configuration. To this concern, the effect of other functions in eq. 6 to evaluate the energy peak similarity can be explored.

7 Conclusions and outlook

We proposed a novel method for feature extraction in audio signals based on trainable feature extractors, which we called COPE (that stands for Combination of Peaks of energy). We employed the COPE feature extractors in the task of environmental sound event detection and classification, and tested their robustness to variations of the SNR of the sounds of interest. The results that we achieved on four public data sets (recognition rate equal to on the MIVIA audio events, on the MIVIA road events, on the ESC-10 and on the TU Dortmund data sets) are higher than many existing approaches and demonstrate the effectiveness of the proposed method.

The design of COPE feature extractors was based on neuro-physiological evidence of the mechanism that translates sound pressure waves into neural stimuli from the cochlea membrane through the Inner Hair Cells (IHC) in the auditory system of mammals. The proposed method can be extended by also including in its processing the implementation of a neuron response inhibition mechanism that prevents the short-time firing of those IHCs that have recently fired Lopez-Poveda2006 . In this view, the computation of the energy peak map would need to account for the energy distribution of the sounds of interest in each frequency band at a larger time scale, instead of performing a local analysis only. The extension of the COPE feature extractor with such inhibition phenomenon can further improve the robustness of the proposed method to changes of background noise and SNR, as only significant energy peaks are to be processed.

Although the current implementation of the COPE feature extractor is rather efficient ( seconds to compute a COPE feature vector of 200 elements for a signal of

seconds, on a 2GHz dual core CPU), their computation can be further speeded-up. Parallelization approaches can be explored, which compute the value of COPE features or the local energy peak responses in separate threads. The construction of the COPE feature vector can also be optimized by including in the classification system only those filters that are relevant for the application at hand. A feature selection scheme based on the relevance of the feature values described can be employed 

Strisciuglio2016 . The optimization of the number of configured feature extractors and the implementation of parallelization strategies can jointly contribute to the implementation of a real-time system for intelligent audio surveillance on edge embedded systems.

Appendix A Biological motivation

The sound pressure waves that hit our ears are directed to the cochlea membrane in the inner auditory system. Different parts of the cochlea membrane vibrate according to the energy of the frequency components of the sound pressure waves patterson1986auditory . A bank of Gammatone filters was proposed as a model of the cochlea membrane, whose response over time forms a spectrogram-like image called Gammatonegram Patterson1992 . The membrane vibrations stimulate firing of inner hair cells (IHC), which are neurons that lay behind the cochlea. The firing activity of IHCs stimulates various fibers of the auditory nerve over time. We consider the pattern of the IHC firing activity as a descriptor of the input sound.

Given a prototype sound, a COPE feature extractor models the pattern of points that describe the IHC firing activity. We consider the points of highest local energy in the Gammatonegram as the locations at which the IHCs fire, and the constellation that they form is a robust representation of the pattern of interest. Hence, a COPE feature extractor is configured by modeling the constellation of the peak points of the Gammatonegram of a prototype sound.



  • (1) M. Crocco, M. Cristani, A. Trucco, V. Murino, Audio surveillance: A systematic review, ACM Comput. Surv. 48 (4) (2016) 52:1–52:46. doi:10.1145/2871183.
  • (2) M. Vacher, F. Portet, A. Fleury, N. Noury, Challenges in the processing of audio channels for ambient assisted living, in: IEEE Healthcom, 2010, pp. 330–337.
  • (3) J. C. Wang, H. P. Lee, J. F. Wang, C. B. Lin, Robust environmental sound recognition for home automation, IEEE Trans. Autom. Sci. Eng 5 (1) (2008) 25–31.
  • (4) J. Maxime, X. Alameda-Pineda, L. Girin, R. Horaud, Sound representation and classification benchmark for domestic robots, in: IEEE ICRA, 2014, pp. 6285–6292. doi:10.1109/ICRA.2014.6907786.
  • (5) L. Besacier, E. Barnard, A. Karpov, T. Schultz, Automatic speech recognition for under-resourced languages: A survey, Speech Communication 56 (0) (2014) 85–100. doi:10.1016/j.specom.2013.07.008.
  • (6) A. Roy, M. Magimai-Doss, S. Marcel, A fast parts-based approach to speaker verification using boosted slice classifiers, IEEE Trans. Inf. Forensics Security 7 (1) (2012) 241–254. doi:10.1109/TIFS.2011.2166387.
  • (7) Z. Fu, G. Lu, K. M. Ting, D. Zhang, A survey of audio-based music classification and annotation, IEEE Trans. Multimedia 13 (2) (2011) 303–319.
  • (8) M. Cowling, R. Sitte, Comparison of techniques for environmental sound recognition, Pattern Recogn. Lett. 24 (15) (2003) 2895–2907.
  • (9) S. Chachada, C. C. J. Kuo, Environmental sound recognition: A survey, in: APSIPA, 2013, pp. 1–9. doi:10.1109/APSIPA.2013.6694338.
  • (10) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444. doi:10.1038/nature14539.
  • (11) Y. G. Jui-Ting Huang, Jinyu Li, An analysis of convolutional neural networks for speech recognition, in: ICASSP, 2015.
  • (12) A. van den Oord, S. Dieleman, B. Schrauwen, Deep content-based music recommendation, in: NIPS, 2013, pp. 2643–2651.
  • (13) H. Phan, L. Hertel, M. Maass, P. Koch, R. Mazur, A. Mertins, Improved audio scene classification based on label-tree embeddings and convolutional neural networks, IEEE Trans. Audio, Speech, Language Process. 25 (6) (2017) 1278–1290.
  • (14) N. Strisciuglio, M. Vento, N. Petkov, Bio-inspired filters for audio analysis, in: BrainComp 2015, Revised Selected Papers, 2016, pp. 101–115. doi:10.1007/978-3-319-50862-7_8.
  • (15) P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, M. Vento, Reliable detection of audio events in highly noisy environments, Pattern Recogn. Lett. 65 (2015) 22 – 28. doi:10.1016/j.patrec.2015.06.026.
  • (16) P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, M. Vento, Audio surveillance of roads: A system for detecting anomalous sounds, IEEE Trans. Intell. Transp. Syst. 17 (1) (2016) 279–288. doi:10.1109/TITS.2015.2470216.
  • (17) K. J. Piczak, Esc: Dataset for environmental sound classification, in: Proc ACM Int Conf Multimed, MM ’15, 2015, pp. 1015–1018.
  • (18) A. Plinge, R. Grzeszick, G. A. Fink, A bag-of-features approach to acoustic event detection, in: IEEE ICASSP, 2014, pp. 3704–3708.
  • (19)

    H. Liu, J. Lu, J. Feng, J. Zhou, Group-aware deep feature learning for facial age estimation, Pattern Recognition 66 (2017) 82 – 94.

  • (20) P. Foggia, A. Saggese, N. Strisciuglio, M. Vento, Exploiting the deep learning paradigm for recognizing human actions, in: IEEE AVSS 2014, 2014, pp. 93–98. doi:10.1109/AVSS.2014.6918650.
  • (21) S. Ding, L. Lin, G. Wang, H. Chao, Deep feature learning with relative distance comparison for person re-identification, Pattern Recognition 48 (10) (2015) 2993 – 3003.
  • (22) L. G. Hafemann, R. Sabourin, L. S. Oliveira, Learning features for offline handwritten signature verification using deep convolutional neural networks, Pattern Recognition 70 (2017) 163 – 176. doi:10.1016/j.patcog.2017.05.012.
  • (23) Y. Aytar, C. Vondrick, A. Torralba, Soundnet: Learning sound representations from unlabeled video, in: NIPS 2016, 2016.
  • (24) R. Rubinstein, M. Zibulevsky, M. Elad, Double sparsity: Learning sparse dictionaries for sparse signal approximation, IEEE Transactions on Signal Processing 58 (3) (2010) 1553–1564. doi:10.1109/TSP.2009.2036477.
  • (25)

    Y. Chen, J. Su, Sparse embedded dictionary learning on face recognition, Pattern Recognition 64 (2017) 51 – 59.

  • (26)

    K. Bunte, M. Biehl, M. F. Jonkman, N. Petkov, Learning effective color features for content based image retrieval in dermatology, Pattern Recognition 44 (9) (2011) 1892 – 1902.

  • (27) N. Passalis, A. Tefas, Neural bag-of-features learning, Pattern Recognition 64 (2017) 277 – 294. doi:10.1016/j.patcog.2016.11.014.
  • (28) P. Koniusz, F. Yan, P. H. Gosselin, K. Mikolajczyk, Higher-order occurrence pooling for bags-of-words: Visual concept detection, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2) (2017) 313–326. doi:10.1109/TPAMI.2016.2545667.
  • (29) G. Guo, S. Z. Li, Content-based audio classification and retrieval by support vector machines, IEEE Trans. Neur. Netw. 14 (1) (2003) 209–215.
  • (30) C. Nadeu, D. Macho, J. Hernando, Time and frequency filtering of filter-bank energies for robust HMM speech recognition, Speech Communication 34 (2001) 93 – 114. doi:10.1016/S0167-6393(00)00048-0.
  • (31) J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad, A. Serralheiro, Non-speech audio event detection, in: IEEE ICASSP, 2009, pp. 1973–1976.
  • (32) C. Clavel, T. Ehrette, G. Richard, Events detection for an audio-based surveillance system, in: ICME, 2005, pp. 1306 –1309. doi:10.1109/ICME.2005.1521669.
  • (33) P. K. Atrey, N. C. Maddage, M. S. Kankanhalli, Audio based event detection for multimedia surveillance, in: IEEE ICASSP, Vol. 5, 2006.
  • (34) A. Rabaoui, M. Davy, S. Rossignol, N. Ellouze, Using one-class svms and wavelets for audio surveillance, IEEE Trans. Inf. Forensics Security 3 (4) (2008) 763–775.
  • (35) S. Chu, S. Narayanan, C. C. J. Kuo, Environmental sound recognition with time-frequency audio features, IEEE Trans. Audio, Speech, Language Process. 17 (6) (2009) 1142–1158. doi:10.1109/TASL.2009.2017438.
  • (36) J. Dennis, H. D. Tran, E. S. Chng, Image feature representation of the subband power distribution for robust sound event classification, IEEE Trans. Audio, Speech, Language Process. 21 (2) (2013) 367–377.
  • (37) R. D. Patterson, K. Robinson, J. Holdsworth, D. Mckeown, C. Zhang, M. Allerhand, Complex Sounds and auditory images, in: Auditory Physiology and Perception, 1992, pp. 429–443.
  • (38) R. F. Lyon, J. Ponte, G. Chechik, Sparse coding of auditory features for machine hearing in interference, in: IEEE ICASSP, 2011, pp. 5876–5879.
  • (39) P. Foggia, A. Saggese, N. Strisciuglio, M. Vento, Cascade classifiers trained on gammatonegrams for reliably detecting audio events, in: IEEE AVSS, 2014, pp. 50–55.
  • (40) J.-J. Aucouturier, B. Defreville, F. Pachet, The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music, J Acoust Soc Am 122 (2) (2007) 881–891.
  • (41) S. Pancoast, M. Akbacak, Bag-of-audio-words approach for multimedia event classification., in: INTERSPEECH, 2012, pp. 2105–2108.
  • (42) D. Giannoulis, D. Stowell, E. Benetos, M. Rossignol, M. Lagrange, M. D. Plumbley, A database and challenge for acoustic scene classification and event detection, in: EUSIPCO, 2013, pp. 1–5.
  • (43) X. Lu, Y. Tsao, S. Matsuda, C. Hori, Sparse representation based on a bag of spectral exemplars for acoustic event detection, in: IEEE ICASSP, 2014, pp. 6255–6259. doi:10.1109/ICASSP.2014.6854807.
  • (44) R. Grzeszick, A. Plinge, G. Fink, Temporal acoustic words for online acoustic event detection, in: Pattern Recognition, Vol. 9358 of LNCS, 2015, pp. 142–153.
  • (45) M. Chin, J. Burred, Audio event detection based on layered symbolic sequence representations, in: IEEE ICASSP, 2012, pp. 1953–1956.
  • (46) A. Kumar, P. Dighe, R. Singh, S. Chaudhuri, B. Raj, Audio event detection from acoustic unit occurrence patterns, in: IEEE ICASSP, 2012, pp. 489–492.
  • (47) H. Phan, L. Hertel, M. Maass, R. Mazur, A. Mertins, Audio phrases for audio event recognition, in: EUSIPCO, 2015.
  • (48) R. Grzeszick, A. Plinge, G. A. Fink, Bag-of-features methods for acoustic event detection and classification, IEEE Trans. Audio, Speech, Language Process. 25 (6) (2017) 1242–1252. doi:10.1109/TASLP.2017.2690574.
  • (49) K. J. Piczak, Environmental sound classification with convolutional neural networks, in: IEEE MLSP, 2015, pp. 1–6. doi:10.1109/MLSP.2015.7324337.
  • (50) J. S. Downie, A. F. Ehmann, M. Bay, M. C. Jones, The Music Information Retrieval Evaluation eXchange: Some Observations and Insights, 2010, pp. 93–115.
  • (51) J. Barker, E. Vincent, N. Ma, H. Christensen, P. Green, The PASCAL CHiME speech separation and recognition challenge, Computer Speech and Language 27 (3) (2013) 621 – 633. doi:10.1016/j.csl.2012.10.004.
  • (52) R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, P. Soundararajan, The CLEAR 2006 Evaluation, 2007, pp. 1–44.
  • (53) D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, M. D. Plumbley, Detection and classification of acoustic scenes and events, IEEE Trans. Multimedia 17 (10) (2015) 1733–1746. doi:10.1109/TMM.2015.2428998.
  • (54) R. D. Patterson, B. C. J. Moore, Auditory filters and excitation patterns as representations of frequency resolution, Frequency selectivity in hearing (1986) 123–177.
  • (55) B. R. Glasberg, B. C. Moore, Derivation of auditory filter shapes from notched-noise data, Hearing Research 47 (1–2) (1990) 103 – 138.
  • (56) A. L. chun Wang, T. F. B. F, An industrial-strength audio search algorithm, in: ISMIR, 2003.
  • (57) A. Palmer, I. Russell, Phase-locking in the cochlear nerve of the guinea-pig and its relation to the receptor potential of inner hair-cells, Hearing Research 24 (1) (1986) 1 – 15. doi:10.1016/0378-5955(86)90002-X.
  • (58) C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (3) (1995) 273–297. doi:10.1007/BF00994018.
  • (59) K. Morik, P. Brockhausen, T. Joachims, Combining statistical learning with a knowledge-based approach – a case study in intensive care monitoring, in: International Conference on Machine Learning (ICML), 1999, pp. 268–277.
  • (60) A. Saggese, N. Strisciuglio, M. Vento, N. Petkov, Time-frequency analysis for audio event detection in real scenarios, in: AVSS, 2016, pp. 438–443. doi:10.1109/AVSS.2016.7738082.
  • (61) Y. Aytar, C. Vondrick, A. Torralba, Soundnet: Learning sound representations from unlabeled video, in: NIPS, 2016, pp. 892–900.
  • (62) F. Colangelo, F. Battisti, M. Carli, A. Neri, F. Calabró, Enhancing audio surveillance with hierarchical recurrent neural networks, in: AVSS, 2017, pp. 1–6. doi:10.1109/AVSS.2017.8078496.
  • (63) V. Carletti, P. Foggia, G. Percannella, A. Saggese, N. Strisciuglio, M. Vento, Audio surveillance using a bag of aural words classifier, in: IEEE AVSS, 2013, pp. 81–86. doi:10.1109/AVSS.2013.6636620.
  • (64) L. Hertel, H. Phan, A. Mertins, Comparing time and frequency domain for audio event recognition using deep learning, in: IJCNN, 2016, pp. 3407–3411.
  • (65) F. Medhat, D. Chesmore, J. Robinson, Environmental Sound Recognition Using Masked Conditional Neural Networks, 2017, pp. 373–385.
  • (66) C. Nadeau, Y. Bengio, Inference for the generalization error, Machine Learning 52 (3) (2003) 239–281. doi:10.1023/A:1024068626366.
  • (67) G. Azzopardi, N. Petkov, A CORF computational model of a simple cell that relies on LGN input outperforms the Gabor function model, Biological Cybernetics 106 (3) (2012) 177–189. doi:{10.1007/s00422-012-0486-6}.
  • (68) G. Azzopardi, N. Petkov, Trainable COSFIRE filters for keypoint detection and pattern recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2) (2013) 490–503. doi:10.1109/TPAMI.2012.106.
  • (69) G. Azzopardi, N. Strisciuglio, M. Vento, N. Petkov, Trainable COSFIRE filters for vessel delineation with application to retinal images, Medical Image Analysis 19 (1) (2015) 46 – 57. doi:10.1016/j.media.2014.08.002.
  • (70) N. Strisciuglio, G. Azzopardi, M. Vento, N. Petkov, Unsupervised delineation of the vessel tree in retinal fundus images, in: Computational Vision and Medical Image Processing VIPIMAGE 2015, 2015, pp. 149–155.
  • (71) N. Strisciuglio, N. Petkov, Delineation of line patterns in images using b-cosfire filters, in: IWOBI, 2017, pp. 1–6. doi:10.1109/IWOBI.2017.7985538.
  • (72) N. Strisciuglio, G. Azzopardi, N. Petkov, Detection of curved lines with b-cosfire filters: A case study on crack delineation, in: CAIP, 2017, pp. 108–120. doi:10.1007/978-3-319-64689-3_9.
  • (73) N. Strisciuglio, G. Azzopardi, N. Petkov, Brain-inspired robust delineation operator, in: Computer Vision – ECCV 2018 Workshops, 2019, pp. 555–565.
  • (74) A. Saggese, N. Strisciuglio, M. Vento, N. Petkov, Learning skeleton representations for human action recognition, Pattern Recognition Lettersdoi:10.1016/j.patrec.2018.03.005.
  • (75) M. Newton, L. Smith, Biologically-inspired neural coding of sound onset for a musical sound classification task, in: IJCNN, 2011, pp. 1386–1393. doi:10.1109/IJCNN.2011.6033386.
  • (76) A. Neocleous, G. Azzopardi, C. Schizas, N. Petkov, Filter-based approach for ornamentation detection and recognition in singing folk music, in: CAIP, Vol. 9256 of LNCS, 2015, pp. 558–569. doi:10.1007/978-3-319-23192-1_47.
  • (77) P. Cano, E. Batlle, T. Kalker, J. Haitsma, A review of audio fingerprinting, Journal of VLSI signal processing systems for signal, image and video technology 41 (3) (2005) 271–284. doi:10.1007/s11265-005-4151-3.
  • (78) E. A. Lopez-Poveda, A. Eustaquio-Martín, A biophysical model of the inner hair cell: The contribution of potassium currents to peripheral auditory compression, Journal of the Association for Research in Otolaryngology 7 (3) (2006) 218–235. doi:10.1007/s10162-006-0037-8.
  • (79) N. Strisciuglio, G. Azzopardi, M. Vento, N. Petkov, Supervised vessel delineation in retinal fundus images with the automatic selection of B-COSFIRE filters, Mach. Vis. Appl. (2016) 1–13doi:10.1007/s00138-016-0781-7.