Audio plays a critical role in understanding the environment around us. This makes audio content analysis research important for tasks related to multimedia [1, 2], and human computer interaction [3, 4]
to mention a pair. However, unlike the field of computer vision which has a variety of standard publicly available datasets such as Imagenet, audio event/scene analysis lacks such large dataset. This makes it difficult to compare different approaches and establishing the state of art. The second iteration of DCASE, occurring in 2016, offers an opportunity to compare approaches on a standard public dataset. This edition it includes four different tasks: acoustic scene classification, sound event detection– real and synthetic audio, and audio tagging.
The state-of-the-art of the previous DCASE challenge, for both acoustic scenes [6, 7, 8] and sound event detection [6, 8, 9], attributed their success mainly to features and audio representations rather than classifiers. Hence, an important aspect in our work is to emphasize on classifier exploration along with features. In this paper we present our work performed on Task 1 and Task 3. We proposed a variety of methods for both tasks and we obtained significant improvement over the baseline methods.
2 Tasks and Data
The goal of Task 1, Acoustic Scene Classification, is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example park, home, office. TUT Acoustic Scenes 2016 dataset is used for this task. It consists of recordings from various acoustic scenes. For each recording location, a 3-5 minute long audio recording was captured. The original recordings were then split into 30-second segments for the challenge. There are 15 acoustic scenes for the task.
Task 3, Sound Event Detection in Real Life Recordings, evaluates performance of sound event detection in multi-source conditions similar to our everyday life. There is no control over the number of overlapping sound events at each time, not in the training nor in the audio data. TUT Sound Events 2016 dataset is used for Task 3, which consists of recordings from two acoustic scenes: Home and Residential Area. There are 18 selected sound event classes, 11 for Home and 7 for Residential Area.
3 Task 1: Acoustic Scene Classification
From machine learning perspective, we treated Task 1 as a multi-class classification problem. The first step is to use a suitable method for characterizing acoustic scenes in the audio segments. An effective approach for characterizing audio events is bag-of-audio-words based feature representation
, which is usually built over low-level features such as MFCCs. Acoustic scenes, however, are more complex mixtures of different audio events and a more robust representation is required. To obtain a more robust representation we use Gaussian Mixture Models (GMMs) for feature representations of audio segments. Broadly, we employed two high-level feature representations to represent audio scenes. On the classification front we used Support Vector Machines (SVMs) as our primary classifier and in combination with other classifiers.
3.1 Feature Representations
-dimensional MFCCs vectors for a recording be represented as, where to, is the total number of MFCCs vectors for the recording. The major idea behind both high-level feature representation is to capture the distribution of MFCCs vectors of a recording. We will refer to these features as and features and the sub-types will be represented using appropriate subscripts and superscripts.
The first step in obtaining high-level fixed dimensional feature representation for audio segments is to train a GMM on MFCC vectors of the training data. Let us represent this GMM by , where , and are the mixture weight, mean and covariance parameters of the Gaussian in . We will assume diagonal covariance matrices for all Gaussians and will represent the diagonal vector of . Given the MFCCs vectors of a recording, we computed the probabilistic assignment of to the Gaussian. These soft assignments are added over all to obtain the total mass of MFCCs vectors belonging to the Gaussian (Eq 1). Normalization by is used to remove the effect of the duration of recordings.
The soft count histogram features referred to as is, . is an -dimensional feature representation for a given recording. It captures how the MFCC vectors of a recording are distributed across the Guassians in . is normalized to sum to before using it for classifier training.
The next feature (), also based on the GMM , tries to capture the actual distribution of the MFCC vectors of a recording. This is done by adapting the parameters of to the MFCC vectors of the recording. We employ maximum a posteriori
(MAP) estimation to for the adaptation . Parameter adaptation for Gaussian follows the following steps. First we compute,
Finally, the updated mean and variances are obtained as
The relevance factor controls the effect of the original parameters on the new estimates. We obtain different feature representation using the adapted means () and variances (). The first one denoted by is an dimensional feature obtained by concatenating the adapted means for all , that is . In the second features adapted are concatenated along with to obtain a dimensional features. This form of features are denoted by .
Once the feature representation for audio segments have been obtained, Task 1 essentially becomes a multi-class classification problem. Our primary classifiers are SVMs where we explore a variety of kernels. For the features, we use Linear Kernel (LK) and RBF Kernel (RK). For soft-count histogram features we explore a panoply of kernels. Along with LK and RK we explored the following kernels.
Exponential Distance (ECK): the kernel is computed as , where is distance.
Kernel (CK): In this case
Intersection Kernel (IK):
Exponential Hellinger Distance Kernel (EHK): where
Hellinger Kernel (HK):
The details of these kernels can be found in [13, 14, 15]. For kernels where term appears, the optimal value of value can be obtained by cross validation over training data. However, setting equal to the inverse of average distance between training data points works well in general as well. We use  for SVM implementation.
Finally, we have a classifier fusion step where we combined the output of the different classifiers. We combined multiple classifiers by taking prediction vote from each classifier and the final predicted class is the one which gets the maximum vote. We call it the Fused Classifier and we observed that the fused classifier can give significant improvement for several acoustic scenes.
Our experimental setup with the folds structure, is same as the one provided by DCASE. We extracted 20 dimensional MFCC features using window and overlap. MFCCs are augmented with their delta and acceleration features. For our final feature representation we experimented with different values of GMM component size , and . The relevance factor for is set to . Due to space constraints we cannot present fold-and-scene specific results for all cases and hence overall accuracy for all folds is shown. Table 1 shows overall accuracy results for different cases. The accuracy for the MFCC-GMM baseline method provided in the challenge is .
We can observe from Table 1 that features in general do not perform better than the baseline method for any SVM kernel. However, features clearly outperformed baseline method. In the best case, with and our method outperformed the baseline by an absolute .
Table 2 shows results for the fused classifiers. For the fusion step we did not consider classifiers built over since these classifiers are inferior compared to those using features. We can observe that our proposed method beats the baseline method by an absolute . Moreover, for scenes such as Park, Train, Library where the baseline method gives very poor results, we improved the accuracy by an absolute . We also obtained superior overall accuracy on all folds which suggests that our proposed method is fairly robust. This is further supported by the fact that on DCASE evaluation set, We achieved an overall accuracy of .
|Scene||Fold 1||Fold 2||Fold 3||Fold 4||Avg.||Fold 1||Fold 2||Fold 3||Fold 4||Avg.|
4 Task 3: Sound Event Detection in Real Life Recordings
|Stacked + PCA||66.06||Random Forest|
Detection of sound events in scenes and long recordings have been treated as a multi-class classification problem before in[18, 19, 20] where a classifier is trained with the sound segments. For testing, the classifier outputs segment/frame-level predictions for all the classes. In order to follow a similar approach, first we wanted to analyze features’ performance for sound events regardless of the scene. This way, we could have an intuition of performance on the harder scenario of Task 3 where not every segment of the scene corresponds to a labeled sound event.
4.1 Features and Classifiers Optimization
For the features we tried the conventional MFCCs with standard parameters such as 12 coefficients plus energy, delta and double delta for a total of 39 dimensions. Moreover, we explored three features addressing the time-frequency acoustic characteristics. The Gabor Filter Bank (GBFB) in 
have 2D-filters arranged by spectral and temporal modulation frequencies in a filter bank. The Separable Gabor filter bank (SGBFB) features extract spectro-temporal patterns with two separate 1D GBFBs, a spectral one and a temporal one. This approach reduces the complexity of the spectro-temporal feature extraction and further improves robustness as demonstrated in. Both features have the default parameters from the toolbox111http://www.uni-oldenburg.de/mediphysik-akustik/mediphysik/downloads/gabor-filter-bank-features/ for a total dimension of 1,020 each. The Scatnet  features are generated by a scattering architecture which computes invariants to translations, rotations, scaling and deformations, while keeping enough discriminative information. It can be interpreted as a deep convolution network, where convolutions are performed along spatial, rotation and scaling variables. As opposed to standard convolution networks, the filters are not learned but are scaled and rotated wavelets. The features were extracted with a toolbox222http://www.di.ens.fr/data/software/scatnet/ using 0.25 second segments. The dimensionality of the three Scatnet components are 2, 84, 435 for a total of 521. Additionally, we included the normalized (mean and variance) Stacked (MFCCs+ SGBFB+ Scatnet) with PCA and also the normalized (mean and variance) Stacked without PCA. For the PCA we used Scikit’s  and used the full dimensionality of 1,580 as the number of input components and the resultant automatic reduction was 909 dimensions. For all the feature types and for the sake of avoiding the length variability of the temporal dimension, we averaged the vectors across time to end up with one single vector per sound event file.
, which is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming. This toolbox (version 4) considers 12 classifiers such as Decision Tree, Random Forest, Xtreme Gradient Boosting, SVMs, K-Neighbors and Logistic Regression. The main Tpot parameter is “number of generations”, which corresponds to the number of iterations carried to tune the classifier, we set it to 15. An example of the best classifier for each feature type can be seen in Table3. Interestingly, decision tree-based algorithms and logistic regression outperformed others like SVMs.
For our experiments, we extracted the 18 sound events from the two scenes using the annotations, and then we extracted different feature types from these isolated sounds. For each feature type experiment, the sound events’ feature files were fed to Tpot in a randomly selected ratio of 75% training and 25% testing, each set with different files. We kept the same partitions across our experiments for consistency. The performance was measured in terms of accuracy and is displayed in Table 3.
|Home + G||55.2||Random Forest|
|Home + G + P||55.7||Random Forest|
|Residential + G||57.8||Decision Tree|
|Residential + G + P||56.7||Random Forest|
The features with the best performance were MFCCs with 67.7% and thus we keep them for our DCASE evaluation set up. The other features have shown better results than MFCCs on audio classification, but it wasn’t the case for this particular dataset. Results for Scatnet was 62.1%, for GBF was 52.4%, and for SGBF was 61.5%. Moreover, the two normalized stacked features performed almost as good as MFCCs with 66.68% for the stacked without PCA and 66.08% for the stacked with PCA. In principle the stacked version contains more information about the acoustics and thus they were expected to perform better. Nevertheless, they didn’t outperform MFCCs which is designed for speech and focus on lower frequencies rather than on a wider frequency range. We cannot draw a fundamental conclusion on the performance of these features for sound event classification since the amount of data and classes are determinant.
4.2 Inclusion of Generic Sound Event Class
In the annotated scenes, not every segment of audio corresponds to a labeled sound. Hence, it cannot be assured that any of our sound event classes have to be present on every test segment. To handle out-of-vocabulary segments, we proposed a generic sound event class.
For the first experiment we wanted to analyze the impact of the generic class together with the 18 sounds in the multi-class classification set up described in Section 4.1 using MFCCs. To create such class, we used the sound events annotations and trimmed out the audio between the labeled segments, which are unlabeled. Then, we randomly selected from both scenes, 60 audio files which is about the average number of sound event samples per class. In order to visualize the performance, we included the normalized confusion matrices (CMs) in Figures 1 and 2. The accuracy performance without the generic class was 67.7% and with the generic class was 60.94%. The performance dropped with the inclusion of the new class, but we can also observe how although the generic class shared the background acoustics with the other sound classes, it didn’t significantly ambiguate with them.
The second set of experiments used the DCASE setup of separate scenes and four folds, and utilized the sound events with and without the generic class, but this time the generic class will have files particular to the scene. The results can be seen in Table 4 showing benefit of including the generic class. Moreover, the CMs not included due to space limitations, had cleaner diagonals. The reasons for performance improvements on the DCASE set up are suggested by the utilization of less sound classes, which reduces class ambiguity. The utilization of the generic class built with same-scene files as opposed to a mix of both scenes. As well as the optimization per scene of the classifier using Tpot.
4.3 Generation of Data Through Perturbation
The scarcity of labeled data per event is a common issue as discussed in [26, 27]. Annotations are costly, sounds don’t occur with the same frequency and in general it’s hard to capture enough variations of the same sound to train robust models. To address this problem, multiple techniques have been explored in the literature such as perturbation of the audio signal as in [28, 29]. The authors presented multiple types of perturbations resulting in improvements of speech separation. For Task 3 we performed time-based perturbation by speeding up and slowing down the sound event samples. We empirically analyzed multiple combinations of speed up-down values for different events. We concluded that speeding up more than 30% the original signal resulted in unintelligible audio and speeding down the signal more than 100% would be unlikely to occur. The range included 13 different speed values and the original version.
The set of experiments used the DCASE setup of separate scenes and four folds, and utilized the time-based perturbed audio. For training, we added to the original files the 13 versions of the perturbed audio files, whereas for testing, the set remained intact. The results can be seen in Table 4, where the performance for Home improved, but not for Residential. Thus, we decided to use perturbation for the DCASE evaluation.
4.4 Sound Event Detection and Submission Systems
For Task 3, we used the DCASE setup of separate scenes and four folds in a similar setup as the experiments from Table 4. For each scene, we extracted the sound events from the recordings using the annotations from the train set, followed by the extraction of MFCCs features. After, we trained the Tpot optimized multi-class classifier with the event samples. For testing, instead of using sound event files only, we segmented the scene recordings from the test set into one-second consecutive segments. This number was selected due to the metric schema of the DCASE evaluation, which considers one-second segments. After, we extracted audio features from the test segments and evaluate them with the classifier to obtain scores for each trained sound event class. The label corresponding to the highest score was chosen for the segments and then were written down into the DCASE format output file and fed to the official scoring scripts 333http://www.cs.tut.fi/sgn/arg/dcase2016/sound-event-detection-metrics along with the ground truth to compute performance.
We utilized the pipeline for three experiments, without generic class & without perturbation, with generic class & without perturbation and with generic class & with perturbation. The results using the development-test set are shown in Table 5. The inclusion of the generic class and the perturbation outperformed the baseline method by a significant margin for both Home and Residential scenes. Our submission consisted on the runs using G and G+P but using the evaluation set. The eval results were SBER of 0.9613 and Fscore of 33.6% given by the G+P version.
|Home + G||0.91||23.7|
|Home + G + P||0.9||24.7|
|Residential + G||0.72||45.9|
|Residential + G + P||0.63||52.2|
In this paper we showed different approaches for both acoustic scene classification (Task 1) and sound event detection (Task 3) of the 2016 DCASE challenge. On both tasks we were able to obtain significant improvement over the baseline method. For Task 1 we observed that the features performed much better than features. Although, linear and RBF kernels with features can outperform the baseline by considerable margin on its own, we make note of the fact that a multiple classifier system can give further improvements. For Task 3, we tested different features and classifiers and significantly improved the baseline. Moreover, we explored a way of handling out-of-vocabulary sound segments with the generic class and the inclusion of perturbed audio to add robustness.
-  H. Cheng, J. Liu, S. Ali, O. Javed, Q. Yu, A. Tamrakar, A. Divakaran, H. S. Sawhney, R. Manmatha, J. Allan et al., “Sri-sarnoff aurora system at trecvid 2012: Multimedia event detection and recounting,” in Proceedings of TRECVID, 2012.
-  L. Jiang, S.-I. Yu, D. Meng, Y. Yang, T. Mitamura, and A. G. Hauptmann, “Fast and accurate content-based semantic search in 100m internet videos,” in Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015.
-  S. Chu, S. Narayanan, C.-C. J. Kuo, and M. Mataric, “Where am i? scene recognition for mobile robots using audio features,” in 2006 IEEE ICME. IEEE, 2006, pp. 885–888.
-  M. Janvier, X. Alameda-Pineda, L. Girin, and R. Horaud, “Sound-Event Recognition with a Companion Humanoid,” in Humanoids 2012 - IEEE International Conference on Humanoid Robotics. Osaka, Japan: IEEE, 2012, pp. 104–111.
-  A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in 24th European Signal Processing Conference 2016 (EUSIPCO 2016), Budapest, Hungary, 2016.
-  G. Roma, W. Nogueira, P. Herrera, and R. de Boronat, “Recurrence quantification analysis features for auditory scene classification,” IEEE AASP Challenge:, Tech. Rep, 2013.
-  A. Rakotomamonjy and G. Gasso, “Histogram of gradients of time-frequency representations for audio scene classification,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 142–153, 2015.
-  J. Schröder, N. Moritz, M. R. Schädler, B. Cauchi, K. Adiloglu et al., “On the use of spectro-temporal features for the ieee aasp challenge ‘detection and classification of acoustic scenes and events’,” in 2013 IEEE WASPAA. IEEE, 2013.
-  J. F. Gemmeke, L. Vuegen, P. Karsmakers, B. Vanrumste et al., “An exemplar-based nmf approach to audio event detection,” in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 2013, pp. 1–4.
-  F.-F. Li and P. Perona, “The perceived position of moving objects: Transcranial magnetic stimulation of area MT+ reduces the flash-lag effect,” in IEEE CVPR, vol. 2, 2005.
-  F. Bimbot et al., “A tutorial on text-independent speaker verification,” EURASIP journal on applied signal processing, vol. 2004, pp. 430–451, 2004.
-  J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid, “Local features and kernels for classification of texture and object categories: A comprehensive study,” International journal of computer vision, vol. 73, no. 2, pp. 213–238, 2007.
-  A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit feature maps,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 3, pp. 480–492, 2012.
-  P. Li, G. Samorodnitsk, and J. Hopcroft, “Sign cauchy projections and chi-square kernel,” in Advances in Neural Information Processing Systems, 2013, pp. 2571–2579.
-  C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011.
-  R.-E. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, “Liblinear: A library for large linear classification,” The Journal of Machine Learning Research, 2008.
B. Elizalde, M. Ravanelli, K. Ni, D. Borth, and G. Friedland, “Audio-concept features and hidden markov models for multimedia event detection.”
-  B. Elizalde and G. Friedland, “Lost in segmentation: Three approaches for speech/non-speech detection in consumer-produced videos,” in 2013 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2013, pp. 1–6.
-  B. Elizalde, G. Friedland, H. Lei, and A. Divakaran, “There is No Data Like Less Data: Percepts for Video Concept Detection on Consumer-Produced Media,” in ACM International Workshop on Audio and Multimedia Methods for Large-Scale Video Analysis at ACM Multimedia, 2012.
M. R. Schädler, B. T. Meyer, and B. Kollmeier, “Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition,”The Journal of the Acoustical Society of America, pp. 4134–4151, 2012.
-  M. R. Schädler and B. Kollmeier, “Separable spectro-temporal gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition,” The Journal of the Acoustical Society of America, 2015.
-  L. Sifre and S. Mallat, “Rotation, scaling and deformation invariant scattering for texture discrimination,” in Proceedings of the IEEE CVPR, 2013, pp. 1233–1240.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, and others., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
R. S. Olson, R. J. Urbanowicz, P. C. Andrews et al.
, “Automating biomedical data science through tree-based pipeline optimization,” inProceedings of the 18th European Conference on the Applications of Evolutionary and Bio-inspired Computation, ser. Lecture Notes in Computer Science. Springer-Verlag, 2016.
-  A. Kumar and B. Raj, “Audio event detection using weakly labeled data,” in 24th ACM International Conference on Multimedia. ACM Multimedia, 2016.
-  ——, “Weakly supervised scalable audio content analysis,” in 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016.
-  J. Chen, Y. Wang, and D. Wang, “Noise perturbation improves supervised speech separation,” in International Conference on Latent Variable Analysis and Signal Separation. Springer, 2015, pp. 83–90.
N. Kanda, R. Takeda, and Y. Obuchi, “Elastic spectral distortion for low resource speech recognition with deep neural networks,” inAutomatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 309–314.