General audio tagging with ensembling convolutional neural network and statistical features

by   Kele Xu, et al.

Audio tagging aims to infer descriptive labels from audio clips. Audio tagging is challenging due to the limited size of data and noisy labels. In this paper, we describe our solution for the DCASE 2018 Task 2 general audio tagging challenge. The contributions of our solution include: We investigated a variety of convolutional neural network architectures to solve the audio tagging task. Statistical features are applied to capture statistical patterns of audio features to improve the classification performance. Ensemble learning is applied to ensemble the outputs from the deep classifiers to utilize complementary information. a sample re-weight strategy is employed for ensemble training to address the noisy label problem. Our system achieves a mean average precision (mAP@3) of 0.958, outperforming the baseline system of 0.704. Our system ranked the 1st and 4th out of 558 submissions in the public and private leaderboard of DCASE 2018 Task 2 challenge. Our codes are available at



There are no comments yet.


page 3


General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline

This paper describes Task 2 of the DCASE 2018 Challenge, titled "General...

Sample Mixed-Based Data Augmentation for Domestic Audio Tagging

Audio tagging has attracted increasing attention since last decade and h...

Combining High-Level Features of Raw Audio Waves and Mel-Spectrograms for Audio Tagging

In this paper, we describe our contribution to Task 2 of the DCASE 2018 ...

Audio Tagging by Cross Filtering Noisy Labels

High quality labeled datasets have allowed deep learning to achieve impr...

Audio tagging with noisy labels and minimal supervision

This paper introduces Task 2 of the DCASE2019 Challenge, titled "Audio t...

Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging

Environmental audio tagging aims to predict only the presence or absence...

Horizontal and Vertical Ensemble with Deep Representation for Classification

Representation learning, especially which by using deep learning, has be...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Audio tagging task is a task to predict the presence or absence of certain acoustic events in an audio recording, and it has drawn lots of attention during the last several years. Audio tagging has widely applications, such as surveillance, monitoring, and health care [1]

. Historically, audio tagging has been addressed with different handcrafted features and shallow-architecture classifiers including Gaussian mixture models (GMMs)

[2] and non-negative matrix factorizations (NMFs) [3]

. Recently, deep learning approaches such as convolutional neural networks (CNNs) have achieved state-of-the-art performance for the audio tagging task

[4, 5].

DCASE 2018 Task 2 launched a competition for the general audio tagging task [1] to attract research interests for the audio tagging problem. However, due to the limited size of data and noisy labels [1], general audio tagging remains as a challenge and falls short of accuracy and robustness. The current general audio tagging systems are confronted with several challenges: (1) There are a large amount of event classes in [1] compared with previous audio classification tasks [3, 6]. (2) The imbalance problem could make the model emphasize more on the classes with more training samples and difficult to learn from the classes with less samples. (3) The data quality varies from class to class. For example, some audio clips are manually verified in [1] but others are not. Designing supervised deep learning algorithms that can learn from data sets with noisy labels is an important problem, especially, when the data set is small.

In this paper, we aim to build scalable ensemble approach with taking the noisy label into account. The proposed method achieves a state-of-the-art performance on the DCASE 2018 Task 2 dataset. The contributions of the paper are summarized as below: (1) A quantitative comparison is investigated using different convolutional neural network (CNN) architectures inspired from computer vision. These CNN architectures are further deployed for the ensemble learning. (2) We propose to employ statistical features including the skewness and kurtosis of frame-wise MFCC to improve the performance. (3) A scalable ensemble approach is used to utilize the complementary information of different deep architectures and handcrafted features. (4) A samples re-weight strategy is proposed for the ensemble learning to solve the noisy label problem in the dataset.

The paper is organized as follows: Section 2 describes the proposed CNNs, statistical features, ensemble learning and sample re-weight methods. Section 3 shows experimental results. Section 4 concludes and forecasts future work.

2 Methodology

2.1 Convolutional neural networks

CNNs have been successfully applied to many computer vision tasks [7, 8, 9]. Though there are many works using CNN for audio tagging [3], there is few work investigating a quantitative comparison of different CNNs on the audio tagging task. In this paper, we investigated 7 effective CNN architectures from computer vision on the tagging task including VGG [7], Inception [8], ResNet [9], DenseNet [10], ResNeXt [11], SE-ResNeXt [12] and DPN [13]. Our aim is to investigate these CNNs on the audio tagging task.

VGGNet [7] consists of 33 convolutional layers stacked on top of each other to increase the depth of a CNN. Inception [8] applies different size of convolution filter within the blocks of a network, which can act as a “multi-level feature extractor”. ResNet [9] introduces residual models to alleviate gradient vanishing problem to train very deep CNNs. DenseNet [10] consists of many dense blocks, which are connected to a transition layer to re-utilize the previous features. ResNeXt [11] is an improvement of ResNet. It is constructed by repeating a building block that aggregates a set of transformations with the same topology. By introducing the Squeeze-and-Excitation (SE) block [12], networks could improve the representational power by explicitly modeling the interdependencies between the channels of its convolutional features. The SE block can be deployed on the ResNeXt, which is denoted as SE-ResNeXt in this paper. DPN inherits the benefits from ResNet and DenseNet. It shares common features while maintaining the flexibility to explore new features through dual path architectures.

It is worthwhile to notice that two different ways to train a deep model including: using ImageNet-based pre-trained model to initialize the weights and finetune the model or randomly initialize the weights and train the model from scratch.

2.2 Statistical features

Some statistical patterns of the audio representation cannot be easily learned by the deep models. For example, the higher-order statistics including the skewness and kurtosis. We show that these handcrafted statistical features can provide complementary information, which can be used to improve the classification performance [14].

In our experiments, as the audio pre-processing method described above, all audio samples are divided into 1.5-second audio clips. We employ the statistical features on raw audio signal and MFCC features. The statistical features including the mean, variance, variance of the derivative, skewness and kurtosis. The definitions of skewness and kurtosis are given as follows.



is the vector (for example,

can be the raw signal or the MFCC of an audio segment which is random selected from the audio clip). is the mean,

is the standard deviation, and

is the expectation operator. The statistical features are clip-wise, which suggests that the statistical analysis is conducted for each clip.

Sample statistical features are shown in Fig. 1, in which the kurtosis, root mean square (RMS) and skewness of audio data are given. As can be seen from the figure, the kurtosis values varies with different categories, such as Finger_snapping and Scissors have larger kurtosis values than other categories, Scissors and Writing have lower RMS values than other categories. The classifier could benefit from the combination of these statistical features. Thus, an effective approach to employ these patterns maybe can further boost the performance of the tagging task, which will be demonstrated in next section.

Figure 1: Sample statistical features for different categories (kurtosis, RMS and skewness values).

In this part, we explore to extract handcrafted statistical features, which will be deployed by the ensemble learning. Here, we are not aim to extract all of state-of-the-art handcrafted features for the classification task, but can be used to provide a demo for the ensemble learning framework.

2.3 Ensemble learning

Due to the limited size of DCASE 2018 Task 2, single model is easily overfitted. Ensemble different models can improve the accuracy and robustness for the classification task [15] using the complementary prediction result from different models. However, the ensemble learning has been under-explored for the audio tagging task. Most of previous methods simply average the predictions [15]

. In this paper, we explore the use of stacked generalization in multiple levels to improve the accuracy and robustness to solve the audio tagging problem. The framework is computational, scalable and it have been tested on multiple machine learning tasks

[16]. Fig. 2 shows the proposed stacking architecture used in our task, which is composed of two levels. Level 1 consists of the deep models using different CNN architectures. Level 2 is shallow-architecture classier using the meta-features obtained from level 1. Fig. 2 shows that both of deep learning-based meta features and handcrafted statistical features are used for the ensemble learning in level 2.

Figure 2: Framework of proposed ensemble learning approach.

We randomly split the training data into 5 folds in our experiments. For the deep models, the out-of-fold based approaches are used to generate the out-of-predictions. All deep models use the same folds split configuration during the meta-feature creation. For each CNN, we run the CNN models for each out-of-fold training data, and one model to predict the probabilities for each sample in the validating set by using the whole training dataset. The predicted probabilities of different classes will be concatenated to generate meta-features. For each classifier, the probabilities for 41 classes will be used as the meta-features, which will be concatenated to generate the new training dataset (as can be seen in Fig. 2.), and the meta features will be used as the input for level 2.

For the ensemble learning in level 2, we employ the Gradient Boosting Decision Tree (GBDT)


for the task. The reason is that: compared to other approaches such as linear regression and support vector machine, GBDT provides better classification performance on several public machine learning challenges. GBDT is a tree-based gradient boosting algorithm. By continuously fitting the residuals of the training samples

[18], each new tree reduces the errors produced from the prediction of the previous tree. The strategy of reducing residuals greatly improves the prediction accuracy of the model.

2.4 Sample re-weight

The objective function in ensemble learning is: , where

is the convex loss function,

is regulation component, including regulation and regulation. is the number of the samples, is the prediction for sample and

is the label. As the data size of the audio tagging is limited, some non-verified samples are employed as the training data, which may induce noisy samples. To train a classifier, the outliers in the training set have a high negative influence on the trained model. Indeed, designing supervised learning algorithms that can learn from data sets with noisy labels is an important problem, especially, when the data set is small.

Here, we propose to induce a new hyper-parameter to re-weight the training samples. In more detail, the sample weight of manually verified samples is set as 1.0, while the weight for the non-manually verified samples are set as a constant value (smaller than 1). The best configuration for can be obtained using the grid search. Thus, the final objective function for ensemble learning can be re-written as:


where is the sample-wise weight, is the number of manually verified audio clips and is the number of non-verified audio clips.

3 Experimental Results

3.1 Datasets

The DCASE 2018 task 2 challenge dataset was provided by Freesound [19]. This dataset contains 18,873 audio files annotated with 41 classes of label from Google’s AudioSet Ontology [20], in which 9,473 audio clips are used for training, 9,400 samples for validation (1,600 samples are manual verified). The provided sound files are uncompressed PCM 16-bit, 44.1 kHz, mono audio files with widely varying recording quality and techniques. The duration of the audio samples range from 300 ms to 30 s due to the diversity of the sound categories. The average length of the audio files is 6.7 seconds. In the training dataset, the number of audio clips ranges from 94 to 300 depending on different classes.

3.2 Preprocessing

Two different kinds of inputs are employed to train the deep networks: log-scaled mel-spectrograms (log-mel) and Mel-frequency cepstral coefficients (MFCC) of the audio segment. For the raw signal, 1.5s audio segments are randomly selected. For the log-mel, we choose the number of the mel filter banks as 64, with a frame width of 80 ms and the frame shift is 10 ms. This will result in 150 frames in an audio clip. Then the delta and delta-delta features of log-mel is calculated with a window size of 9. Finally, the original log-mel features are concatenated with delta and delta-delta features to form a dimension [21]. MFCC follows similar generation procedure, except for the size. To prevent over-fitting, we apply mixup-data augmentation [22] with an ratio of 0.2 [23].

3.3 Quantitative comparison between different CNN architectures

We apply mean average precision (mAP) as evaluation criterion, the mAP@3 performance of CNN models is shown in Fig. 1. All the 1600 manually-verified samples are used for the evaluation. Fig. 3 shows that: (1) Using the same architecture, log mel feature achieves better mAP@3 than MFCC using all CNN architectures; (2) Using pre-trained model, deeper CNN models such as ResNext improve the mAP@3 for the tagging task with the prior knowledge extracted from the visual data. Moreover, the combination of Log mel and deeper model provide superior performance. (3) With the network pre-trained with the computer vision data, the CNN models can provide better performance with comparison to the model trained from scratch. This indicates that the size of the audio dataset might not be sufficiently large enough to train deep models from scratch.

Figure 3: The mAP@3 obtained using different single CNN model.

3.4 Ablation study for statistical features

To demonstrate the effectiveness of handcrafted features for proposed ensemble learning, we provide an ablation study for the handcrafted features. We first calculate the mAP@3 of ensemble learning by only using the out-of-fold predictions from deep models, which are regarded as a baseline. Thus, the handcrafted features are added to make a quantitative comparison with the same hyper-parameter configuration. As can be seen from Table 1, the obtained mAP@3s are much higher with statistical features.

3.5 Ensemble learning with sample re-weight

Out-of-fold predictions from the component models are aggregated to original file level before being fed into the level 2 model. Here, we implement our approach based on the LightGBM python library [18]. The ‘max_depth’ parameter of the model is set to 3 and learning rate was set at 0.03. which works good in our experiment. In addition, feature subsample and the sample subsample values were set at 0.7 to prevent from overfitting. Table 1 shows the experimental results with different . As can be seen from the table, with handcrafted features, the mAP@3 of the classifier can be boosted with as 0.6.

with TF without TF
0.0 0.944 0.935
0.2 0.946 0.935
0.4 0.953 0.946
0.6 0.958 0.947
0.8 0.947 0.943
1.0 0.946 0.939
Table 1: The mAP@3 value of the tagging task different configuration for ensemble learning. (Statistical features is abbreviated as TF)

4 Conclusion

In this work, we proposed an novel ensemble-learning system employing a variety of CNNs and statistic features for the general-purpose audio tagging task in DCASE 2018. (1) A comparative study of the performance of different state-of-the-art CNN architectures are presented. (2) Statistical features are researched and are demonstrated to be effective for the ensemble learning. (3) The proposed ensemble-learning can employ the complementary information of deep-models and statistical features, which have a superior classification performance. (4) A sample re-weight strategy is employed to handle with the potential noisy label of the non-verified annotations in the dataset. Our system ranked the 1st and 4th out of 558 submissions in the public and private leaderboard of the DCASE 2018 Task 2 Challenge. For future work, we will evaluate the performance of our method on the Google AudioSet.

5 Acknowledgment

This work was supported by the National Grand R&D Plan(Grant No. 2016YFB1000101). This study was also partially funded by the National Natural Science Foundation of China (No. 61806214).


  • [1] E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis, X. Favory, J. Pons, and X. Serra, “General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), 2018.
  • [2]

    A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in

    European Signal Processing Conference.   IEEE, 2016, pp. 1128–1132.
  • [3] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” in IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events Workshop, 2017.
  • [4] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “CNN architectures for large-scale audio classification,” in

    IEEE conference on Computer Vision and Pattern Recognition

    .   IEEE, 2017, pp. 131–135.
  • [5] Y. Xu, Q. Huang, W. Wang, P. Foster, S. Sigtia, P. J. Jackson, and M. D. Plumbley, “Unsupervised feature learning based on deep models for environmental audio tagging,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1230–1241, 2017.
  • [6] P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley, “Chime-home: A dataset for sound source recognition in a domestic environment.” in WASPAA, 2015, pp. 1–5.
  • [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [8] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in IEEE conference on Cmputer Vision and Pttern Recognition, 2016, pp. 2818–2826.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [10] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in Proceedings of the IEEE conference on Cmputer Vision and Pttern Recognition, vol. 1, no. 2, 2017, p. 3.
  • [11] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2017, pp. 5987–5995.
  • [12] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXiv preprint arXiv:1709.01507, vol. 7, 2017.
  • [13] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” in Advances in Neural Information Processing Systems, 2017, pp. 4467–4475.
  • [14] E. Fonseca, R. Gong, and X. Serra, “A simple fusion of deep and shallow learning for acoustic scene classification,” arXiv preprint arXiv:1806.07506, 2018.
  • [15] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer, “CP-JKU submissions for DCASE-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), 2016.
  • [16] L. Deng, D. Yu, and J. Platt, “Scalable stacking and learning for building deep architectures,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on.   IEEE, 2012, pp. 2133–2136.
  • [17] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001.
  • [18] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems, 2017, pp. 3146–3154.
  • [19] F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in ACM international conference on Multimedia.   ACM, 2013, pp. 411–412.
  • [20] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2017, pp. 776–780.
  • [21] D. Feng, K. Xu, H. Mi, F. Liao, and Y. Zhou, “Sample dropout for audio scene classification using multi-scale dense connected convolutional neural network,” arXiv preprint arXiv:1806.04422, 2018.
  • [22] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” International Conference on Learning Representations, 2018.
  • [23] K. Xu, D. Feng, H. Mi, B. Zhu, D. Wang, L. Zhang, H. Cai, and S. Liu, “Mixup-based acoustic scene classification using multi-channel convolutional neural network,” arXiv preprint arXiv:1805.07319, 2018.