Audio tagging task is a task to predict the presence or absence of certain acoustic events in an audio recording, and it has drawn lots of attention during the last several years. Audio tagging has widely applications, such as surveillance, monitoring, and health care 
. Historically, audio tagging has been addressed with different handcrafted features and shallow-architecture classifiers including Gaussian mixture models (GMMs) and non-negative matrix factorizations (NMFs) 
. Recently, deep learning approaches such as convolutional neural networks (CNNs) have achieved state-of-the-art performance for the audio tagging task[4, 5].
DCASE 2018 Task 2 launched a competition for the general audio tagging task  to attract research interests for the audio tagging problem. However, due to the limited size of data and noisy labels , general audio tagging remains as a challenge and falls short of accuracy and robustness. The current general audio tagging systems are confronted with several challenges: (1) There are a large amount of event classes in  compared with previous audio classification tasks [3, 6]. (2) The imbalance problem could make the model emphasize more on the classes with more training samples and difficult to learn from the classes with less samples. (3) The data quality varies from class to class. For example, some audio clips are manually verified in  but others are not. Designing supervised deep learning algorithms that can learn from data sets with noisy labels is an important problem, especially, when the data set is small.
In this paper, we aim to build scalable ensemble approach with taking the noisy label into account. The proposed method achieves a state-of-the-art performance on the DCASE 2018 Task 2 dataset. The contributions of the paper are summarized as below: (1) A quantitative comparison is investigated using different convolutional neural network (CNN) architectures inspired from computer vision. These CNN architectures are further deployed for the ensemble learning. (2) We propose to employ statistical features including the skewness and kurtosis of frame-wise MFCC to improve the performance. (3) A scalable ensemble approach is used to utilize the complementary information of different deep architectures and handcrafted features. (4) A samples re-weight strategy is proposed for the ensemble learning to solve the noisy label problem in the dataset.
The paper is organized as follows: Section 2 describes the proposed CNNs, statistical features, ensemble learning and sample re-weight methods. Section 3 shows experimental results. Section 4 concludes and forecasts future work.
2.1 Convolutional neural networks
CNNs have been successfully applied to many computer vision tasks [7, 8, 9]. Though there are many works using CNN for audio tagging , there is few work investigating a quantitative comparison of different CNNs on the audio tagging task. In this paper, we investigated 7 effective CNN architectures from computer vision on the tagging task including VGG , Inception , ResNet , DenseNet , ResNeXt , SE-ResNeXt  and DPN . Our aim is to investigate these CNNs on the audio tagging task.
VGGNet  consists of 33 convolutional layers stacked on top of each other to increase the depth of a CNN. Inception  applies different size of convolution filter within the blocks of a network, which can act as a “multi-level feature extractor”. ResNet  introduces residual models to alleviate gradient vanishing problem to train very deep CNNs. DenseNet  consists of many dense blocks, which are connected to a transition layer to re-utilize the previous features. ResNeXt  is an improvement of ResNet. It is constructed by repeating a building block that aggregates a set of transformations with the same topology. By introducing the Squeeze-and-Excitation (SE) block , networks could improve the representational power by explicitly modeling the interdependencies between the channels of its convolutional features. The SE block can be deployed on the ResNeXt, which is denoted as SE-ResNeXt in this paper. DPN inherits the benefits from ResNet and DenseNet. It shares common features while maintaining the flexibility to explore new features through dual path architectures.
It is worthwhile to notice that two different ways to train a deep model including: using ImageNet-based pre-trained model to initialize the weights and finetune the model or randomly initialize the weights and train the model from scratch.
2.2 Statistical features
Some statistical patterns of the audio representation cannot be easily learned by the deep models. For example, the higher-order statistics including the skewness and kurtosis. We show that these handcrafted statistical features can provide complementary information, which can be used to improve the classification performance .
In our experiments, as the audio pre-processing method described above, all audio samples are divided into 1.5-second audio clips. We employ the statistical features on raw audio signal and MFCC features. The statistical features including the mean, variance, variance of the derivative, skewness and kurtosis. The definitions of skewness and kurtosis are given as follows.
is the vector (for example,can be the raw signal or the MFCC of an audio segment which is random selected from the audio clip). is the mean,
is the standard deviation, andis the expectation operator. The statistical features are clip-wise, which suggests that the statistical analysis is conducted for each clip.
Sample statistical features are shown in Fig. 1, in which the kurtosis, root mean square (RMS) and skewness of audio data are given. As can be seen from the figure, the kurtosis values varies with different categories, such as Finger_snapping and Scissors have larger kurtosis values than other categories, Scissors and Writing have lower RMS values than other categories. The classifier could benefit from the combination of these statistical features. Thus, an effective approach to employ these patterns maybe can further boost the performance of the tagging task, which will be demonstrated in next section.
In this part, we explore to extract handcrafted statistical features, which will be deployed by the ensemble learning. Here, we are not aim to extract all of state-of-the-art handcrafted features for the classification task, but can be used to provide a demo for the ensemble learning framework.
2.3 Ensemble learning
Due to the limited size of DCASE 2018 Task 2, single model is easily overfitted. Ensemble different models can improve the accuracy and robustness for the classification task  using the complementary prediction result from different models. However, the ensemble learning has been under-explored for the audio tagging task. Most of previous methods simply average the predictions 
. In this paper, we explore the use of stacked generalization in multiple levels to improve the accuracy and robustness to solve the audio tagging problem. The framework is computational, scalable and it have been tested on multiple machine learning tasks. Fig. 2 shows the proposed stacking architecture used in our task, which is composed of two levels. Level 1 consists of the deep models using different CNN architectures. Level 2 is shallow-architecture classier using the meta-features obtained from level 1. Fig. 2 shows that both of deep learning-based meta features and handcrafted statistical features are used for the ensemble learning in level 2.
We randomly split the training data into 5 folds in our experiments. For the deep models, the out-of-fold based approaches are used to generate the out-of-predictions. All deep models use the same folds split configuration during the meta-feature creation. For each CNN, we run the CNN models for each out-of-fold training data, and one model to predict the probabilities for each sample in the validating set by using the whole training dataset. The predicted probabilities of different classes will be concatenated to generate meta-features. For each classifier, the probabilities for 41 classes will be used as the meta-features, which will be concatenated to generate the new training dataset (as can be seen in Fig. 2.), and the meta features will be used as the input for level 2.
For the ensemble learning in level 2, we employ the Gradient Boosting Decision Tree (GBDT)
for the task. The reason is that: compared to other approaches such as linear regression and support vector machine, GBDT provides better classification performance on several public machine learning challenges. GBDT is a tree-based gradient boosting algorithm. By continuously fitting the residuals of the training samples, each new tree reduces the errors produced from the prediction of the previous tree. The strategy of reducing residuals greatly improves the prediction accuracy of the model.
2.4 Sample re-weight
The objective function in ensemble learning is: , where
is the convex loss function,is regulation component, including regulation and regulation. is the number of the samples, is the prediction for sample and
is the label. As the data size of the audio tagging is limited, some non-verified samples are employed as the training data, which may induce noisy samples. To train a classifier, the outliers in the training set have a high negative influence on the trained model. Indeed, designing supervised learning algorithms that can learn from data sets with noisy labels is an important problem, especially, when the data set is small.
Here, we propose to induce a new hyper-parameter to re-weight the training samples. In more detail, the sample weight of manually verified samples is set as 1.0, while the weight for the non-manually verified samples are set as a constant value (smaller than 1). The best configuration for can be obtained using the grid search. Thus, the final objective function for ensemble learning can be re-written as:
where is the sample-wise weight, is the number of manually verified audio clips and is the number of non-verified audio clips.
3 Experimental Results
The DCASE 2018 task 2 challenge dataset was provided by Freesound . This dataset contains 18,873 audio files annotated with 41 classes of label from Google’s AudioSet Ontology , in which 9,473 audio clips are used for training, 9,400 samples for validation (1,600 samples are manual verified). The provided sound files are uncompressed PCM 16-bit, 44.1 kHz, mono audio files with widely varying recording quality and techniques. The duration of the audio samples range from 300 ms to 30 s due to the diversity of the sound categories. The average length of the audio files is 6.7 seconds. In the training dataset, the number of audio clips ranges from 94 to 300 depending on different classes.
Two different kinds of inputs are employed to train the deep networks: log-scaled mel-spectrograms (log-mel) and Mel-frequency cepstral coefficients (MFCC) of the audio segment. For the raw signal, 1.5s audio segments are randomly selected. For the log-mel, we choose the number of the mel filter banks as 64, with a frame width of 80 ms and the frame shift is 10 ms. This will result in 150 frames in an audio clip. Then the delta and delta-delta features of log-mel is calculated with a window size of 9. Finally, the original log-mel features are concatenated with delta and delta-delta features to form a dimension . MFCC follows similar generation procedure, except for the size. To prevent over-fitting, we apply mixup-data augmentation  with an ratio of 0.2 .
3.3 Quantitative comparison between different CNN architectures
We apply mean average precision (mAP) as evaluation criterion, the mAP@3 performance of CNN models is shown in Fig. 1. All the 1600 manually-verified samples are used for the evaluation. Fig. 3 shows that: (1) Using the same architecture, log mel feature achieves better mAP@3 than MFCC using all CNN architectures; (2) Using pre-trained model, deeper CNN models such as ResNext improve the mAP@3 for the tagging task with the prior knowledge extracted from the visual data. Moreover, the combination of Log mel and deeper model provide superior performance. (3) With the network pre-trained with the computer vision data, the CNN models can provide better performance with comparison to the model trained from scratch. This indicates that the size of the audio dataset might not be sufficiently large enough to train deep models from scratch.
3.4 Ablation study for statistical features
To demonstrate the effectiveness of handcrafted features for proposed ensemble learning, we provide an ablation study for the handcrafted features. We first calculate the mAP@3 of ensemble learning by only using the out-of-fold predictions from deep models, which are regarded as a baseline. Thus, the handcrafted features are added to make a quantitative comparison with the same hyper-parameter configuration. As can be seen from Table 1, the obtained mAP@3s are much higher with statistical features.
3.5 Ensemble learning with sample re-weight
Out-of-fold predictions from the component models are aggregated to original file level before being fed into the level 2 model. Here, we implement our approach based on the LightGBM python library . The ‘max_depth’ parameter of the model is set to 3 and learning rate was set at 0.03. which works good in our experiment. In addition, feature subsample and the sample subsample values were set at 0.7 to prevent from overfitting. Table 1 shows the experimental results with different . As can be seen from the table, with handcrafted features, the mAP@3 of the classifier can be boosted with as 0.6.
|with TF||without TF|
In this work, we proposed an novel ensemble-learning system employing a variety of CNNs and statistic features for the general-purpose audio tagging task in DCASE 2018. (1) A comparative study of the performance of different state-of-the-art CNN architectures are presented. (2) Statistical features are researched and are demonstrated to be effective for the ensemble learning. (3) The proposed ensemble-learning can employ the complementary information of deep-models and statistical features, which have a superior classification performance. (4) A sample re-weight strategy is employed to handle with the potential noisy label of the non-verified annotations in the dataset. Our system ranked the 1st and 4th out of 558 submissions in the public and private leaderboard of the DCASE 2018 Task 2 Challenge. For future work, we will evaluate the performance of our method on the Google AudioSet.
This work was supported by the National Grand R&D Plan(Grant No. 2016YFB1000101). This study was also partially funded by the National Natural Science Foundation of China (No. 61806214).
-  E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis, X. Favory, J. Pons, and X. Serra, “General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), 2018.
A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” inEuropean Signal Processing Conference. IEEE, 2016, pp. 1128–1132.
-  A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” in IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events Workshop, 2017.
S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore,
M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “CNN
architectures for large-scale audio classification,” in
IEEE conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 131–135.
-  Y. Xu, Q. Huang, W. Wang, P. Foster, S. Sigtia, P. J. Jackson, and M. D. Plumbley, “Unsupervised feature learning based on deep models for environmental audio tagging,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1230–1241, 2017.
-  P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley, “Chime-home: A dataset for sound source recognition in a domestic environment.” in WASPAA, 2015, pp. 1–5.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in IEEE conference on Cmputer Vision and Pttern Recognition, 2016, pp. 2818–2826.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in Proceedings of the IEEE conference on Cmputer Vision and Pttern Recognition, vol. 1, no. 2, 2017, p. 3.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 5987–5995.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXiv preprint arXiv:1709.01507, vol. 7, 2017.
-  Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” in Advances in Neural Information Processing Systems, 2017, pp. 4467–4475.
-  E. Fonseca, R. Gong, and X. Serra, “A simple fusion of deep and shallow learning for acoustic scene classification,” arXiv preprint arXiv:1806.07506, 2018.
-  H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer, “CP-JKU submissions for DCASE-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), 2016.
-  L. Deng, D. Yu, and J. Platt, “Scalable stacking and learning for building deep architectures,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 2133–2136.
-  J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001.
-  G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems, 2017, pp. 3146–3154.
-  F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in ACM international conference on Multimedia. ACM, 2013, pp. 411–412.
-  J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2017, pp. 776–780.
-  D. Feng, K. Xu, H. Mi, F. Liao, and Y. Zhou, “Sample dropout for audio scene classification using multi-scale dense connected convolutional neural network,” arXiv preprint arXiv:1806.04422, 2018.
-  H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” International Conference on Learning Representations, 2018.
-  K. Xu, D. Feng, H. Mi, B. Zhu, D. Wang, L. Zhang, H. Cai, and S. Liu, “Mixup-based acoustic scene classification using multi-channel convolutional neural network,” arXiv preprint arXiv:1805.07319, 2018.