As smart mobile devices are widely used in recent years, huge amounts of multimedia recordings are generated and uploaded to the web every day. These recordings, such as music, field sounds, broadcast news, and television shows, contain sounds from a wide variety of sources. The demand for analyzing these sounds is increasing, e.g. for automatic audio tagging , audio segmentation  and audio context classification [3, 4]. Due to the technology and customer need, there have been some applications of audio processing in different scenarios, such as urban monitoring, surveillance, health care, music retrieval, and customer video.
For environmental audio tagging, there is a large amount of audio data online, e.g. from Youtube or Freesound, which are labeled with tags. How to utilize them, predict them and further add some new tags on the related audio is a challenge. The environmental audio recordings are more complicated than the pure speech or music recordings due to the multiple acoustic sources and incidental background noise. This will make the acoustic modeling more difficult. On the other hand, one acoustic event (or one tag) in environmental audio recordings might occur in several long temporal segments. A compact representation of the contextual information will be desirable in the feature domain.
. K-means, as an unsupervised clustering method, has been widely used in audio analysis and music retrieval [6, 7]. In , Cai et al
. replaced K-means with a spectral clustering-based scheme to segment and cluster the input stream into audio elements. Sainathat al. 
derived an audio segmentation method using Extended Baum-Welch (EBW) transformations for estimating parameters of Gaussian mixtures. Shaoat al. 
proposed to use a measure of similarity derived by hidden Markov models to cluster segment of audio streams. Xiaet al.  used Eigenmusic and Adaboost to separate rehearsal recordings into segments, and an alignment process to organize segments. Gaussian mixture model (GMM), as a common model, was also used as the official baseline method in DCASE 2016 for audio tagging . Recently, in 
, a Support Vector Machine (SVM) based Multiple Instance Learning (MIL) system was also presented for audio tagging and event detection. The details of the GMM and SVM methods are presented in the appendix of this paper. However, these methods can not well utilize the contextual information and the potential relationship among different event classes.
Deep learning was also widely explored in feature learning [19, 20]. These works have demonstrated that data-driven learned features can get better performance than the expert-designed features. In 
, four unsupervised learning algorithms, K-means clustering, restricted Boltzmann machine (RBM), Gaussian mixtures and auto-encoder are explored in image classification. Compared with RBM, auto-encoder is a non-probabilistic feature learning paradigm. For the audio tagging task, Mel-frequency Cepstral Coefficients (MFCCs) and Mel-Filter Banks (MBKs) are commonly adopted as the basic features. However it is not clear whether they are the best choice for audio tagging.
In this paper, we propose a robust deep learning framework for the audio tagging task, with focuses mainly on the following two parts, acoustic modeling and unsupervised feature learning, respectively. For the acoustic modeling, we investigate deep models with shrinking structure, which can be used to reduce the model size, accelerate the training and test process .
Dropout  and background noise aware training  are also adopted to further improve the tagging performance in the DNN-based framework. Different loss functions and different basic features will be also compared for the environmental audio tagging task.
For the feature learning, we propose a symmetric or asymmetric deep de-noising auto-encoder (sDAE or aDAE) based unsupervised method to generate a new feature from the basic features. There are two motivations here, the first is the background noise in the environmental audio recordings which will introduce some mismatch between the training set and the test set. However, the new feature learned by the DAE can mitigate the impact of background noise. The second motivation is that compact representation of the contextual frames is needed for the reason that only chunk-level labels are available. The proposed sDAE or aDAE can encode the contextual frames into a compact code, which can be used to train a better classifier.
are also adopted to further improve the tagging performance in the DNN-based framework. Different loss functions and different basic features will be also compared for the environmental audio tagging task. For the feature learning, we propose a symmetric or asymmetric deep de-noising auto-encoder (sDAE or aDAE) based unsupervised method to generate a new feature from the basic features. There are two motivations here, the first is the background noise in the environmental audio recordings which will introduce some mismatch between the training set and the test set. However, the new feature learned by the DAE can mitigate the impact of background noise. The second motivation is that compact representation of the contextual frames is needed for the reason that only chunk-level labels are available. The proposed sDAE or aDAE can encode the contextual frames into a compact code, which can be used to train a better classifier.
The rest of the paper is organized as follows. We present our robust DNN-based framework in section II. The proposed deep DAE-based unsupervised feature learning will be presented in section III. The data description and experimental setup will be given in section IV. We will show the related results and discussions in section V, and finally draw a conclusion in section VI. Appendix will introduce the GMM and SVM based methods in detail, which will be used as baselines for performance comparison in our study.
Ii Robust DNN-based audio tagging
DNN is a non-linear multi-layer model for extracting robust features related to a specific classification  or regression  task. The objective of the audio tagging task is to perform multi-label classification on audio chunks (i.e. assign one or more labels to each audio chunk of a length e.g. four seconds in our experiments). The labels are only available for chunks, but not frames. Multiple events may happen at many particular frames.
Ii-a DNN-based multi-label classification
Fig. 1 shows the proposed DNN-based audio tagging framework using the shrinking structure, i.e., the hidden layer size is gradually reduced through depth. In , it is shown that this structure can reduce the model size, training and test time without losing classification accuracy. Furthermore, this structure can serve as a deep PCA  to reduce the redundancy and background noise in the audio recordings. With the proposed framework, a large set of features of the chunk are encoded into a vector with valuesMean squared error (MSE) and binary cross-entropy were adopted and compared as the objective function. As the labels of the audio tagging are binary values, binary cross-entropy can get a faster training and better performance than MSE 
where and are the mean squared error and binary cross-entropy, and denote the estimated and reference tag vector at sample index , respectively, with representing the mini-batch size, being the input audio feature vector where the window size of context is . It should be noted that the input window size should cover a large set of contextual frames of the chunk considering the fact that the reference tags are in chunk-level rather than frame-level. The weight and bias parameters to be learned are denoted as . The DNN linear output is defined as O before the Sigmoid activation function is applied.
The updated estimate of and in the -th layer, with a learning rate , can be computed iteratively as follows:
where denotes the total number of hidden layers and the ()-th layer represents the output layer.
During the learning process, the DNN can be regarded as an encoding function, and the audio tags are automatically predicted. The background noise may exist in the audio recordings which may lead to mismatch between the training set and the test set. To address this issue, two additional methods are given below to improve the generalization capability of DNN-based audio tagging. Alternative input features, eg., MFCC and MBK features, are also compared.
Ii-B Dropout for the over-fitting problem
Deep learning architectures have a natural tendency towards over-fitting especially when there is little training data. This audio tagging task only has about four hours training data with imbalanced training data distribution for each type of tag, e.g., much fewer samples for event class ‘b’ compared with other event classes in the DCASE 2016 audio tagging task. Dropout is a simple but effective way to alleviate this problem . In each training iteration, the feature value of every input unit and the activation of every hidden unit are randomly removed with a predefined probability (e.g., ). These random perturbations effectively prevent the DNN from learning spurious dependencies. At the decoding stage, the DNN scales all of the weights involved in the dropout training by , regarded as a model averaging process .
Ii-C Background noise aware training
Different types of background noise in different recording environments could lead to the mismatch problem between the testing chunks and the training chunks. To alleviate this, we propose a simple background noise aware training (or adaptation) method. To enable this noise awareness, the DNN is fed with the primary audio features augmented with an estimate of the background noise. In this way, the DNN can use additional on-line information of background noise to better predict the expected tags. The background noise is estimated as follows:
where the background noise is fixed over the utterance and estimated using the first frames. Although this noise estimator is simple, a similar idea was shown to be effective in DNN-based speech enhancement [24, 22].
Ii-D Alternative input features for audio tagging
Mel-frequency Cepstral Coefficients (MFCCs) have been used in environmental sound source classification [29, 30], however, some previous work [31, 32] showed that the use of MFCCs is not the best choice as they are sensitive to background noise. Mel-filter bank (MBK) features have already been demonstrated to be better than MFCCs in speech recognition in the DNN framework . However it is not clear whether this is also the case for the audio tagging task using DNN models. Recent studies in audio classification have also shown that accuracy can be boosted by using features that are learned in an unsupervised manner, with examples in the areas of bioacoustics  and music . We will study the potential of such methods for audio tagging and present a DAE-based feature learning method in following section.
Iii Proposed Deep asymmetric DAE based unsupervised feature learning
MFCCs and MBKs are used as the basic features for the training of DNN-based predictor in this work. MFCCs and MBKs are well-designed features derived by experts based on the human auditory perception mechanism . Recently, more supervised or unsupervised feature learning works have demonstrated that data-driven learned features can offer better performance than the expert-designed features. Neural network based bottleneck feature 
in speech recognition is one such type of feature, extracted from the middle layer of a DNN classifier. Significant improvement can be obtained after it is fed into a subsequent GMM-HMM (Hidden Markov Model) system and compared with the basic features. However, for the audio tagging task, the tags are weakly labeled and not accurate through the multiple voting scheme. Furthermore, there are lots of related audio files without labels on the web. Hence to use these unlabeled data, we proposed a DAE based unsupervised feature learning method.
Specifically, for environmental audio tagging task, disordered background noise exists in the recordings which may lead to the mismatch between the training set and the test set. DAE-based method can mitigate the effect of background noise and focus on more meaningful acoustic event patterns. Another motivation is that the compact representation of the contextual frames is needed since the labels are in chunk-level rather than frame-level.
An unsupervised feature learning algorithm is used to discover features from the unlabeled data. For this purpose, the unsupervised feature learning algorithm takes the dataset as input and outputs a new feature vector. In , four unsupervised learning algorithms, K-means clustering, restricted Boltzmann machine (RBM), Gaussian mixtures and auto-encoder are explored in image classification. Among them, RBM and auto-encoder are widely adopted to get new features or pretrain a deep model. Compared with RBM, auto-encoder is a non-probabilistic feature learning paradigm . The auto-encoder explicitly defines a feature extracting function, called the encoder, in a specific parameterized closed form. It also has another closed-form parametrized function, called the decoder. The encoder and decoder function are denoted as and , respectively. To force the hidden layers to discover more robust features, the de-noising auto-encoder  introduces a stochastic corruption process applied to the input layer. Dropout  is used here to corrupt the input units. Compared with the auto-encoder, DAE can discover more robust features and prevent it from simply learning the identity function . Fig. 2 shows a typical one hidden layer of DAE structure with an encoder and a decoder. The encoder generates a new feature vector h from an input . It is defined as,
where h is the new feature vector or new representation or code  of the input data x with the corrupted version . is the non-linear activation function. W and b denote the weights and bias of the encoder, respectively. On the other hand, the decoder, can map the new representation back to the original feature space, namely producing a reconstruction .
where is the reconstructed feature which is the approximation of the input feature. is the non-linear activation function of the decoder. and denote the weights and bias of the decoder. Here W and are not tied, namely . The set of parameters of the auto-encoder are updated to incur the lowest reconstruction error , which is a measure of the distance between the input x and the output. . The general loss function for the de-noising auto-encoder  training can be defined as,
Furthermore, the DAE can be stacked to obtain a deep DAE. The DAE is actually an advanced PCA with the non-linear activation functions .
In Fig. 3, the framework of deep asymmetric DAE (aDAE) based unsupervised feature learning for audio tagging is presented. It is a deep DAE stacked by simple DAE with random initialization. To utilize the contextual information, multiple frames MBK features are fed into the deep DAE. A typical DAE is a symmetric structure (sDAE) with the same size as the input. However here the deep DAE is only designed to predict the middle frame feature. This is because the more predictions in the output means the more memory needed in the bottleneck layer. In our practice, the deep DAE would generate a larger reconstruction error if multiple frames features were designed as the output with a narrow bottleneck layer. This leads to an inaccurate representation of the original feature in a new space. Nonetheless, with only the middle frame features in the output, the reconstruction error is smaller. Fig. 4 plots the reconstruction error between the aDAE and sDAE for the example shown in Section IV. However, we will show the performance difference between deep aDAE and deep sDAE later in Section V. The default size of the bottleneck code is 50 and 200 for aDAE and sDAE, respectively. For sDAE, there is a trade-off when setting the bottleneck code size, to avoid the high input dimension for the back-end DNN classifier, as well as for reconstructing the multiple-frame output. Typically, the weights between the encoder and the decoder are tied. Here we set them to be untied to retain more contextual information in the bottleneck codes. More specifically, the input frame number in the DAE input layer is chosen as seven for the reason that 91-frame expansion will be used in the back-end DNN classifier. In addition, larger frame expansion in DAE is more difficult to encode into a fixed bottleneck code.
As the output of DAE is a real-valued feature, MSE was adopted as the objective function to fine-tune the whole deep DAE model. A stochastic gradient descent algorithm is performed in mini-batches with multiple epochs to improve learning convergence as follows,
where is the mean squared error, and denote the reconstructed and input feature vector at sample index , respectively, with representing the mini-batch size, being the input audio feature vector where the window size is . denote the weight and bias parameters to be learned.
The activation function of the bottleneck layer is another key point in the proposed deep DAE based unsupervised feature learning framework. Sigmoid is not suitable to be used as the activation function of the code layer, since it compresses the value of the new feature into a range [0, 1] which will reduce its representation capability. Hence, Linear or ReLU activation function is a more suitable choice. In , the activation function of the units of the bottleneck layer or the code layer of the deep DAE is linear. A perfect reconstruction of the image can be obtained. In this work, ReLU and Linear activation functions of the bottleneck layer are both verified to reconstruct the audio features in the deep auto-encoder framework. Note that all of the other layer units also adopt ReLU as the activation function.
In summary, the new feature derived from the bottleneck layer of the deep auto-encoder can be regarded as the optimized feature due to three factors. The first one is that the DAE learned feature is generated from contextual input frames with new compact representations. This kind of features are better for capturing the temporal structure information compared with the original feature. The second advantage is that the deep DAE based unsupervised feature learning can smooth the disordered background noise in the audio recordings to alleviate the mismatch problem between the training set and test set. Finally, with this framework, the large amount of unlabeled data could be utilized and more statistical knowledge in the feature space can be learned.
Iv data description and experimental setup
Iv-a DCASE2016 data set for audio tagging
The data that we used for evaluation is the dataset of Task 4 of the DCASE 2016 , which is built based on the CHiME-home dataset . The audio recordings were made in a domestic environment . Prominent sound sources in the acoustic environment are two adults and two children, television and electronic gadgets, kitchen appliances, footsteps and knocks produced by human activity, in addition to sound originating from outside the house . The audio data are provided as 4-second chunks at two sampling rates (48kHz and 16kHz) with the 48kHz data in stereo and the 16kHz data in mono. The 16kHz recordings were obtained by downsampling the right channel of the 48kHz recordings. Note that Task 4 of the DCASE 2016 challenge is based on using only 16kHz recordings.
For each chunk, multi-label annotations were first obtained from each of 3 annotators. There are 4378 such chunks available, referred to as CHiME-Home-raw ; discrepancies between annotators are resolved by conducting a majority vote for each label. The annotations are based on a set of 7 label classes as shown in Table I. A detailed description of the annotation procedure is provided in .
|f||Adult female speech|
|m||Adult male speech|
|o||Other identifiable sounds|
|e.g. crash, bang, knock, footsteps|
To reduce uncertainty about annotations, evaluations are based on considering only those chunks where 2 or more annotators agreed about label presence across label classes. There are 1946 such chunks available, referred to as CHiME-Home-refined . Another 816 refined chunks are kept for the final evaluation set of Task 4 of the DCASE 2016 challenge.
Iv-B Experimental setup
In our experiments, following the original specification of Task 4 of the DCASE 2016 , we use the same five folds from the given development dataset, and use the remaining audio recordings for training. Table II lists the number of chunks of training and test data used for each fold and also the final evaluation setup.
To keep the same feature configurations as in the DCASE 2016 baseline system, we pre-process each audio chunk by segmenting them using a ms sliding window with a
ms hop size, and converting each segment into 24-dimension MFCCs and 40-dimension MBKs. For each 4-second chunk, 399 frames of MFCCs are obtained. A large set of frames expansion is used as the input of the DNN. The impact of the number of frame expansion on the performance will be evaluated in the following experiments. Hence the input size of DNN was the number of expanded frames plus the appended background noise vector. All of the input features are normalized into zero-mean and unit-variance. The first hidden layer with 1000 units and the second with 500 units were used to construct a shrinking structure. The 1000 or 500 hidden units are a common choice in DNNs . Seven sigmoid outputs were adopted to predict the seven tags. The learning rate was 0.005. The momentum was set to be 0.9. The dropout rates for input layer and hidden layer were 0.1 and 0.2, respectively. The mini-batch size was 100. in Equation (6) was 6. In addition to the CHiME-Home-refined set  with 1946 chunks, the remaining 2432 chunks in the CHiME-Home-raw set  without ‘strong agreement’ labels in the development dataset were also added into the DNN training considering that DNN has a better fault-tolerant capability. Meanwhile, these 2432 chunks without ‘strong agreement’ labels were also added into the training data for GMM and SVM training. The deep aDAE or deep aDAE has 5 layers with 3 hidden layers. For aDAE, the input is 7-frame MBKs, and the output is the middle frame MBK. The first and third hidden layer both have 500 hidden units while the middle layer is the bottleneck layer with 50 units. For sDAE, the output is 7-frame MBKs, and the middle layer is the bottleneck layer with 200 units. The dropout level for the aDAE or sDAE is set to be 0.1. The final DAE models are trained at epoch 100.
For performance evaluation, we use equal error rate (EER) as the main metric which is also suggested by the DCASE 2016 audio tagging challenge. EER is defined as the point of the graph of false negative () rate versus false positive () rate . The number of true positives is denoted as
. EERs are computed individually for each evaluation fold, and we then average the obtained EERs across the five folds to get the final performance. Precision, Recall and F-score are also adopted to evaluate the performance among different systems.
All the source codes for this paper and pre-trained models can be downloaded at Github website111https://github.com/yongxuUSTC/aDAE_DNN_audio_tagging. The codes for the SVM and GMM baselines are also uploaded at the same website.
Iv-C Compared methods
For a comparison, we also ran two baselines using GMMs and the SVMs mentioned in the Appendix section. For the GMM-based method, the number of mixture components is 8 which is a default configuration of the DCASE 2016 challenge. The sliding window and hop size set for the two baselines and our proposed methods are all the same. Additionally, we also use chunk-level features to evaluate on SVM-based method according to . The mean and covariance of the MFCCs over the duration of the chunk can describe the Gaussian with the maximum likelihood . Hence those statistics can also be unwrapped into a vector as a chunk-level feature to train the SVM. To handle audio tagging with SVM, each audio recording will be viewed as a bag. To accelerate computation, we use linear kernel function in our experiments.
We also compared our methods with the state-of-the-art methods. Lidy-CQT-CNN , Cakir-MFCC-CNN  and Yun-MFCC-GMM  are the first, second and third prize of the audio tagging task of the DCASE2016 challenge 
. The former two methods used convolutional neural networks (CNN) as the classifier. Yun-MFCC-GMM adopted the discriminative training method on GMMs.
V Results and discussions
In this section, the overall evaluations on the development set and the evaluation set of the DCASE 2016 audio tagging task will be firstly presented. Then several evaluations on the parameters of the models will be given.
|GMM (DCASE baseline) ||0.074||0.225||0.289||0.269||0.290||0.248||0.050||0.206|
|Lidy-CQT-CNN * ||-||-||-||-||-||-||-||-|
|GMM (DCASE baseline) ||0.117||0.191||0.314||0.326||0.249||0.212||0.056||0.209|
|* Lidy-CQT-CNN  did not measure the EER results on the development set.|
V-a Overall evaluations
Table III shows the EER comparisons on seven labels among the proposed aDAE-DNN, sDAE-DNN, DNN baseline trained on MBK, DNN baseline trained on MFCC methods, Yun-MFCC-GMM , Cakir-MFCC-CNN , Lidy-CQT-CNN , SVM trained on chunks, SVM trained on frames and GMM methods , which are evaluated on the development set and the evaluation set of the DCASE 2016 audio tagging challenge. On the development set, it is clear that the proposed DNN-based approaches outperform the SVM and GMM baselines across the five-fold evaluations. GMM is better than the SVM methods. SVM performs worse on the audio event ‘b’ where less training samples are included in the imbalanced development set compared with other audio events . However, the GMM and DNN methods perform better on the audio event ‘b’ with lower EERs. The chunk-level SVM is inferior to the frame-level SVM. This is because the audio tagging is a multi-label classification task not single-label classification task while the statistical mean value in the chunk-SVM will make the feature indistinct among different labels in the same chunk. Compared with the DNN methods, SVM and GMM are less effective in utilizing the contextual information and the potential relationship among different tags. Note that the DNN models here are all trained using binary cross-entropy defined in Eq. (2) as the loss function. Binary cross-entropy is found better than the mean squared error for training the audio tagging models which will be shown in the following subsection. The DNN trained on MFCCs is worse than the DNN trained on MBKs with the reduced EER from 0.151 to 0.135, especially on the percussive sounds (‘p’), e.g. crash, bang, knock and footsteps. This result is consistent with the observations in speech recognition using DNNs . Compared with MBKs, MFCCs lost some information after the discrete cosine transformation (DCT) step. The bottleneck code size for aDAE-DNN and sDAE-DNN here is 50 and 200, respectively. It is found that aDAE-DNN can reduce the EER from 0.157 to 0.148 compared with MBK-DNN-baseline, especially on tag ‘c’ and tag ‘o’. The aDAE-DNN method is slightly better than the sDAE-DNN because the sDAE-DNN should have a larger bottleneck code to reconstruct the seven-frame output. However, the large size of the bottleneck code in sDAE-DNN will make the input dimension of the back-end DNN classifier very high. Lidy-CQT-CNN  did not measure the EER on the development set . Our proposed DNN methods can get better performance than Cakir-MFCC-CNN  and Yun-MFCC-GMM .
Fig. 5 shows the box-and-whisker plot of EERs, among the GMM baseline, MBK-DNN baseline and aDAE-DNN method, across five standard folds on the development set of the DCASE 2016 audio tagging challenge. It can be found that the aDAE-DNN is consistently better than the MBK-DNN baseline. To test the statistical significance between the aDAE-DNN and MBK-DNN baseline, we use the paired-sample t-test tool in MATLAB. The audio tagging task of the DCASE2016 challenge has five standard folds with seven tags. Hence, a 35-dimension vector can be obtained for each method, then the paired-sample t-test tool can be used to calculate the p-value. Its results indicate that t -test rejects the null hypothesis at the 1% significance level. It was found that the
-test rejects the null hypothesis at the 1% significance level. It was found that thep-value is in this test which indicates that the improvement is statistically significant. Finally our proposed method can get a 38.9% relative EER reduction compared with the GMM baseline of the DCASE 2016 audio tagging challenge on the development set.
Table III also presents EER comparisons on the evaluation set. Note that the final evaluation set was not used for any training which means sDAE and aDAE also did not use it in the training. It can be found that our proposed aDAE-DNN can get the state-of-the-art performance. Our MBK-DNN is a strong baseline through the use of several techniques, e.g., the dropout, background noise aware training, shrinking structure and also binary cross-entropy. The proposed aDAE-DNN can get a 5.7% relative improvement compared with the MBK-DNN baseline. sDAE-DNN did not show improvement over the MBK-DNN because sDAE-DNN with the bottleneck code size 200 can not well reconstruct the unseen evaluation set. However the aDAE-DNN with the bottleneck code size 50 can well reconstruct the unseen evaluation set. Finally, our proposed methods can get the state-of-the-art performance with 0.148 EER on the evaluation set of the DCASE 2016 audio tagging challenge. Another interesting result here is that Yun-MFCC-GMM  performs well on tag ‘c’ and tag ‘f’ where high pitch information exists. It would be interesting to fuse their prediction posteriors together in our future work.
To give a further comparison between the MBK-DNN baseline and the aDAE-DNN method, Table IV shows precision, recall and score comparisons evaluated for seven tags on the final evaluation set of the DCASE2016 audio tagging task. As the DNN prediction belongs to [0,1], a threshold 0.4 is set to judge whether it is a hit or not. Using aDAE-DNN better performance than the MBK-DNN baseline can be obtained on most of the three measures. One interesting result is that DNN method can get a quite high score on tag ‘b’ although there are only few training samples in the training set .
Fig. 6 shows the spectrograms of the original MBKs and the reconstructed MBKs by the deep sDAE and deep aDAE. Both of them can reconstruct the original MBKs well while sDAE got a smoother reconstruction. There is background noise in original MBKs which will lead to the mismatch problem mentioned earlier. sDAE can well reduce the background noise shown in the dashed ellipses with the risk of losing the important spectral information. However, aDAE can be a trade-off between background noise smoothing and signal reconstruction. On the other hand, the weights of the encoder and decoder in the deep aDAE and the deep sDAE are not typically tied. In this way, more contextual information is encoded into the bottleneck layer to get a compact representation, which is helpful for the audio tagging task considering the fact that the reference labels are in chunk-level.
V-B Evaluations for the number of contextual frames in the input of the DNN classifier
The reference label information for this audio tagging task is on the utterance-level rather than the frame-level, and the occurring orders and frequencies of the tags are unknown. Hence, it is important to use a large set of the contextual frames in the input of the DNN classifier. However, the dimension of the input layer of the DNN classifier will be too high and the number of training samples would be reduced if the number of the frame expansion is too large. Larger input size will increase the complexity of the DNN model and as a result, some information could be lost during the feed-forward process considering that the hidden unit size is fixed to be 1000 or 500. Fewer training samples will make the training process of DNN unstable considering that the parameters are updated using a stochastic gradient descent algorithm performed in mini-batches.
Fig. 7 shows the EERs for Fold 1 evaluated by using different number of contextual frames in the input of the DNN classifier. Here the MBKs are used as the input features. It can be found that using the 91-frame MBKs as the input gives the lowest EER. As mentioned in the experimental setup, the window size of each frame is 20ms with 50% hop size. 91-frame expansion means that the input length is about one second. The length of the whole chunk is 4 seconds. It indicates that most of the tags overlap with each other in certain chunk. Meanwhile, 91-frame expansion in the input layer of the DNN is a good trade-off among the contextual information, input size, and total training samples.
V-C Evaluations for different kinds of input features and different types of loss functions
Fig. 8 shows EERs on Fold 1 evaluated using different features, namely MFCCs and MBKs, different loss functions, namely mean squared error (MSE) and binary cross entropy (BCE). It can be found that MBKs perform better than MFCCs. MBKs contain more spectral information than the MFCCs. BCE is superior to MSE considering that the value of label is binary, either zero or one. MSE is more suitable to fit the real values.
V-D Evaluations for different bottleneck size of DAE and comparison with the common auto-encoder
Fig. 9 shows the EERs on Fold 0 evaluated using different de-noising auto-encoder configurations and compared with the MBK-DNN baseline. For the deep sDAE, the bottleneck layer size needs to be properly set. If it is too small, the 7-frame MBKs can not be well reconstructed. While it will increase the input size of the DNN classifier if the bottleneck code is too large. For the deep aDAE, the bottleneck layer with 50 ReLU units is found empirically to be a good choice. The linear unit (denoted as aDAE-Linear50) is worse than the ReLU unit for the new feature representation. Another interesting result is that the performance was almost the same if there is no de-noising operation (denoted as aAE-ReLU50) in the auto-encoder. The reason is that the baseline DNN is well trained on MBKs with the binary cross-entropy as the loss function.
V-E Evaluations for the size of the training dataset
In the preceding experiments, ‘CHiME-Home-raw’ dataset was used to train the DNN, GMM and SVM models. Here, to evaluate the performance using different training data sizes, DNNs were trained based on ‘CHiME-Home-raw’ or ‘CHiME-Home-refined’ alternatively while keeping the same testing set. MBKs were used as the input features for the DNN classifier.
Table V shows the EERs for Fold 1 across seven tags with the DNNs trained on the ‘CHiME-Home-raw’ set and ‘CHiME-Home-refined’ set. It can be clearly found that the DNN trained on the ‘CHiME-Home-raw’ set is better than the DNN trained on the ‘CHiME-Home-refined’ set, although part of the labels of the ‘CHiME-Home-raw’ set are not accurate. This indicates that DNN has fault-tolerant capability which suggests that the labels for the tags can not be refined with much annotators’ effort. The size of the training set is crucial for the DNN training. Nonetheless the GMM method is sensitive to the inaccurate labels. The increased training data with inaccurate tag labels does not help to improve the performance of GMMs.
V-F Further discussions on the deep auto-encoder features
Fig. 10 presents the audio spectrogram of the deep aDAE features, which can be regarded as the new non-negative representation or optimized feature of the original MBKs. The units of the bottleneck layer in the deep aDAE are all activated by the ReLU functions as mentioned in Sec. III. Hence, the values of the learned feature are all non-negative, leading to a non-negative representation of the original MBKs. Such a non-negative representation can then be multiplied with the weights in the decoding part of the DAE to obtain the reconstructed MBKs. It is also adopted to replace the MBKs as the input to the DNN classifier to make a better prediction for the tags. The pure blue area at some dimensions in Fig. 10 indicates the zero values in the ReLU activation function.
In this paper we have studied the acoustic modeling and feature learning issues in audio tagging. We have proposed a DNN incorporating unsupervised feature learning based approach to handle audio tagging with weak labels, in the sense that only the chunk-level instead of the frame-level labels are available. The Dropout and background noise aware training methods were adopted to improve its generalization capacity for new recordings in unseen environments. A deep asymmetric DAE with untied weights based unsupervised feature learning was also proposed to generate a new feature with non-negative representations. The DAE can generate smoothed feature against the disordered background noise and also give a compact representation of the contextual frames. We tested our approach on the dataset of the Task 4 of the DCASE 2016 challenge, and obtained significant improvements over the two baselines, namely GMM and SVM. Compared with the official GMM-based baseline system given in the DCASE 2016 challenge, the proposed DNN system can reduce the EER from 0.207 to 0.126 on the development set. The proposed unsupervised feature learning method can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. We also get the state-of-the-art performance with 0.148 EER on the evaluation set compared with the latest results (Yun-MFCC-GMM , Cakir-MFCC-CNN , Lidy-CQT-CNN ) from the DCASE 2016 challenge. For the future work, we will use convolutional neural network (CNN) to extract more robust high-level features for the audio tagging task. Larger dataset, such as Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset  and YouTube-8M dataset  will be considered to further evaluate the proposed algorithms.
Two baseline methods compared in our work are briefly summarized below.
-a Audio Tagging using Gaussian Mixture Models
GMMs are a commonly used generative classifier. A GMM is parametrized by , where is the number of mixtures and is the weight of the -th mixture component.
To implement multi-label classification with simple event tags, a binary classifier is built associating with each audio event class in the training step. For a specific event class, all audio frames in an audio chunk labeled with this event are categorized into a positive class, whereas the remaining features are categorized into a negative class. On the classification stage, given an audio chunk , the likelihoods of each audio frame are calculated for the two class models, respectively. Given audio event class and chunk , the classification score is obtained as log-likelihood ratio:
-B Audio Tagging using Multiple Instance SVM
Multiple instance learning is described in terms of bags B. The th instance in the th bag, , is defined as where , and is the number of instances in . ’s label is . If , then for all . If , then at least one instance is a positive example of the underlying concept .
As MI-SVM is the bag-level MIL support vector machine to maximize the bag margin, we define the functional margin of a bag with respect to a hyper-plane as:
Using the above notion, MI-SVM can be defined as:
where w is the weight vector, is bias, is margin violation, and is a regularization parameter.
Classification with MI-SVM proceeds in two steps. In the first step, is initialized as the centroid for every positive bag as follows
The second step is an iterative procedure in order to optimize the parameters.
Firstly, w and are computed for the data set with positive samples .
Secondly, we compute
Thirdly, we change by
The iteration in this step will stop when there is no change of . The optimized parameters will be used for testing.
We gratefully acknowledge the critical comments by anonymous reviewers and associate editor of this paper.
-  C. Alberti, M. Bacchiani, A. Bezman, C. Chelba, A. Drofa, P. Liao, H. Moreno, T. Power, A. Sahuguet, M. Shugrina, and O. Siohan, “An audio indexing system for election video material,” in Proceedings of International Conference on Audio, Speech and Signal Processing, 2009, pp. 4873–4876.
-  J. Foote, “Automatic audio segmentation using a measure of audio novelty,” in IEEE International Conference on Multimedia and Expo, 2000, pp. 452–455.
-  S. Allegro, M. Büchler, and S. Launer, “Automatic sound classification inspired by auditory scene analysis,” in Consistent and Reliable Acoustic Cues for Sound Analysis, 2001.
-  B. Picart, S. Brognaux, and S. Dupont, “Analysis and automatic recognition of human beatbox sounds: A comparative study,” in Proceedings of International Conference on Audio, Speech and Signal Processing, 2015, pp. 4225–4229.
-  G. Chen and B. Han, “Improve K-means clustering for audio data by exploring a reasonable sampling rate,” in Seventh International Conference on Fuzzy Systems and Knowledge Discovery, 2010, pp. 1639–1642.
-  M. Riley, E. Heinen, and J. Ghosh, “A text retrieval approach to contnent-based audio retrieval,” in Proceedings of International Conference on Music Information Retrieval, 2008, pp. 1639–1642.
-  X. Shao, C. Xu, and M. Kankanhalli, “Unsupervised classification of music genre using hidden markov model,” in Proceedings of International Conference on Multimedia and Expo, 2004, pp. 2023–2026.
-  R. Cai, L. Lu, and A. Hanjalic, “Unsupervised content discovery in composite audio,” in Proceedings of International Conference on Multimeida, 2005, pp. 628–637.
-  T. Sainath, D. Kanevsky, and G. Ivengar, “Unsupervised audio segmentation using extended baum-welch transformations,” in Proceedings of International Conference on Acoustic, Speech and Signal Processing, 2007, pp. 2009–2012.
-  G. Xia, D. Liang, R. Dannemberg, and M. Harvilla, “Segmentation, clustering, and displaying in a personal audio database for musicians,” in 12th International Society for Music Information Retrieval Conference, 2011, pp. 139–144.
-  http://www.cs.tut.fi/sgn/arg/dcase2016/task-audio-tagging.
-  A. Kumar and B. Raj, “Audio event detection using weakly labeled data,” CoRR, vol. abs/1605.02401, 2016. [Online]. Available: http://arxiv.org/abs/1605.02401
-  Y. Petetin, C. Laroche, and A. Mayoue, “Deep neural networks for audio scene recognition,” in 23rd European Signal Processing Conference (EUSIPCO), 2015, pp. 125–129.
-  E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event detection using multi label deep neural networks,” in International Joint Conference on Neural Networks (IJCNN), 2015, pp. 1–7.
-  P. Hamel, S. Wood, and D. Eck, “Automatic identification of instrument classes in polyphonic and poly instrument audio,” in Proceedings of the 10th International Society for Music Information Retrieval Conference, 2009, pp. 399–404.
-  S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” in Proceedings of Internation Conference on Acoustic, Speech and Signals Processing, 2014, pp. 6964–6968.
-  K. Choi, G. Fazekas, and M. Sandler, “Automatic tagging using deep convolutional neural networks,” arXiv preprint arXiv:1606.00298, 2016.
-  P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. Plumbley, “CHiME-home: A dataset for sound source recognition in a domestic environment,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2015, pp. 1–5.
A. Coates, H. Lee, and A. Y. Ng, “An analysis of single-layer networks in
unsupervised feature learning,”
in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 215–223, 2011.
-  Y. Bengio, A. C. Courville, and P. Vincent, “Unsupervised feature learning and deep learning: A review and new perspectives,” CoRR, abs/1206.5538, vol. 1, 2012.
-  G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural networks for LVCSR using rectified linear units and dropout,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 8609–8613.
-  Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “Dynamic noise aware training for speech enhancement based on deep neural networks.” in INTERSPEECH, 2014, pp. 2670–2674.
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
-  Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2015.
-  S. Zhang, Y. Bao, P. Zhou, H. Jiang, and L. Dai, “Improving deep neural networks for LVCSR using dropout and shrinking structure,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 6849–6853.
-  G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
-  P. Zhou and J. Austin, “Learning criteria for training neural network classifiers,” Neural computing & applications, vol. 7, no. 4, pp. 334–342, 1998.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
-  R. Radhakrishnan, A. Divakaran, and A. Smaragdis, “Audio analysis for surveillance applications,” in IEEE Workshop on Appications of Signal Processing to Audio and Acoustic, 2005, pp. 158–161.
-  L.-H. Cai, L. Lu, A. Hanjalic, and L.-H. Zhang, “A flexible framework for key audio effects detection and auditory context inference,” IEEE Transactions on Acoustic, Speech and Language Processing, vol. 14, pp. 1026–1039, 2006.
P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” inProceedings of the 11th International Society for Music Information Retrieval Conference, 2010, pp. 339–344.
-  C. V. Cotton and D. P. W. Ellis, “Spectral vs. spectro-temporal features for acoustic event detection,” in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2011, pp. 69–72.
-  M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7398–7402.
-  D. Stowell and M. D. Plumbley, “Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning,” PeerJ, vol. 2, p. e488, 2014.
-  Y. Vaizman, B. Mcfee, and G. Lanckriet, “Codebook-based audio feature representation for music information retrieval,” IEEE Transactions on Acoustic, Speech and Language Processing, vol. 22, pp. 1483–1493, 2014.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in
Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103.
-  S. Molau, M. Pitz, R. Schluter, and H. Ney, “Computing mel-frequency cepstral coefficients on the power spectrum,” in Processdings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 2001, pp. 73–76.
-  F. Grézl, M. Karafiát, S. Kontár, and J. Cernocky, “Probabilistic and bottle-neck features for LVCSR of meetings,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, 2007, pp. 757–760.
-  H. Christensen, J. Barker, N. Ma, and P. Green, “The CHiME corpus: A resource and a challenge for computational hearing in multisource environments,” in Proceedings of Interspeech, 2010, pp. 1918–1921.
-  A.-R. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22, 2012.
-  K. P. Murohy, Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
-  M. I. Mandel and D. Ellis, “Song-level features and support vector machines for music classification.” in ISMIR, 2005, pp. 594–599.
-  T. Lidy and A. Schindler, “CQT-based convolutional neural networks for audio scene classification and domestic audio tagging,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016).
-  E. Cakır, T. Heittola, and T. Virtanen, “Domestic audio tagging with convolutional neural networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016).
-  S. Yun, S. Kim, S. Moon, J. Cho, and T. Kim, “Discriminative training of gmm parameters for audio scene classification and audio tagging,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016).
-  B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “YFCC100M: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
-  S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,” arXiv preprint arXiv:1609.08675, 2016.
-  S. Andrew, I. Tsochantaridis, and T. Hofmann, “Support vector machines for multiple-instance learning,” in Proceedings of Advances in Neural Information Processing Systems, 2003, pp. 557–584.