Acoustic scene classification (ASC) aims to distinguish between the various acoustical scenes, and effective identification of scenes through the analysis of unstructured patterns has many potential applications, such as, intelligent monitoring systems, context aware devices design and so on. Multiple overlapping sound sources are contained in the acoustic mark of a certain scene, as a result, ASC is of a great challenge despite the sustainable efforts have been made.
From the perspective of scene classification, different methods have been tested in the computer vision field, and dramatic progress has been made during last two decades, especially with the improvements of local invariant feature, such as (SIFT, SURF ) and convolutional neural network . Nevertheless, compared to the image-based scene classification, audio-based approach is still under-explored. The state-of-the-art scene-classification methods using audio are not able to provide comparable accuracy with comparison to the image-based methods . However, the audio is more descriptive and salient than the images in some practical situations.
In the past several years, an increasing interest has been observed, which aims to find more robust and efficient approaches for acoustic scene classification and sound event detection, by using both supervised learning and unsupervised-learning methods. Specifically, the first Detection and Classification of Acoustic Scenes and Events (DCASE) 2013 challenge was organized by the IEEE Audio and Acoustic Signal Processing (AASP) Technical Committee, aims to solve the problem of lacking common benchmarking datasets, and has stimulated the research community to explore more efficient methods. Since the release of the relatively larger labeled data, there has been a plethora of efforts made for the audio scene classification task , , .
Recently, deep learning have achieved convincing performance in different fields, ranging from computer vision., speech recognition 10]
. Extensive deep learning architectures have been explored for the audio signal processing, for example, auto-encoder, convolutional neural network, recurrent neural network, and different regularization methods are also tested for the task.
Most of the previous attempts aimed to apply the deep learning by modifying the CNN architectures. In this paper, we aim to improve the ASC performance by using the multi-scale DenseNet and culling sample-based regularization method. In more detail, multi-scale DenseNet is employed to extract multi-scale information embedded in the time-frequency of the audio signal. Moreover, unlike previous attempts to dropout hidden layers in the neural network training, we explore the low-variance sample dropout approaches, with the goal to culling the “outliers” in the training samples. After removing the specified samples in the training data set, the neural network classifier is trained with the remaining examples to obtain robust models. Using the DCASE 2017 audio scene classification dataset, our experimental evaluation shows that the proposed method can improve the robustness of the classifier.
The paper is organized as follows: Section 2 gives a short summary for the related work. Section 3 presents the data used and the experimental setup, while section 4 describes the multi-scale DenseNet. The approach to cull samples in the training set is given in Section 5. The experimental results are presented in section 6, while Section 7 concludes this paper.
2 Related Work
ASC is a complicated issue which aims at distinguishing acoustic environments solely depended on the audio recordings of the scene. During last decades, various feature extraction algorithms (representing the audio scenes), and classification models have been proposed in previous works. The most popular baseline is Gaussian Mixture Model (GMM) or Hidden Markov Model (HMM), by using the Mel-Frequency Cepstral Coefficients (MFCCs)
. Shallow-architecture classifier, such as, Support Vector Machines (SVM) and Gradient Booting Trees (GBM) , were also tested for the classification task. Moreover, non-negative matrix factorization (NMF) approach can be utilized to extract subspace representation prior to the classification.
Recently, many works demonstrated that deep neural network can improve the classification accuracy while no handcrafted features are needed. In brief, for the ASC task, the main modifications of deep learning-based approaches can be divided into three categories: deep learning using different representations of the audio signal , ; more sophisticated deep learning architectures classifiers ,  and the applications of different regularizations methods to train the deep neural network.
. The well-known deep neural network models include deep belief network (DBN) and auto-encoders, convolutional neural network (CNN), recurrent neural network (RNN). Here, CNN is selected as the classifier due to its high potential to identify the various patterns of audio signals. Moreover, unlike previous attempts, we explore the use of multi-scale DenseNet as additional features may be embedded in different time range. The combination of features from multi-scale may lead to more salient feature representations for the classification task.
On the other hand, the neural networks are vulnerable to the outliers in the training set, and the outliers in the training set have a high negative influence on the trained model. As a result, the pre-processing of training samples is a key factor for the audio scene classification accuracy , while it is often under-explored in previous studies. In this work, we explore to cull the train samples of low-variance, which can be viewed as the noise (or outliers), with the goal to improve the accuracy.
3 Audio Scene Classification Datasets and Experimental Setup
As aforementioned, for the audio scene classification task, the well-collected data sets include: DCASE 2013 dataset , DCASE 2016 dataset , DCASE 2017 dataset , Rouen dataset  and ESL dataset . We evaluate the proposed method on the TUT audio scene classification 2017 database , as this data is of large scale and covers 15 different acoustic scenes.
In more detail, the database consists of stereo recordings, which were collected using 44.1 kHz sampling rate and 24-bit resolution. The recordings came from various acoustic scenes, which have distinct recording locations. 3-5 minutes long audio was recorded for each sample, all of samples are divided into four cross-validation folds. And the audios were split into 10-second segments. The acoustic scene classes considered in this task were: bus, cafe/restaurant, car, city center, forest path, grocery store, home, lakeside beach, library, metro station, office, residential area, train, tram, and urban park.
As the input of the different deep neural network architectures, it can be either the raw audio signal or the time-frequency representation of the raw audio. Presently, most of the audio-related classification and detection system relied on the hand-crafted time-frequency representations of the audio signal. For example, Mel-frequency cepstrum (MFCCs) are widely used in speech recognition. However, MFCCs are developed inspired by the human speech production process, which assumes sounds are produced by glottal pulse passing through the vocal tract filter. MFCCs discard useful information about the sound, which restricts its ability for recognition and classification. In recent years, Mel Bank Features have been widely used in speaker recognition. In this paper, Mel-filter bank feature is used for our experiments, as mel filterbanks can provide better performance, and the mel-bank is reweighted to the same height (shown in Figure 1).
4 Multi-Scale DenseNet
Convolutional neural network (CNN) has a great potential to identify the various salient patterns of audio signals. There are numerous variants of CNN architectures in the literature. However, their basic components are very similar. Since the starting with LeNet-5 21].
Many recent works claim that deeper CNN can provides better performance, as demonstrated by the progress on the image-classification task by using AlexNet , VGGNet , ResNet  architectures. Unlike traditional sequential network architectures such as AlexNet, ResNet is a form of architecture that relies on network-in-network block. DenseNet is a new architecture and it is a logical extension of ResNet.
In more detail, DenseNet has a fundamental building block, which connects each layer to every other layer in a dense manner. There is a direct connection between any two layers in each dense block, and the input for each layer is the union of the outputs from all the previous layers. Compared to the conventional CNN, DenseNet not only performs better in image classification, but also provide a higher utilization rate for the original data and less feature information loss. It reinforces feature propagation, supports feature re-utilization, solves vanishing gradient problems effectively and significantly reduce the number of parameters.
For the audio scene classification task, the input for the DenseNet can the time-frequency representation of the raw audio signal. In this paper, as aforementioned, the Mel filter-bank energy is selected as the input of neural network. However, it is widely known that the multi-scale information are embedded in the time-frequency representation of the audio signal. Thus, automatically fully extraction of the multi-scale features is of great importance to improve the classification accuracy. A multi-scale DenseNet  architecture is given in Figure 2.
As shown in Figure 2, the proposed system architecture comprises of dense convolutional blocks with direct connections from any layer to all subsequent layers to improve the information flow on 128 128 input images. Four dense blocks with unequal numbers of layers make up the DenseNet used in our experiments.
A convolution with 64 output channels is performed on the input images in front of the first dense block. For convolutional layers with kernel size 3
3, one-pixel padding is applied at each side of the inputs to keep the feature-map size fixed. The layers between two contiguous dense blocks are referred as transition layers for convolution and pooling, which contain 11 convolution and 22 average pooling. A bottleneck layer is used by using a 11 convolution before each 33 convolution in order to reduce the number of input feature-maps and improve computational efficiency. At the end of the last dense block, a global average pooling and a softmax classifier are employed.
The multi-scale dense block is employed to capture the multi-scale features embedded within the Mel filter bank features. The length of the convolution kernel is longer in the upper layer and can generate more feature maps to extract long-term and medium-term features in the Mel filter bank energy representation of the audio signal. In the lower layer, the convolution kernel receives all outputs in the previous layers and the original input. As a result, it allows us to decrease the length of the convolution kernel and prompt the feature maps to grasp more short-term features. Since the network structure supports to re-utilize the feature maps, the multi-scale dense block concatenates all periodic features with different lengths into the transition layer for further manipulations. After this block, the model can extract a certain number of multi-scale features.
5 Culling training samples for convolutional neural network
It is widely known that convolutional neural networks have a natural tendency towards overfitting, especially when the training data size is small. DCASE 2017 audio scene classification dataset have much larger data size with comparison to the datasets used in the previous studies. However, compared the data size to train a neural network for the image classification task, the size is still small. Dropout and data augmentation are proved to be effective regularization approaches to train the convolutional neural network. In more detail, dropout of neural in the hidden layer is widely used in a plethora of literatures, and the activation of every hidden unit is random randomly removed with a predefined probability. While, another common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations for data augmentation. However, artificially enlarge the dataset may induce more ambiguous data (which can be considered as the outliers) on which the model has low confidence and thus high loss. In the audio scene classification task, it is widely known that the convolutional neural networks are vulnerable to the ambiguous samples.
In this paper, unlike previous attempts, we aim to reduce the training set by removing outliers. In more detail, we tested two different kinds of methods. First, inspired by the silence remove employed for the speech recognition, we remove the training samples, which associated with low-magnitude of the audio signal segment. It is well-known that silence-remove can boost the performance of the automatic speech recognition.
Secondly, we explore to cull low-variance training samples directly. As can be seen from the Figure 1, flatten regions exits in the mel-bank features of the audio signal. However, these flatten regions may contain few useful information during the network training as pixel values are almost the same in the whole sample. Our insight is that by removing the low-variance samples, we can make our model to be more robust without changing either the network architecture or training method.
In our experiments, we trained the classifier by using the multi-scale DenseNet to test every example in the training set, and considered examples whose variances were below a threshold as outliers, which will be removed during the majority voting.
6 Experimental results
In our experimental stage, the average accuracy is calculated by using pre-defined four folds for the cross-validation. For comparison, the baselines are also given in our experiments. The baseline includes a Gaussian mixture model (GMM) (with 16 Gassians per class).
In more detail, the baseline system used here consists of 60 MFCC features and a Gaussian mixture model based classifier. MFCCs were calculated using 40-ms frames with Hamming window and 50% overlap and 40 mel bands. They include the first 20 coefficients (including the 0th order coefficient) and delta and acceleration coefficients calculated using a window length of 9 frames. A GMM model with 32 components was trained for each scene class. To train the multi-scale DenseNet, we used Keras with tensorflow backend, which can fully utilize GPU resource. CUDA and cuDNN were also used to accelerate the system.
The audio scene classification accuracies obtained by different dropout ratio are given in Table 1. During our experiments, the silence remove-based sample dropout method does not boost the performance, thus the culling samples of low-variance, using mel filterbank energy representation, is reported in our experiments. Due to the page limitation, only the multi-scale DenseNet-based audio scene classification accuracy is reported, by using different sample dropout ratio. As can be seen from the table, the sample dropout can improve the performance for the classification task. While, 0.01 percentage of low-variance samples dropout processing provides superior performance. The dropout ratio 0.01 is employed in our following experiments.
|Sample dropout ratio||Cross-validation||Evaluation|
Table 2 provides a further quantitative comparison between different methods. In our experiments, both original DenseNet and multi-scale DenseNet showed better performance against the baseline. Without any data augmentation, the original cross-validation accuracy of multi-scale DenseNet is 80.4%, while the original (single scale) DenseNet provides the 78.3% accuracy. With sample dropout, the performance of original DenseNet and multi-scale DenseNet are boosted. The performance of baseline is also improved with sample dropout.
For the performance on the evaluation dataset, the accuracy of baseline was improved from 61.0% to 63.4% with the sample dropout. The accuracy of original DenseNet was improved from 68.8% to 69.5% with the sample dropout, while the accuracy of multi-scale DenseNet was improved from 68.8% to 69.5% with the sample dropout.
|Multi-scale using DenseNet||80.4%||70.5%|
|Baseline (GMM) with sample dropout||76.2%||63.4%|
|DenseNet with sample dropout||80.6%||69.5%|
|Multi-scale using DenseNet with sample dropout||83.4%||72.5%|
As can be seen from the table, the results demonstrates that multi-scale DenseNet showing a better performance than the original single-scale DenseNet. It may imply that additional features can be extracted from multi-scale, which can improve the accuracy for the audio scene classification task. Moreover, the sample dropout method can boost the performance further, which demonstrates the effectiveness of proposed method.
In this paper, we have presented the multi-scale DenseNet-based method for the multi-class acoustic scene classification. To summarize, the contributions of this paper are twofold: firstly, we explore a multi-scale CNN architecture for the classification task. To the best knowledge of the authors, this is the first attempt to employ multi-scale DenseNet for the audio scene classification task. Secondly, we propose a novel sample dropout method, and experiments demonstrate that by employing the proposed sample dropout approach, the classification performance can be improved further. For future work, we will conduct a quantitative comparison between different widely-used CNN architectures, which can be helpful to design specified architecture for the audio scene classification task. Moreover, the transfer learning-based audio scene classification also needs to be explored in future work.
This study was supported by the Strategic Priority Research Programme(17-ZLXD-XX-02-06-02-08). We also thank for Qiuqiang Kong from Centre for Vision, Speech and Signal Processing (CVSSP), Surrey University, for providing helpful suggestions.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
-  H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in European conference on computer vision. Springer, 2006, pp. 404–417.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
M. Dixit, S. Chen, D. Gao, N. Rasiwasia, and N. Vasconcelos, “Scene
classification with semantic fisher vectors,” in
Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015, pp. 2974–2983.
-  D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.
-  H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer, “Cp-jku submissions for dcase-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), 2016.
-  A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic scene classification and sound event detection,” in Signal Processing Conference (EUSIPCO), 2016 24th European. IEEE, 2016, pp. 1128–1132.
-  A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets and baseline system,” in DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events, 2017.
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
-  C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky, “The stanford corenlp natural language processing toolkit,” in Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 2014, pp. 55–60.
-  J. T. Geiger, B. Schuller, and G. Rigoll, “Large-scale audio feature extraction and svm for acoustic scene classification,” in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. IEEE, 2013, pp. 1–4.
E. Fonseca, R. Gong, D. Bogdanov, O. Slizovskaia, E. Gómez Gutiérrez, and X. Serra, “Acoustic scene classification by ensembling gradient boosting machine and convolutional neural networks,” inVirtanen T, Mesaros A, Heittola T, Diment A, Vincent E, Benetos E, Martinez B, editors. Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017); 2017 Nov 16; Munich, Germany. Tampere (Finland): Tampere University of Technology; 2017. p. 37-41. Tampere University of Technology, 2017.
-  Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” in Advances in Neural Information Processing Systems, 2016, pp. 892–900.
-  E. Marchi, D. Tonelli, X. Xu, F. Ringeval, J. Deng, and B. Schuller, “The up system for the 2016 dcase challenge using deep recurrent neural network and multiscale kernel subspace learning,” Detection and Classification of Acoustic Scenes and Events, 2016.
-  S. H. Bae, I. Choi, and N. S. Kim, “Acoustic scene classification using parallel combination of lstm and cnn,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016, pp. 11–15.
-  H. Phan, P. Koch, L. Hertel, M. Maass, R. Mazur, and A. Mertins, “Cnn-lte: a class of 1-x pooling convolutional neural networks on label tree embeddings for audio scene classification,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 136–140.
-  K. Xu, D. Feng, H. Mi, B. Zhu, D. Wang, L. Zhang, H. Cai, and S. Liu, “Mixup-based acoustic scene classification using multi-channel convolutional neural network,” arXiv preprint arXiv:1805.07319, 2018.
-  A. Rakotomamonjy and G. Gasso, “Histogram of gradients of time-frequency representations for audio scene classification,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 142–153, 2015.
-  K. J. Piczak, “Esc: Dataset for environmental sound classification,” in Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015, pp. 1015–1018.
-  Y. LeCun, L. Jackel, L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. Muller, E. Sackinger, P. Simard et al., “Learning algorithms for classification: A comparison on handwritten digit recognition,” Neural networks: the statistical mechanics perspective, vol. 261, p. 276, 1995.
-  B. Li, K. Xu, X. Cui, Y. Wang, X. Ai, and Y. Wang, “Multi-scale densenet-based electricity theft detection,” arXiv preprint arXiv:1805.09591, 2018.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 1, no. 2, 2017, p. 3.
-  G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger, “Multi-scale dense convolutional networks for efficient prediction,” arXiv preprint arXiv:1703.09844, 2017.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,”
The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.