Acoustic scene classification (ASC) is concerned with correctly identifying real-world sounds into a set of given environment classes, such as metro station, street traffic, or public square. An acoustic scene sound contains much information and rich content that makes accurate scene prediction difficult, and thereby an intriguing research problem. ASC has thus been an attractive research field for decades, and the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge [7, 8, 9, 4] provides the benchmark data and a competitive platform to promote sound scene research and analyses.
In DCASE 2020, there are two different sub-tasks of Task 1. Task 1a focuses on the robustness problem of ASC system. Its goal is to promote research to solve the device mismatch issue, which is a common case in ASC applications. The key aim is to design a device-invariant system, which can classify ten scene audios recorded by different devices well, without leveraging any device information in the evaluation stage. Task 1b focuses on the model size of the ASC system. The goal is to build a three-class classifier occupying no more than 500KB.
We describe our submitted systems for the two sub-tasks of DCASE 2020. For Task 1a, we build a two-stage acoustic scene classification system, which includes a three-class classifier and a ten-class classifier. The final predicted class is based on the score fusion of these two classifiers. Four different convolutional neural network (CNN) based models are used in our two-stage classifier. Moreover, several data augmentation strategies are adopted to reduce the device dependency of our models. Model ensemble of 4 CNN-based systems provided a significant boost of the ASC performance. For task 1b, we build a small-size model at first, and then use quantization method to compress the well-trained model. A model can be compressed to of the original size by this way. In our experiments, the ensemble of two smaller models can get better evaluation results than a single model, when they have the same level model size.
2 Acoustic Scene Classification System
2.1 Two-Stage Classification Procedure
For Task 1a, we build a two-stage ASC system, which includes two different classifiers and output the class of the input audio scene choosing among ten classes. As shown in figure 1, the first classifier is a three-class classifier, and it classifies an input audio scene into one of three main classes, including in-door, out-door, and transportation. The second classifier is a ten-class classifier, which classifies a given input audio scene into one of ten basic scene classes, including airport, shopping mall, metro station, pedestrian street, public square, street traffic, tram, bus, metro, park. For each input audio, its final scene class is chosen by score fusion of those two classifiers. If we let and denote the set of three main classes and ten classes, respectively, and let and indicate the output of the first and second classifier, respectively. The final predicted class for the input is:
where means that can be though of a super set of
. For example, transportation class is a super set for bus, tram, and metro classes. Therefore, the probability of an input audio clip to be from the public square scene is equal to the product of the probability of out-door place given byand the probability of public square given by .
2.2 CNN-based Classifiers
Five CNN based architectures, which differ one from another for specific details concerning the usage of (i) time and frequency pooling (sub-sampling) operations, (ii) independent frequency sub-band analysis, (iii) shortcut connections (i.e., residual mapping functions), and (iv) number of convolutional layers, are used:
FCNN (fully convolutional neural network): FCNN is a VGG 
-like fully convolutional network. We use 9 stacked convolutional layers with small kernel sizes. Each convolutional layer is followed by a Batch Normalization operation and ReLU activation function. Dropout is also used in the five to last convolutional layers to alleviate over-fitting issues. Amax-pooling layer is appended after the second, fourth, and eighth ReLu-based layers. Channel attention is applied to each output channel of the last convolutional layer, followed by a global pooling layer. Finally, a 10-way softmax layer is used to generate the final classification result. In Task 1b, we use an FCNN with similar architecture but much fewer parameters, and we refer to such a model identified as small-FCNN.
fsFCNN (frequency sub-sampling controlled fully convolutional neural network): Through our experiments on DCASE 2020 Task 1a, we noticed that reducing max-pooling in the frequency axis helped improving overfitting issues; thereby, we deployed a neural architecture very similar to the above-mentioned FCNN but being having 11 stacked convolutional layers. Moreover, max-pooling layer is appended after the second and fourth ReLU-based layers. A max-pooling layer instead follows the sixth, and eighth ReLU-based layers. Channel attention is applied to each output channel of the last convolutional layer, followed by a global pooling layer. Channel attention is applied to each output channel of the last convolutional layer, followed by a global pooling layer. Finally, a 10-way softmax layer is used to generate the final classification result.
fsFCNN-s (frequency sub-sampling controlled fully convolutional neural network with split frequency bands): Each input feature map is first split (’-s’ in the model name) into two sub-feature maps along a frequency dimension. If there are frequency bins, frequency bins between 0 and are processed by an fsFCNN, and frequency bins between and would be independently processed by another fcFCNN. The processing will happen in parallel up to the ReLU-based layer. Then the two processed streams would be concatenated and processed by two further convolutional layers. Finally, a global pooling layer and 10-way softmax is used to get the final scene classification decision.
Resnet (17-layer residual network): Resnet model is a residual network . We use the network structure proposed in , which has 17 convolutional layers. There is no frequency sub-sampling throughout the whole network. Each input feature map is divided into two sub-feature maps along the frequency dimension. To be specific, if we have frequency bins, the first and the second half are processed by two parallel stacked convolutional layers. Thus, we have a two-stage model structure. Like FCNN, a global pooling layer and 10-way softmax are used to get the final utterance level prediction results. Different from the structure in , in our final submission, we double the filter number of each convolutional layer, and we denote it to Resnet-d.
Mobnet (MobileNet-v2): Mobnet is based on MobileNet-v2 . The key feature of Mobnet is its low complexities despite the high accuracies that can be attained, as demonstrated in . Mobnet uses lightweight depth-wise convolutions to process features in the intermediate expansion layer. We leveraged a relatively small-size Mobnet to tackle Task 1b.
3 Data Augmentation Strategy
A key element of our submission is the usage of several data augmentation strategies. The first four approaches discussed below do not generate extra data, whereas the remaining data augmentation schemes generate extra training data. In Task 1a, all the following data augmentation methods except the channel confusion were used. In Task 1b, mixup, channel confusion, and spectrum augment are the only schemes employed.
Mixup: It was proposed in  and is often adopted to train ASC models. We use mixup with alpha equal to . Mixup is performed at a mini-batch level: Two data batches, along with corresponding labels, are randomly mixed in each training step.
Random cropping: It was proposed in . During the training stage, the input data is randomly cropped into a fixed-length along the time axis. In our experiments, the input data with the size of is cropped into input feature map. Due to quantization constraints on dynamic ranging, random cropping is not applied to Task 1b.
Channel confusion: It Is only used in Task 1b. Two channels in input data are randomly swapped.
Spectrum augmentation: It was proposed in 
and showed a significant boost of automatic speech recognition accuracy. In our implementation, we carried out spectrum augment over each input feature map. We applied it at a mini-batch level. For a batch data in the training step, each feature map is randomly masked in both time and frequency axes. With respect to terminology, we set the parameter of time and frequency mask to 10% of their dimensions, respectively.
Spectrum correction: It was described in  and demonstrated moderate device adaptation properties. However, spectrum correction aims at transforming a given input spectrum to that of a reference, possibly ideal device. Different from the original idea, we here propose spectrum correction as a data augmentation technique. To this end, we modify the original procedure as follows: (i) we create a reference device spectrum, by averaging the spectrum from all training devices except that from device A; (ii) we correct the spectrum of each training waveform collected with device A to obtain extra data.
Reverberation + Dynamic Range Compression (DRC): It’s inspired to the procedure used by the organizers to generated simulated devices, namely s1-s11. In fact, s1-s11 data is generated by adding reverberation followed by DRC to audio collected with device A.
Pitch shift: For each training waveform, we randomly shift the pitch based on the uniform distribution.
Speed change: For each training waveform, we randomly change the audio speed based on the uniform distribution. If the output waveform is longer than the original one, extra samples are dropped from the end. If shorter, padding is applied till attaining the same input length.
Random noise: For each training waveform, random Gaussian noise is added.
Mix audios: We randomly mix two audios from the same acoustic scene class. It’s device-independent; this data augmentation scheme might help simulate a new ”device.”
4 Quantization for Model Compression
In Task 1b, the goal is to keep the system size within 500 Kilobytes (KB). A post-training quantization method, which is provided by Tensorflow2 
, is used to compress our neural models. Quantization not only reduces model size but also improves hardware accelerator latency with little degradation in classification accuracy. We opted for dynamic range quantization (DRQ) as a compression scheme. In DRQ, neural weights are quantized from floating-point to integer having an 8-bit precision. Leveraging DRQ, we thus transferred our neural architectures from a 32-bit TensorFlow format to an 8-bit TensorFlow-lite format, which compresses the model size to aboutof its original size. According to our experimental evidence, such a compression method resulted in a minor performance drop.
5 Experimental Setup & Results
5.1 Feature Extraction
DCASE2020 ASC audio clips have a fixed-length of 10 seconds. Log-mel filter bank (LMFB) features were used in our experiments as audio features. The input audio waveform is analyzed using a SFFT points, with a window size of samples, and a frameshift of samples. The librosa 
library is used to generated LMFBs, and the HTK formula definition for the Mel scale is adopted. Due to different sampling rates for Tasks 1a and 1b, the final spectrogram has 431-time bins in task 1a and 469-time bins in task 1b, but the number of frequency bins is 128 in both tasks. Log-me deltas and delta-deltas without padding were also computed, which reduced the number of time samples to 423 (task 1a) or 461 (tasks 1b). The final final input tensor size isfor Task 1a, and for Task 1b. Before feeding input features into CNN classifier, we scaled each value into [0,1].
5.2 Model Training
For the train-test split, we adopt the official recommended way to split the development material. In Task 1a, there are 13965 train audio clips, and 2970 test audio clips. The training set includes audio from devices A, B, C, and s1-s3. The test set covers data from those six devices and extra data from unseen devices s4, s5, and s6. Device A data dominates the training set, which has over 10K utterances. In the test set, the number of waveforms from each device is the same. In Task 1b, there are 9185 train waveforms and 4185 test waveforms. All audio clips are from device A. Stochastic gradient descent (SGD) with a cosine-decay-restart learning rate scheduler is used to train our models. The maximum and minimum learning rates are 0.1 and 1e-5, respectively. In our final submission, all development data is used. And due to there is no validation data, we use the average output of models when learning rate hits around the minimum number. Keras is used to implement all our CNN-based models.
5.3 Results on Task 1a
We here report only some of the evaluation results collected on Task 1a in Table 1 due to space constraints. In the training set, device A data accounts for around 75%, and device B, C, s1-s3 accounts for around 5%, respectively. Thus, based on the device information of data, we here divide the test set into four different subsets, which represent real source data (device A), real target data (device B & C), the target is seen simulated data (device s1-s3), and target unseen simulated data (device s4-s6).
For comparison, model (0) gives the accuracy of the baseline ASC system provided by the DCASE2020 organizers. For systems (1) to (3), we investigate the effect of scaling and spectrum augment. Scaling features to [0,1] give a 0.8% absolute improvement, and that gain is mainly from device s1-s6. Spectrum augment boosts the ASC accuracy from 71.0% to 71.6%, yet accuracy on unseen devices drops significantly. Model (4) doubled Resnet parameters, and that leads to a significant performance increase. Leveraging on all data augmentation schemes, we can further improve ASC accuracy up to 74.6%, which is the best result using the Resnet based models discussed earlier.
Systems (6) to (8) are based on an FCNN architecture. The fsFCNN-s model is not evaluated on this train-test setup, but it’s used in our final submission. Moving from Resnet-d to FCNN and leveraging all available training data (original and augmented), ASC accuracy goes from 74.6% to 76.9%, which represents the best result using a stand-alone FCNN-based architecture. Comparing model (6) and (7) allows us to appreciate the effect of the reverberation + DRC data augmentation. Indeed, it provides a large improvement in s1-s6 data, especially on unseen s4-s6. That is expected since s1-s6 data is generated by adding reverberation and applying DRC on device A data. An ASC accuracy with models (7), (8), and fsFCNN-split have very similar structures. Nonetheless, we compared classification results per testing utterance and among those three models, and we observed over 20% difference. Therefore, we keep all of those three models for the model ensemble.
Model ensemble is known to boost the ASC accuracy, and we use a simple non-weighted average score fusion for system combinations. From results corresponding to systems (9) to (10), we can see that a two-model combination outperforms any stand-alone systems. By increasing the number of models in the ensemble results, we attain an enhancement of the ASC accuracy from 77.5% (two models) to 78.1% (three models). Finally, after model ensemble, a three-class classifier is integrated by score fusion leveraging on the two-stage fusion scheme discussed in Section 2. The accuracy of the three-class classification system is 93.2%. From systems (11) and (12), we can argue that the proposed two-stage fusion strategy significantly improves ASC accuracy. In a two-model ensemble, the ASC accuracy increases from 77.5% to 80.2%; and further increases from 79.4% to 81.9% with a three-model ensemble.
For our final system submissions, all the development data is used. Due to the resource and time limitation, some of our final submissions are not tested on above-mentioned train-test setup of development data. The four submissions include: 1) average ensemble of Resnet-d, Resnet-d with attention, fsFCNN-s, and fsFCNN with attention; 2) average ensemble of Resnet-d, FCNN, fsFCNN, and fsFCNN-s trained by different data strategies; 3) average ensemble of all models in 1) and 2); and 4) average ensemble of all models in 1) and 2) except Resnet based models.
5.4 Results on Task 1b
Table 2 shows evaluation results on Task 1b. Mobnet and small-fcnn models can attain an ASC accuracy of 95.2% and 96.4%, respectively. The model size can be reduced to about of its original size using quantization. For Mobnet, it is observed 0.4% performance drop. For small-fcnn, the performance drop is 0.1%. Next, we decide to carry out model ensemble, but we further reduce mobnet and small fcnn parameters before system combination to keep the final combined model size under 500 KB. Therefore, systems in rows 3 and 4 in Table 2 is slightly different from the models in row 1 and 2. In Table 2
, we present the results by logistic regression based model ensemble. As expected, model ensemble achieves better accuracy than a stand-alone system even after quantization.
For our final submission, all the development data is used. Among our four submissions, the first is from a single small-FCNN model, and the other three are from the ensemble of Mobnet and small-FCNN, including average ensemble of Mobnet and small-FCNN, logistic regression based ensemble of Mobnet and small-FCNN, and two small-FCNN models.
|mobnet||95.2 (3.2M)||94.8 (411K)|
|small-FCNN||96.4 (2.8M)||96.3 (357K)|
|96.8 (1.8M+1.9M)||96.7 (490K)|
|96.5 (1.9M+2.1M)||96.3 (499K)|
6 Discussion & Conclusion
Although we achieve over 80% and 95% ASC accuracy on Task 1a and Task 1b, respectively, there is yet a large prediction difference between systems with very similar performance. For example, the prediction overlap of FCNN and fsFCNN is only 77%, which implied that there exists some audio clips are differently predicted by different systems with similar performance. By listening inspection, we realize that several audio clips are difficult to classify even for a human listener because some acoustic scenes are fuzzy by design. To bright ideas: voices of people in an indoor area can either be from a shopping mall or an airport, because the same micro-scene, i.e., a small shop, can be in either a shopping mall or an airport.
In this technical report, we have described our submitted systems to DCASE 2020 Tasks 1a and 1b challenges. For Task 1a, a two-stage classification system is designed, which includes both 3-class classifiers and 10-class classifiers. The final predicted class is determined by the score fusion of these two classifiers. Different fully CNN based classifiers and data augmentation techniques are investigated in our experiments. For Task 1b, small-size FCNN and mobnet are used as a classifier. The quantization method is performed to compress the ASC model to be less than 500KB. From our evaluation of development data, we achieve an accuracy of 81.9% on Task 1a, and 96.7% on Task 1b.
TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §4.
-  (2015) Keras. Note: https://keras.io Cited by: §5.2.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: 4th item.
-  (2020) Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Note: Submitted External Links: Cited by: §1, Table 1.
-  (2020) Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 141–145. Cited by: 4th item, 2nd item.
-  (2015) Librosa: audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. Cited by: §5.1.
-  (2018) Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (2), pp. 379–393. Cited by: §1.
-  (2017-11) DCASE 2017 challenge setup: tasks, datasets and baseline system. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp. 85–92. Cited by: §1.
-  (2018-11) A multi-device dataset for urban acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pp. 9–13. Cited by: §1.
-  (2020) Acoustic scene classification for mismatched recording devices using heated-up softmax and spectrum correction. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 126–130. Cited by: 5th item.
-  (2019) Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: 4th item.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: 5th item.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: 1st item.
-  (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: 1st item.