1 Introduction
The problem of recognizing the acoustic soundscapes and identifying the environment in which a sound is recorded is known as Acoustic Scene Classification [1, 2]. The objective is to assign a semantic label (acoustic scene) to the input audio stream that characterizes the type of environment in which it is recorded – for example shopping mall, airport, street. The problem has been very well explored as a singlelabel classification task [3, 4]. Due to the possible presence of diverse sound events in a sound scene, developing a descriptive representation for ASC is known to be a difficult task [5].
DCASE Challenges, started in 2013, provide benchmark data for computational sound scene analysis research, including tasks for detection and classification of acoustic scenes and events, motivating researchers to further work in this area. Looking at the current trend of challenge submissions in the ASC task, it is clear that researchers are moving towards using deep learning methods for system development
[3, 4, 6]. This is because of the fact that the current handcrafted methods are not sufficient to capture the discerning properties of soundscapes [7]. With time, datadriven approaches are taking over conventional methods which involve more expert knowledge for designing and choosing features. Most published systems typically use a combination of audio descriptors and learning techniques, with a growing inclination towards deep learning [8, 9].The literature of ASC research is vast and a lot has been done in system design. Earliest works in this field have tried to use numerous methods from speech recognition (for example, using features like Melfrequency cepstral coefficients [10], normalized spectral features, and lowlevel features [2, 11] like the zerocrossing rate). General architecture follows a pipeline based on extracting framebyframe handcrafted audio features or learning them using various methods like matrix decomposition of spectral representations (log melspectrograms [12], ConstantQ transformed spectrograms [13]
), and then performing machine learning based classification. The final decision is a combination of frame wise outputs, for example, by using majority voting or mean probability. Many systems incorporate deep learning approaches, generally by using some kind of timefrequency representation as the input and training deep neural networks (DNNs) or CNNs
[14, 15]. Some methods also exploited ideas from the image processing literature, for example, training a classifier using the histogram of gradient representations over spectrograms of audio frames
[16, 17].CNNs are extensively used in ASC. Some systems incorporate the use of convolutional layers with big receptive fields (kernels) to capture global correlations in the spectrograms [18, 19], while some use smaller kernels focusing on local spatial data [20, 14]. We aim to create a better understanding of how CNNs could be used to model acoustic scenes, rather than achieving state of the art results. Our work shows that depending on the scene class, there is a specific frequency band showing most activity, hence providing discriminative features for that class; to the authors’ knowledge this has not been considered in earlier studies. We first develop a motivation for using spectrogram crops, which we term Subspectrograms. Finally, we propose a CNN model, SubSpectralNet, to make use of the Subspectrograms to capture more enhanced features, hence resulting in superior performance over a model with similar parameters which does not incorporate subspectrograms (discussed in Section 4). For all experiments, we used the DCASE 2018 ASC development dataset [21] having 6122 twochannel 10second samples for training and 2518 samples for testing, divided into ten acoustic scenes.
The rest of the paper is divided as follows – in Section 2, we develop a basic statistical model for ASC which we use as the motivation to design the proposed CNN architecture. Section 3 discusses the methodology used to develop the CNN model and Section 4 describes various experiments performed to prove the efficacy of the system. Finally, we conclude the work in Section 5.
2 Statistical Analysis of Spectrograms
Magnitude spectrograms are twodimensional representations over time and frequency, which are very different from real life images. In spectrograms, there is a clear variation in the frequency axis. While images have local relationships over both spatial dimensions, spectrograms have definitive local relationships in the time dimension, but not in the frequency dimension. In the frequency dimension, for some types of sounds there are local relationships (e.g. sounds that have broadband spectra like noiselike sounds), sometimes they have nonlocal relationships (e.g. harmonic sounds, where there are relationships between nonadjacent frequency bins), and sometimes there are simply no local relationships at all.
We first create a simple mathematical model to gain more insights on how CNNs could leverage timefrequency features efficiently. We extract log melspectrograms using a 2048point short time Fourier transform (STFT) on 40ms Hamming windowed frames with 20ms overlap and then transform this into 200 Melscale band energies. Finally, the log of these energies is taken. Next, we perform binwise normalization of the sample space and obtain 6122 samples having
(melbinstimeindex) feature size. Now, we concatenate all the samples of the same class in the time dimension and take the average of the temporal direction to obtain ten distributions having 200 vectorsize, one for each class. We observe that there is a clear variation in the classwise activation of different melbins. For more clarity, we perform binwise classification of test samples using the ten 200D reference mean vectors.
For each test sample, we compute the mean of temporal direction to get a 200D vector and the hypothesis is that this vector should have a similar distribution as that for the mean vector of the corresponding class. Mathematically, we compute the L distance with the reference mean vectors and the class for which this distance has the minimum value should be the correct label.
Now, instead of computing the distance measure for the whole 200D vector, we compute separate distances for each melbin because we are interested in analyzing how those bins are activated for specific classes. This is equivalent to saying that we have 200 small classifiers. Finally, using these 200 outputs for all the test samples, we create one normalized histogram for each class, in which we have frequencies of correct classifications of corresponding melbins, shown in Fig. 1
. We also calculate the chisquare distance between these histograms to see how similar classwise distributions are. For this, we normalize the histograms with maximum value to one, and then compute the distance. The lesser the distance, the more confusion exists between those classes. We aim to obtain a matrix which has some resemblance with a confusion matrix. For that, after getting the
symmetrical matrix having distances between the classes, we normalize the matrix by dividing with the maximum value. Then, we apply the following mathematical transform:(1) 
where is the prior distance value and are the matrix indices. is a constant parameter which when increased, enhances the differences of values on the higher range. We used as so that the matrix resembles a confusion matrix. Next, we normalize again the matrix by dividing with the maximum value and lastly, subtract these values from one. The output of this is shown in Fig. 2
. We also compute the KullbackLeibler divergence
[22] and Hellinger distance [23] over these histograms and they result in a very similar matrix, which shows that the statistical model is robust. We can clearly see that some classes are having higher confusions (for example, “metro_station” and “metro”; “shopping_mall” and “airport”), which resembles the confusion matrices obtained from the baseline model results [21] and proposed CNN model (shown in Figure 4).In the histograms obtained, we observe a definite variation of activation of melbins and subbands, which is specific to every scene. For example, the “metro” class has more activation in lower frequency bins; the “bus” has less activation in mid frequency bins. For “park” or “street_traffic”, nearly all melbins are active and from the DCASE 2018 baseline result [21], we can see that these classes have relatively superior performance. We use these observations to develop SubSpectralNet, which is discussed in the next section.
3 Designing Subspectralnet
We start off with the DCASE 2018 baseline system for the ASC task and gradually develop the proposed network. The baseline system is based on a CNN, where melband energies with 40 melbins are extracted for every sample with 40 millisecond frame size and 50% overlap using 2048point STFT. The samples are further normalized and the size of each sample is . These samples are passed to a CNN consisting of two layers with samepadding in order – 32 kernels and 64 kernels, each having kernels of
size, batch normalization and ReLU activation. After each convlayer, a maxpooling layer of
andpoolsize respectively is used to decrease the size of the feature space and a dropout rate of 30% is applied to prevent overfitting. Finally, a fully connected (FC) layer with 100 neurons is used over the flattened output, which is further connected to the output (softmax) layer.
We train DCASE 2018 baseline models on different channels of the audio dataset – left channel, right channel, averagemono channel and lastly, both channels to the CNN model. The best results are obtained using both channels which is expected as binaural input would give more information on the prominent sound events in soundscapes, for example, a car passing by in “street_traffic”.
3.1 Incorporating Subspectrograms
From the analysis in Section 2, we infer that using bigger convolutional kernels over spectrograms is not a good idea because it tends to combine global context and we lose the local timefrequency information. We perform an experiment (discussed in Section 4
) in which we gradually increase the size of the kernels in the first convlayer of the baseline system. The accuracy decreases with the increase in kernel size. Spectrograms have a definite variation in the frequency dimension. Using smaller convolutional kernels over complete spectrograms works fine because CNNs are very powerful in fitting these receptive fields to understand the variances in the data. But the fact that spectrograms have these variations could be advantageous.
Building upon this idea, we propose SubSpectralNet and its architecture is shown in Figure 3. SubSpectralNet essentially creates horizontal slices of the spectrogram and trains separate CNNs on these subspectrograms, finally acquiring bandlevel relations in the spectrograms to classify the input using diversified information.
We extract the log melenergy spectrograms for the samples and perform binwise normalization. For creating subspectrograms, we design a new layer (we term it as SubSpectral Layer) which splits the spectrogram into various horizontal crops. It takes three inputs – input spectrogram ( dimension, , and being number of channels, melbins and timeindices respectively), subspectrogram size and melbin hopsize (vertical hop) . This results in frequencytime subspectrograms of dimension for every sample, where .
3.2 SubSpectralNet – Architecture Details
Twochannel subspectrograms are independently connected to 2 convlayers with same padding and kernelsize of having 32 and 64 kernels respectively. After each convlayer, there is a batch normalization layer, ReLU activation layer, maxpooling layers of and size respectively, and finally a dropout of 30%. After the second pooling, we flatten the layer and add an FC layer with 32 neurons with ReLU activation and 30% dropout, followed by the softmax layer. We call these subclassifiers of the SubSpectralNet. We do not remove these softmax outputs from the final network because this enforces them to learn to classify the sample based on only a part of spectrogram. We keep most parameters same as the DCASE 2018 baseline model for fair comparison. We believe that subspectrograms could be incorporated into more complex architectures [24, 25, 26] that could be used to surpass the state of the art in ASC performance.
To capture the global correlation (or decorrelation) between frequency bands, we concatenate the FC (ReLU) layer of the subnetworks and train a DNN with hidden layers with neurons, where: . We term this as the global classification subnetwork.
All crossentropy errors from the global and subclassifiers are backpropagated simultaneously to train a single network. The subclassifiers learn to classify using specific bands of spectrograms, while the global classifier combines and learns discerning information at a bandlevel. This modification of training method results in improved performance and faster convergence of the model with minimal addition to the complexity [24].
We create confusion matrices (shown in Fig. 4) from the output of these subclassifiers and the global classification model discussed in Section 4. We observe that the statistical motivation given in Section 2 fits well with the results. For example, for the “airport” class, statistical distribution says that lower frequencies are more effective in classification. The same is shown in the confusion matrix where the lowband subclassifier shows better results for this class. For the “bus” class, the midband subclassifier shows relatively better results. For most classes, the global classifier achieves better results than any subclassifier. It is interesting to note that for some classes like “public_square” or “tram”
, a subclassifier performs better than the global classifier, which could mean that using the complete spectrogram adds outliers and it is better to use a specific band of spectrogram in such case.
4 Experiments
We demonstrate the potential of SubSpectralNet on the DCASE 2018 development public dataset (Task 1A) and compare the results with DCASE 2018 baseline. We use dcase_util toolbox [27]
to the extract features from the DCASE 2018 dataset. We implement SubSpectralNet in Keras with TensorFlow backend and experiments are performed on an NVIDIA Titan Xp GPU having 12GB RAM. We train all models 3 times for 200 epochs and report the averagebest accuracy. The learning rate is set to
with Adam as the optimizer. Following are the experiments we perform in this work:We train DCASE 2018 baseline models on different channels of audio dataset and the test accuracy achieved are 63.24% (left channel), 61.83% (right channel), 64.91% (averagemono channel) and 65.66% (stereo channels). We also train the DCASE 2018 baseline model for various kernel sizes of the first CNN layer – , , and . The corresponding test accuracies are 65.66%, 65.23%, 65.08% and 62.80% respectively. This shows that bigger receptive fields tend to combine information on a bigger scale, hence losing local spatial information. As a result, we choose the kernelsize of with stereo input for SubSpectralNet.
We train a SubSpectralNet model on 40 log melenergy spectrograms with 20 subspectrogram size and 10 melbin hopsize (331K model parameters). The resultant accuracy we achieve from this is 72.18%. The confusion matrices computed for this model are shown in the Figure 4. Also, we plot a curve of training epoch versus test accuracy, comparing the performance of the DCASE 2018 baseline (2channel model) and this SubSpectralNet model, which is shown in Figure 5. It can be seen that SubSpectralNet (global classifier) has a relatively faster convergence with superior test accuracy.
To demonstrate the importance of subclassifiers, we train a SubSpectralNet model excluding the SubClassifier (softmax layers) backpropagations and only use the global classification subnetwork (330K model parameters). This achieves an accuracy of 68.79%, comparing to 72.18% with the subclassifiers, which verifies the significance of the same. More experiments with 40 log melenergy spectrograms are shown in Figure 6 (a).
Parameters of a CNN are one of the major criteria to compare two models. The DCASE 2018 baseline model on 40 melbins has 117K parameters (using 2channel input) and the SubSpectralNet model with 20 subspectrogram size and 10 melbin hopsize has 331K parameters. To prove the efficacy of SubSpectralNet, we modified the DCASE 2018 baseline model by doubling the number of kernels in both convlayers (now 64 and 128). This model, having 434K parameters achieved 66.79% accuracy which is 5.39% lower than the accuracy of proposed model. Hence, this justifies the fact that the idea of fitting separate kernels (training separate CNNs) over separate bands of spectrograms learns more salient features than directly training a CNN on spectrograms.
Considering that 200 melbin spectrograms achieve better performance than using lesser melbins [19], we train a DCASE 2018 baseline model on 200 melbins and the accuracy achieved is 71.94%. It is interesting to note that SubSpectralNet with 40 melbins can achieve comparably superior accuracy. We trained various SubSpectralNets on 200 melbins and the results are shown in Fig. 6 (b) and (c). The best accuracy achieved was 74.08%, by using 30 subspectrogram size and 10 melbin hopsize, which is an overall increase of +14% over the DCASE 2018 baseline [21].
5 Conclusions
In this paper, we introduce a novel approach of using spectrograms in Convolutional Neural Networks in the context of acoustic scene classification. First, we show from the statistical analysis of Sec. 2 that some specific bands of melspectrograms carry discriminative information than other bands, which is specific to every soundscape. From the inferences taken by this, we propose SubSpectralNets in which we first design a new convolutional layer that splits the timefrequency features into subspectrograms, then merges the bandlevel features on a later stage for the global classification. The effectiveness of SubSpectralNet is demonstrated by a relative improvement of +14% accuracy over the DCASE 2018 baseline model.
SubSpectralNets also have some limitations, including the fact that for some classes, subclassifiers are performing better than the global classifier. Also in the current model, we have to specify parameters like subspectrogram size and melbin hopsize. One way to address this could be by using the statistical analysis to choose the most appropriate parameters. In future, we plan to work on further improving the performance of this network, for example, by incorporating wellfounded CNN architectures like SqueezeandExcitation network [25] and Densely Connected Neural Networks [26].
References
 [1] Tuomas Virtanen, Mark D Plumbley, and Dan Ellis, Computational analysis of sound scenes and events, Springer, 2018.
 [2] Antti J Eronen, Vesa T Peltonen, Juha T Tuomi, Anssi P Klapuri, Seppo Fagerlund, Timo Sorsa, Gaëtan Lorho, and Jyri Huopaniemi, “Audiobased context recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 321–329, 2006.
 [3] Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.
 [4] Annamaria Mesaros, Toni Heittola, Emmanouil Benetos, Peter Foster, Mathieu Lagrange, Tuomas Virtanen, and Mark D Plumbley, “Detection and classification of acoustic scenes and events: Outcome of the dcase 2016 challenge,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 2, pp. 379–393, 2018.

[5]
Yifang Yin, Rajiv Ratn Shah, and Roger Zimmermann,
“Learning and fusing multimodal deep features for acoustic scene categorization,”
in 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 2018, pp. 1892–1900.  [6] Tuomas Virtanen, Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Emmanuel Vincent, Emmanouil Benetos, and Benjamin Martinez Elizalde, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Tampere University of Technology. Laboratory of Signal Processing, 2017.
 [7] Mathieu Lagrange, Grégoire Lafay, Boris Defreville, and JeanJulien Aucouturier, “The bagofframes approach: a not so sufficient model for urban soundscapes,” The Journal of the Acoustical Society of America, vol. 138, no. 5, pp. EL487–EL492, 2015.
 [8] Victor Bisot, Romain Serizel, Slim Essid, and Gaël Richard, “Feature learning with matrix factorization applied to acoustic scene classification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1216–1229, 2017.
 [9] Huy Phan, Lars Hertel, Marco Maass, Philipp Koch, and Alfred Mertins, “Label tree embeddings for acoustic scene classification,” in Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016, pp. 486–490.

[10]
JeanJulien Aucouturier, Boris Defreville, and Francois Pachet,
“The bagofframes approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music,”
The Journal of the Acoustical Society of America, vol. 122, no. 2, pp. 881–891, 2007. 
[11]
Jurgen T Geiger, Bjorn Schuller, and Gerhard Rigoll,
“Largescale audio feature extraction and svm for acoustic scene classification,”
in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. IEEE, 2013, pp. 1–4.  [12] Benjamin Cauchi, Mathieu Lagrange, Nicolas Misdariis, and Arshia Cont, “Saliencybased modeling of acoustic scenes using sparse nonnegative matrix factorization,” in Image Analysis for Multimedia Interactive Services (WIAMIS), 2013 14th International Workshop on. IEEE, 2013, pp. 1–4.
 [13] Victor Bisot, Romain Serizel, Slim Essid, and Gaël Richard, “Acoustic scene classification with matrix factorization for unsupervised feature learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 6445–6449.
 [14] Michele Valenti, Aleksandr Diment, Giambattista Parascandolo, Stefano Squartini, and Tuomas Virtanen, “Dcase 2016 acoustic scene classification using convolutional neural networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016), Budapest, Hungary, 2016.
 [15] Karol J Piczak, “Environmental sound classification with convolutional neural networks,” in Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on. IEEE, 2015, pp. 1–6.
 [16] Victor Bisot, Slim Essid, and Gaël Richard, “Hog and subband power distribution image features for acoustic scene classification,” in Signal Processing Conference (EUSIPCO), 2015 23rd European. IEEE, 2015, pp. 719–723.
 [17] Alain Rakotomamonjy and Gilles Gasso, “Histogram of gradients of timefrequency representations for audio scene classification,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 142–153, 2015.
 [18] Karol J Piczak, “Environmental sound classification with convolutional neural networks,” in Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on. IEEE, 2015, pp. 1–6.
 [19] Karol J Piczak, “The details that matter: Frequency resolution of spectrograms in acoustic scene classification,” Detection and Classification of Acoustic Scenes and Events, 2017.
 [20] Yoonchang Han, Jeongsoo Park, and Kyogu Lee, “Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification,” the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1–5, 2017.
 [21] Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen, “A multidevice dataset for urban acoustic scene classification,” Tech. Rep., DCASE2018 Challenge, September 2018.
 [22] Solomon Kullback and Richard A Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.

[23]
Rudolf Beran et al.,
“Minimum hellinger distance estimates for parametric models,”
The annals of Statistics, vol. 5, no. 3, pp. 445–463, 1977.  [24] Sai Samarth R Phaye, Apoorva Sikka, Abhinav Dhall, and Deepti Bathula, “Multilevel dense capsule networks,” in Asian Conference on Computer Vision, 2018, accepted, https://arxiv.org/abs/1805.04001.

[25]
Jie Hu, Li Shen, and Gang Sun,
“Squeezeandexcitation networks,”
in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018, pp. 7132–7141.  [26] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269.
 [27] Toni Heittola, “DCASE UTIL: utilities for detection and classification of acoustic scenes,” 2018, https://dcaserepo.github.io/dcase_util/index.html.
Comments
There are no comments yet.