Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks

06/22/2017 ∙ by M. Huzaifah, et al. ∙ 0

Recent successful applications of convolutional neural networks (CNNs) to audio classification and speech recognition have motivated the search for better input representations for more efficient training. Visual displays of an audio signal, through various time-frequency representations such as spectrograms offer a rich representation of the temporal and spectral structure of the original signal. In this letter, we compare various popular signal processing methods to obtain this representation, such as short-time Fourier transform (STFT) with linear and Mel scales, constant-Q transform (CQT) and continuous Wavelet transform (CWT), and assess their impact on the classification performance of two environmental sound datasets using CNNs. This study supports the hypothesis that time-frequency representations are valuable in learning useful features for sound classification. Moreover, the actual transformation used is shown to impact the classification accuracy, with Mel-scaled STFT outperforming the other discussed methods slightly and baseline MFCC features to a large degree. Additionally, we observe that the optimal window size during transformation is dependent on the characteristics of the audio signal and architecturally, 2D convolution yielded better results in most cases compared to 1D.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

While not receiving as much attention by the scientific community as speech processing tasks, environmental sound recognition nonetheless contributes to important applications in surveillance [1], robotics [2] and home automation [3]

among others. In comparison to standard speech, environmental sounds are often more chaotic and noise-like, without the underlying phonetic structure that has been successfully modeled by traditional machine learning methods like the hidden Markov model (HMM). Recent work in this field has shown two distinctive developments: the utilization of deep neural networks (DNNs) in particular the convolutional neural network (CNN) as a classifier and feature extractor, and the use of the time-frequency representation of an audio signal, known as the spectrogram, as input.

CNN-based models were first adopted for speech recognition systems by Abdel-Hamid et al. [4] for the TIMIT phone recognition task. This model was later improved architecturally in [5], with added consideration to kernel size, pooling, network size and regularization, while other large-scale speech tasks were also carried out in [6][7][8] using CNNs. More recently, Piczak [9] and Salamon [10] showed that a basic CNN could generally outperform existing methods for environmental sound classification provided sufficient data.

To achieve desirable results, the classifier has to be paired with an appropriate input representation. Conventional choices were largely hand-crafted features such as Mel-frequency cepstral coefficients (MFCCs) or Perceptual Linear Prediction (PLP) coefficients that were previously state-of-the-art when used with Gaussian mixture model (GMM)-based HMMs. However, such cepstral features became less popular with deep learning algorithms as it was no longer essential for feature maps to be sufficiently de-correlated

[11][12]. Conversely, the strength of CNNs lie with its ability to learn localized patterns through weight-sharing and pooling [13] - patterns present in the spectro-temporal features of spectrograms.

In the domain of environmental sound it has been noted that time-frequency representations are especially useful as learning features [14][15][16][17][18] due to the non-stationary and dynamic nature of the sounds. To extract these spectro-temporal features, a range of signal processing techniques have been proposed. A survey on environmental sound recognition by Chachada and Kuo [19] covers several methods, including sparse-representation-based techniques such as matching pursuit, power-spectrum-based techniques to obtain variants of the spectrogram, and several wavelet-based approaches. Another comparative study [20]

investigated the performance of methods such as short-time Fourier transform (STFT), fast Wavelet transform (FWT) and continuous Wavelet transform (CWT) against stationary features like the aforementioned MFCC and PLP. The authors classified the extracted features using conventional machine learning techniques, including GMM-HMM, support vector machines (SVMs) and shallow artificial neural nets.

This letter builds upon previous comparative studies by focusing on the specifics of a CNN model as opposed to more traditional classifiers. We investigate four common approaches to obtain the time-frequency representation, namely the short-time Fourier transform (STFT) with both linear and Mel scales, the constant-Q transform (CQT) and the continuous Wavelet transform (CWT), while addressing additional considerations like window size. The impact of the different approaches is evaluated in comparison to baseline MFCC features on two publicly available environmental sound datasets (ESC-50, UrbanSound8K) through the classification performance of several CNN variants.

Ii Experimental Methodology

Ii-a Datasets

The ESC-50 dataset [21] comprises of 2000 short (5 second) environmental recording split equally among 50 classes. Classes were derived from 5 major groups: animals, natural soundscapes and water sounds, human non-speech sounds, interior/domestic sounds, and exterior/urban noises. The small number of samples coupled with a relatively large number of classes made this quite a demanding dataset for traditional classification methods. Previous work by Piczak [9] showed that using a deep learning approach through CNNs markedly improved classification performance, yielding an accuracy of 64.5. A recent paper [22] further improved this result to 74.2 using a deeper pre-trained network.

UrbanSound8K [23] is a collection of 8732 short (4 seconds or less) audio clips taken from field recordings. The dataset is divided into 10 distinct classes of urban sounds: air conditioner, car horn, children playing, dog barking, drilling, engine idling, gun shot, jackhammer, siren and street music. Unlike ESC-50, the classes were not completely balanced, with the car horn, gun shot and siren sounds having fewer examples. The current sate-of-the-art on this dataset [10] achieved a mean classification accuracy of 79.0.

Ii-B Pre-processing

Proper pre-processing of the raw data was a major focus to make comparisons between the different transformations as fair as possible. Four main frequency-time representations were extracted in addition to MFCCs: a) linear-scaled STFT spectrogram b) Mel-scaled STFT spectrogram c) CQT spectrogram d) CWT scalogram e) MFCC cepstrogram.

Firstly, all audio clips were standardized by padding/clipping to a 4 second duration on both datasets and resampled at 22050 Hz. Unlike

[9] and [10], whole clips were used for the subsequent transformations, including periods of silence and without additional augmentation. For STFT [24], the discrete Fourier transform with a sliding Hann window was applied to overlapping segments of the signal, given by


Varying the window length results in a trade-off between frequency and time resolution. Both wideband () and narrowband () transforms were used to probe this effect. Hop size was fixed at in both cases.

Spectrograms are defined as the squared magnitude of the STFT, giving the power of the sound for a particular frequency and time in the third dimension. The values were converted to a logarithmic scale (decibels) then normalized to [-1,1] generating a single-channel greyscale image (Fig.1). The frequency bins were either spaced linearly or mapped onto the Mel scale with 512 or 128 Mel bands for wideband and narrowband respectively.

The same general procedure was applied to the other transforms. The CQT [25] is a bank of filters corresponding to tonal spacing where each filter is equivalent to a subdivision of an octave, with central frequencies given by . Here denotes the frequency of the spectral component and the number of filters per octave. As the name suggests, the value, which is the ratio of central frequency to bandwidth, should be constant


Like the STFT, wideband () and narrowband () versions of the CQT coefficients were extracted using


Instead of decomposing a signal into sinusoids like the FT, the CWT uses a combination of basis functions that are located in both real and Fourier space i.e. frequency domain. Here the CWT was specified with 256 frequency bins and a Morlet mother function that has been used in previous audio recognition studies [14][20]


The CWT analogue of the spectrogram was attained by computing the squared value of the resultant Wavelet coefficients [26]. Since the CWT allows for arbitrary time-frequency resolution limited only by sampling rate, only one transformation was carried out corresponding to narrowband dimensions.

Finally, MFCCs were obtained using the standard procedure and arranged as a cepstrogram. The coefficients were also normalized to [-1,1] but were not log-scaled.

To keep the number of input feature maps (area of spectrogram) identical, the images were further downscaled to 3750 pixels for CWT, MFCC and all narrowband spectrograms, and 15412 pixels for wideband spectrograms, resulting in 1850 and 1848 input parameters respectively. Downscaling was done with PIL111 using Lanczos resampling for optimal results. The added effect of the resizing was a significant speed up in training times without sacrificing much accuracy.

Audio processing was mostly carried out using librosa [27] with the exception of CWT for which pyWavelets [28] was used.

Fig. 1: Examples of time-frequency spectrogram-like representations used as input extracted from the same ESC-50 sample (Handsaw 5-253094-c). Left, from far left: wideband linear-STFT, Mel-STFT, CQT. Right, clockwise from top right: narrowband Mel-STFT, CWT, MFCC, CQT, linear-STFT.

Ii-C Network Architecture and Evaluation

Shallower and deeper variations of a CNN were implemented, informed by popular image recognition models and the CNNs in [9][10]. The overall architecture is illustrated in Fig.2.

Fig. 2: Network architecture of the deeper (Conv-5) and shallower (Conv-3) CNN models. 33 and M3 sized filters were used in the convolutional layers. While containing more layers, Conv-5 had fewer number of parameters in each layer compared to Conv-3’s single convolutional layer.

Two types of convolutional filters were considered, a 33 square filter and a M3 rectangular filter, spanning all M frequency bins, that essentially forces a one-dimensional convolution over time. As opposed to natural images where both axes contain spatial information, the two are nonsymmetric in a spectrogram. The invariant property may hold over time as a translation in a spectrogram image but it was not immediately evident whether pitch invariance will hold over the full frequency spectrum. For instance [10] used a small filter, [9] used one spanning just short of all the frequency bins while [29] implemented both types to varying success. Our results show that its performance is partly effected by the scale or transformation used.

The convolutional layers were interspersed with rectified linear unit (ReLu) and max pooling layers with stride sizes equal to the pooling dimensions. The Conv-3 employed more aggressive max pooling compared to the Conv-5. Dropout

[30] was utilized during training after the first convolutional and fully-connected layers, in addition to L2-regularization on all weight layers, to reduce overfitting .

Training was performed using Adam optimization [31]

with a batch size of 100, and cross-entropy for the loss function. Both datasets came prearranged into non-overlapping folds, and all models were evaluated using 5-fold (ESC-50) and 10-fold (UrbanSound8K) cross validation with a single fold held out as a test set for each round of validation while training with the remainder. Models were trained for 200 epochs for ESC-50 and 100 epochs for UrbanSound8K. The order of samples in the training and test sets were randomly shuffled after each training epoch. The reported results are median values of the test classification accuracy from the best training epoch across 4 separate cross validation runs for ESC-50 and 2 separate runs for UrbanSound8K. All network parameters were kept constant for each fold and run although the weights were randomly initialized each time following a truncated normal distribution. The network was implemented in Python with Tensorflow


Iii Results

Linear-STFT Mel-STFT CQT CWT MFCC ESC-50 Conv-5: M3 44.502.00 46.622.25 46.252.00 48.001.63 42.002.37 42.621.50 38.251.50 30.501.50 33 49.250.75 50.001.88 50.872.50 53.751.75 46.871.13 48.622.00 40.502.13 36.622.13 Conv-3: M3 52.121.12 55.121.88 56.371.63 56.251.75 54.372.25 53.501.87 46.501.63 35.252.75 33 55.001.37 53.001.62 54.001.25 55.001.63 51.751.25 51.622.25 46.621.87 35.000.75 UrbanSound8K Conv-5: M3 61.194.81 63.443.39 62.225.19 64.973.69 62.873.25 63.123.25 56.902.10 59.233.24 33 67.944.22 62.834.73 69.594.19 65.312.19 69.254.69 64.333.60 61.561.80 57.151.81 Conv-3: M3 68.814.50 66.722.72 70.694.06 68.293.00 70.944.06 67.063.12 64.002.17 64.872.17 33 70.942.94 68.193.25 74.663.39 71.251.85 73.033.56 68.312.35 64.751.44 62.814.03

TABLE I: Median and median absolute deviation of accuracies () for ESC-50 and UrbanSound8K

Iii-a Impact of time-frequency representation

Median classification accuracy and their corresponding median absolute deviations222

Median was utilized as a statistical measure as opposed to mean for its robust property especially against outliers. Deviations were chiefly the result of the cross-validation protocol with each fold being a distinct test set, but also came from other stochastic factors such as the random initialization of weights and batching during optimization. For this reason, the whole k-fold cross validation process was repeated more than once.

for all experimental cases are presented in Table 1. It was immediately evident that classification performance with spectral representations as input outperformed traditional MFCC features. Other than the models with rectangular filters on the UrbanSound8K dataset for which MFCCs returned a better accuracy than CWT, this was consistent throughout. In fact time-frequency representations bettered MFCCs by a relatively wide margin, up to 15-20 in some cases. Even so, the confusion matrices in Fig.3 indicates that similar sets of classes were misclassified using MFCC and narrowband spectrogram features, albeit at a greater degree for the former. This suggests that the features provided by both transformations were closely related although spectral features were ultimately more discriminatory.

Fig. 3: Confusion matrices for the best performing Conv-3 model with 3

3 filter on UrbanSound8K with different inputs: wideband (top left) and narrowband (top right) Mel-STFT spectrogram, MFCC cepstrogram (left). The wideband confusion matrix displays a similar distribution across classes as its analogue in

[9] and [10].

Among the spectral transformations under consideration, it was observed that linear-STFT, Mel-STFT and CQT performed comparably on both datasets. On the other hand, CWT results were lower and closer to MFCC, especially for UrbanSound8K. The top performers for each model variation (indicated in bold print) were determined by means of an ANOVA and post-hoc Tukey test333

There were multiple top performers when the null hypothesis could not be rejected between pairs of transformations. Significance value used was 0.05.


Iii-B Effect of CNN architecture and filter size

Overall, the shallower Conv-3 model tended to yield better accuracies than Conv-5 regardless of input. A probable explanation is the diminishing returns of the deeper model due to significant overfitting. While it simplified the experimental methodology, using whole audio clips as input inevitably resulted in fewer training examples with less variation, impairing generalizability of the models. This can be seen when comparing the training and test curves (Fig.4) that tended to be closer together on Conv-3 than Conv-5.

Fig. 4: ESC-50 training (solid line) and test (dashed line) accuracy over epochs using Conv-5 (left) and Conv-3 (right) models with a narrowband Mel-STFT spectrogram as input.

Further, 2D convolution generally gave better results than 1D with the notable anomaly of Conv-3 for ESC-50. Again, this could be partly attributed to overfitting of the bigger M3 filters as illustrated in Fig.4, although it cannot be discounted that some invariant properties of the frequencies were captured by the smaller filter. Theoretically, Mel-STFT and especially CQT was expected to be more tolerant to pitch invariance, the former being based on a log-based perceptual scale of pitches and the latter preserving harmonic structure despite changes to the fundamental frequency. This may have been the case seeing the accuracy gains moving from M3 to 33 filters, although other transforms also displayed similar improvements. We posit that CQT may be even more beneficial for music analysis where timbre signatures are more evident in the instrumentation as compared to environmental sounds.

Iii-C Wideband vs Narrowband

The benefit of wideband against narrowband transforms were not consistent across both datasets. This may be indicative of a disparity in the types of environmental sounds present in both datasets. More interestingly perhaps, comparing the confusion matrices reveals that each specializes in discriminating certain classes of sound. The wideband Mel-STFT fared poorly for short temporal sounds like “drilling” or “jackhammer” and for droning sounds like “air conditioner” but excelled at classes with high frequency variations such as “children playing”. For narrowband, having better temporal resolution but poorer frequency resolution, the inverse was true.

Iv Conclusion

The main objective of this paper was to perform a comparative study between different forms of commonly used time-frequency representations and evaluate their impact on the CNN classification performance of environmental sound data. It was shown that Mel-STFT spectrograms were consistently good performers across the variations tested, although Linear-STFT and CQT also did well on some models. Generally all time-frequency representations produced better accuracies than baseline MFCC features, corroborating previous studies. The effectiveness of using a wide or narrow window during the transformation, if applicable, was determined to be class dependent. Further insight into the audio characteristics of the datasets would help in determining which variation would be more advantageous. Finally, we considered capability of both 2D and 1D convolution over time by changing the filter size. 2D convolution seemed to work better for most cases, with the exception of the shallower model on the ESC-50 dataset. Nonetheless, the best approach may instead lie in between, such as by using a variable-sized filter, to properly trade-off between invariance to pitch and better discriminative power.


The author would like to thank Haifa Beji and Lonce Wyse for their input and feedback.


  • [1] R. Radhakrishnan, A. Divakaran, and A. Smaragdis, “Audio analysis for surveillance applications,” in Applications of Signal Processing to Audio and Acoustics, 2005. IEEE Workshop on.   IEEE, 2005, pp. 158–161.
  • [2] N. Yamakawa, T. Takahashi, T. Kitahara, T. Ogata, and H. Okuno, “Environmental sound recognition for robot audition using matching-pursuit,” Modern Approaches in Applied Intelligence, pp. 1–10, 2011.
  • [3] J.-C. Wang, H.-P. Lee, J.-F. Wang, and C.-B. Lin, “Robust environmental sound recognition for home automation,” IEEE transactions on automation science and engineering, vol. 5, no. 1, pp. 25–31, 2008.
  • [4] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533–1545, 2014.
  • [5] L. Deng, O. Abdel-Hamid, and D. Yu, “A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.   IEEE, 2013, pp. 6669–6673.
  • [6] T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A.-r. Mohamed, G. Dahl, and B. Ramabhadran, “Deep convolutional neural networks for large-scale speech tasks,” Neural Networks, vol. 64, pp. 39–48, 2015.
  • [7] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for lvcsr,” in Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on.   IEEE, 2013, pp. 8614–8618.
  • [8]

    H. Lee, P. Pham, Y. Largman, and A. Y. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in

    Advances in neural information processing systems, 2009, pp. 1096–1104.
  • [9] K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on.   IEEE, 2015, pp. 1–6.
  • [10] J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,” IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017.
  • [11] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams et al., “Recent advances in deep learning for speech research at microsoft,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.   IEEE, 2013, pp. 8604–8608.
  • [12] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.   IEEE, 2013, pp. 8599–8603.
  • [13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [14] M. C. Orr, D. S. Pham, B. Lithgow, and R. Mahony, “Speech perception based algorithm for the separation of overlapping speech signal,” in Intelligent Information Systems Conference, The Seventh Australian and New Zealand 2001.   IEEE, 2001, pp. 341–344.
  • [15]

    B. Ghoraani and S. Krishnan, “Time–frequency matrix feature extraction and classification of environmental audio signals,”

    IEEE transactions on audio, speech, and language processing, vol. 19, no. 7, pp. 2197–2209, 2011.
  • [16]

    P. Khunarsal, C. Lursinsap, and T. Raicharoen, “Very short time environmental sound classification based on spectrogram pattern matching,”

    Information Sciences, vol. 243, pp. 57–74, 2013.
  • [17] S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition with time–frequency audio features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, 2009.
  • [18] J. Dennis, H. D. Tran, and H. Li, “Spectrogram image feature for sound event classification in mismatched conditions,” IEEE Signal Processing Letters, vol. 18, no. 2, pp. 130–133, 2011.
  • [19] S. Chachada and C.-C. J. Kuo, “Environmental sound recognition: A survey,” in Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific.   IEEE, 2013, pp. 1–9.
  • [20] M. Cowling and R. Sitte, “Comparison of techniques for environmental sound recognition,” Pattern recognition letters, vol. 24, no. 15, pp. 2895–2907, 2003.
  • [21] K. J. Piczak, “Esc: Dataset for environmental sound classification,” in Proceedings of the 23rd ACM international conference on Multimedia.   ACM, 2015, pp. 1015–1018.
  • [22] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” in Advances in Neural Information Processing Systems, 2016, pp. 892–900.
  • [23] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM international conference on Multimedia.   ACM, 2014, pp. 1041–1044.
  • [24] J. B. Allen and L. R. Rabiner, “A unified approach to short-time-fourier analysis and synthesis,” Proceedings of the IEEE, vol. 65, no. 11, November 1977.
  • [25] J. C. Brown, “Calculation of a constant q spectral transform,” The Journal of the Acoustical Society of America, 1991.
  • [26] O. Rioul and M. Vetterli, “Wavelets and signal processing,” IEEE signal processing magazine, vol. 8, no. 4, pp. 14–38, 1991.
  • [27] B. McFee, M. McVicar, C. Raffel, D. Liang, O. Nieto, J. Moore, D. Ellis, D. Repetto, P. Viktorin, and J. F. Santos, “Librosa: v0.5.0,” 2017. [Online]. Available:
  • [28] F. Wasilewski, “Pywavelets: Discrete wavelet transform in python,” 2010. [Online]. Available:
  • [29] N. Anand and P. Verma, “Convoluted feelings convolutional and recurrent nets for detecting emotion from audio data,” in Technical Report.   Stanford University, 2015.
  • [30] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [31] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [32] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.