Neural network methods such as convolutional neural networks (CNNs) have been used for audio classification and have achieved state-of-the-art performance [9, 5, 26]. Generally, state-of-the-art audio classification models are designed with large sizes and complicated modules, which make the audio classification networks computationally inefficient, in terms of e.g. the number of floating point operations (FLOPs) and running memory. However, in many real-world scenarios, audio classification models need to be deployed on resource-constrained platforms such as mobile devices .
There has been increasing interest in building efficient audio neural networks in the literature. Existing methods can generally be divided into three categories. The first is to utilize model compression techniques such as pruning [21, 20]. The second is to transfer the knowledge from large-scale pre-trained model to a small model via knowledge distillation [2, 11, 1]. The last one is to directly exploit efficient networks for audio classification, such as MobileNets [9, 10]. In summary, these methods mainly focus on reducing model size. However, the computational cost (e.g., FLOPs) of the audio neural network is not only determined by the size of the model, but also highly dependent on the size of the input features.
As existing audio neural networks usually take mel-spectrogram which may be temporally redundant. For example, the pattern of a siren audio clip is highly repetitive in the spectrogram, as shown in Figure 1. In principle, if one can remove the redundancy in the input mel-spectrogram, the computational cost can be significantly reduced. However, reducing input feature size for audio neural networks has received little attention in the literature, especially in terms of improving their computation efficiency.
In this paper, we propose a family of simple pooling f
ront-ends (SimPFs) for improving the computation efficiency of audio neural networks. SimPFs utilize simple non-parametric pooling methods (e.g., max pooling) to eliminate the temporally redundant information in the input mel-spectrogram. The simple pooling operation on an input mel-spectrogram achieves a substantial improvement in computation efficiency for audio neural networks. To evaluate the effectiveness of SimPFs, we conduct extensively experiments on four audio classification datasets including DCASE19 acoustic scene classification, ESC-50 environmental sound classification , Google SpeechCommands keywords spotting , and AudioSet audio tagging . We demonstrate that SimPFs can reduce more than half of the computation FLOPs for off-the-shelf audio neural networks , with negligibly degraded or even improved classification performance. For example, on DCASE19 acoustic scene classification, SimPF can reduce the FLOPs by 75% while improving the classification accuracy approximately by 1.2%. Our proposed SimPFs are simple to implement and can be integrated into any audio neural network at a negligible computation cost. The code of our proposed method is made available at GitHub111https://github.com/liuxubo717/SimPFs.
The remainder of this paper is organized as follows. The next section introduces the related work of this paper. Section 3 introduces the method SimPFs we proposed for efficient audio classification. Section 4 presents the experimental settings and the evaluation results. Conclusions and future directions are given in Section 5.
2 Related work
Our work relates to several works in the literature: efficient audio classification, feature reduction for audio classification, and audio front-ends. We will discuss each of these as follows.
2.1 Efficient audio classification
Efficient audio classification for on-device applications has attracted increasing attention in recent years. Singh et al. [21, 20, 22] proposed to use pruning method to eliminate redundancy in audio convolutional neural networks for acoustic scene classification, which can reduce approximately 25% FLOPs at 1% reduction in accuracy. Knowledge distillation methods [2, 11, 1] have been used for efficient audio classification via transferring knowledge of large teacher models to small student on-device models. Efficient models such as MobileNets  have been proposed for visual applications to mobile devices. Kong et al.  have adapted MobileNet for audio tagging, demonstrating its potential to improve computational efficiency for audio classification. Unlike these methods, which focus on reducing model size, our proposed SimPF aims to reduce the size of input features.
2.2 Feature reduction for audio classification
Feature reduction methods such as principal component analysis (PCA) have been widely investigated for audio classification with classical machine learning methods such as discriminative support vector machines (SVMs)[3, 28]. The most relevant work to SimPFs in the literature is , where max and average pooling operations are applied to sparse acoustic features to improve the performance of SVM-based audio classification, especially in a noisy environment. In contrast to this method, SimPFs are designed to improve the efficiency of audio neural networks, whose computational cost is highly dependent on the size of the input features. In addition, the effectiveness of SimPFs is extensively evaluated on various audio classification benchmarks.
2.3 Audio Front-ends
Audio front-ends were studied as an alternative to mel-filterbanks for audio classification in the last decade. For example, trainable front-ends SincNet  and LEAF  are proposed for learning audio features from the waveform. These front-ends perform better than using traditional mel-filterbanks on various audio classification tasks. Unlike existing work on learnable front-ends, SimPFs are non-parametric and built on top of a widely-used mel-spectrogram for audio neural networks. Our motivation for designing SimPFs is not to learn a replacement for mel-spectrogram, but to eliminate temporal redundancy in mel-spectrograms. This redundancy significantly impacts on the efficiency of audio neural networks but is often ignored by audio researchers.
|Model (CNN10 )||DCASE19||ESC-50||SpeechCommands|
|Front-end||Compression Factor||Compression Factor||Compression Factor|
3 Proposed Method
Mel-spectrogram is widely used as an input feature for neural network-based audio classification. Given an audio signal , its mel-spectrogram is a two-dimensional time-frequency representation denoted as , where and represent the number of time frames and the dimension of the spectral feature, respectively. An audio neural network takes as the input and predicts the category of the input audio:
where stands for the model parameterized by . Generally, the computation cost of the neural network is dependent on both the size of the parameter and the size of input .
In this work, we propose to use simple non-parametric pooling methods to eliminate the temporal redundancy in the input mel-spectrogram. SimPFs can significantly improve the computational efficiency of audio neural networks without any bells and whistles. Formally, SimPFs take a mel-spectrogram as input and output a compressed time-frequency representation , where is the compression coefficient in time domain and should be a positive integer. We will introduce a family of SimPFs which uses five pooling methods.
SimPF (Max) We apply a 2D max pooling with kernel size over an mel-spectrogram . The output is described as follows:
where and .
SimPF (Avg) Similar to SimPF (Max), we apply a 2D average pooling with kernel size over an input mel-spectrogram . Formally, the output is described as:
SimPF (Avg-Max) In this case, we add the outputs of SimPF (Max) and SimPF (Avg), which is defined as:
SimPF (Spectral) We adapt the spectral pooling method proposed in 
. Concretely, the Discrete Fourier Transform (DFT)of the input mel-spectrogram is computed by:
and the zero frequency is shifted to the center of . Then, a bounding box of size crops around its center to produce . The output is obtained by exerting inverse DFT on :
SimPF (Uniform) We uniformly sample one spectral frame every frames. The output of is calculated by:
We visualize the mel-spectrogram of a siren audio clip and the compressed spectrograms using different SimPFs with 50% compression factor in Figure 1. Intuitively, we can observe that even though the resolution of the spectrogram is compressed by half, the pattern of the siren remains similar in the spectrogram, which indicates high redundancy in the siren spectrogram.
4 Experiments and Results
DCASE 2019 Task 1  is an acoustic scene classification task, with a development set consisting of -second audio clips from acoustic scenes such as airport and metro station. In the development set, and audio clips are used for training and validation, respectively. We will refer to this dataset as DCASE19.
ESC-50  consists of five-second environmental audio clips. ESC-50 is a balanced dataset with sound categories, including animal sounds, natural soundscapes, human sounds (non-speech), and ambient noises. Each sound class has audio clips. The dataset is pre-divided into five folds for cross-validation.
SpeechCommands  contains K speech utterances from various speakers. Each utterance is one second long and belongs to one of classes corresponding to a speech command such as “Go”, “Stop”, “Left”, and “Down”. We divided the datasets by a ratio of :: for training, validation, and testing, respectively.
AudioSet  is a large-scale audio dataset with sound classes in total. The audio clips are sourced from YouTube videos. The training set consists of audio clips. The evaluation set has
test clips. We convert all audio clips to monophonic and pad the audio clips to ten seconds with silence if they are shorter than ten seconds.
4.2 Experiment setup
Baseline systems We evaluate our proposed approach using several off-the-shelf audio classification methods proposed in . As for the evaluation of ESC-50, DCASE19, and SpeechCommands dataset, we use two baseline models, CNN10 and MobileNetV2. On AudioSet, we conduct the experiment on CNN14 and MobileNetV2. CNN10 and CNN14 are both large-scale audio neural networks, and MobileNetV2 is designed with low complexity by multiply-add operations and fewer parameters. Hence, MobileNetV2 is suitable for on-device scenarios. We train all the models from scratch.
Implementation details We load the audio clips using the sampling rate as provided in the original dataset. The audio clip is converted to -dimensional log mel-spectrogram by the short-time Fourier transform with a window size of samples, a hop size of samples, and a Hanning window. The baseline audio classification networks are optimized with the Adam optimizer with the learning rate . The batch size is set to
and the number of epochs is, except for AudioSet where we run epochs. Following , random SpecAugment  is used for data augmentation.
Evaluation metrics Following 
, we use accuracy as the evaluation metric on ESC-50, DCASE19, and SpeechCommands datasets. As for the AudioSet dataset, we use mean average precision (mAP) to evaluate the performance of audio tagging.
|Models||Baseline||SimPF (50%)||SimPF (25%)|
|Model (MobileNetV2) )||DCASE19||ESC-50||SpeechCommands|
|Front-end||Compression Factor||Compression Factor||Compression Factor|
|Model (CNN14) )||AudioSet|
|Model (MobileNetV2) )||AudioSet|
4.3 Evaluation results and analysis
4.3.1 Computation cost analysis (FLOPs)
We analyze the impact on FLOPs reduction of our SimPFs on compression coefficients 50% and 25% for three baseline systems. Table 2 shows the FLOPs of the model to infer a 10-seconds audio clip with a sampling rate of 44 kHz. For our three baseline models, the compression ratio on the input spectrogram is roughly equivalent to the FLOPs reduction ratio. We refer to the spectrogram compression ratio as the FLOPs reduction ratio in the later experiment analysis.
For the CNN10 model, we evaluate the effectiveness of all our proposed SimPF with three compression factors: 50%, 25%, and 10% (as introduced in Section 3). The experimental results are shown in the left column in Table 1. Overall, SimPF (Spectral) performs best among all SimPFs candidates on three compression coefficient settings. Even though MobileNetV2 is smaller than CNN10, we find that SimPF (Spectral) can reduce the FLOPs by roughly 50% and 25% while still improving the classification accuracy by 1.7% and 1.2%, respectively. Even reducing the FLOPs by 90%, the classification accuracy only drops by 0.01%. For MobileNetV2, we evaluate the performance of two representative candidates SimPF (Avg-Max) and SimPF (Spectral) with three compression factor 50%, 25%, and 10%. We find similar trends to CNN10 model, as shown in the left column of Table 2. SimPF (Avg-Max) improves the accuracy by 1.2% on 25% setting and SimPF (Spectral) only sacrifices 0.3% accuracy on 10% setting. Experimental results on CNN10 and MobileNetV2 demonstrate the superior performance of our proposed SimPF for acoustic scene classification, also indicating the highly redundant information in the acoustic scene data.
For CNN10 model, we evaluate the performance of all our proposed SimPF methods with two compression factors: 50% and 25%. The experimental results are shown in the middle column of Table 1. On the 50% setting, the best candidate SimPF (Max) achieves the accuracy improvement by 0.3%. On the 25% setting, the best candidate SimPF (Avg) reduces an accuracy by 1.9%. For MobileNetV2 model, we evaluate the performance of two representative candidates SimPF (Avg-Max) and SimPF (Spectral) with two compression factor 50%, 25%, as shown in the middle column of Table 2. SimPF (Spectral) performs better in these two different compression coefficient settings. Specifically, SimPF (Spectral) improves the classification accuracy by 0.6% on the 50% setting, and slightly decreases the accuracy by 0.7% on the 25% setting. The performance gain of SimPFs on ESC-50 is not as good as that on DCASE19 but is still decent in terms of the trade-off between the accuracy and FLOPs.
For CNN10 model, we evaluate all our proposed SimPFs with two compression factors 50% and 25%. The results are shown in the right column of Table 1. On the 50% setting, SimPF (Avg), SimPF (Avg-Max), SimPF (Max), and SimPF (Uniform) achieve the equivalent performance as the baseline system. On the 25% setting, the best two candidates SimPF (Avg) and SimPF (Spectral) reduce the accuracy only by 0.7%. For the MobileNetV2 model, we evaluate the performance of two representative candidates SimPF (Avg-Max) and SimPF (Spectral) with compression factor at 50% (25% setting is not available222The 25% SimPF setting is not available as 25% of one-second speech clip is too short for MobileNetV2 to process.), as shown in the right column of Table 2. The best candidate SimPF (Avg-Max) decreases the accuracy by 1.3%. Evaluation results on SpeechCommands show that our proposed method is useful for short-utterance speech data.
We evaluate the performance of two representative candidates SimPF (Avg-Max) and SimPF (Spectral) with two compression factors 50% and 25%, for the CNN10 model, as shown in Table 3. On the 50% setting, SimPF (Spectral) only reduces the mAP by 0.8%, on the 25% setting, SimPF (Avg-Max) reduces the mAP by 3.2%. Similar results we obtained for the MobileNetV2 model, as shown in Table 4. On the 50% setting, SimPF (Spectral) only reduces the mAP by 0.4%, on the 25% setting, SimPF (Avg-Max) reduces the mAP by 2.1%. SimPF can roughly reduce 50% computation cost with a negligible mAP drop within 1%. Even though tagging AudioSet is a more challenging task as compared with classification for other datasets, SimPFs achieve a promising trade-off between computation cost and mAP.
In this paper, we have presented a family of simple pooling front-ends (SimPFs) for efficient audio classification. SimPFs utilize non-parametric pooling methods (e.g., max pooling) to eliminate the temporally redundant information in the input mel-spectrogram. SimPFs achieve a substantial improvement in computation efficiency for off-the-shelf audio neural networks with negligible degradation or considerable improvement in classification performance on four audio datasets. In future work, we will study parametric pooling audio front-ends to adaptively reduce audio spectrogram redundancy.
This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 “AI for Sound”, a Newton Institutional Links Award from the British Council (Grant number 623805725), and a Research Scholarship from the China Scholarship Council (CSC). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.
-  (2022) Temporal knowledge distillation for on-device audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 486–490. Cited by: §1, §2.1.
-  (2020) Distilling the knowledge of BERT for sequence-to-sequence ASR. arXiv preprint:2008.03822. Cited by: §1, §2.1.
Large-scale audio feature extraction and SVM for acoustic scene classification. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4. Cited by: §2.2.
-  (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. Cited by: §1, §4.1.
-  (2021) PSLA: improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 3292–3306. Cited by: §1, §4.2.
-  (2019-03) TAU Urban Acoustic Scenes 2019, Development dataset. External Links: Cited by: §4.1.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint:1704.04861. Cited by: §2.1.
-  (2006) A generic audio classification and segmentation approach for multimedia indexing and retrieval. IEEE Transactions on Audio, Speech, and Language Processing 14 (3), pp. 1062–1081. Cited by: §1.
PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2880–2894. Cited by: §1, §1, §1, §2.1, Table 1, §4.2, Table 3, Table 4, Table 5.
-  (2020) Channel compression: rethinking information redundancy among channels in CNN architecture. IEEE Access 8, pp. 147265–147274. Cited by: §1.
-  (2017) Knowledge distillation for small-footprint highway networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4820–4824. Cited by: §1, §2.1.
-  (2021) Sound event detection: a tutorial. IEEE Signal Processing Magazine 38 (5), pp. 67–83. Cited by: §1.
-  (2018) A multi-device dataset for urban acoustic scene classification. In Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 9–13. Cited by: §1.
-  (2020) SpecAugment on large scale datasets. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6879–6883. Cited by: §4.2.
-  (2009) . In IEEE International Conference on Multimedia, pp. 1218–1221. Cited by: §1.
-  (2015) ESC: dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. Cited by: §1, §4.1.
-  (2005) Audio analysis for surveillance applications. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005., pp. 158–161. Cited by: §1.
-  (2018) Speaker recognition from raw waveform with SincNet. In IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028. Cited by: §2.3.
-  (2015) Spectral representations for convolutional neural networks. Advances in Neural Information Processing Systems 28. Cited by: §3.
-  (2022) Low-complexity CNNs for acoustic scene classification. arXiv preprint:2208.01555. Cited by: §1, §2.1.
-  (2022) A passive similarity based CNN filter pruning for efficient acoustic scene classification. arXiv preprint:2203.15751. Cited by: §1, §2.1.
-  (2020) SVD-based redundancy removal in 1-D CNNs for acoustic scene classification. Pattern Recognition Letters 131, pp. 383–389. Cited by: §2.1.
-  (2022) Deep neural decision forest for acoustic scene classification. arXiv preprint:2203.03436. Cited by: §1.
-  (2018) Speech Commands: a dataset for limited-vocabulary speech recognition. arXiv preprint:1804.03209. Cited by: §1, §1, §4.1.
-  (2022) Continual learning for on-device environmental sound classification. arXiv preprint:2207.07429. Cited by: §1.
-  (2018) Large-scale weakly supervised audio classification using gated convolutional neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 121–125. Cited by: §1.
-  (2021) LEAF: a learnable frontend for audio classification. arXiv:2101.08596. Cited by: §2.3, §4.2.
-  (2013) Dictionary learning based sparse coefficients for audio classification with max and average pooling. Digital Signal Processing 23 (3), pp. 960–970. Cited by: §2.2.