Acoustic scene classification (ASC) is the task of classifying sound scenes such as “airport”, “train station”, and “urban park” to which a user belongs. ASC is an important research field that plays a key role in various applications such as context-awareness and surveillance[valenti2016dcase, radhakrishnan2005audio, chu2009environmental]. Detection and Classification of Acoustic Scenes and Events (DCASE) [dcase2021web] is an annual challenge, attracting attention to the field. There are various interesting tasks in the DCASE2021 challenge, and we aim for TASK1A: Low-Complexity Acoustic Scene Classification with Multiple Devices [dcase_task1A, dcase_dataset].
TASK1A classifies ten different audio scenes from 12 European cities using four real and 11 simulated devices. In this year, the task becomes more challenging as an ASC model needs to solve two problems simultaneously which practically exist in real applications; First, data is collected from multiple devices, and the number of samples per device is unbalanced. Therefore, the proposed system needs to solve the domain imbalance problem while generalizing to different devices. Second, TASK1A restricts the model size and therefore requires an efficient network design.
In recent years, a number of researches have been proposed for more efficient and high-performance ASC.
Most of them are based on convolutional neural network (CNN)
Most of them are based on convolutional neural network (CNN)using residual network and ensemble [task1a2020best_cnn, receptivefield, task1a2020_2nd_cnn, task1a2019best]. The top-performing models in the previous TASK1A utilize multiple CNNs in a single model with parallel connections [task1a2020best_cnn, task1a2020_2nd_cnn]. For the generalization of the model, [receptivefield, phaye2019subspectralnet] show that there is a regularization effect by adjusting the receptive field size in CNN-based design. However, these works also use models of several MB, and it is still challenging to satisfy the low model complexity of TASK1A of this year. In addition, when using the previous methods, we found an accuracy drop of up to 20% on the unseen devices compared to the device with sufficient training data. In this work, we propose methods to leverage the generalization capabilities of unseen devices while maintaining the model’s performance in lightweight models. First, we introduce a network architecture for ASC that utilizes broadcasted residual learning [bcresnet]. Based on this architecture, we can achieve higher accuracy while reducing the size by a third of the baseline [receptivefield]. Next, we propose a novel normalization method, Residual Normalization (ResNorm), which can leverage the generalization performance for unseen devices. ResNorm allows maintaining classification accuracy while minimizing the influence on different frequency responses of devices by performing normalization of frequency bands in the residual path. Finally, we describe model compression combined with pruning and quantization to satisfy the model complexity of the task while maintaining performance using knowledge distillation.
This work is an expanded version from the challenge technical report submissions [Kim2021b]. The rest of the paper is organized as follows. Section 2 describes the network architecture, Residual Normalization, and model compression methods. Section 3 shows the experimental results and analysis. Finally, we conclude the work in Section 4.
2 Proposed Method
This session introduces an efficient model design for device-imbalanced acoustic scene classification. First, we present a modified version of Broadcasting-residual network [bcresnet] for the acoustic scene domain. Following, we propose Residual Normalization for generalization in a device-imbalanced dataset. Finally, we describe how to get a compressed version of the proposed system.
2.1 Network Architecture
To design a low-complexity network in terms of the number of parameters, we use a Broadcasting-residual network (BC-ResNet) [bcresnet] which uses 1D and 2D CNN features together for better efficiency. While the BC-ResNet targets human voice, we aim to classify the audio scenes.
To adapt to the differences in input domains, we make two modifications to the network, i.e. , limit the receptive field and use max-pool instead of dilation.
, limit the receptive field and use max-pool instead of dilation.
conv2d 5x5, stride 2
|conv2d 1x1||-||num class|
The proposed architecture, BC-ResNet-ASC, is shown in Table 1. The model has 5x5 convolution on the front with a (2, 2) stride for downsampling followed by BC-ResBlocks [bcresnet]. In [receptivefield], they show that the size of the receptive field can regularize CNN-based ASC models. We change the depth of the network and use max-pool to control the size of the receptive field. With a total of 9 BC-ResBlocks and two max-pool layers, the receptive field size is 109x109. We also do the last 1x1 convolution before global average pooling that the model classifies each receptive field separately and ensembles them by averaging. Original BC-ResNets use dilation in temporal dimension to obtain a larger receptive field while maintaining temporal resolution across the network. We observe that time resolution does not need to be fully kept in the audio scene domain, and instead of dilation, we insert max-pool layers in the middle of the network.
In this work, we use BC-ResNet-ASC-1 and BC-ResNet-ASC-8 whose base numbers of channels are 10 and 80, respectively, in Table 1. Table 2 compares our BC-ResNet-ASC-8 with two baselines: CP-ResNet [receptivefield] which is a residual network-based ASC model with limited receptive field size; and original BC-ResNet-8 with the number of Subspectral Normalization [ssn] groups of 4. As shown in Table 2, BC-ResNet-ASC-8 records Top-1 test accuracy 69.5% with only one-third number of parameters compared to CP-ResNet showing 67.8% accuracy. Moreover, BC-ResNet-ASC-8 outperforms the original BC-ResNet-8 by a 1% margin with the modifications.
2.2 Residual Normalization
Instance normalization (IN) [instancenorm] is a representative approach to reducing unnecessary domain gaps for better domain generalization [batchinstancenorm] or domain style transfer [adain, jung2020arbitrary]
in the image domain. While domain difference can be captured by channel mean and variance inthe image domain, we observe that differences between audio devices are revealed along frequency dimension as shown in Figure 1. To get audio device generalized features, we use instance normalization by frequency (FreqIN) as below.
Here, , are mean and standard deviation of the input feature
are mean and standard deviation of the input feature, where , , , denote batch size, number of channel, frequency dimension, and time dimension respectively. is a small number added to avoid division by zero.
|Network Architecture||#Param||Top-1 Acc. (%)|
Direct use of IN can result in loss of useful information for classification contained in domain information. To compensate for information loss due to FreqIN, we add an identity shortcut path multiplied by a hyperparameter. We suggest a normalization method, named Residual Normalization (ResNorm) which is
We apply ResNorm for input features and after the end of every stage in Table 1. There are a total of five ResNorm modules in the network.
|BC-ResNet-ASC-1 (Baseline)||8.1k||73.1||61.2||65.3||58.2||57.3||66.2||51.5||51.5||46.3||58.9 0.8|
|BC-ResNet-ASC-1 + Global FreqNorm||8.1k||73.9||60.9||65.5||60.2||57.9||67.9||50.2||54.3||49.4||60.0 0.9|
|BC-ResNet-ASC-1 + Fixed PCEN||8.1k||68.0||60.4||57.2||64.0||63.0||66.2||62.3||61.8||56.5||62.2 0.8|
|BC-ResNet-ASC-1 + ResNorm||8.1k||76.4||65.1||68.3||66.0||62.2||69.7||63.0||63.0||58.3||*65.8 0.7|
|w/o ResNorm in Network||8.1k||75.1||68.9||67.0||66.0||63.9||69.3||63.4||66.9||63.6||67.1 0.8|
|w/o Shortcut||8.1k||68.2||62.1||58.6||64.2||65.3||66.3||65.1||63.8||61.3||63.9 0.7|
|BC-ResNet-ASC-8 + ResNorm||315k||81.3||74.4||74.2||75.6||73.1||78.6||73.0||74.0||72.7||*75.2 0.4|
|w/o ResNorm in Network||315k||80.8||73.7||73.0||74.0||72.9||77.8||73.3||72.1||71.0||74.3 0.3|
|w/o Shortcut||315k||78.3||73.5||69.1||73.8||72.9||75.6||72.2||72.5||71.0||73.2 0.3|
2.3 Model Compression
To compress the proposed model, we utilize three model compression schemes: pruning, quantization, and knowledge distillation.
Pruning. The pruning method prunes unimportant weights or channels based on many criteria. In this work, we choose a magnitude-based one-shot unstructured pruning scheme used in [NEURIPS2020_eb1e7832]. After training, we conduct unstructured pruning on all convolution layers and do additional training to enhance the pruned model’s performance.
Quantization. Quantization is the method to map continuous infinite values to a smaller set of discrete finite values. We quantize all of our models with quantization-aware training (QAT) with symmetric quantization [NEURIPS2020_eb1e7832]. We combine the pruning and quantization methods. It means that we quantize the important weights which are not pruned after the pruning process in the additional training phase. We quantize all convolution layers as an 8-bit while utilize the half-precision representation for other weights.
Knowledge Distillation. Knowledge Distillation (KD) trains the lightweight model using the outputs of a pre-trained teacher network. In general, previous model compression schemes such as pruning and quantization decrease the performance by reducing the model complexity. To enhance the performance of the compressed model, we use a KD loss [kim2021feature] using the pre-trained model as a teacher network.
3.1 Experimental Setup
Datasets. We evaluate the proposed method on the TAU Urban Acoustic Scenes 2020 Mobile, development dataset [dcase_dataset]. The dataset consists of a total of 23,040 audio segment recordings from 12 European cities in 10 different acoustic scenes using 3 real devices (A, B, and C) and 6 simulated devices (S1-S6). The 10 acoustic scenes contain “airport”, “shopping mall”, “metro station”, “pedestrian street”, “public square”, “street with traffic”, “park”, and travelling by “tram”, “bus”, and “metro”. Audio segments from B and C are recorded simultaneously with device A, but not perfectly synchronized. Simulated devices S1-S6 generate data using randomly selected audio segments from real device A. Each utterance is 10-sec-long and the sampling rate is 48kHz. [dcase_dataset] divides the dataset into training and test of 13,962 and 2,970 segments, respectively. In the training data, device A has 10,215 samples while B, C, and S1-S3 have 750 samples each, which means the data is device-imbalanced. Devices S4-S6 remain unseen in training. In test data, all devices from A to S6 have 330 segments each.
Implementation Details. We do downsampling by 16kHz and use input features of 256-dimensional log Mel spectrograms with a window length of 130ms and a frameshift of 30ms. During training, we augment data to get a more generalized model. In the time dimension, we randomly roll each input feature in the range of -1.5 to 1.5 sec, and the out-of-range part is added to the opposite side. We also use Mixup [mixup] with and Specaugment [specaugment] with two frequency masks and two temporal masks with mask parameters of 40 and 80, respectively, except time warping. We use Specaugment only for the large model, BC-ResNet-ASC-8. In BC-ResNet-ASC, we use Subspectral Normalization [ssn] as indicated in [bcresnet]
with 4 sub-bands and use dropout rate of 0.1. We train the models for 100 epoch using stochastic gradient descent (SGD) optimizer with momentum to 0.9, weight decay to 0.001, mini-batch size to 64, and learning rate linearly increasing from 0 to 0.06 over the first five epochs asa warmup [warmup] before decaying to zero with cosine annealing [cosine_schedule] for the rest of the training. We use fixed for ResNorm in experiments. Due to the absence of validation split in the development dataset, we report the numbers of early stopping.
Baselines. We compare our method with other methods and do some ablation studies: 1) Global FreqNorm, which normalizes data by global mean and variance of each frequency bin; 2) Fixed per-channel energy normalization (PCEN) [pcen], which is an automatic gain control based dynamic compression and is used instead of log Mel spectrogram in our experiment; 3) w/o ResNorm in Network, which uses ResNorm module only at input not in the middle of the network. 4) w/o shortcut, which is a special case of ResNorm when in Equation 3 and uses FreqIN.
3.2 Residual Normalization
We do the experiments using BC-ResNet-ASC-1 and BC-ResNet-ASC-8, and the overall results are on Table 3. The task has multi-device inputs which are imbalanced with dominant device A. As a result, the baseline, BC-ResNet-ASC-1, shows that the accuracy of the device A is relatively higher than other seen devices, B, C, S1, S2, and S3. Furthermore, the accuracy on unseen devices, S4, S5, and S6 are even lower, and these results imply that the model is not generalized well to multiple devices, especially for unseen devices. When we use global normalization by frequency dimension, the result shows 60.0% accuracy which is 1% improvements compared to the baseline, but still we can observe poor domain generalization. We also try PCEN, a normalized feature instead of log Mel spectrogram. PCEN shows improvements for unseen devices, but we also observe that the performance of device A degrades due to its normalization. The proposed ResNorm uses FreqIN to get domain invariant features while not loosing the useful class-discriminative information through identity shortcut connection. The ‘BC-ResNet-ASC-1 + ResNorm’ shows a large improvement, 6% compared to baseline and records 65.8% test accuracy. The ResNorm shows performance improvements not just for unseen devices but also for all seen devices.
We do some ablation studies for the component of ResNorm. First, we use the ResNorm module as the preprocessing module, and do not use the module in the middle of the network; ‘w/o ResNorm in Network’. For the small model, BC-ResNet-ASC-1, ‘w/o ResNorm in Network’ shows better performance, 67.1%, and for the larger model, BC-ResNet-ASC-8, it shows a performance degradation of 1%. Due to ResNorm’s regularization effect, it was expected that this module could degrade the performance of a small network. We expect that the module can control the normalization power by the hyperparameter in Equation 3 to adapt to various size of networks. In this work, we use fixed , and leave the automatic update of the as a future work. Second, ‘w/o shortcut’ shows the result when in ResNorm which equals to FreqIN in Equation 1. Our design motivation is that the shortcut path will keep the useful information for classification. The results show that FreqIN records relatively lower accuracy for seen devices compared to ResNorm. Especially, the margins on device A are 8.2% and 3.0% on BC-ResNet-ASC-1 and BC-ResNet-ASC-8, respectively.
3.3 Model Compression
Simultaneously, we distill the knowledge of the pre-trained teacher network (‘Vanilla’ model) into the compressed model for enhancing the performance and achieve the 0.2% improvement in test accuracy. In detail, we prune the convolution layers of the model with 89% pruning ratios compared to vanilla and quantize all convolution layers in a compressed model as an 8-bit. Other layers are quantized as a 16-bit. The resulting ‘Compressed’ model has 33K 8-bit nonzero for convolution layers and 15K 16-bit parameters for normalization, resulting in 61.5kB and shows 75.3% test accuracy which is 1% lower than Vanilla model. We use the ensemble of two compressed model in the DCASE 2021 challenge, task 1A.
In this work, we design a system to achieve two goals; 1) efficient design in terms of the number of parameters and 2) adapting to device imbalanced dataset. To design an efficient acoustic scene classification model, we suggest a modified version of Broadcasting residual network [bcresnet] by limiting receptive field and using max-pool. We compress the model further by utilizing three model compression schemes, pruning, quantization, and knowledge distillation. Moreover, we propose a frequency-wise normalization method, named Residual Normalization which uses instance normalization by frequency and shortcut connection to be generalized to multiple devices while not losing discriminative information. Our system achieves 76.3% test accuracy on TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k number of parameters and the compressed version achieves 75.3 % test accuracy with 89% pruning, 8-bit quantization, and knowledge distillation. Residual normalization has a hyperparameter which can control the regularization power of the module. We leave the automatic update of the hyperparameter as future work.