Acoustic scene classification (ASC) is the task of recognizing scenes based on environmental sounds. The recent studies on environmental audio analysis are led by the detection and classification of acoustic scenes and events (DCASE) community, which covers a variety of tasks such as ASC, sound event detection, and audio captioning [mesaros2017detection, mesaros2018multi, mesaros2019acoustic, heittola2020acoustic]
. In the DCASE 2020 challenge, ASC is divided into two subtasks: 1) subtask A for generalization across different devices and 2) subtask B, the newly released task, for a low-complexity solution in terms of the model size. Subtask A aims to identify a given audio clip recorded by multiple devices into one of the ten pre-defined acoustic scenes: airport, indoor shopping mall, metro station, pedestrian street, public square, a street with a medium level of traffic, tram, bus, metro, park. Subtask B aims to classify a given audio clip into one of the three higher level classes: outdoor, indoor, and transportation. Hereafter, the task for classifying the class of subtask B is defined as a 3-class classification, whereas that for classifying the class of subtask A is defined as a 10-class classification.
Recent studies on ASC can be largely divided into two strands: data preprocessing and modeling. In data preprocessing, most studies exploit delta and delta-deltas for performance improvements [suh2020designing, hu2020device, gao2020acoustic, liu2020acoustic, koutini2020cp]. Various data augmentation strategies such as mixup [zhang2017mixup], SpecAugment [park2019specaugment], pith shift, speed change and adding random noise have also been explored. Many studies have divided the input with several frequency bands and used the separated data as inputs [phaye2019subspectralnet, suh2020designing, wang2020acoustic]. Feature investigation and temporal division have also been explored [mun2017deep, pham2021robust, sakashita2018acoustic].
In modeling, ResNet and convolutional neural network (CNN) are the most widely used architecture in ASC task. Teacher-student network can be exploited to ease frequently mis-classified classes[heo2019acoustic, jung2020knowledge]. The proper size of the receptive field adjusting the kernel size have also been explored [koutini2020cp]. Attention, which is widely used in other domains, has been applied in various ways [ren2019attention, phan2019spatio]. Utilizing the representation of other tasks and training a general-purpose network have been investigated considering the characteristics of DCASE dealing with various tasks [shim2020audio, jung2020acoustic, kim2020audio, fonseca2018general, kong2018dcase, jung2020dcasenet]. In the latest DCASE 2020 challenge, top ranking systems in ASC mostly focused on data preprocessing techniques.
In this paper, we propose two modeling techniques independent of data-preprocessing; an attentive max feature map (AMFM) and joint learning of the concrete classes of subtask A and abstract classes of subtask B. First, we find that although attention improves the performance, it excessively loses information owing to its exclusive nature. Our hypothesis is that a fraction of de-emphasized information might further help the discriminative power of a feature if adequately addressed. We find that the max feature map (MFM), a technique that adopts a competitive scheme via element-wise max operation [wu2018light], can be used in this perspective. MFM is also effective for ASC as studied in previous works[shim2020audio, shim2020capturing, lee2021cnn]. Specifically, we design a new technique combining attention and MFM and refer to it as the AMFM where attention emphasizes the most informative region and the MFM prevents excessive information loss using a relative max operation. Inspired by MFM operation, AMFM compares the feature maps before and after the attention to mitigate the excessive information loss.
Furthermore, we propose the joint learning of 10-class of subtask A and 3-class of subtask B to improve the performance of subtask A. Many studies related to the DCASE explore the use of auxiliary tasks and general-purpose network [deshmukh2020multi, shim2020audio, jung2020acoustic, kim2020audio, fonseca2018general, kong2018dcase, jung2020dcasenet]. Hu et al. [hu2020device, hu2020two] utilized the prediction of the classifier of subtask B to improve the performance of subtask A; however, two classifiers were completely trained separately. In this study, since the labels of the two subtasks differ in the degree of abstraction, we assume that training by using the two labels together would be helpful. Therefore, we propose the joint learning of the two subtasks and explore various architectures and balancing between two subtasks. To the best of our knowledge, this is the first approach to simultaneously train an ASC system using both labels of subtask A and B. The proposed system has relatively low-complexity and achieves a performance comparable to that of a state-of-the-art system without complex data preprocessing. The experimental results show that both the generalization to multiple devices and the low-complexity can be achieved in a single system.
2 Proposed method
2.1 Attentive max feature map
MFM was originally proposed for image classification where noisy labels are dominant [wu2018light]. Wu et al. [wu2018light]
argued that a rectified linear unit (ReLU) activation separates noisy and informative signals using a threshold (or bias), which might lead to information loss, especially for the first several convolution layers. To overcome this issue, MFM was proposed to replace the non-linear activation function with a competitive scheme via an element-wise max operation. The max operation of MFM selects the optimal feature at each location learned by different filters and helps separate noisy and informative signals. The previous works demonstrate that MFM is also effective for ASC[shim2020audio, shim2020capturing, lee2021cnn].
The implementation of MFM operation can be denoted as follows. Let be an output feature map of a convolution layer, , where , , and refer to the number of output channels, time domain frames, and frequency bins, respectively. is split into two feature maps, and , where , . The MFM applied feature map is obtained by applying , element-wise. In the left portion of Figure 1, the pink box includes the conventional MFM operation.
The attention mechanism has proven its effectiveness on ASC [ren2019attention, phan2019spatio, liu2020acoustic]. It can highlight important information and enrich representation. Among various attention mechanisms, the convolutional block attention module (CBAM) considers both channel and spatial attention module and has the advantage of seamless implementation regardless of the architecture. Our previous work confirmed that applying the CBAM to the MFM also improves the performance [shim2020capturing]. However, as a result of visualizing the feature map, we empirically find that excessive information loss occurs, emphasizing only narrow parts when attention is applied (Figure 2-(b)). We hypothesize that a fraction of de-emphasized information might help the discriminative power of a feature.
Accordingly, we propose the application of the attention mechanism in a competitive manner, referred to as AMFM to mitigate abovementioned excessive deletion. Inspired by MFM operation, AMFM compares two feature maps, before and after attention element-wise. The AMFM can be denoted as , element-wisely. We compose an AMFM block, involving MFM and AMFM, and it is illustrated in Figure 1. MFM and AMFM are the pink and green box of left portion of Figure 1, respectively. AMFM is not only applicable to MFM, it can also be used in all architectures to which attention is applied. In this study, the effectiveness is validated only in MFM. We plan to apply AMFM for other architectures in our future work. Figure 2 shows that AMFM could attenuate exaggerated attention and select salient representation by comparing the feature maps at the same time.
2.2 Joint learning
A few studies have considered the relationship between the two subtasks of ASC. Although Hu et al.[hu2020device] utilized joint prediction, classifiers were completely trained separately. In the joint prediction, the classifiers for the two subtasks are trained separately, and the final prediction is calculated using the score fusion of the two classifiers. However, many studies proved that training the related tasks simultaneously helps improve the performance [zamir2018taskonomy, dwivedi2019representation]. Since the labels of the two subtasks differ in the degree of abstraction, we assume that training by using the two labels together would be helpful. Therefore, we propose the joint training of two subtasks.
In the experiments, we compare four different methods and vary the weights between the two tasks for the joint learning. First, we adopt the pre-training method. The system is first trained for the 3-class classification and is then fine-tuned with the 10-class classification. Second, the conventional multi-task learning (MTL) [caruana1997multitask]
is applied for the joint learning of the two subtasks. In this case, the network is designed to learn the shared representation and to identify each task only in the last layer. Third, the extended MTL with the additional fine-tuning of each task is exploited. The extended MTL refers to allocating additional layer(s) for each training task after the last hidden layer that the two tasks share. Lastly, we investigate the order of the training of the two tasks, according to the hierarchical relationship of the two tasks. In the conventional and extended MTL methods, the classification layer of the two subtasks is separated only at the end of the system. On the contrary, in the sequential MTL, a layer classifying one subtask is connected to another subtask classification layer. We also compared the order of 3-class and 10-class classifications and find that it is more effective to train the 3-class classification first, followed by the 10-class classification. This is in line with the deep learning structure that trains abstract tasks in the front layer and specific tasks in the back layer[maninis2019attentive]. The sequential MTL is premised on the hierarchical relationship between the two tasks, among other MTL structures.
For further improvement, we apply the joint prediction [hu2020device]. In terms of adjusting the weight between the two tasks, we explore both intuitive and methodological approaches: a grid search and GradNorm [chen2018gradnorm].
3 Experimental settings and results
3.1 Experimental settings
For all experiments, DCASE 2020 task 1 subtask A (1-A) dataset is used. A total of 13,965 and 2,970 audio clips are used for the training and validation, respectively. We do not use the DCASE 2020 task 1 subtask B (1-B) audio clips but use only corresponding labels for the 3-class classification. DCASE 2020 task 1-A dataset consists of various audio clips collected from three real devices (A, B, and C) and six simulated devices (s1-s6). Only devices A-C and s1-s3 are used in the training set and s4-s6 are not available in the training phase. Each audio clip has a duration of 10 s with a 44.1kHz sampling rate and 24-bit resolution.
For each data, 256-dimensional Mel-spectrograms are extracted. A short-time Fourier transform with 2048 FFT points is applied, using 40 ms window size and 20 ms hop length. Mixup[zhang2017mixup] and SpecAugment [park2019specaugment]
are exploited for data augmentation. Other data preprocessing techniques such as applying logarithm, deltas, double-deltas, and sub-band frequency separation have not been used in this work. All models are implemented with Pytorch, a deep learning library in Python. The initial learning rate is set to 0.001 and scheduled with a warm restart of the stochastic gradient descent. SGD optimizer with a momentum of 0.9 is used. The batch size and the number of epochs are set to 24 and 800, respectively. The architecture details are similar to those in our previous work[shim2020capturing]. In the blocks added for the MTL, the parameters of the AMFM block are as same as the other blocks. The last hidden layer for each task has 100 nodes followed by the output layer for each subtask, respectively.
|CNN w/ LeakyReLU||X||69.6|
3.2 Result analysis
3.2.1 Attentive max feature map
Table 1 describes the results of comparing the effects of applying attention and DNN structures: CNN, MFM, and AMFM. When attention is applied to the CNN with Leaky ReLU, the performance decreased, but when attention is applied to MFM, the performance significantly improved. When the proposed AMFM is applied, an additional performance improvement is achieved compared to the MFM with an accuracy of 70.8%. Figure 2-(c) illustrates the AMFM result, which can emphasize important information while preventing information loss. The joint learning experiments are conducted using the AMFM structure which shows the best result.
|System||Joint||# Params||Acc (%)|
|Separated Classifier [hu2020device]||O||1.5M||69.4|
|Proposed method||1 : 1||70.3|
|1 : 2||69.6|
|1 : 3||70.3|
|1 : 4||70.7|
|1 : 5||71.3|
3.2.2 joint learning
Table 2 shows the comparison of various joint learning strategies. Pre-training has first explored among various methods. The pre-training needs to perform two training steps, training for 3-class task followed by the fine-tuning for 10-class task, and the optimization of adjusting the learning rate and model selection are time-consuming. Therefore, we did no more adjustment of other variables on this method. With the conventional MTL, there is no performance improvement. Using extended MTL, which has additional layers for each training task after the last hidden layer that the two tasks share, the best result without joint prediction was achieved with fewer parameters. The sequential MTL without joint prediction showed a slight improvement in performance compared to the 70.8% AMFM. Although the joint prediction showed its effectiveness in [hu2020device], the performance improvement could not be confirmed in our experiments on joint learning strategies applied to the AMFM structure. The experiment results imply that the conventional MTL may not be able to learn each task enough because each task output has only one classification layer, and the sequential MTL has a high dependence between the two tasks because they share many hidden layers. This proves that even if there is a relationship between tasks, performance can decrease when excessive interference occurs as reported in [maninis2019attentive]. On the contrary, the extended MTL additionally learns about individual tasks with more hidden layers, thus it can reduce task interference and yields the best performance.
Table 3 indicates the results of adjusting the weight ratio between the two subtasks for the joint learning. We adjust the weight ratio between the two tasks using both manual and automatic approaches. The best result is achieved when the ratio is 1 : 5 for the abstract and specific labels.
Table 4 shows the accuracy of 3-class classification. The MTL is well known to be effective when related tasks are jointly trained, but it is difficult to show if the two tasks are related. Therefore, when the relevance between tasks is low, MTL often causes a performance degradation compared to training each task individually. To verify the relationship between 3-class and 10-class classifications, we also investigate the performance of the 3-class classification, although our goal is to improve the 10-class classification task. The accuracy of the DCASE2020 baseline of subtask B is 88% and the average of the submitted system is 87.3% [heittola2020acoustic]. In our experiment, when training with AMFM structure without joint learning, the accuracy is 91.3% for 3-class and 70.8% for 10-class. Compared with the aforementioned result, when joint learning is applied, it is proved that the two subtasks are related in that we can get the best accuracy of 92.2% for 3-class and 71.3% for 10-class.
3.2.3 Comparison with the recent studies
Table 5 presents a comparison with the top five state-of-the-art systems, without the application of ensemble. Our proposed system demonstrates comparable performance with state-of-the-art systems without complex data preprocessing techniques. In addition, our method achieves such performance with a low-complexity architecture. Although it is outside the scope of this study, we expect that applying more data preprocessing methods could lead to further improvements in the future.
|System||Acc (%)||# Params|
|DCASE2020 Baseline [heittola2020acoustic]||54.1||5M|
|Suh et al. [suh2020designing]||73.7||13M|
|Hu et al. [hu2020device]||74.6||-|
|Gao et al. [gao2020acoustic]||71.8||4M|
|Liu et al. [liu2020acoustic]||72.1||3M|
|Koutini et al. [koutini2020cp]||71.8||225M|
In this paper, we proposed the AMFM technique and joint learning considering information loss and the abstraction level of classes. First, we proposed the AMFM which selects salient features in a competitive manner, avoiding information loss. Second, we proposed the joint learning of the 10-class classification of subtask A and the 3-class classification of subtask B. Experimental results demonstrated the effectiveness of our proposed methods. Although our works aimed to improve the performance of subtask A, we achieved comparable performance with the state-of-the-art systems even in low-complexity comparable to subtask B. Data preprocessing in various ways could be complementary with our proposed method and we plan to utilize both for further improvement in our future work.