Log In Sign Up

Capturing scattered discriminative information using a deep architecture in acoustic scene classification

by   Hye-Jin Shim, et al.

Frequently misclassified pairs of classes that share many common acoustic properties exist in acoustic scene classification (ASC). To distinguish such pairs of classes, trivial details scattered throughout the data could be vital clues. However, these details are less noticeable and are easily removed using conventional non-linear activations (e.g. ReLU). Furthermore, making design choices to emphasize trivial details can easily lead to overfitting if the system is not sufficiently generalized. In this study, based on the analysis of the ASC task's characteristics, we investigate various methods to capture discriminative information and simultaneously mitigate the overfitting problem. We adopt a max feature map method to replace conventional non-linear activations in a deep neural network, and therefore, we apply an element-wise comparison between different filters of a convolution layer's output. Two data augment methods and two deep architecture modules are further explored to reduce overfitting and sustain the system's discriminative power. Various experiments are conducted using the detection and classification of acoustic scenes and events 2020 task1-a dataset to validate the proposed methods. Our results show that the proposed system consistently outperforms the baseline, where the single best performing system has an accuracy of 70.4 65.1


page 1

page 2

page 3

page 4


Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification

In this paper we present a Deep Neural Network architecture for the task...

A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene Classification

One of the biggest challenges of acoustic scene classification (ASC) is ...

DNN Transfer Learning based Non-linear Feature Extraction for Acoustic Event Classification

Recent acoustic event classification research has focused on training su...

CNNs-based Acoustic Scene Classification using Multi-Spectrogram Fusion and Label Expansions

Spectrograms have been widely used in Convolutional Neural Networks base...

Acoustic scene classification using teacher-student learning with soft-labels

Acoustic scene classification identifies an input segment into one of th...

On the performance of different excitation-residual blocks for Acoustic Scene Classification

Acoustic Scene Classification (ASC) is a problem related to the field of...

Deep Learning Based Open Set Acoustic Scene Classification

In this work, we compare the performance of three selected techniques in...

1 Introduction

The detection and classification of acoustic scene and events (DCASE) community has been hosting multiple challenges to utilize sound event information generated in everyday environment and physical events [23, 19, 20]

. DCASE challenges provide not only the dataset for various audio-related tasks, but also a platform to compare and analyze the proposed systems. Among many kinds of tasks covered in DCASE challenges, acoustic scene classification (ASC) is a multi-class classification task that classifies an input recording into one of the predefined scenes.

In the process of developing an ASC system, two major issues have been widely explored in recent research literature. One is the generalization of the system in domain mismatch conditions that could arise from different recording devices [3, 24, 12]. More specifically, if an ASC system is not generalized towards unknown devices, performance on different devices degrades in the test phase. Another critical issue is the occurrence of frequently misclassified classes (e.g. shopping mall - airport, tram - metro) [7, 11]. Many acoustic characteristics coincide with these pairs of classes. Trivial details can be decisive clues for accurate classification; however, focusing on such details easily leads to a trade-off, thereby degrading generalization. In particular, due to the characteristics of the ASC task (see Section 2

), discriminative information is scattered rather throughout the recording. However, widely used convolutional neural network (CNN)-based models that exploit the ReLU activation function make feature representations sparse as it may discard negative values


(a) Devices
(b) Scenes
Figure 1: t-SNE visualization results of embeddings.

To investigate the aforementioned problems, we present a visualization of the representation vectors (i.e. embeddings, codes) of the baseline using a t-SNE algorithm

[18], depicted in Figure 1. Here, (a) and (b) refer to the result of plotted embeddings where different colors denote different device and scene labels, respectively. Figure 1-(a) shows that the devices do not form noticeable clusters, indicating good generalization. However, it can be seen in Figure 1-(b) that each scene does not have a clear decision boundary. Therefore, on leveraging this analysis, we focus on mitigating the misclassified classes.

In this study, we explore several methods to reduce the removal of information and overfitting based on the characteristics of the ASC task, and this analysis is presented in Section 2. Firstly, instead of common CNNs, we utilize a light CNN (LCNN) architecture [27]. LCNN is an architecture that adopts a max feature map (MFM) operation instead of non-linear activation functions such as ReLU or tanh. LCNN demonstrates the state-of-the-art performance in spoofing detection for automatic speaker verification (i.e. audio spoofing detection) [15, 13]. Second, to mitigate overfitting, data augmentation and attention-based deep architectural modules are explored. Two data augmentation techniques, mix-up and SpecAugment are also investigated [29, 22]. The convolutional block attention module (CBAM) and squeeze and excitation (SE) networks are studied for enhancing the discriminative power using minimum additional parameters [8, 26].

2 Characteristics of ASC

In this section, we present an analysis of the characteristics of the ASC task. We assume that the discriminative information for the ASC task included in an audio recording is scattered. Sound cues could occur either consistently or occasionally. For example, consistently occurring sound cues, such as a low degree of reverberation and the sound of the wind imply outdoor location. Various sound events such as chirping of birds and barking of dogs are also important cues, but they are impactive and short, and they may only occur in some recordings that are labeled as “parks”. Therefore, important cues can have multiple characteristics; they are not focused on specific parts of the data, and they occur irregularly. In our analysis, gathering scattered information that resides in an input recording is of interest.

In tasks such as speaker and image classification, the target information in data is relatively clear. As speaker classification utilizes human voice to identify speaker identity, the discriminative information is concentrated in human voice rather than in non-speech segments. Therefore, many studies on speaker classification attempt to remove non-speech segments using techniques such as voice activity detection (VAD). Similarly, many tasks in the image domains adopt various methods to focus only on the target object. Because of the differences in these tasks such as speaker and image classification versus the ASC task, we argue that different modeling approaches should be considered.

Audio spoofing detection is a task that shares similar characteristics with the ASC task considered in our analysis. Audio spoofing detection also makes a binary decision whether an input utterance is spoofed. In the case of audio spoofing detection, discriminative information is more scattered because distortions occur in the entire audio file during the spoofing process. Therefore, non-speech segments are also important because the distortion is not limited to the speech segments. Previous studies also show that VAD could eliminate useful information [14, 2]. Considering these characteristics, and in order to not miss much information, it has been demonstrated that LCNN is particularly effective in audio spoofing detection [15, 13]. This is because relatively less informative parts (i.e. negative values) could be removed using ReLU activation with a common CNN, making a sparse representation (as illustrated in Figure 2-(a)). This phenomenon has been reported in [27] to occur especially for the first few convolution layers.

We hypothesize that this phenomenon would apply to an ASC system too because the ASC task has commonalities in that important information is scattered across the data, similar to audio spoofing detection. To mitigate the problem of sparse representation in an ASC task, we propose to utilize MFM operation included in the LCNN architecture. As MFM operation selects feature maps with an element-wise competitive relationship, trivial information can be retained if the value is relatively high. Furthermore, focusing on trivial details could also lead to overfitting. Hence, in this study, we aim to adopt regularization methods, while introducing a minimum number of additional parameters and retaining the discriminative power of the system by applying state-of-the-art deep architecture modules.

(a) ReLU
(b) MFM
Figure 2: Comparison of ReLU activation function (left) and MFM (right) . Orange, green, and white indicate negative, positive, and zero values, respectively. ReLU removes all negative values, while MFM considers the element-wise maximum one based on a comparative relationship

3 Proposed framework



Conv_1 7 3 / 1 1 l 124 64
MFM_1 - l 124 32
MaxPool_1 2 2 / 2 2 (l / 2) 62 32
Conv_2a 1 1 / 1 1 (l / 2) 62 64
MFM_2a - (l / 2) 62 32
BatchNorm_2a - (l / 2) 62 32
Conv_2 3 3 / 1 1 (l / 2) 62 96
MFM_2 - (l / 2) 62 48
CBAM_2 - (l / 2) 62 48
MaxPool_2 2 2 / 2 2 (l / 4) 31 48
BatchNorm_2 - (l / 4) 31 48
Conv_3a 1 1 / 1 1 (l / 4) 31 96
MFM_3a - (l / 4) 31 48
BatchNorm_3a - (l / 4) 31 48
Conv_3 3 3 / 1 1 (l / 4) 31 128
MFM_3 - (l / 4) 31 64
CBAM_3 - (l / 4) 31 64
MaxPool_3 2 2 / 2 2 (l / 8) 16 64
Conv_4a 1 1 / 1 1 (l / 8) 16 128
MFM_4a - (l / 8) 16 64
BatchNorm_3a - (l / 8) 16 64
Conv_4 3 3 / 1 1 (l / 8) 16 64
MFM_4 - (l / 8) 16 32
CBAM_4 - (l / 8) 16 32
BatchNorm_4 - (l / 8) 16 32
Conv_5a 1 1 / 1 1 (l / 8) 16 64
MFM_5a - (l / 8) 16 32
BatchNorm_5a - (l / 8) 16 32
Conv_5 3 3 / 1 1 (l / 8) 16 64
MFM_5 - (l / 8) 16 32
CBAM_5 - (l / 8) 16 32
MaxPool_5 2 2 / 2 2 (l / 16) 8 32
FC_1 - 160
MFM_FC1 - 80
FC_2 - 10
Table 1:

The LCNN architecture. The numbers in the output shape column refer to the frame (time), frequency, and the number of kernels. MFM, MaxPool and FC indicate max feature map, max pooling layer and fully-connected layer, respectively.

3.1 Lcnn

LCNN is a deep learning architecture, initially designed for face recognition with noisy labels

[27]. Its main feature is a novel operation referred to as max feature map (MFM) that replaces the non-linear activation function of a deep neural network (DNN). MFM operation extends the concept of maxout activation [4] and adopts a competitive scheme between filters of a given feature map. In this study, we introduce the MFM operation to the ASC task, based on two assumptions. To the best of our knowledge, this is the first report on such an implementation. Firstly, we hypothesize that scattered discriminative information can relatively reside throughout an input feature map, compared to widely used ReLU non-linearity that discards negative values. Secondly, we note that MFM operations demonstrate state-of-the-art performance in audio spoofing detection in which two tasks share common properties.

The implementation of an MFM operation can be denoted as follows. Let be a given feature map derived through a convolution layer, , where , , and refer to the number of output channels, time domain frames, and frequency bins, respectively. We split into two feature maps, and , , . The MFM applied feature map is obtained by , element-wise. Figure 2-(b) illustrates the MFM operation.

Specifically, our design of LCNN is similar to that of [15], with some modifications. The architecture of [15] is also a modified version of the original LCNN [27]

, applying additional batch normalization used after a max pooling layer. Table

1 provides details of the architecture of the proposed system that adopts an LCNN. Conva, MFMa, BatchNorm, Conv, MFM, CBAM can be seen as a block, and 4 blocks are implemented to contain an adequate number of parameters. The number of blocks is determined based on comparative experiments.

3.2 Regularization and deep architecture modules

With limited labelled data and recent DNNs with many parameters, overfitting easily occurs in DNN-based ASC systems [20, 11, 21, 29, 22]. To account for overfitting, our design choices include data augmentation methods and deep architecture modules for generalization purposes with enhanced model capacity. For the regularization purpose, we adopt two data augmentation methods: mix-up [29] and specAugment [22]. Let and be two audio recordings that belong to class and , respectively, where is a one-hot vector. A mix-up operation creates an augmented audio recording with a corresponding soft-label using two different recordings. Formally, an augmented audio recording can be denoted as the following:



, is a random variable drawn from

, and , is a real value between 0 and 1. With a rather simple implementation, mix-up is widely adopted for the ASC task in the literature.

We also adopt specAugment [22], which was first proposed for robust speech recognition that masks a certain region of two-dimensional input feature (i.e. spectrogram, Mel-filterbank energy). Among three methodologies proposed in the paper, we adopt frequency masking and time masking. Let ,

be a Mel-filterbank energy feature extracted from an input audio recording, where

and are the number of frames and Mel-frequency bins, respectively, and and are indices for and respectively. To apply time masking, we randomly select and , , where and are indices for start and end, and then, mask with 0. To apply frequency masking, we randomly select and , , and then, mask with 0. In this study, we sequentially apply specAugment and mix-up for better generalization.

To increase model capacity while introducing minimum number of additional parameters to the model, we investigate two recent deep architecture modules: SE [8] and CBAM [26]. SE focuses on the relationship between different channels of a given feature map. SE first squeezes the input feature map via a global average pooling layer to derive a channel descriptor that includes the global spatial (time and frequency in ASC) context. Then, using minimal number of additional parameters, SE re-calibrates channel-wise dependencies via an excitation step. Specifically, the excitation step adopts two fully-connected layers that input a derived channel descriptor and output a re-calibrated channel descriptor. SE transforms the given feature map by multiplying the re-calibrated channel descriptor, where each value in the channel descriptor is broadcasted to conduct element-wise multiplication with each filter of a feature map. In our experiments that incorporate the SE module, we apply SE to the output of each residual block following the methods reported in the literature. Further details regarding the SE module can be found in [8].

CBAM is a deep architecture module that sequentially applies channel attention and spatial attention. To derive a channel attention map, CBAM applies global max and average pooling operations to the spatial domain. It then uses two fully-connected layers. Channel attention is applied by element-wise multiplication of the input feature map with the channel attention map, where each value of the channel attention map is broadcasted to fit the spatial domain. To derive a spatial attention map, CBAM applies two global pooling operations to the channel domain and then adopts a convolution layer. Spatial attention is also applied by element-wise multiplication of the feature map after channel attention with a derived spatial attention map. In our experiments using the CBAM module, we apply it to the output of each residual block following the literature. Further details regarding the CBAM module can be found in [26].

4 Experiments

4.1 Dataset

We use the DCASE2020 task1-a dataset for all our experiments. It includes 23,040 audio recordings 44.1 kHz with a 24-bit resolution, where each recording has a duration of 10 s. The dataset contains audio recordings from three real devices (A, B, and C) and six augmented devices (S1-S6). Unless explicitly mentioned, all performances in this paper are reported using the official DCASE2020 fold 1 configuration, which assigns 13,965 recordings as the training set and 2,970 recordings as the test set.

4.2 Experimental configurations

Mel-spectrograms with 128 Mel-filterbanks are used for all experiments where the number of FFT bins, window length, and shift size are set to 2,048, 40 ms, and 20 ms, respectively. During the training phase, we randomly select 250 consecutive frames (5 s) instead of using the whole recording. In the test phase, an audio recording is split into three overlapping sub-recordings (i.e. 0 5 s, 2.5 7.5 s, and 5 10 s), and the mean of the output layer is used to perform classification. This technique has been reported to mitigate overfitting in previous works [7, 10].

We use an SGD optimizer with a batch size of 24. The initial learning rate is set to 0.001 and scheduled with a warm restart of stochastic gradient descent


. For a single system, we train the DNN in an end-to-end fashion. For the ensemble system, support vector machine (SVM) classifiers are employed. Further technical details are provided in our technical report for facilitating the reproduction of conducted experiments


5 Result analysis

System Acc (%)
DCASE2019 baseline [19] 46.5
DCASE2020 baseline [6] 54.1
Ours-baseline 65.3
Table 2: Baseline comparison with other systems. Classification accuracies reported using DCASE2020 fold1 configuration.

Table 2 compares the baseline of this study with the two official baselines of the DCASE community. The DCASE2019 baseline inputs log Mel-spectrograms and uses convolution and fully-connected layers. Further, the DCASE2020 baseline inputs L3 embeddings [1] extracted from another DNN and uses fully-connected layers for classification. Our baseline uses Mel-spectrograms as inputs, and it uses convolution, batch normalization [9], and Leaky ReLU [17]

layers with residual connection

[5], where a SE module exists after each residual block111Model architecture and accuracies per each device and scene is presented in our technical report for the DCASE2020 challenge.. The results show that our baseline outperforms the DCASE2020 baseline by over 10% in classification accuracy.

System Config Acc (%)
ResNet - 65.1
ResNet(baseline) mix-up 65.3
ResNet SpecAug 66.7
ResNet mix-up+SpecAug 67.3
LCNN - 67.1
LCNN mix-up 68.4
LCNN SpecAug 69.2
LCNN mix-up+SpecAug 69.4
LCNN SE 68.0
LCNN mix-up+SpecAug+SE 69.8
LCNN (submitted) mix-up+SpecAug+CBAM 70.4
Table 3: Effect analysis of LCNN, data augmentation, and deep architecture modules.

Table 3 describes the effectiveness of the proposed approaches using LCNN, SE, and CBAM. It also provides a comparison of the effects of using mix-up or/and specAugment data augmentation methods. Firstly, for comparing the system architecture without any data augmentation and deep architecture modules, ResNet and LCNN achieve accuracies of 65.1% and 67.1%, respectively. To optimize the LCNN system, we also adjust the number of blocks and find that the original LCNN with 4 blocks achieves the best performance. Secondly, we validate the effectiveness of data augmentation. The results show that mix-up and specAugment are both effective and using a combination of these two methods is the best choice. Thirdly, we apply the deep architecture modules of SE and CBAM. From the results of the experiment, we observe that the CBAM is slightly better than SE.

Table 4

represents the results of comparing the frequently misclassified pair of two classes through a confusion matrix. Due to limited space, we omit the entire confusion matrix, instead, we have depicted only the top-5 frequently misclassified pairs. Except for the pair of shopping mall and street pedestrian, misclassified errors are reduced. There are several improvements for other misclassified pairs, but even in the top-5, we found that the total misclassified pair improved by 17% compared to the baseline.

Table 5

shows performances of the proposed systems, submitted for the DCASE2020 challenge task-1a. Our method comprise a 4-fold cross validation and apply a score-sum ensemble. For the ensemble, an SVM classifier using a kernel with a radial basis function is used. Ours-LCNN shows the result of our submitted LCNN system in which the system outperforms the baseline by over 15% using less than one-fifth the number of parameters. Further, using a score-sum ensemble with another ASC system using audio tagging

222Also submitted to DCASE2020 workshop; authors will add a citation if accepted., classification accuracy increased to 71.7%.

Class Baseline Proposed Reduction
Metro - Tram 114 81 33
Shopping - Airport 107 101 6
Shopping - Metrost 84 56 28
Shopping - Streetped 83 88 -5
Publicsquare - Streetped 74 70 4
Total 462 396 66
Table 4: Comparison results of the number of frequently misclassified pairs of acoustic scenes between baseline and proposed system. Reduction refers to the number of confusion pairs between the two classes.
System # Param Acc (%)
DCASE2020 baseline [6] 5M 51.4
Ours-LCNN 0.9M 68.5
Ours-LCNN+tagging 1.6M 71.7
Table 5: Results of our submitted systems for the DCASE2020 challenge task1-a.

6 Conclusion

In this paper, we assumed that the information that enables classification between different scenes with similar characteristics is scattered throughout the recording of the ASC task. In the case of a shopping mall and an airport, there was a common characteristic that they were reverberant and there was a babel of voices as they are indoors. Therefore, trivial details could be important cues to distinguish the two classes. Based on this hypothesis, we proposed a method that is expected to capture this discriminative information better. We applied two deep architecture modules of LCNN and CBAM and two data augmentation methods of mix-up and specAugment. The proposed method helped to improve the system performance with less computation and overhead parameters. We achieved an accuracy of 70.4% using the single best performing system, compared to 65.1% of the baseline.


  • [1] J. Cramer, H-.H. Wu, J. Salamon, and J. P. Bello (2019-05) Look, listen and learn more: design choices for deep audio embeddings. In IEEE Int.~Conf.~on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, pp. 3852–3856. External Links: Link Cited by: §5.
  • [2] H. Dinkel, Y. Qian, and K. Yu (2017) Small-footprint convolutional neural network for spoofing detection. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3086–3091. Cited by: §2.
  • [3] S. Gharib, K. Drossos, E. Cakir, D. Serdyuk, and T. Virtanen (2018) Unsupervised adversarial domain adaptation for acoustic scene classification. arXiv preprint arXiv:1808.05777. Cited by: §1.
  • [4] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio (2013) Maxout networks. arXiv preprint arXiv:1302.4389. Cited by: §3.1.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §5.
  • [6] T. Heittola, A. Mesaros, and T. Virtanen (2020) Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Note: Submitted External Links: Link Cited by: Table 2, Table 5.
  • [7] H. Heo, J. Jung, H. Shim, and H. Yu (2019) Acoustic scene classification using teacher-student learning with soft-labels. Proc. Interspeech 2019, pp. 614–618. Cited by: §1, §4.2.
  • [8] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1, §3.2.
  • [9] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning

    pp. 448–456. Cited by: §5.
  • [10] J. Jung, H. Heo, H. Shim, and H. Yu (2018-11) DNN based multi-level feature ensemble for acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pp. 113–117. Cited by: §4.2.
  • [11] J. Jung, H. Heo, H. Shim, and H. Yu (2019-10) Distilling the knowledge of specialist deep neural networks in acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York University, NY, USA, pp. 114–118. Cited by: §1, §3.2.
  • [12] M. Kosmider (2019) Calibrating neural networks for secondary recording devices. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), New York, NY, USA, pp. 25–26. Cited by: §1.
  • [13] C. Lai, N. Chen, J. Villalba, and N. Dehak (2019) ASSERT: anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120. Cited by: §1, §2.
  • [14] I. Lapidot and J. Bonastre (2019) Effects of waveform pmf on anti-spoofing detection. Proc. Interspeech 2019, pp. 2853–2857. Cited by: §2.
  • [15] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova, A. Gorlanov, and A. Kozlov (2019) STC antispoofing systems for the asvspoof2019 challenge. arXiv preprint arXiv:1904.05576. Cited by: §1, §2, §3.1.
  • [16] I. Loshchilov and F. Hutter (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §4.2.
  • [17] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §5.
  • [18] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §1.
  • [19] M. Mandel, J. Salamon, and D. P. W. Ellis (2019-10) Proceedings of the detection and classification of acoustic scenes and events 2019 workshop (dcase2019). New York University, NY, USA. Cited by: §1, Table 2.
  • [20] M. D. McDonnell and W. Gao (2020) Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 141–145. Cited by: §1, §3.2.
  • [21] S. Mun, S. Park, D. K. Han, and H. Ko (2017) Generative adversarial network based acoustic scene training set augmentation and selection using svm hyper-plane. Proc. DCASE, pp. 93–97. Cited by: §3.2.
  • [22] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019)

    SpecAugment: a simple data augmentation method for automatic speech recognition

    Proc. Interspeech 2019, pp. 2613–2617. Cited by: §1, §3.2, §3.2.
  • [23] M. D. Plumbley, C. Kroos, J. P. Bello, G. Richard, D. P.W. Ellis, and A. Mesaros (2018) Proceedings of the detection and classification of acoustic scenes and events 2018 workshop (dcase2018). Tampere University of Technology. Laboratory of Signal Processing. Cited by: §1.
  • [24] P. Primus and D. Eitelsebner (2019) Acoustic scene classification with mismatched recording devices. Tech. Rep., DCASE2019 Challenge. Cited by: §1.
  • [25] H. Shim, J. Kim, J. Jung, and H. Yu (2020) Audio tagging and deep architectures for acoustic scene classification: uos submission for the DCASE 2020 challenge. Technical report DCASE2020 Challenge. Cited by: §4.2.
  • [26] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19. Cited by: §1, §3.2, §3.2.
  • [27] X. Wu, R. He, Z. Sun, and T. Tan (2018) A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security 13 (11), pp. 2884–2896. Cited by: §1, §2, §3.1, §3.1.
  • [28] X. Wu, R. He, and Z. Sun (2015) A lightened cnn for deep face representation. ArXiv abs/1511.02683. Cited by: §1.
  • [29] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §1, §3.2.