The task of acoustic event (or scene) classification (AEC) is to annotate given audio streams with their semantic categories. It is an important step for audio content analysis, audio information retrieval [1, 2, 3]
, and speech applications that integrate with automatic speech recognition (ASR). In ASR, acoustic and language models can be applied based on a clear hierarchical structure of speech (e.g., phone, syllable, word, etc.). However, acoustic events do not have phone-like or word-like units to connect low level acoustic features to high level semantic labels in modeling, successful model pipelines used in ASR are not suitable for AEC. Most of the conventional algorithms for AEC are based on a two-step process framework: feature extraction and classifier model design. In feature extraction, the idea of bag of frames (BoF) is often applied[4, 5]. Based on the BoF, acoustic events are represented as a histogram distributions of basic frame features. In order to take long temporal information for event representation, the bag of acoustic word model (BoAW) has been proposed where the acoustic events are represented as a histogram distributions of basic temporal-frequency spectral patches or acoustic words [6, 7]8, 9]
. In order to take the temporal correlation between acoustic frames or words into consideration, the Hidden Markov Model (HMM) also has been proposed in modeling. In these conventional algorithms, feature extraction and classifier modelling are designed independently, and are not optimized in a joint learning framework.
With successful applications of supervised deep learning (SDL) frameworks in processing and recognition of image and speech, the SDL frameworks also have been applied in the AEC. The advantage of the SDL frameworks is that they can automatically learn discriminative features and classifiers in a joint learning framework. Many models with various types of network architectures have been proposed in the SDL frameworks. The convolutional neural network (CNN) model can explore temporal- and/or frequency-shift invariant features for AEC [11, 12, 13]14]
or gated recurrent units (GRU), the RNN can be efficiently trained for AEC. Models that combine the advantages of the CNN and RNN also have been proposed, e.g., convolutional recurrent neural network (CRNN) model, where the CNN is used to explore invariant features while the RNN is used to model the temporal structure in classification [16, 17].
Among these deep models, the CNN is an indispensable building block for designing a state of the art system. In a typical pipeline for AEC based on deep learning framework, multiple layers of CNNs are used for feature extraction. By stacking multiple CNN layers, hierarchical multi-scale features are explored. Features extracted from top layers encode long-range feature dependency since top layer convolutions correspond to large scales of receptive fields. Therefore, features in the final representation layer encode global and abstract features which are suitable for AEC. Although feature representations from top layers are robust or invariant to various of local variations, the local representations from bottom layers which are important in discriminating different acoustic events may be smoothed out in the final representations. Therefore, discriminative features from local regions should be efficiently propagated to the final representation for classification tasks. Based on this consideration, several classification models which integrate multi-scale features or multi-scale decisions scores have been proposed [18, 19]. In most of their work, rather than utilizing the hierarchical scale features extracted from stacked CNN layers, they used multi-scale spectral patches as input to stacked CNN layers to extract multi-scale features or multi-scale scores. In our opinion, it is not necessary to pass multi-scale spectral patches in the model repeatedly with redundant calculations which increases the computational complexity.
Besides taking advantage of integrating multi-scale hierarchical features for classification, attention modeling integrated in deep CNN (DCNN) models also have been proposed for AEC, for example, attention and localization based deep models[20, 21, 22, 23], attention pooling algorithm [24, 25, 26]. These studies were initially inspired by the success of attention model in machine translation and image processing [27, 28, 29]. However, most of their algorithms took either temporal attention or spatial attention to localize the important time or time-frequency regions, and pooling features along the temporal or frequency dimensions. The final feature representation which is used to the classifier is still based on features from the last convolutional block which is with a large size of receptive field. Several advantages of feature maps in deep neural networks have not been explored. For example, in a DCNN, the features are actually explored and represented in spatial (time-frequency), multi-channels (number of feature maps), and hierarchical structures (multi-scales), and these features represent different aspects of acoustic events.
In summary, semantic meanings of sound events may lay in different levels of abstractions, i.e., discriminative features of sound events are distributed in multi-scales with local and global dependency. In this paper, we propose a progressive multi-scale attention (MSA) model for AEC. The MSA model dynamically weights features in consecutive scales based on their importance, and aggregates multi-scale features for AEC. Although algorithms using multi-scale features and attentions have been proposed [30, 31], the multi-scale features were concatenated or as inputs to another network for discriminative feature extraction, in our model, the small scale features are progressively and adaptively propagated to top layers for discriminative feature extraction. In functional effect, our proposed MSA model is quite similar as the residual convolutional network (ResCNN) which progressively passes small scale features via skip connections from bottom to top layers [32, 33]. However, the ResCNN uses fixed and implicit scale feature weighting and their initial motivation with skip connections is to deal with the gradient vanish problem in training deep networks. In our MSA model, we explicitly use adaptive scale weighting for feature propagation from bottom to top layers, the final feature representation encodes local and global discriminative features with a wide range of scales which are expected to improve the performance for the AEC task. Our contributions are summarized as follows:
(1) We explicitly formulated the importance of features in consecutive scales from stacked CNN layers with an attention model, and adaptively propagate small-scale features to top layers. The final representation encodes discriminative features from a wide range of scales with consideration of their importance for AEC.
(2) We first time revealed the relationship between our MSA and ResCNN models based on mathematic formulations. The ResCNN model can be regarded as a special case of implicit fixed scale feature propagation in feature extraction. In addition, based on the design of the scale attention function, temporal and spatial context-dependency can be taken into consideration. In this sense, conventional temporal and/or spatial attention models can be regarded as special cases of the proposed MSA model.
(3) We proposed a system for AEC based on the MSA model, and evaluated the performance of the system with exploring several factors which affect the performance.
The remainder of the paper is organized as follows. Section II introduces the proposed framework with attention models, and explains its connection to the ResCNN with mathematical formulations. Section III presents AEC experiments and results based on the proposed framework, and analyzes the contribution of the attention models in details. Discussion and conclusion are given in Section IV.
Ii Progressive scale attention for AEC
Since discriminative features of sound events are encoded in multi-scale patterns, we need to figure out two problems: (1) How to extract multi-scale features from acoustic signals; (2) how to weight and integrate these multi-scale features based on their importance for class discrimination. In this section, we first explain the multi-scale feature extraction in the DCNN, and then introduce our proposed MSA model, and the design of the AEC framework based on the MSA model.
Ii-a Multi-scale feature representations in a deep convolutional network
Intuitively, the multi-scale patterns of acoustic events are distributed in various types of temporal-frequency patterns. Therefore, we define the temporal-frequency spectral patches as instances for acoustic events. Based on this definition, acoustic events can be described as bags of spectral patches. In addition, in order to take different size of instances in modelling, the bags can be composed of multi-scale of spectral patches. With this concept, we could explain the function of the DCNN based acoustic feature extraction as a process of multi-scale feature detections. For convenience of explanation of our proposed architecture, we first describe the formulations of the DCNN. The DCNN is composed of multiple processing blocks, and each block is consisted of an affine transform (convolution), nonlinear activation, and feature pooling as:
where is layer index, is a linear convolution process with 2-dimension (2D) kernels (convolution along both the time and frequency dimensions),
is a max-pooling (MP) operation. The output of the-th layer
is a 3-dimension (3D) tensor, where, , are the number of CNN filters (or filter channels), dimensions of height and width of feature maps, respectively. Hierarchical feature structure corresponding to multi-scale spectral patches is encoded in these tensors. An explanation is shown in Fig. 1.
In this figure, there are two layers of convolutional blocks (function modules are defined as in Eq. 1). The first input layer “Scale 0” corresponds to the raw spectral space. The convolutional filters in each layer can be regarded as spectral patch detectors. Each “pixel” in the output space of convolutional blocks is a representation corresponding to different scales of spectral patch (with different sizes of receptive fields). For example, as shown in Fig. 1, each pixel in “Scale 1” or “Scale 2” is a representation of a or spectral patches corresponding to the original spectral space. These spectral patches can be regarded as instances for acoustic events, and classification frameworks based on bags of multiple instances could be applied. In most of the frameworks, an average feature representations obtained from the last top layer are used for classification [34, 35, 36, 37, 38, 39, 40]. It is possible that local small scale representations in other lower layers which encode class discriminative information are smoothed out. It is better to take advantage of rich representations in a DCNN model for class discrimination, i.e., propagating multi-scale representations from bottom to top layers for representation.
Ii-B Scale attention in consecutive layers
In multi-layer CNNs, multi-scale features are extracted and propagated in consecutive layers, their importance can be measured by a scale attention model. A direct realization of this scale attention is to adaptively weight multi-scale features in consecutive layers as illustrated in Fig. 2. In this figure, the convolution block with “CNN () block”, each is composed of a linear convolutional operator and a nonlinear transform. The output of the layer with scale has a larger receptive field than in the layer with scale (output of the preceding layer). Each “pixel” in the layer with scale can be regarded as a “smoothed” process from a “patch” () in the preceding layer with scale . In order to keep discriminative structure in the consecutive layers, the output should be a weighted combination based on their importance (discrimination ability in classification tasks) from the two feature spaces. For convenience of illustration and explanation, we show the attentive weighting from the two consecutive scales only in one filter channel as in Fig. 2. In Fig. 2, the transform between the two consecutive scales and is defined as:
where is the feature transform between scale and scale . It is the transform function of the convolution block as showed in Fig. 2. In each “pixel” position, the attended output is represented as:
where is the attention weight on the -th position (in “pixel” level) between scale and , and it is constrained as . Eq. 3 shows that in two consecutive scales, the output in each position is a weighted summation based on their importance in the two scales. Eq. 3 is further cast to:
From Eq. 4, we can see that the term is exactly the same as the residual branch as used in Residual network (ResNet) [32, 33]. In the following, we illustrate our idea with a connection to the ResNet, which is one of the most widely used state-of-the art model frameworks in deep learning.
Ii-C Residual convolutional neural network (ResCNN)
The original purpose of using the ResNet architecture [32, 33] is to deal with gradient vanish problem when the deep network is optimized based on gradient back-propagation (BP) algorithms . By using skip connections, the gradient which is used for model parameter updating can be efficiently back propagated to lower layers in a deep network. Therefore, model parameters for both top and bottom layers can be well trained. Besides the original motivation for the proposal of the ResNet, several studies have revealed that the benefit of ResNet is rather than making the training of a deep network efficient, it also can benefit to good model generalization due to the ensemble and stochastic depth properties of the model structure [42, 43]. When convolutional blocks are used in the ResNet, small scale features can be carried from bottom to top layers through skip connections. There are two types of ResNet blocks with convolution operators (ResCNN) as showed in Fig. 3
, one is with identity connection (panel (a)), the other is with convolutional connection to make the input and output compatible in feature dimensions (panel (b)). In this figure, the convolution block includes a batch normalization (BN) and a nonlinear activation function which are omitted for convenience of explanation. The transform in this ResCNN block is:
where is the input of the ResCNN block, and is the output, is a transform function of the residual branch.
Ii-D Relationship between multi-scale attention and ResCNN
From Eq. 5, we can see that the ResCNN tries to learn the residual transform function as:
From this equation, we can see that the conventional ResCNN can be explained as one special case of the proposed multi-scale attention model with , i.e., the conventional ResCNN is actually an implicit multi-scale feature processing by propagating small scale features through the skip connections. In the transform function of the residual branch, is with a convolution operation, by directly adding features in scale to the output, the residual branch tries to learn more on large scale features in scale . Based on this analysis, we assume that the benefit of using the ResCNN is to efficiently propagate multi-scale features in feature representation which helps to integrate local and global feature dependency for classification. Our multi-scale attention model can be implemented with the structure of ResCNN as a backbone by adding attention blocks on the residual branches (as shown in panel (b) of Fig. 4).
Ii-E Design of the multi-scale attention function as a feed forward attention network
There are many strategies for the estimation of the attention weightas used in Eq. 7, for example, feed forward attention, feed-back attention, and local and global attentions [27, 28, 45]. In our study, as a general form, a feed forward attention network is designed. The position-wise attention weight for features between scales and is defined as:
In this equation, is a network function takes and its context as inputs, is the attention network parameters. In our study, is implemented as a convolutional neural network, the context information of can be easily integrated via convolutional operators.
is the logistic sigmoid function to constrain the attention value in the range. Considering the different scale resolution in attention weighting, two types of implementations are proposed, one is that the attention net keeps the same resolution as that of the residual branch (in panel (a) of Fig. 5), the other is that the attention network is with down- and up-sampling (in panel (b) of Fig. 5). In our experiments, we will investigate different down- and up-sampling methods. From Eq. 8, we can see that the scale attention in our proposal is an adaptive weighting process depending on current input and its context. It is different from ResCNN which uses a fixed scale feature summation in consecutive scales. In addition, this scale attention actually is a general and unified form of attention modeling since temporal and spatial information can be integrated in the calculation as context information.
Ii-F The proposed framework for AEC based on multi-scale attention model
This framework follows the state of the art pipeline for AEC tasks where invariant feature extraction and temporal context integration are modeled with CNN and RNN [16, 17], respectively. There are two modules involved in the framework, one is concerned with hierarchical scale feature process, the other is related to temporal integration and score aggregation process. The feature process blocks in the first module can be realized by CNN blocks, Residual CNN blocks, or our proposed scale attention blocks. The explored features are aggregated to give classification scores in the final stage with a pooling operation. The design of the two modules are introduced in details in the following subsections.
Ii-F1 Implementation of scale attention block
Based on our analysis, the scale attention can be formulated as a position-wise attention on residual branch where the ResCNN is used as a backbone. The implementation of the scale attention block is showed in Fig. 7. By stacking several scale attention blocks in feature process module (Fig. 6), we can obtain the refined multi-scale feature representation of spectral patches which will be used in the next stage for AEC.
Ii-F2 Temporal integration and score aggregation
After obtaining importance weighted features from the multi-scale attention model, the temporal integration and score aggregation module is designed as shown in Fig. 8.
In this figure, the input is the representation processed by the feature process module (, the output of the last feature extraction module from (a) of Fig. 6) with an average pooling along frequency axis (as operator in the figure). A bi-directional recurrent network (BGRU units are used) is applied to further process the sequence. A full connected (FC) layer with softmax activation is stacked in each output of the BGRU process. The output is a probability vector. The process in this block is formulated as follows:
where and are the weight matrix and bias of the FC layer, respectively, is the output of the -th step from the BGRU layer, is the number of classes. The final bag-level probability vector is obtained with an aggregation function . In this paper, an average pooling is used as the aggregation function.
Suppose the estimated class probability vector is a function of input acoustic spectrum and neural network parameters as (estimated from Eq. 9), where is an input acoustic sample (raw input spectrum at scale ), is the sample index as ( is the total number of samples),
is the network parameter set vector. The learning is based on minimizing a loss function defined as the cross-entropy (CE) of the predicted and true targets as:
where is the transpose of the true target label vector . In most studies, a parameter regularization () defined as in Eq. 11 is added in the objective function with a trade-off weighting coefficient.
Ii-G A further discussion on the implementation of the attention network
In subsections II-E and II-F, we have explained the scale attention as a position-wise attention on the residual branch. In this subsection, we further look into its design details with regard to the temporal and spatial attentions in feature weighting. Intuitively, as a sequential perception problem, the perception importance can be allocated to some temporal frames or segments, i.e., the attention variable is an 1D time-dependent variable. But in this study, we defined acoustic event patterns as representations of temporal-frequency patches. Therefore, the output of the attention function could be a 2D spatial-dependent variable. Furthermore, in the 2D CNN framework, the output of each layer is a 3D tensor, e.g., the -th layer output is with a feature channel index besides the two spatial dimensions. In this situation, the CNN filters are regarded as 2D instance detectors, and each feature channel can be associated with a 2D attention weight matrix. In this sense, output of the attention model should be a 3D channel-spatial-dependent variable. Theoretically, 1D (time) attention model can be regarded as a special case of a 2D (spatial) attention model with an average along the frequency dimension, while 2D attention model is a special case of a 3D (channel-spatial) attention model with an average along the channel dimension. Although models with high representation capacity should outperform those with low representation capacity, in real applications, we need to consider the effect of model complexity. In the followings, we show how to estimate the attention functions for each of these situations.
We suppose that an acoustic event is finally represented as a 3D tensor where are the channel index, two spatial dimensions of height and width, respectively. In each channel (corresponding to index ), there is an attention matrix to weight each “pixel”. For a 3D attention model, i.e., cross-channel spatial attention model (CC_SAM_3D) as shown in Fig. 9,
the attention map function for the -th channel is designed as:
where is a 2D attention matrix (the consecutive scale indexes are omitted for easy explanation), is an attention network transform (before sigmoid activation) for the -th channel. Apparently, in the 3D attention model, it is possible that different filter channels might pay attention to the same local features of spectral patches. In order to constrain the output of each filter channel to focus on different local features, an attention map orthogonality constraint is added in optimization. Correspondingly, the attention orthogonality loss function is defined as:
where is an element-wise multiply operator, is the Frobenius norm, is the transpose of , while is a square matrix, where each column is the flattened attention matrix of a filter channel , is an all one square matrix,
is an identity matrix. The loss defined in Eq.13 is an orthogonal constraint of the attention, i.e., each filter channel tries to catch uncorrelated attention components. The final optimization is based on minimizing the following objective function:
where is defined as in Eq. 11, and are two regularization parameters. Based on this 3D attention model, special cases of attention models can be obtained as follows.
Ii-G1 Channel-wise spatial attention model
Different from the cross-channel spatial attention model, a channel-wise 2D spatial attention matrix (pseudo 3D which is denoted as CW_SAM_2.5D in this paper) is estimated independently for each feature channel. There are two types of designs considering whether their model parameters are shared or not. The design is showed in Fig. 10.
In this figure, there is no cross connections between different feature channels in attention network design. With different input from each channel, different attention map for each channel can be obtained.
Ii-G2 Spatial attention model
As acoustic event patterns are encoded in time-frequency spectral patches, a 2D spatial attention model (SAM_2D) could be applied naturally. In addition, different from the pseudo 3D attention model, all channels share the same one attention matrix. The design is showed in Fig. 11.
In this design, each spectral patch extracted from spatial location (as a “pixel” in the output space of the CNN layer) is represented as ( for an input acoustic signal). It is represented as a collection of representations across all filter channels. If spatial context are taken into consideration, concatenation of features from spatial context are applied.
Ii-G3 Temporal attention model
As a temporal sequence, spectral patches share the same importance if they are extracted from the same time stamp in the original spectral space. This is the most intuitive application of attention model in AEC. In the design, based on the CC_SAM_3D, CW_SAM_2.5D, and SAM_2D attention models, in each temporal location, we obtain a temporal attention weight by an average along the frequency dimension. Correspondingly, there are many types of designs for temporal attention model, for example, cross-channel temporal attention model (CC_TAM_1D), channel-wise temporal attention model (CW_TAM_1D), and one temporal attention map model (TAM_1D) shared by all channels. In this paper, only the results based of TAM_1D is given.
The attention function can be implemented as a full connected (FC) neural network or a local-connected (LC) convolutional neural network. Due to the flexibility of the convolutional neural network, in this study, LC based convolutional neural network was implemented for attention network models. In attention model with independency assumption for spectral patches, convolution kernel is used. And with dependency assumption, the kernel size is set with a given spatial neighborhood size ( neighborhood size is used in this study), i.e., the attention for one patch is estimated based on the patches surrounding the focused patch in spatial locations.
From the analysis above, we can see that most studies using attention models for AEC, for example, the temporal attention model or multiple instance attention model [20, 22, 25, 24], their algorithms can be regarded as a special case of our scale attention model. To the best of our knowledge, we are the first to reveal the relationship between scale attention model and residual CNN, and generalize the multi-scale attention model for AEC.
Two data corpora are adopted to test the proposed framework, one is the UrbanSound8K corpus for acoustic event and scene classification[11, 46], and the other is the task 4 data set of DCASE 2017 (DCASE2017_T4) for large scale weak supervised acoustic event detection in smart car environments . The UrbanSound8K consists of 8732 sound clips (less than 4 seconds) of 10 classes labeled on clip-level as: air conditioner (AC), car horn (horn), playing children (child), dog bark (dog), drilling (drill), engine idling (engine), gun shot (gun), jackhammer (hammer), siren, street music (music). All sounds were organized into 10 folds. In our study, each of the 10 folds is selected alternatively as the test set, and the remaining 9 folds as training and validation sets (validation set is selected from one of the 9 folds with the remaining as training set alternatively following ). The final evaluation is based on an average performance on the 10 test fold sets. Since the sounds were recorded and collected from crowd-sources, and labels were given only on clip-level without accurate start and ending time stamps, classification of these sounds is difficult and realistic. The DCASE2017_T4 data set is a subset of Google Audioset . It consists of 17 types of sound events that occur in car or vehicle environments. Each sample is collected as a clip with duration around or less than 10 seconds. In total, there are 51172 clips for training, 488 clips for testing (used as validation set for selecting good models for the final evaluation), and 1103 clips for final evaluation . As a weak-label classification task, only tag label information on clip-level is applied. In our study, we first take the UrbanSound8K corpus for detailed analysis and model comparisons since many studies carried out research work based on deep architectures to test their algorithms on this data set [11, 12, 46], and then we carry out experiments on the DCASE2017_T4 corpus to further verify the proposed framework.
Iii-a Implementation details
The raw input feature to the model is log compressed Mel-spectrogram (MSP). In the MSP extraction, all sounds were down-sampled with a 16 kHz sampling rate. 512-point windowed FFT with 256-point shift was used for frame-based power spectrum extraction, and 60 Mel filter bands were used for the MSP representation. As our models take spectral patches in convolution for feature processing, a 2D MSP feature matrix for each event clip was used as input to the models.
Many models based on deep learning algorithms have been proposed to improve the performance for AEC tasks [11, 12, 49, 50, 51]. Among these models, the DCNN based models perform consistently well due to their strong power in temporal-frequency invariant feature extraction. With incorporation of temporal context modeling by an RNN layer (with either LSTM or GRU units), the CRNN always gives the state-of-the-art performance in most studies related to audio classification and detection [16, 17]
. In our study, as a baseline model, we implemented the deep CRNN (DCRNN) model for comparisons. In addition, as an implicit multi-scale feature integration model, the ResCNN based model is also implemented. With the ResCNN as our backbone model, our scale attention models were built. Besides various selection of model architectures, many methods or techniques could be applied to improve the AEC performance, for example, data augmentation, optimization algorithms, transfer learning, cross-modal learning, or rover of many sub-systems. However, comparing the results with different methods sometimes are not fair or could not help to understand the problems inside since we need a lot of tricks to tune a model from many aspects. In our study, with a very typical architecture and learning algorithm, we only focus on whether integrating multi-scale attention model could improve the discriminative feature extraction for improving the performance or not. For a fair comparison, all processing procedures follow the same pipeline as showed in Fig.6
. For testing different models, only the feature process blocks are replaced as plug and play. In the CNN blocks for local feature extraction, each block includes one linear CNN layer (with 256 of 3*3 filter kernels) followed by a BN operation (for improving convergence speed and model generalization), a nonlinear activation (ReLU was used in this study), and a max-pooling operation (with strides of 2*2) (as defined in Eq.1
). After feature extraction in the DCNN, a bi-directional RNN layer is applied on the output of the last CNN block for temporal context modelling, then followed by a fully connected layer and softmax layer for classification. In the RNN layer, 128 GRU nodes (to obtain 256-dimension event feature vectors) were used (as shown in Fig.8). In implementation of the ResCNN, the identity residual block showed in (a) of Fig. 3 is used. In the residual block, except the last CNN block, each CNN block is with a linear CNN layer, BN operation, and ReLU activation. The last CNN block is only with linear CNN layer, and the BN and nonlinear activation are applied after the identity addition. Moreover, in each residual block, the output dimension is 256, and bottleneck compression ratio as 4. Our scale attention model applied the ResCNN as a backbone with a modification of each block following the design in Fig. 7. In model training, the regularization coefficient (in Eq. 14) in the objective function was fixed to be , and was experimentally decided. A mini-batch size 32 was set, and a stochastic optimization Adam algorithm with a learning rate was applied  in model optimization learning.
Iii-B Experiments on UrburnSound8K corpus
Before carrying out experiments for final evaluations, we first did experiments on the UrbanSound8K corpus to confirm several factors that affect the performance.
Iii-B1 Baseline models
In order to select good baseline models, we have tried several configurations of “DCRNN” with different layers of CNN blocks, the results are showed in table I.
|DCRNN (1 CNN block)||74.8 (4.60)||72.1 (4.95)|
|DCRNN (2 CNN blocks)||78.2 (5.10)||74.8 (5.34)|
|DCRNN (3 CNN blocks)||79.4 (5.52)||76.2 (6.01)|
|DCRNN (4 CNN blocks)||79.1 (5.54)||75.1 (5.51)|
|PiczakCNN ()||-||73.7 (-)|
|SB-CNN ()||-||73.0 (-)|
Performance of baseline models (classification accuracy (standard deviation)) (%)
The results in this table are classification accuracies and their standard deviations summarized on the 10 test folds. From this table, comparing the results on the test set, we can see that model with one convolutional block does not have enough model capacity which results in under fitting. With increasing of CNN processing blocks, the performance is consistently improved. However, adding more convolutional blocks beyond three has no further benefit on this AEC task. Comparing other studies on this data corpus, for example, the DCNN models in [12, 11], the “DCRNN” model with three feature processing blocks is a reasonable baseline model.
In the baseline models with different layers of CNN blocks, their features for acoustic events discrimination can be regarded as from different scales. For a detailed analysis, we show the confusion matrix of acoustic events based on models with two and three CNN blocks in Fig.12.
From this confusion matrix, we can see that although the total performance with three CNN blocks (panel (b)) is improved on that of with two CNN blocks (panel (a)), the improvement is with the cost of performance degradation on some specific event categories. For example, as shown in Fig. 12, the accuracies of recognizing events “AC”, “hammer”, “engine”, and “drill” are decreased (from panel (b) to panel (a)). Moreover, although the confusion from “horn” to “music” is reduced when features are represented in a large scale, the confusion from “AC” to “music” is increased. For a further investigation, we showed the performance changes when number of layers of CNN blocks are changed in table II.
|Layers||1 2||2 3||3 4|
In this table, the columns with “Corrected”, “Degraded”, and “Kept” are defined as:
In this definition, means number of samples of correctly predicted by model while wrongly predicted by model . Other definitions follow the similar meanings. Based on these definitions, we can know, when number of CNN layers are changed (e.g., from model M1 to model M2), how many percentage of clips are corrected, degraded and kept during recognition. From table II, we can see that in the baseline model, with increasing the number of CNN blocks, there is a tendency that the number of corrected clips is larger than that of degraded clips in recognition. However, the tendency is reverse when model changes from with three CNN blocks to four blocks. In summary, most convention models by stacking several CNN layers try to find an optimal scale of feature representation based on which number of corrected patterns is larger than that of degraded patterns.
Iii-B2 ResCNN and scale attention models
In this subsection, we test the performance of the ResCNN model and our scale attention models in which the ResCNN is used as a backbone. In all these models, three feature processing blocks are adopted based on the initial analysis of baseline models. According to the same processing pipeline in Fig. 6, we replace the CNN blocks with residual CNN blocks and scale attention CNN blocks in feature processing module. In a common usage of ResNet [32, 33], the number of neural nodes is increased from bottom to top layers, for example, the output nodes for residual blocks can be arranged as 64-256-512 in a ResNet with three residual blocks. However, as our empirical experiments showed that a plain usage of residual blocks with 256-256-256 obtained a little bit better results. Moreover, for a fair and consistent comparison with using CNN blocks for feature extraction, in our implementations, only the identity convolutional blocks were used. Our scale attention model also follows the same configuration based on the ResCNN, and attention map orthogonal regularization (based on our empirical experiments in section III-C). In addition, in the implementation of the scale attention model, according to Fig. 5
, attention networks without and with down-up sampling are designed. Three down sampling methods are examined, i.e., CNN stride convolution, average-pooling, and max-pooling. All of these down samplings are with strides 2 by 2. The bilinear interpolation was used in up-sampling. The performance with these models are showed in tableIII.
|ResCNN||80.5 (5.56)||77.5 (5.69)|
|CC_SAM_3D (No sampling)||81.9 (5.75)||79.0 (6.01)|
|CC_SAM_3D (CNN down-up)||81.7 (4.76)||79.2 (4.34)|
|CC_SAM_3D (AP down-up)||81.3 (5.34)||78.8 (5.63)|
|CC_SAM_3D (MP down-up)||81.9 (5.19)||79.4 (6.29)|
|CW_SAM_2.5D||81.5 (5.24)||79.2 (5.60)|
|CW_SAM_2.5D (shared)||81.8 (5.49)||78.8 (5.77)|
|SAM_2D||80.9 (5.19)||78.6 (5.50)|
|TAM_1D||80.7 (5.40)||78.2 (5.59)|
Comparing the results in tables I and III, we can see that the ResCNN significantly improved the performance compared with the model using CNN blocks without scale feature skip connections. Our scale attention models improved the performance on the ResCNN model. Moreover, attention models with down-up sampling improved the performance a little bit except with average-pooling based down sampling method. Based on these results, in later experiments and analysis, all scale attention models are implemented with the max-pooling based down sampling method.
As we discussed in section II-G, with different definitions of attention, the 3D attention model can be generalized to pseudo 3D (channel-wise), 2D (spatial), and 1D (temporal) attention models with decreasing of model complexity. The results are showed in table IV. From the results we can see that applying different attention map for each feature channel is better than applying the same attention map to all feature channels. In case of channel-wise attention model CW_SAM_2.5D (implemented as depth-wise convolution ), there is a little decrease of the performance compared with the cross-channel attention model (3D), while the decrease is large when model parameters are shared in CW_SAM_2.5D (shared). Moreover, comparing results with CW_SAM_2.5D, SAM_2D and TAM_1D, we can conclude that both channel and spatial factors should be taken into considerations in attention weight estimation.
Iii-C Effect of orthogonal regularization in attention map estimation
In estimation of attention maps for multi-channels, we added an orthogonal regularization in order to push the attention to focus on different discriminative features in learning. With variation of the regularization parameter , we carried out experiments to examine its effect. The results are showed in table V.
|0.0001||81.9 (5.86)||78.1 (5.10)|
|0.001||81.7 (5.75)||78.9 (6.10)|
|0.01||81.6 (5.70)||79.0 (5.49)|
|0.1||81.7 (5.19)||79.4 (6.29)|
|1||81.4 (4.86)||79.0 (5.26)|
|5||81.0 (5.82)||78.4 (5.65)|
From these results, we can see that adding orthogonal regularization for cross-channel attention map estimation is important for improving the performance. And setting could obtain the best performance among the tried regularization parameters. These results also confirmed that constraining attention maps to focus on different features (channels) could help for improving discriminative feature extraction.
Iii-D Model complexity
In order to check complexity for different models, we showed the parameter size of each model in table VI.
|Models||number of parameters|
|DCRNN (3 CNN blocks)||1.484 M|
|DCRNN (4 CNN blocks)||2.075 M|
|CW_SAM_2.5D (shared)||0.513 M|
Checking the results in tables I and II with reference to the number of model parameter size, we can see that skip connections for propagating multi-scale features to top layers is important. Increasing model capacity does not always increase model performance, but increases the performance significantly if the model is with scale attention modeling. This confirms that the performance improvement is due to the good model structure with attention models.
Iii-E Experiments on DCASE2017_T4 corpus
In this section, we carried out experiments on DCASE2017_T4 data set, and focused on acoustic event tagging task (task A). Different from experiments on UrburnSound8K corpus, recall, precision, and F1 values are used as evaluation metrics which is defined in. Based on the analysis of experimental settings for UrburnSound8K corpus, as well as with reference to models build in  (top-1 model on task A of DCASE2017_T4 data set), we made a little modification on the pipeline of framework in Fig. 6. Two BGRU layers were applied instead of one BGRU layer as used in UrburnSound8K corpus. In addition, in implementation of the ResCNN, we tried models with several residual blocks. The results are showed in table VII. From this table, we can see that propagating multi-scale features with skip connections is important in discriminative feature extraction. The proposed scale attention model with the ResCNN as a backbone further improved the performance and obtained state of the art results.
|DCASE2017 Baseline ||7.8||17.5||10.9|
|Gated_CRNN _MFCC ||51.7||52.5||52.1|
|ResCNN (2 Blocks)||52.0||56.4||54.1|
|ResCNN (3 Blocks)||54.2||57.3||55.7|
|ResCNN (4 Blocks)||53.7||60.2||56.8|
|CC_SAM_3D (3 Blocks)||56.7||60.1||58.3|
|DCASE2017 Baseline ||15.0||23.1||18.2|
|Gated_CRNN _MFCC ||51.7||52.5||52.1|
|ResCNN (2 Blocks)||54.1||58.3||56.1|
|ResCNN (3 Blocks)||58.0||60.7||59.3|
|ResCNN (4 Blocks)||56.2||62.1||59.0|
|CC_SAM_3D (3 Blocks)||58.1||62.9||60.4|
Iv Discussion and conclusion
The CNN is widely used as a feature processing block for robust and invariant feature extraction in AEC. With a preliminary analysis of the DCNN based model for AEC, we showed that conventional classification models actually are based on features from a certain large scale space where representations from such a scale space are suitable for improving discrimination of most acoustic patterns with the cost of degradation for some other acoustic patterns. It is better to explore multi-scale discriminative features from a wide range of scales. Although a direct link to multi-scale audio process is based on multi-scale time-frequency resolutions as used in [54, 55], the concept of multi-scale used in our paper is different. The multi-scale in this paper refers to multi-scale of spectral patches processed in a DCNN model. As acoustic event patterns are encoded in spectral patches, it is quite suitable for CNN to explore discriminative features from different size of spectral patches for AEC. By stacking multiple layers of CNNs in feature extraction, multi-scale features corresponding to multi-scale spectral patches are obtained. Different from conventional usage of the DCNN based model for AEC, we proposed a scale attention model for efficient propagation of multi-scale features from bottom to top layers. With an attentive weighting of features in consecutive layers, we could propagate discriminative features from different scales to the final feature representation. With formulations (Eqs. (4), (5), (6), (7)), we derived that the proposed multi-scale attention can be regarded as a scale attention on the residual block of the ResCNN. The conventional ResCNN can be regarded as a special case of the proposed multi-scale attention model. In addition, in the calculation of the attention weights, either temporal or spatial context information could be combined to estimate the attention weights. In this sense, the proposed multi-scale attention model can be regarded as a generalized attention model of either temporal or spatial attention model.
We carried out experiments to test the proposed framework based on scale attention model, and confirmed that adaptively incorporating multi-scale features with attention progressively performs better than only using a fixed scale propagation in feature extraction (as used in ResCNN). Moreover, as our experiments show that explicitly incorporating spatial attention performs better than incorporating only temporal attention, and weighting differently for each feature channel performs better than using the same attention to all feature channels. However, incorporating more feature information with large increase of model parameter size sometimes does not necessarily improve the performance. How to explicitly incorporate much more information with little increase of model parameter size is one of our future work. Moreover, we will further investigate the effective factors involved in the design of the multi-scale attention blocks, for example, how to design the residual branch as showed in panel (b) of Fig. 3, how to include context information either with local or global attentions remains as another one of our future work.
-  D. Giannoulisy, E. Benetosx, D. Stowelly, M. Rossignolz, M. Lagrangez and M. Plumbley, “Detection and Classification of Acoustic Scenes and Events: an IEEE AASP Challenge,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013.
-  T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, “Context-dependent sound event detection,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 1, pp. 1-13, 2013.
-  X. Zhuang, X. Zhou, M. A. Hasegawa-johnson, T. S. Huang, “Real-world acoustic event detection,” Pattern Recognition Letters, vol. 31, no. 12, pp. 1543-1551, 2010.
-  A. Plinge, R. Grzeszick, and G. Fink, “A Bag-of-Features approach to acoustic event detection,” in Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Process (ICASSP)., pp. 3732-3736, 2014.
-  R. Grzeszick, A. Plinge, G. A. Fink, “Bag-of-Features Methods for Acoustic Event Detection and Classification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6), pages 1242-1252, June 2017.
-  X. Lu, Y. Tsao, S. Matsuda, C. Hori, “Sparse representation based on a bag of spectral exemplars for acoustic event detection,” in Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Process (ICASSP), pp. 6255-6259, 2014.
-  C. V. Cotton, D. P. W. Ellis, “Spectral vs. spectro-temporal features for acoustic event detection,” in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 69-72, 2010.
-  A. Temko, C. Nadeu, and J. I. Biel, “Acoustic Event Detection: SVM-Based System and Evaluation Setup in CLEAR’07,” Multimodel technologies for perception of humans, pp. 354-363, 2008.
-  Z. Huang, Y. Cheng, K. Li, V. Hautamaki, C. Lee, “A Blind Segmentation Approach to Acoustic Event Detection Based on I-Vector,” in Proc. of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2282-2286, 2013.
-  C. Zieger, “An HMM based system for acoustic event detection,” Multimodel technologies for perception of humans, pp. 338-344, 2008.
-  J. Salamon, and J. P. Bello, “Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification,” IEEE Signal processing letters, Vol. 24, No. 3, pp. 279-283, 2017.
Karol J. Piczak, “Environmental sound classification with convolutional neural networks,”
IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), 12 November 2015.
-  A. Gorin, N. Makhazhanov, and N. Shmyrev, “DCASE 2016 sound event detection system based on convolutional neural network,” Tech. Rep., DCASE2016 Challenge, 2016.
-  S. Hochreiter, J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, 9 (8), pp. 1735-1780, 1997.
K. Cho, B. Merrienboer, D. Bahdanau, Y. Bengio, “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches,”the 8-th Workshop on Syntax, Semantics and Structure in Statistical Translation, SSST-8, 2014.
-  E. Cakir, G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, “Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection,” IEEE/ACM Trans. Audio, Speech and Language Processing, 25(6), pp. 1291-1303, 2017.
-  K. Choi, G. Fazekas, M. Sandler, K. Cho, “Convolutional Recurrent Neural Networks for Music Classification,” in Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Process (ICASSP), pp. 2392-2396, 2017.
-  D. Lee, S. Lee, Y. Han, K. Lee, “Ensemble of Convolutional Neural Networks for Weakly-Supervised Sound Event Detection using Multiple Scale Input,” Detection and Classification of Acoustic Scenes and Events (DCASE), 2017.
-  Y. Guo, M. Xu, J. Wu, Y. Wang, K. Hoashi, “Multi-scale convolutional recurrent neural network with ensemble method for weakly labeled sound event detection,” Detection and Classification of Acoustic Scenes and Events (DCASE), 2018.
-  Q. Kong, Y. Xu, W. Wang, MD. Plumbley, “A joint detection-classification model for audio tagging of weakly labelled data,” in Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Process (ICASSP), pp. 641-645, 2017.
-  Q. Kong, Y. Xu, W. Wang, MD. Plumbley, “Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 27, no. 4, pp. 777-787, 2019. Page 777-787
-  Y. Xu, Q. Kong, Q. Huang, W. Wang, MD. Plumbley, “Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging,” in Proc. of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 3083-3087, 2017.
-  Y. Xu, Q. Kong, W. Wang, MD. Plumbley, “Large-scale weakly supervised audio classification using gated convolutional neural network,” ICASSP, 2018.
-  M. Ilse, J. M. Tomczak, M. Welling, “Attention-based Deep Multiple Instance Learning,” arXiv:1802.04712, 2018.
-  X. Lu, P. Shen, S. Li, Y. Tsao, H. Kawai, “Temporal Attentive Pooling for Acoustic Event Detection,” in Proc. of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1354-1357, 2018.
-  A. Kumar, B. Raj, “Audio Event Detection using Weakly Labeled Data,” in Proc. of the ACM on Multimedia Conference, pp. 1038-1047, 2016.
-  D. Bahdanau, K. Cho, Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv:1409.0473, 2014.
M. Luong, H. Pham, C. Manning, “Effective Approaches to Attention-based Neural Machine Translation,” in
Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1412-1421, 2015.
-  W. Chan, N. Jaitly, Q. Le, O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Process (ICASSP), pp. 4960-4964, 2016.
S. Chou, J. Jang, Y. Yang, “Learning to Recognize Transient Sound Events using Attentional Supervision,” in
Proc. Int. Joint Conf. Artificial Intelligence (IJCAI), 2018.
-  C. Yu, K.Barsim, Q. Kong, B. Yang, “Multi-level Attention Model for Weakly Supervised Audio Classification,” arXiv:1803.02353.
K. He, X. Zhang, S. Ren, J. Sun, “Deep Residual Learning for Image Recognition,”
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  K. He, X. Zhang, S. Ren, J. Sun, “Identity Mappings in Deep Residual Networks,” European Conference on Computer Vision, 2016.
-  T. W. Su, J. Y. Liu, Y. H. Yang, “Weakly-supervised audio event detection using event-specific Gaussian filters and fully convolutional networks,” in Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Process (ICASSP), pp. 791-795, 2017.
-  J. Y. Liu, Y. H. Yang, “Event Localization in Music Auto-tagging,” in Proc. of ACM on Multimedia Conference, pp. 1048-1057, 2016.
-  O. Maron, A. L. Ratan, “Multiple-Instance Learning for Natural Scene Classification,” in Proc. 15th Int. Conf. on Machine Learning, pp. 341-349, 1998.
-  X. Wang, Y. Yan, P. Tang, X. Bai, W. Liu, “Revisiting Multiple Instance Neural Networks,” Pattern Recognition, No. 74, pp. 15-24, 2018.
-  J. Ramon, L. De Raedt, “Multi instance neural networks,” in Proc. of the ICML-2000 workshop on attribute-value and relational learning, pp. 53-60, 2000.
-  A. Kumar, B. Raj, “Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data,” in NIPS Workshop on Machine Learning for Audio, 2017.
-  S. Tseng, J. Li, Y. Wang, F. Metze, J. Szurley, S. Das, “Multiple instance deep learning for weakly supervised small-footprint audio event detection,” in Proc. of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 3279-3283, Sep. 2018
-  D. Rumelhart, G. Hinton, R. Williams, “Learning representations by back-propagating errors,” Nature, 323 (6088), 533-536, 1986.
-  A. Veit, M Wilber, S. Belongie, “Residual Networks Behave Like Ensembles of Relatively Shallow Networks,” Advances in neural information processing systems, 2016.
-  G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger, “Deep Networks with Stochastic Depth,” European Conference on Computer Vision, 2016
-  S. Ioffe, C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” In proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML), pp. 448-456, 2015.
-  C. Raffel, and D. Ellis, “Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems,” arXiv preprint arXiv:1512.08756, 2015.
-  J. Salamon, C. Jacoby and J. P. Bello, “A Dataset and Taxonomy for Urban Sound Research,” the 22nd ACM International Conference on Multimedia, Orlando USA, Nov. 2014.
-  J. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” ICASSP, 2017.
-  A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets and baseline system,” DCASE Workshop, 2017.
-  DCASE (detection and classification of acoustic scenes and events) 2016 challenge, http://www.cs.tut.fi/sgn/arg/dcase2016/index
-  Y. Aytar, C. Vondrick, A. Torralba, “SoundNet: Learning Sound Representations from Unlabeled Video,” In Proceeding of the 30th International Conference on Neural Information Processing Systems, pp. 892-900, 2016.
-  S. Li, Y. Yao, J. Hu, G. Liu, X. Yao, and J. Hu, “An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition,” Appl. Sci. 2018, 8, 1152, doi: 10.3390/app8071152
-  Diederik P. Kingma, Jimmy Ba, “Adam: A Method for Stochastic Optimization,” the 3rd International Conference on Learning Representations (ICLR), 2014.
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  Zhenyao Zhu, Jesse H. Engel, Awni Hannun, “Learning Multiscale Features Directly From Waveforms,” arXiv:1603.09509.
-  B. Zhu, C. Wang, F. Liu, J. Lei, Z. Lu, Y. Peng, “Learning Environmental Sounds with Multi-scale Convolutional Neural Network,” arXiv:1803.10219.