Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events
In this paper, we propose a new strategy for acoustic scene classification (ASC) , namely recognizing acoustic scenes through identifying distinct sound events. This differs from existing strategies, which focus on characterizing global acoustical distributions of audio or the temporal evolution of short-term audio features, without analysis down to the level of sound events. To identify distinct sound events for each scene, we formulate ASC in a multi-instance learning (MIL) framework, where each audio recording is mapped into a bag-of-instances representation. Here, instances can be seen as high-level representations for sound events inside a scene. We also propose a MIL neural networks model, which implicitly identifies distinct instances (i.e., sound events). Furthermore, we propose two specially designed modules that model the multi-temporal scale and multi-modal natures of the sound events respectively. The experiments were conducted on the official development set of the DCASE2018 Task1 Subtask B, and our best-performing model improves over the official baseline by 9.4 This study indicates that recognizing acoustic scenes by identifying distinct sound events is effective and paves the way for future studies that combine this strategy with previous ones.READ FULL TEXT VIEW PDF
Sound event detection (SED) and acoustic scene classification (ASC) are ...
Acoustic Scene Classification (ASC) is a challenging task, as a single s...
Human perception of surrounding events is strongly dependent on audio cu...
As we interact with the world, for example when we communicate with our
Our aural experience plays an integral role in the perception and memory...
The "bag-of-frames" approach (BOF), which encodes audio signals as the
Immersive audio-visual perception relies on the spatial integration of b...
Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events
Acoustic scene classification (ASC) refers to the task of categorizing real-life audio recordings into one of the environment classes (office, bus, etc)  . Potential applications of ASC include context-aware devices  and robotic navigation.
An acoustic scene (e.g., office) consists of a stream of sound events (e.g., people-talking, pen-dropping etc), where each sound event is associated with one or more sound sources that produce it. Early works on ASC took a two-stage strategy, which recognized sound events before recognizing acoustic scenes. For instance, a hierarchical probabilistic model is presented for key audio effects (sound events) detection and auditory context (acoustic scene) inference . While in 
, audio events were detected using a supervised classifier, with each audio context then represented using a histogram of sound events.
The two-stage strategy relies on a manually predefined sound events set and detailed sound event annotations (onset and offset), which requires huge human efforts and is only feasible for small scale datasets. Even more significant is that, real-life acoustic scene recordings usually involve a great number of overlapping sound events with ambiguous temporal boundaries, which makes annotating sound events almost impossible.
The aforementioned limitations have led the most research efforts to a one-stage strategy. For a one-stage strategy, only the audio recording level label is needed. Typically, acoustic scences are described by their global acoustical distributions or the temporal evolution of short-term audio features. For instance, some low-level  or middle-level     
descriptors are extracted. The extraction of these descriptors is generally followed by statistical models or sequence-learning models to summarize the information and make final decisions. Some high-level feature extraction methods, which are popular in the speaker recognition field, have also been investigated and proved to be quite effective for ASC, such as the famous i-vectors and x-vectors .
More recently, another one-stage strategy that has been widely adopted is to utilize deep learning models to map an audio recording directly to its scene label. In particular, Convolutional Neural Networks (CNN) based methods    
have demonstrated great potential. Equally, several novel Recurrent Neural Networks (RNN) based architectures have been proposed  to model the temporal evolution of short-term audio features. Novel data augmentation methods for the ASC have also been proposed to handle the problem of data scarcity  .
Alongside these developments, listening tests have been conducted to understand the perceptual processes of the ASC for humans. In particular, it has been observes that human recognition of soundscapes is guided by identifying prominent sound events . As such it is clear that it is important to focus the recognition process on distinct sound events. This finding accords with our intuition. For instance, identifying the sound of birds singing would immediately differentiate a park scene from an office scene, which is as not easily recognized from global acoustical distributions.
Inspired by these psychological findings, in this paper we aim to integrate this strategy into designing computational algorithms for ASC. We aim to identify to what extent we can recognize acoustic scenes by identifying the distinct sound events. To avoid manually annotating sound events, we treat this problem as a weakly supervised learning problem and formulate the problem through a multi-instance learning (MIL) framework. As shown in Fig. 1, the main idea of our MIL-based ASC is to map audio into a bag-of-instances representation and identify distinct sound events for each scene by detecting positive instances.
We summarize our contribution as follows: First, we propose a new strategy for ASC and formulate the problem of identifying distinct sound events for ASC in a Multi-Instance Learning (MIL) framework. Second, we propose a novel deep learning-based MIL model with specially designed modules to model the multi-temporal scale and multi-modal natures of the sound events in our daily acoustic environment.
MIL is a popular framework for solving weakly-supervised learning problems. In MIL, labels are attached to a set of instances, called bag, rather than to individual instances within the bag. MIL has been widely applied, to areas such as audio event detection (AED)    and bird sound classification . An attention-based CNN model  has also been used for ASC and has provided interpretations in a MIL framework. However, the interpretation is under the embedding space  paradigm of the MIL, which does not have the ability to identify distinct sound events as our methods do. For more details of MIL and its applications, please see .
For each audio recording, a log mel-spectrogram is extracted and denoted as . The bag-of-instances representation of the spectrogram is noted as , where is a high level vector representation for sound events in a segment of the audio recording. and is the index for the bag and the instance respectively, and denotes the number of instances in the bag. The original MIL and its standard multi-instance (SMI) hypothesis  was proposed in the context of binary classification. Here for multi-class ASC, we treat each class independently, then the SMI hypothesis can be naturally described as follows:
Here, is the index for each class and is the number of classes. and are the one-hot label for the bag and instances respectively, where 1 means positive and 0 means negative. The SMI is important since it determines the relations between the bag and the instances. By adopting the SMI hypothesis, positive instances in one scene cannot appear in other scenes. Therefore, the positive instances of one scene represent the distinct events of the scene.
It is worth noting that for our task, instance labels are not available for training. Moreover, in following the convention of the MIL, we consistently utilize capital letters to represent the bag-level symbols and lower-case letters to represent instance-level symbols.
The MIL based ASC model can be broken into three parts. 1) An instance generator , which maps an input log mel-spectrogram into instance vectors. 2) A group of distinct instance detectors , which map each instance vector to its label prediction score . 3) A prediction aggregator , which aggregates instance-wise predictions into a bag-level prediction . A symbolic representation of the complete MIL model is shown as follows:
CNN module is designed to map the input log mel-spectrogram to a bag-of-instances representation. The input dimensions are ordered as (feature, time). The CNN module begins with three convolutional blocks. Each block consists of two stacked 2D convolutional layers followed by a strided (2, 2) max pooling layer, where the number of filters is doubled for each subsequent block (32, 64, 128) and all filters are of dimension (3, 3). This is followed by a single 2D convolutional layer with 256 full-height (5, 1) filters, colored pink in Fig.2
. Batch normalization layer
and ReLU nonlinearity is applied to the output of every convolutional layer as well as to the input of the network (i.e., after the spectrogram). Finally, the 3D tensor is reshaped into the bag-of-instances representation (256, 62). Each instance vector (with) could be considered as a high-level representation for sound events.
For the group of distinct instance detectors, one independent detector for each class is applied, as shown below:
Each is composed of an affine transformation followed by a sigmoidactivation function. One way to interpret the Eq. (3) is that there is one ’template’ of distinct sound event for each class, and a large would indicate that is very likely a distinct instance (event) for the scene. Eq. (3) could be easily implemented by a 1D convolutional layer with (i.e., the number of classes) filters of size 1, followed by a sigmoid activation function. Since there is a single detector (SD) for each scene, we will refer to this module as the SD module.
As shown in Eq. (4), for the prediction aggregator, we chose a max pooling function to aggregate instance-wise predictions into bag-level predictions, which is consistent with the SMI assumption. Other pooling functions  may also be applicable as long as they are not inconsistent with the SMI assumption.
When training the MIL based models, we gathered audio samples from class (i.e., ) as the positive bags for class , whereas audio samples from other classes (i.e., ) are collected as the negative bags for class . Therefore, for each class, the number of negative bags is times of the number of positive bags. In order to solve the imbalance, we apply the weighted binary cross entropy for each class, where the positive weight is set to . The total losses introduced by a sample is the sum of weighted binary cross entropy loss of all the classes:
It is worth noting that, the bag level prediction vector is not
a normalized posterior probability over all the classes. In other words, it is not necessary that. Instead, each node of is an independent posterior of detecting distinct sound events for the corresponding class. During testing, the label with the highest posterior is assigned to the test recording.
At this point, we would like to highlight and explain why we suggest that the instance detectors in Eq. (3) will detect distinct instances for each class. Suppose two similar detectors and were learned for the scene and respectively. Then there must be instances which co-activate both label and . This contradicts the fact that the positive bag of the scene must be the negative bag of the scene (for ). Thus the detector for each scene must find one distinct pattern for that scene.
For the bag-of-instances representations generated using the previously mentioned CNN-MIL model, each instance vector is (indirectly) connected to all the frequency bins of the input spectrogram. Meanwhile, each instance vector reaches only a limited (about 36 frames) temporal receptive filed (TRF). Therefore, to cover both transient sound events patterns (e.g., birds singing) and the long-lasting sound events (e.g., an engine idling), a multi-temporal scale (MTS) module is proposed to improve over the CNN-MIL model.
As shown in the Fig. 3, dilated convolution  is adopted to exponentially increase the TRF of each instance vector. The MTS module consists of three stacked 1D dilated convolution layers, with a filter size of 3, stride of 1 and dilation rate of
respectively. In this way, the TRF of the last layer is seven times the TRF of the input layer. Batch normalization and ReLU are applied after each dilated convolution layer and proper zero padding is added to keep the ’time’ axis of the feature map fixed. At last, four feature-maps are concatenated over the ’feature’ dimension, and a 1D convolution with 256 filters of size 1 is used to combine the four feature maps. This module could be employed right after theinstance generator of the CNN-MIL model.
As described in the introduction, acoustic scenes usually consist of multiple events. Considering instance vectors are high-level representations for sound events, the distribution of instance vectors inside a bag are inevitably multi-modal. Thus, there might be multiple distinct sound events for each scene. In the CNN-MIL model, only one ’template’ is learned for each scene , which runs contrary to the multi-modality of the sound events. Therefore, we further propose to use multiple distinct instance detectors for each scene instead of one detector, inspired by the sub-concepts layer presented in .
As shown in Eq. (6), we allow the model to learn at most detectors () for each scene , where is a hyper-parameter and is set by preliminary experiments. Then, the max pooling function is used to aggregate evidence from the detectors. This means a distinct sound event for scene is said to be identified if any of the detectors of the scene found a match. At last, we apply a softmax layer to normalize the evidences over the scene labels . This means if one instance is said to be a distinct sound event for one scene, it could not be a distinct sound event for other scenes at the same time. This multi-detector (MD) module can replace the single detector (SD) module as in Eq. (3).
For our experiments, we used the development set of DCASE2018 Task1 Subtask B , which is the largest freely available dataset for ASC. Materials from the device A (high-quality) are utilized, which contain single-channel audios with a sampling rate of 44.1 kHz. The dataset consists of ten acoustic scene classes, where each scene has 864 segments of 10 seconds in length, resulting in a total of 24 hours of audios. The default official partition of training and testing folds is adopted.
For input features, we follow the configurations of the official baseline of the DCASE2018 challenge . The log mel-spectrogram is firstly extracted from each audio wave, with a frame length of 40 ms, 50% hop size, and 40 mel-bands. Therefore, a feature map of shape (40, 500) is generated for each audio sample and fed into the proposed models. Models are trained using an Adam 
optimizer with a batch size of 256 and an initial learning rate of 0.001. We decay the learning rate with a factor of 0.5 when the validation accuracy does not improve for 3 consecutive epochs, which contributes marginally to performance. We train the models for 50 epochs and the results with the highest accuracy are reported. The models are implemented using Pytorch, and we have made our code publicly available at https://github.com/hackerekcah/distinct-events-asc.git.
The hyper-parameter in Eq. (6) controls the maximum number of distinct sound events that could be detected for each scene. To examine how it affects performance, we replace the SD module in the CNN-MIL model with the proposed MD module and gradually increase the value of from 2 to 10. The results are plotted in Fig. 4. From this it can be seen that increasing the value of does not necessarily improve performance. The model achieves highest accuracy at . We speculate that with large , the model may just learn duplicate sound event detectors. Thus in the following section, we set for the MD module.
presents the performance of our proposed models. All models were trained and tested 10 times by varying the random seeds. The mean and standard deviation of the performance from these 10 independent trials are reported. For comparison, we include results from the official baseline as well as the best-performing single (as opposed to fusion-based methods) model  we could find in the literature. The results are directly extracted from the reference papers.
|Baseline ||-||-||58.9 (0.8)|
|Modified Xception ||-||-||76.9|
|1⃝ CNN-MIL||64.2 (1.1)|
|2⃝ CNN-MTS-MIL||✓||65.4 (0.7)|
|3⃝ CNN-MD-MIL||✓||66.5 (0.8)|
|4⃝ CNN-MTS-MD-MIL||✓||✓||68.3 (0.9)|
As can be seen, although our proposed models have not yet achieved the state-of-the-art , all the proposed models improve over the official baseline by a large margin. In addition, to evaluate the proposed MTS and MD module, we proposed four models that form the ablation study for the two modules. Comparing model pairs (1⃝ vs 2⃝) and (3⃝ vs 4⃝), we can see that the multi-temporal scale (MTS) module improved the results to a minor extent. Alongside this, comparing the model pairs (1⃝ vs 3⃝) and (2⃝ vs 4⃝), we can see that the MD module moderately improved the performance in both cases, which suggests allowing detecting of multiple distinct sound events is important for ASC. Finally, combining the two modules, we achieve the highest accuracy (68.3%) of all our proposed models.
Further insight about the proposed MIL based ASC system can be obtained by analyzing the confusion matrix. As shown in Fig.5, the worst case is when the system predicts ’airport’ instead of ’shopping mall’. This situation can happen when the distinct sound events detectors learned for the ’airport’ during training actually exist in the ’shopping mall’ during testing. In addition, confusions are observed between scenes with similar prominent events, such as ’metro’, ’tram’ and ’bus’. We expect that this confusion can be reduced by combining evidence from previous strategies.
A number of factors could be investigated to further improve performance. For example, the prediction aggregator has been proven to affect the performance of the MIL model significantly for sound event detection  , it remains to be seen how this would affect our models. Furthermore, CNN embeddings pretrained from large scale sound event dataset may be utilized to guide the instance generator. Moreover, an interesting and perhaps valuable product of the MIL model is the instance-level predictions. This information may be further exploited in some way for better inferring bag labels.
In this paper, we presented a new strategy for ASC, which recognizes acoustic scenes by identifying distinct sound events. Distinct sound events are not predefined by the user, instead, they are identified implicitly by using an MIL framework. We show that reasonable results can be achieved by using the proposed CNN-MIL model. Furthermore, we show that the proposed MTS and MD modules consistently improve the basic CNN-MIL model, highlighting that modeling the multi-temporal scale and multi-modal nature of sound events is important. Additionally, the proposed modules are not restricted to ASC and may be applied to other related tasks, such as sound event detection and bird sound detection. Finally, this study also provides an opportunity for future combinations of this strategy with previous ones.
This research is supported by the National Natural Sci-ence Foundation of China under grant No. U1736210.
J.-J. Aucouturier, B. Defreville, and F. Pachet, “The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music,”The Journal of the Acoustical Society of America, vol. 122, no. 2, pp. 881–891, 2007.
T. Zhang, K. Zhang, and J. Wu, “Temporal transformer networks for acoustic scene classification,” inProc. Interspeech, 2018, pp. 1349–1353.
, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37, Lille, France, Jul 2015, pp. 448–456.