This paper investigate the classification of the Audio Set dataset. Audio Set is a large scale multi instance learning (MIL) dataset of sound clips. In MIL, a bag consists of several instances, and a bag is labelled positive if one or more instances in the audio clip is positive. Audio Set is a MIL dataset because an audio clip is labelled positive for a class if at least one frame contains the corresponding class. We tackle this MIL problem using an attention model and explain this attention model from a novel probabilistic perspective. We define a probability space on each bag. Each instance in a bag has a trainable probability measure for a class. Then the classification of a bag is the expectation of the classification of the instances in the bag with respect to the learned probability measure. Experimental results show that our proposed attention model modeled by fully connected deep neural network obtains mAP of 0.327 on Audio Set dataset, outperforming the Google's baseline of 0.314 and recurrent neural network of 0.325.READ FULL TEXT VIEW PDF
Analysis of environmental sounds has been a popular topic which has the potential to be used in many applications, such as public security surveillance, smart homes, smart cars and health care monitoring. Audio classification has also attracted significant research effort due to the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge [1, 2]
. Several tasks have been defined for audio classification including acoustic scene classification, sound event detection  and audio tagging [3, 4]. However, the data sets used in these challenges are relatively small. Recently, Google released an ontology and human-labeled large scale data set for audio events, namely, Audio Set . Audio Set consists of an expanding ontology of 527 sound event classes and a collection of over 2 million human-labeled 10-second sound clips drawn from YouTube videos.
Audio Set is defined for tasks such as audio tagging. The objective of audio tagging is to perform multi-label classification on fixed-length audio chunks (i.e. assigning zero or more labels to each audio chunk) without predicting the precise boundaries of acoustic events. This task was first proposed in DCASE2016 challenge. Deep neural networks (DNNs)  and convolutional recurrent neural networks (CRNNs)  have been used for predicting the occurring audio tags. Neural networks with an attention scheme was firstly proposed in our previous work 
for the audio tagging task which provides the ability to localize the related audio events. Gated convolutional neural networks have also been applied in the “Large-scale weakly supervised sound event detection for smart cars” task of DCASE2017 challenge, where our system achieved the 1st place in the audio tagging sub-task111http://www.cs.tut.fi/sgn/arg/dcase2017/. However, the audio tagging data set used in the DCASE2017 challenge is just a small sub-set of Google Audio Set . The number of the audio event classes is only 17 compared with 527 classes in Google Audio Set. In this paper, we propose to use an attention model for audio tagging on Google Audio Set , which shows better performance than the Google’s baseline. In this work, we have two main contributions, one is that we conduct and explore a large-scale audio tagging on Google Audio Set 
. Secondly, we explain the attention model from a probability perspective. The attention scheme is also similar to the feature selection process which can figure out the related features while suppressing the unrelated background noise. It is achieved by a weighted sum over frames where the attention values are automatically learned by the neural network model.
is a variation on supervised learning, where each learning example contains abag of instances
. In MIL, a positive bag contains at least one positive instance. On the other hand, a negative bag contains no positive instances. Each audio clip in Audio Set contains several feature vectors. An audio clip is labelled positive for a class if at least one feature vector belongs to the corresponding class.
A multi instance dataset consists of many pairs , where is the number training pairs. Each bag consists of several instances , where is an instance in a bag and is the number of instances in each bag. We denote as the label of the -th bag. In Audio Set classification, a bag is a collection of features from an audio clip. Each instance is a feature, where is the dimension of the feature. The label of a bag is where is the number of audio classes and and represent the negative and positive label, respectively. For a specific class , when the label of the -th bag then so that is positive. Otherwise if then so that is negative. Assume we have a classifier on each instance, we want to obtain a classifier on each bag. There are several ways to obtain bag level classifier from instance level classifier described as follows.
The collective assumption  states that all instances in a bag contribute equally and independently to the bag’s label. Under this assumption, the bag level classifier is obtained by using the sum as the aggregation rule:
The collective assumption is simple and assumes that the instances contribute equally and independently to the bag-level class labels. However the collective assumption assumes all the instances inherit the label from its corresponding bag, which is not the same as the MIL assumption.
The maximum selection  states that the prediction of a bag is the maximum classification value of each instance in the bag described as follows:
. Maximum selection corresponds to a global max pooling layer in a convolutional neural network. Maximum selection performs well in audio tagging  but is sometimes inefficient in training because only one instance with the maximum value in a bag is used for training, and the gradient will only be computed from the instance with the highest classification value.
The weighted collective assumption is a generalization of the collective assumption, where a different weight is allowed for each instance :
The weighted collective assumption asserts that each instance contributes independently but not necessarily equally to the label of a tag. This is achieved by incorporating a weight function into the collective assumption. Equation (3) has the same form as our joint detection-classification (JDC) model  and our attention model  proposed for audio tagging and sound event detection. The difference is that the work in [14, 6] model both and using neural network.
Although Equation (3) has been used in many previous works [9, 14, 6], the explanation for this equation is not clearly presented. In this paper we explain this attention model in Equation (3) from a probabilistic perspective, which is helpful to guide the selection of and in Equation (3).
For any instances in a bag, they should contribute differently to the classification of a bag. In MIL, a bag is labelled positive if at least one instance in the bag is positive. To solve this problem, the positive instances should be attended to and the negative instances should be ignored. We first assign a measure on each where is a set laid in, for example Euclidean space. To assign the measure on each instance , we introduce the measure space 
Definition 1. Let be a set, a Borel field  of subsets of . A measure on is a numerically valued set function with domain , satisfying the following axioms:
2. If is a countable collection of disjoint sets in , , then we call the triple a measure space.
In addition, if we have:
then we call the triple a probability space.
When classifying a bag, different instances contribute differently. We define a probability space for each bag for each class . As , we may define a probability space on where and is the Borel filed of the set . The probability measure on satisfies , so Definition 1 Axiom 3 is satisfied. We call a probability space for the -th class. For an instance in a bag, the closer to 1 the more this instance is attended. The closer to 0 the less this instance is attended.
Assume for the -th class, the classification prediction and the probability measure on each instance are and , respectively. To obtain the classification result on the bag , we apply the expectation of the classification result with respect to the probability measure :
is a random variable. Equation (4) represents the instancescontributes differently to the classification of the bag . The probability measure controls how much an instance is attended. Large and small represents the instance is attended and ignored, respectively.
For a dataset with . A mapping is used to model the presence probability of the -th class of an instance . On the other hand, modeling the probability measure is difficult because of the constraint that the sum of the probability of the instances in a bag should be equal to 1:
So instead of modeling directly, we start from modeling in the measure space because in the measure space does not need to satisfy Definition 1, Axiom 3. To model , we use a mapping , where . Then for each bag and , we may define the probability measure of any instance of the -th class as:
where and are the measure of and , respectively. From Definition 1 Axiom 2, can be calculated by . So the constraint in Equation (5) is satisfied. After modeling and , the prediction of the -th class can be obtained by using Equation (4). The framework of the attention model is shown in Fig. 1.
The Audio Set dataset is highly unbalanced. Some classes have tens of thousands samples while other classes only contain hundreds of samples. We therefore propose a mini batch balancing strategy, where the occurrence frequency of training samples of the different classes in a mini-batch are kept the same.
We experiment on the Audio Set dataset . Audio Set contains over 2 million 10 seconds audio clips extracted from YouTube videos. Audio Set consists of 527 classes of audio with a hierarchy structure. The original waveform of the 2 million audio clips are not published. Instead, we use the published bottleneck feature vectors extracted from the embedding layer representation of a deep CNN trained on the YouTube-100M dataset . The bottleneck feature vectors are extracted at one feature per second, that is, there are 10 features in an 10 seconds audio clip. Then the bottleneck feature vectors are post-processed by a principle component analysis (PCA) to remove the correlations and only the first 128 PCA coefficients are kept.
The source code of this system is available here222https://github.com/qiuqiangkong/ICASSP2018_audioset. We apply a simple fully connected deep neural network to verify the effectiveness of the proposed attention model. We first apply fully connected layers on the input feature vectors to extract high level representation. We call this mapping as embedded mapping and denote as . We call as embedded instance. The embedded mapping
is modeled by three fully connected layers, with 500 hidden units in each layer followed by ReLU non-linearity and dropout  rate of 0.2 to reduce the risk of over-fitting. These configurations are chosen empirically. Then we model the classifier and the measure on each embedded instance by the following equation:
where is sigmoid non-linearity . The sigmoid non-linearity ensures that the probability is between 0 and 1. The non-linearity can be any non-negative function and we investigate ReLU , sigmoid and softmax functions in our experiment.
Then we may obtain in the -th bag by:
Finally the prediction of the -th event in bag is obtained by using Equation (4).
We evaluate using mean average precision (mAP), area under curve (AUC) and d-prime used in . These values are computed for each of the 527 classes and averaged across the 527 classes to obtain the final mAP, AUC and d-prime. Higher mAP, AUC and d-prime lead to better performance.
Table 1 shows the results of with and without data balancing strategy using collective assumption in Equation (1). The data balancing strategy is described in Section 3.5. Table 1 shows using balancing strategy performs better than without data balancing strategy in all of mAP, AUC and d-prime.
Table 2 shows the results of modeling the measure function using different non-negative functions including ReLU, sigmoid and softmax functions. Softmax non-negative performs slightly better than sigmoid non-negative and better than ReLU non-negative function.
Table 3 shows the comparison of different pooling strategies. Average pooling and max pooling along time axis are described in Equation (1) and (2), respectively. The Google baseline uses a simple fully connected DNN . Table 3 shows that RNN with global average pooling performs better than Google baseline. Using DNN with attention achieves better performance than Google baseline and RNN.
|DNN ReLU attention||0.306||0.961||2.500|
|DNN sigmoid attention||0.326||0.964||2.547|
|DNN softmax attention||0.327||0.965||2.558|
|DNN max pooling||0.284||0.958||2.442|
|DNN avg. pooling||0.296||0.960||2.473|
|RNN avg. pooling||0.325||0.960||2.480|
|DNN softmax attention||0.327||0.965||2.558|
In this paper, an attention model in audio classification is explained from a probability perspective. Both the classifier and the probability measure on each instance are modeled by a neural network. We apply fully connected neural network with this attention model on Audio Set and achieves mAP of 0.327 and AUC of 0.965 outperforming the Google baseline and recurrent neural network. In the future, we will explore more on modeling probability measure using different non-negative functions.
This research is supported by EPSRC grant EP/N014111/1 “Making Sense of Sounds” and Research Scholarship from the China Scholarship Council (CSC).
Proceedings of the 27th International Conference on Machine Learning. IEEE, 2010, pp. 807–814.