1 Introduction
Finelylabelled data are crucial to the success of supervised and semisupervised approaches to classification algorithms. In particular, common deep learning approaches (Bishop, 1995; LeCun et al., 2015) typically require a great number of data samples to train effectively (Krizhevsky et al., 2012; Rolnick et al., 2017). In this work, a sample refers to a section taken from a larger timeseries dataset of audio. Often, these datasets lack large quantities of fine labels as producing them is extremely costly (requiring exact marking of start and stop times). The common distribution of data in these domains is such that there are small quantities of expertlylabelled (finely) data, large quantities of weakly (coarsely) labelled data, and a large volume of unlabelled data. Here, weak labels refer to labels that indicate one or more events are present in the sample, although do not contain the information as to the event frequency nor the exact location of occurrence(s) (illustrated in Section 2.2, Fig. 2). Our goal therefore is to improve classification performance in domains with variable quality datasets.
Our key contribution is as follows. We propose a framework that combines the strengths of both traditional algorithms and deep learning methods, to perform multiresolution Bayesian bootstrapping. We obtain probabilistic labels for pseudofine labels, generated from weak labels, which can then be used to train a neural network. For the label refinement from weak to fine we use a Kernel Density Estimator (KDE).
The remainder of the paper is organised as follows. Section 2 discusses the structure of the framework as well as the baseline classifiers we test against. Section 3 describes the datasets we use and the details of the experiments carried out. Section 4 presents the experimental results, while Section 5 concludes.
2 Methodology
2.1 Framework Overview
Our framework is separated into an inner and outer classifier in cascade as in Figure 1. For the inner classifier we extract features from the finely and weaklylabelled audio data using the twosample KolgomogrovSmirnov test for features of a logmel spectrogram (Section 2.2). We train our inner classifier, the Gaussian KDE (Section 2.3), on the finelylabelled data and predict on the weaklylabelled data.
For the outer classifier we extract the feature vectors from an unlabelled audio dataset using the logmel spectrogram. We then train our outer classifier, a CNN, (Section
2.4) on the finelylabelled data and the resulting pseudofinely labelled data output by the Gaussian KDE. The details of our data and problem are found in Section 3. Code will be made available on https://github.com/HumBugMosquito/weak_labelling.2.2 Feature Extraction and Selection
The CNN uses the logmel spectrogram (as in Fig. 2) as it has recently become the gold standard in feature representation for audio data (Hayashi et al., 2017; Kong et al., 2019).
The input signal is divided into second windows and we compute logmel filterbank features. Thus, for a given seconds of audio input, the feature extraction method produces a output.
The twosample KolmogorovSmirnov (KS) test (Ivanov et al., 2012)
is a nonparametric test for the equality of continuous, onedimensional probability distributions that can be used to compare two samples. This measure of similarity is provided by the KolmogorovSmirnov statistic which quantifies a distance between the empirical distribution functions of the two samples. We use the KS test to select a subset of the
logmel features, that are maximally different between the two classes to feed into the classifiers. We choose features with the largest KS statistics. Fig. 2 illustrates that the process to find maximally different feature pairs, correctly chooses frequencies of interest. For example, if the noise file is concentrated in high frequencies (as in Fig. 2), the KS feature selection process chooses lower harmonics of the training signal (a mosquito flying tone) as features to feed to the algorithms. Conversely, for lowfrequency dominated noise, higher audible harmonics of the event signal are identified.
2.3 Gaussian Kernel Density Estimation
Kernel density estimation (KDE) or Parzen estimation (Scott, 2015; Parzen et al., 1962) is a nonparametric method for estimating a
dimensional probability density function
from a finite sample , , by convolving the empirical density function with a kernel function.We then use Bayes’ theorem to calculate the posterior over class
(1) 
where is the KDE per class , with representing the event class and the noise class (i.e. nonevent).
2.4 Convolutional Neural Network
With our scarce data environment we use a CNN and dropout with probability (Srivastava et al., 2014). Our proposed architecture, given in Fig. 3, consists of an input layer connected sequentially to a single convolutional layer and a fully connected layer. The CNN is trained for epochs with SGD (Bottou et al., 2010), and all activations are ReLUs. We use this particular architecture due to constraints in data size (Kiskin et al., 2018) and therefore have few layers and fewer parameters to learn.
2.5 Traditional Classifier Baselines
We compare our inner classifier, the Gaussian KDE, with more traditional classifiers that are widely used in machine learning: Linear Discriminant Analysis (LDA), Gaussian Naïve Bayes (GNB), support vector machines using a radial basis function kernel (RBFSVM), random forests (RF) and a multilayer perceptron (MLP).
3 Experiments
3.1 Description of Datasets
The most common scenario where mixed quality labels can be found is in crowdsourcing tasks (Cartwright et al., 2019; Deng et al., 2009; Lin et al., 2014), or any challenge where data collection is expensive. The HumBug (Zooniverse, 2019), (Li et al., 2017) project utilises crowdsourcing, forming the main motivation for this research, as well as the basis for our signal^{1}^{1}1The overall goal of HumBug is realtime mosquito species recognition to identify potential malaria vectors in order to deliver intervention solutions effectively.. The event signal consists of a stationary real mosquito audio recording with a duration of second. The noise file is a nonstationary section of ambient forest sound. The event signal is placed randomly throughout the noise file at varying signaltonoise ratios (SNRs), to create quantifiable data for prototyping algorithms. There is a class imbalance of second of event to seconds of noise in the finelylabelled data and this is propagated to the weaklylabelled and unlabelled datasets. We include seconds of expert, finelylabelled data, seconds of weaklylabelled data, and a further seconds of unlabelled test data. To report performance metrics, we create synthetic labels at a resolution of seconds for the finelylabelled data, and on the order of seconds for the weaklylabelled data. We choose seconds as to allow the labeller to have temporal context when classifying audio as an event or nonevent. As the listener is presented randomly sampled (or actively sampled (Houlsby et al., 2011; Naghshvar et al., 2012)) sections of audio data, a section much shorter than seconds would make the task of tuning into each new example very difficult due to the absence of a reference signal.
3.2 Experimental Design
We evaluate our inner model against the baseline classifiers with two experiments and finally test the overall performance of the framework utilising the outputs of the various inner classifiers. We make the assumption that the accuracy of the weak labels is . Therefore, all the classifiers predict over the coarse class labelled data only. Additionally, the priors we use in Eq. 1 for our Gaussian KDE model are set such that . This is to reflect that, since the audio sample is weakly labelled , each data point is equally likely to be in fine class or .
Generative models, such as the Gaussian KDE obtain a performance boost from the additional information provided by the coarse class data as this allows it to better model the class distribution. Conversely, discriminative models such as the SVM, RF and MLP take a hit in performance because the decision boundary that they create overfits to the class data points due to the increased class imbalance. We therefore train the LDA, GNB, SVM, RF and MLP on the finelylabelled data only, whereas the Gaussian KDE is trained on both the finelylabelled data and the coarse class data.
4 Classification Performance
For each SNR we run iterations, varying the time location of the injected signals, as well as the random seed of the algorithms. After applying median filtering, with a filter window of ms, we see the results in Fig. 4. The Fscore gradually increases as expected from the threshold of detection to more audible SNRs. The decay of performance at the lower SNRs can be partially accounted for by the twosample KS test used for feature selection failing to choose features of interest due to the increased noise floor. We observe a significant benefit to using the Gaussian KDE, which when combined with temporal averaging helps recover the dynamic nature of the signal (namely that there is correlation between neighbouring values in the timeseries).
Fig. 4 shows that the Gaussian KDE predicts better calibrated probabilities than the other baseline classifiers. This is shown by applying rejection (Bishop, 1995; Hanczar and Dougherty, 2008) in addition to the median filtering. The rejection window for the output probabilities is . The Gaussian KDE improves significantly in performance, especially at the lower SNRs; however it should be noted that the Fscore is evaluated on the remaining data after rejection. The Gaussian KDE rejects a large proportion of the data at lower SNRs, showing that the probabilities are at either extremes only when the model is confident in its predictions.
The final experiment tests the overall framework with input to a CNN from pseudofinely labelled data with median filtering and rejection applied. Table LABEL:table:CNNResults shows that using the framework in conjunction with any of the inner classifiers outputs outperforms a regular CNN trained on the coarse data. Furthermore, training the CNN on the output of the Gaussian KDE significantly improves detection of events by
over the best baseline system, the CNN(GNB). We also show that using the strongest inner classifier (KDE) alone results in vastly lower precision and recall scores to the bootstrapping approach advocated here, which sees an improvement of
to the Fscore gained by incorporating the CNN into the pipeline with the KDE.Classifier  Fscore  Precision  Recall 

CNN(KDE)  0.729 0.034  0.719 0.029  0.744 0.031 
CNN(MLP)  
CNN(RF)  
CNN(SVM)  
CNN(GNB)  
CNN(LDA)  
CNN(Coarse)  
KDE 
one standard deviation at an SNR of
dB for iterations5 Conclusions & Further Work
5.1 Conclusions
This paper proposes a novel framework utilising a Gaussian KDE for superresolving weaklylabelled data to be fed into a CNN to predict over unlabelled data. Our framework is evaluated on synthetic data and achieves an improvement of in Fscore over the best baseline system. We thus highlight the value label superresolution provides in domains with only small quantities of finelylabelled data, a problem in the literature that is only sparsely addressed to date.
5.2 Further Work
To leverage the probabilistic labels outputted by the inner classifier, a suitable candidate for the outer classifier is a losscalibrated Bayesian neural network (LCBNN). This combines the benefits of deep learning with principled Bayesian probability theory
(Cobb et al., 2018).Due to computational limitations, optimisation of the hyperparameters was infeasible. Future work plans to use Bayesian Optimisation (Snoek et al., 2012) for this tuning.
Finally, following the promising results of this paper, the next step is application to real datasets.
References

Bishop (1995)
Bishop, C.M., (1995). Neural networks for pattern recognition. Oxford University Press.

Bottou et al. (2010)
Bottou, L., (2010). Largescale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT’2010, pp. 177186.  Cartwright et al. (2019) Cartwright, M., Dove, G., Méndez, A.E.M. and Bello, J.P., (2019). Crowdsourcing multilabel audio annotation tasks with citizen scientists. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 49.
 Cobb et al. (2018) Cobb, A.D., Roberts, S.J. and Gal, Y., (2018). Losscalibrated approximate inference in Bayesian neural networks. In arXiv preprint arXiv:1805.03901

Deng et al. (2009)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K. and FeiFei, L., (2009). Imagenet: A largescale hierarchical image database. In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248255.  Hanczar and Dougherty (2008) Hanczar, B. and Dougherty, E.R., (2008). Classification with reject option in gene expression data. In Bioinformatics, 24(17):18891895.
 Hayashi et al. (2017) Hayashi, T., Watanabe, S., Toda, T., Hori, T., Le Roux, J. and Takeda, K., (2017). BLSTMHMM hybrid system combined with sound activity detection network for polyphonic sound event detection. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 766770.
 Houlsby et al. (2011) Houlsby, N., Huszár, F., Ghahramani, Z. and Lengyel, M., (2011). Bayesian active learning for classification and preference learning. In arXiv preprint arXiv:1112.5745.
 Ivanov et al. (2012) Ivanov, A. and Riccardi, G., (2012). KolmogorovSmirnov test for feature selection in emotion recognition from speech. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 51255128
 Kiskin et al. (2018) Kiskin, I., Zilli, D., Li, Y., Sinka, M., Willis, K. and Roberts, S., (2018). Bioacoustic detection with waveletconditioned convolutional neural networks. In Neural Computing and Applications, pp.113
 Kong et al. (2019) Kong, Q., Xu, Y., Sobieraj, I., Wang, W. and Plumbley, M.D., (2019). Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data. In IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 27(4):777787.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I. and Hinton, G.E., (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 10971105.
 LeCun et al. (2015) LeCun, Y., Bengio, Y. and Hinton, G., (2015). Deep learning. In Nature, 521(7553):436.
 Li et al. (2017) Li, Y., Zilli, D., Chan, H., Kiskin, I., Sinka, M., Roberts, S. and Willis, K., (2017). Mosquito detection with lowcost smartphones: data acquisition for malaria research. In arXiv preprint arXiv:1711.06346.
 Lin et al. (2014) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., (2014). Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740755.

Naghshvar et al. (2012)
Naghshvar, M., Javidi, T. and Chaudhuri, K., (2012). Noisy bayesian active learning. In
2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 16261633  Parzen et al. (1962) Parzen, E., (1962). On estimation of a probability density function and mode. In The annals of mathematical statistics, 33(3):10651076.
 Rolnick et al. (2017) Rolnick, D., Veit, A., Belongie, S. and Shavit, N., (2017). Deep learning is robust to massive label noise. In arXiv preprint arXiv:1705.10694.
 Scott (2015) Scott, D.W., (2015). Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons.
 Snoek et al. (2012) Snoek, J., Larochelle, H. and Adams, R.P., (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 29512959
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., (2014). Dropout: a simple way to prevent neural networks from overfitting. In The Journal of Machine Learning Research, 15(1):19291958.
 Zooniverse (2019) Zooniverse Humbug page (2019), https://www.zooniverse.org/projects/yli/humbug. Accessed: 20190429
Comments
There are no comments yet.