Unsupervised Discriminative Learning of Sounds for Audio Event Classification

05/19/2021
by   Sascha Hornauer, et al.
0

Recent progress in network-based audio event classification has shown the benefit of pre-training models on visual data such as ImageNet. While this process allows knowledge transfer across different domains, training a model on large-scale visual datasets is time consuming. On several audio event classification benchmarks, we show a fast and effective alternative that pre-trains the model unsupervised, only on audio data and yet delivers on-par performance with ImageNet pre-training. Furthermore, we show that our discriminative audio learning can be used to transfer knowledge across audio datasets and optionally include ImageNet pre-training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

05/03/2019

Leveraging Large-Scale Uncurated Data for Unsupervised Pre-training of Visual Features

Pre-training general-purpose visual features with convolutional neural n...
02/23/2016

The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection

This paper strives for video event detection using a representation lear...
02/02/2021

PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation

Audio event classification is an active research area and has a wide ran...
11/18/2015

Net2Net: Accelerating Learning via Knowledge Transfer

We introduce techniques for rapidly transferring the information stored ...
11/05/2020

Improving Event Duration Prediction via Time-aware Pre-training

End-to-end models in NLP rarely encode external world knowledge about le...
07/21/2020

PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding

Arguably one of the top success stories of deep learning is transfer lea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning for audio event detection and classification benefits from large datasets. Despite unlimited access to audio data (e.g. YouTube, Freesound, etc), labeling audio events is labor intensive and noisy due to ambiguity in start and end times and short duration of some audio events.

On the other hand, due to similarities of the most commonly used audio features (i.e. spectrograms) to images, it is possible to benefit from advances in the image and video domain. Recent work shows improved performance when pre-training models on pretext tasks such as image classification or video based prediction [cramer2019look, yang2020telling].

However, using image/video pre-training is very time demanding due to the size of visual data. Also, for every architectural change, this time-demanding pre-training needs to be repeated. Finally, it limits network design since the feature extractor has to be able to process image data which might not be necessarily suitable for an audio task.

Furthermore, for audio applications on embedded devices, such as voice assistants, it is desirable to be able to improve the model performance on the edge over time through fine-tuning on new recorded data. The need of on the edge computation might be due to privacy concerns, to avoid sending users’ audio data to the Internet, or due to missing network availability. As a result, task performance needs to improve fast with few epochs and little data on the device. Large network models, as used for image data, may be too computationally expensive for many devices.

Figure 1: Embedding of spectrograms into the feature space with ESResNet. Stereo channels are stacked for illustration. First the network learns to embed similar spectrograms close together on a hypersphere. Second, sound event classification is trained, starting with the pre-trained network.

We present a pre-training method that accelerates network fine-tuning on the task of sound classification while being itself fast, efficient and versatile. Compared with the state-of-the-art approach ESResNet [guzhov2020esresnet], we show how our method achieves competitive results and significantly outperforms training from scratch. On one benchmark we even outperform state-of-the-art pre-training in early epochs.

We are faster at pre-training because we need only three audio datasets which combined have a fraction of the size of ImageNet. We focus on achieving fast results in very few epochs for edge computation and do not aim to go beyond state of the art performance on sound classification after extensive training. For a fair comparison we apply our method to the ESResNet codebase.

Figure 2: Discriminative Learning of Sounds (DLS) for Audio Event Classification. We compare the same network, pre-trained either on our proxy-task, ImageNet image classification or not at all. ImageNet pre-training takes several days on common GPUs. DLS can train unsupervised, only on sound data and within a few hours. When fine-tuning on Audio Event Detection, DLS stays on par with ImageNet pre-training over many epochs. It constitutes an efficient alternative, especially useful for edge computing or to accelerate the design phase of novel network architectures.

With unsupervised training on a pretext task, using only audio data, we also avoid the need for labels. By using Non-Parametric Instance-level Discrimination (NPID) [wu2018unsupervised] to train ESResNet on audio datasets we learn features beneficial for downstream audio classification tasks, illustrated in fig. 1. This allows us to integrate all data seamlessly across datasets and after deployment to train on novel unlabeled audio data. We call this approach Unsupervised Discriminative Learning of Sounds (DLS) and give an overview in fig. 2.

2 Related Work

Audio Event Detection. Audio event detection was largely improved in the last decade by leveraging new available datasets [7100934]

. We use: ESC-50, its subset ESC-10, UrbanSound8K and the DCASE 2013 scene classification dataset (SCD) (e.g. evaluated on in

[cramer2019look]). Others are: AudioSet [gemmeke2017audio] which holds predominantly music and speech examples, and the newer DCASE datasets [politis2020dataset]. These were not evaluated since we focused on comparison with a specific recent state-of-the-art method that does not use these datasets.

Unsupervised representation learning.

Unsupervised approaches in the vision domain improve in great strides towards their supervised counterparts. MoCo

[he2020momentum], SimCLR [chen2020simple] and NPID [wu2018improving] have shown to produce valuable features for downstream tasks. We base our contribution on NPID and extend it to the audio domain.

Audio representation learning. Compared to traditional hand-crafted features unsupervised training can lead to more robust and compact audio representations. [lee2009unsupervised]

applied deep belief networks to learn audio representations for speech and music. Generative methods have been explored in

[meyer2017unsupervised, chung2016audio, xu2017unsupervised]

using variants of autoencoders. Audio representation learning has also been studied for speech

[hannun2014deep] and music [thomee2016yfcc100m].

Knowledge transfer from vision.Transfer learning from visual tasks and exploiting audio-visual correspondences has been explored in the past. [harwath2016unsupervised] was able to learn associations between free form audio, i.e. spoken sentences, and related images. By predicting simply if parts of an audio and video correspond, [arandjelovic2017look] was able to learn good audio and visual representations. They were also able to extend this to localize sounding objects in a scene [arandjelovic1712objects] and their approach was extended to be used in audio classification [cramer2019look].

Concurrent work which we use as basis for our evaluation improves downstream task performance by directly fine-tuning ImageNet pre-trained visual models. By mapping spectrograms to the format of a color image and by using a pre-trained Resnet50 they achieve state of the art results on audio event detection [guzhov2020esresnet]. However, these methods rely on a large network to leverage visual input for training. We present a method which can be used to pre-train networks optimized for the audio domain and low resources.

3 Discriminative Learning of Sounds (DLS)

Here we show how to pre-train ESResNet using DLS on four datasets and fine-tune on each dataset. We use the best performing ESResNet with attention for all experiments and pre-training steps.

Audio Datasets. Most sounds in the datasets contain audio events recorded in natural environments, such as glass breaking or dog barking. DCASE2013 sounds also include longer recordings, such as riding in a bus. Across datasets, sounds differ in length and number of classes, summarized in table 1. Some contain events only on a fraction of their length, such as a single dog bark, compared to other files filling the entire standardized length, such as kids playing (see fig. 3).

US8K ESC 50 ESC 10 DCASE 2013
Events
8732 2000 400 200
Classes
10 50 10 10
Length
4s 5s 5s 30s
Fold
1 of 10 1 of 5 1 of 5 1 of 2
Table 1:

Dataset setup. Classification is hardest on ESC50 because it has the least data per class. We tune hyperparameters on one fold and use all data for unsupervised pre-training.

Spectrogram Network Input.

Power spectrograms, created with Short-Time Fourier Transform (STFT), are the input for all stages of training, following the method from

[guzhov2020esresnet]. We follow this method exactly to compare on equal grounds. Magnitude and phase are squared separately and the results are added. Spectrograms are divided into three equal sized parts to separate the higher, middle and lower frequencies. Finally the three parts are concatenated along a new channel dimension to create color images from the spectrograms to be processed by the network.

Figure 3: Spectrograms after pre-processing. Frequency ranges are split into three and concatenated as color channels. For illustration stereo channels are separated into rows. The leftmost two columns show children playing, followed by car horns. Large appearance changes exist within some classes.

4 Network Pre-Training.

The pre-text task of instance discrimination generates features, useful for the downstream task of audio classification. The network’s task is to assign a unique id to every spectrogram. Applied on images, this creates a feature space where visually

similar images are grouped. However, visual similarity is not an explicitly defined criteria in the loss function, but a by-product of the pressure on the network to structure and differentiate between all images of the dataset.

We follow [wu2018unsupervised] and use spectrograms as input. We train ESResNet with weights to map a spectrogram with the assigned id

in the training set to a latent output vector

. The vector is L2 normalized, so and updates its entry in a memory bank in each training iteration to calculate

of all spectrograms. This allows efficient calculation of a non-parametric softmax to get the assigned instance id probabilities. The probability of a spectrogram

mapped to vector having training set id is defined by:

(1)

where is the Softmax temparature parameter. Stochastic gradient decent is used to minimize the log-likelihood: during training.

The final feature mapping has the advantage that similar spectrograms are mapped close in the feature space. This enables to group sounds with similar characteristics, as sketched in fig. 1

. While this enables our performance on the downstream task, larger variance within a class than between classes is an issue and limits this approach. Examples of very different car horns can be seen in fig.

3.

Evaluation of Unsupervised Learning.

We first trained and evaluated on each dataset separately, according to the official training/evaluation folds (see tab. 1). Every instance of the evaluation set is mapped and we check its label alignment with its weighted K=5 nearest neighbors in the training data feature space. We tune hyperparameters separately for each dataset but found the optimal configuration was almost the same for all. Therefore, we use the average parameters: Embedding dimension=128, NCE-K=64 and NCE-T

=0.4, which are the number of negative samples and Softmax temperature of the Noise Contrastive Estimation in

[wu2018unsupervised]. We trained 200 epochs with a batchsize of 64.

In a second phase, we use all available audio datasets and folds to combine one large training dataset, without evaluation or test set. Therefore we train the embedding with the fixed hyperparameters found in phase one and no longer evaluate the performance in this step. However, we observe that the training loss at epoch 200 is sufficiently converged. The result of this phase is the pre-trained ESResNet.

5 Network Fine-Tuning.

The pretrained ESResNet is fine-tuned and evaluated on each dataset individually. The final classification layer is set to the number of classes depending on the dataset we fine-tune on, which means it needs to be retrained. We found fine-tuning all layers yields faster performance gain than fixing them and only fine-tuning the classifier. Results are shown in fig.

4.

Comparison with ImageNet Data We compare pre-training by image classification on ImageNet with DLS on the four audio datasets: DCASE2013 (SCD), ESC-50/ESC-10 and UrbanSounds 8k which provide 9.3Gb of sound data. ImageNet consists of over 200Gb of data, with 14,197,122 images. Training a classification task on ImageNet with Resnet50, which is also the backbone of the ESResNet architecture, can take almost two weeks on a single GPU [you2018imagenet]. Even if we assume ImageNet training happens on a similar modern setup as ours (4x GeForce GTX 1080 Ti), the training time of ImageNet would still be at least 1-2 days. Training DLS with ESResNet on 200 epochs takes 4 hours.

Figure 4:

Evaluation accuracy over epochs on a) ESC10, b) DCASE2013, c) ESC50 and d) UrbanSounds8k. Shaded areas show standard deviation of average n-fold cross validation results. The average accuracy pretrained with DLS is on par with ImageNet pre-training and significantly better than training from scratch. Even though we use only sound data with a fraction of the size of ImageNet, we reach a similar performance gain. In d) pre-training shows on par results with ImageNet pretraining. In c) DLS pre-training outperforms from scratch training over the first 18 epochs. In b) DLS pre-training even outperforms ImageNet pre-training on the first 11 epochs.

6 Experimental Results

We evaluated the performance of ESResNet with attention with DLS pretraining, Imagenet pretraining and training from scratch. The architectures are fully identical [guzhov2020esresnet].

DCASE2013 and ESC10. Training on small datasets benefits largely from pre-training, as can be seen by the performance gain on ESC10 and DCASE2013 (fig. 4). On ESC10, DLS pre-training outperforms ImageNet pretraining slightly in the first three epochs, rising to 80% accuracy where training from scratch remains under 40%. On DCASE2013, DLS pre-training is best for 11 epochs, rising towards 50% accuracy and continuing on par with ImageNet pre-training.

ESC50 and UrbanSounds8k. On larger datasets, the accuracy gain in the first epochs is similar between DLS and ImageNet pre-training. However DLS was trained in significantly shorter time and on less data. On ESC50, it outperforms from scratch training significantly and rises to over 60% accuracy in the first 15 epochs. UrbanSounds8k is the largest dataset and both pre-training methods show similar performance gain per epoch. Each epoch contains more sound files than in any other dataset so methods are already close after the first epoch when they are evaluated. DLS still outperforms training from scratch on 5-6 epochs.

Densenet as backbone. We also compared using Densenet as backbone and omit graphs for brevity. Results show slower performance gain over the first 15 epochs and then level off at similar levels. All trends are similar though on DCASE2013, all performance gains are very close before epoch 30.

7 Discussion

We presented an unsupervised pre-training method which enables fast fine-tuning when training networks for audio event classification. While using only a comparatively small amount of sound data, we show that we can gain early performance as the same network pre-trained on ImageNet. We validate our results on three commonly used datasets for audio classification. Our approach enables network training on devices with limited computing resources, e.g. for continuous improvement by adopting novel collected data. Because pre-training with DLS is unsupervised, novel unlabeled data can be integrated with existing data seamlessly.

6 Experimental Results

We evaluated the performance of ESResNet with attention with DLS pretraining, Imagenet pretraining and training from scratch. The architectures are fully identical [guzhov2020esresnet].

DCASE2013 and ESC10. Training on small datasets benefits largely from pre-training, as can be seen by the performance gain on ESC10 and DCASE2013 (fig. 4). On ESC10, DLS pre-training outperforms ImageNet pretraining slightly in the first three epochs, rising to 80% accuracy where training from scratch remains under 40%. On DCASE2013, DLS pre-training is best for 11 epochs, rising towards 50% accuracy and continuing on par with ImageNet pre-training.

ESC50 and UrbanSounds8k. On larger datasets, the accuracy gain in the first epochs is similar between DLS and ImageNet pre-training. However DLS was trained in significantly shorter time and on less data. On ESC50, it outperforms from scratch training significantly and rises to over 60% accuracy in the first 15 epochs. UrbanSounds8k is the largest dataset and both pre-training methods show similar performance gain per epoch. Each epoch contains more sound files than in any other dataset so methods are already close after the first epoch when they are evaluated. DLS still outperforms training from scratch on 5-6 epochs.

Densenet as backbone. We also compared using Densenet as backbone and omit graphs for brevity. Results show slower performance gain over the first 15 epochs and then level off at similar levels. All trends are similar though on DCASE2013, all performance gains are very close before epoch 30.

7 Discussion

We presented an unsupervised pre-training method which enables fast fine-tuning when training networks for audio event classification. While using only a comparatively small amount of sound data, we show that we can gain early performance as the same network pre-trained on ImageNet. We validate our results on three commonly used datasets for audio classification. Our approach enables network training on devices with limited computing resources, e.g. for continuous improvement by adopting novel collected data. Because pre-training with DLS is unsupervised, novel unlabeled data can be integrated with existing data seamlessly.

7 Discussion

We presented an unsupervised pre-training method which enables fast fine-tuning when training networks for audio event classification. While using only a comparatively small amount of sound data, we show that we can gain early performance as the same network pre-trained on ImageNet. We validate our results on three commonly used datasets for audio classification. Our approach enables network training on devices with limited computing resources, e.g. for continuous improvement by adopting novel collected data. Because pre-training with DLS is unsupervised, novel unlabeled data can be integrated with existing data seamlessly.

References