Self-supervised audio representation learning for mobile devices

by   Marco Tagliasacchi, et al.

We explore self-supervised models that can be potentially deployed on mobile devices to learn general purpose audio representations. Specifically, we propose methods that exploit the temporal context in the spectrogram domain. One method estimates the temporal gap between two short audio segments extracted at random from the same audio clip. The other methods are inspired by Word2Vec, a popular technique used to learn word embeddings, and aim at reconstructing a temporal spectrogram slice from past and future slices or, alternatively, at reconstructing the context of surrounding slices from the current slice. We focus our evaluation on small encoder architectures, which can be potentially run on mobile devices during both inference (re-using a common learned representation across multiple downstream tasks) and training (capturing the true data distribution without compromising users' privacy when combined with federated learning). We evaluate the quality of the embeddings produced by the self-supervised learning models, and show that they can be re-used for a variety of downstream tasks, and for some tasks even approach the performance of fully supervised models of similar size.


page 1

page 2

page 3

page 4


BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Inspired by the recent progress in self-supervised learning for computer...

Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training

Transformer-based models attain excellent results and generalize well wh...

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

Self-supervised audio representation learning offers an attractive alter...

DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

Inspired by the recent progress in self-supervised learning for computer...

HEAR 2021: Holistic Evaluation of Audio Representations

What audio embedding approach generalizes best to a wide range of downst...

SPICE: Self-supervised Pitch Estimation

We propose a model to estimate the fundamental frequency in monophonic a...

Federated Self-supervised Learning for Video Understanding

The ubiquity of camera-enabled mobile devices has lead to large amounts ...

1 Introduction

Thanks to advances in supervised audio learning, it is now possible to train models that are able to successfully perform different tasks, including audio annotation (Hershey et al., 2017), music recognition Arcas et al. (2017), automatic speech recognition Chan et al. (2016), speaker identification Matejka et al. (2016), etc. Such supervised models can also be deployed on mobile devices by applying network pruning and quantization techniques Howard et al. (2017); Sandler et al. (2018); Frankle and Carbin (2019).

Despite the indisputable success, this approach suffers from three main shortcomings. First, it requires collecting large annotated datasets specific to each task to be solved. Second, separate models are typically trained for each task, making it difficult to reuse computational resources when multiple such models are deployed on a mobile device. Third, inference is performed on device, but model training is still done on the server side using datasets representing surrogate distributions, which might potentially differ from the true data distribution.

Unsupervised learning attempts to overcome these limitations, by making it possible to learn from widely available unlabelled datasets and by learning general purpose representations that can be reused for different downstream tasks. In addition, unsupervised learning lends itself to be deployed on device, where no explicit labeling of the data is available. Therefore, by leveraging the recent advances in federated learning Bonawitz et al. (2019), it might be possible to distribute the training process across numerous devices, thus training models directly on the true data distribution, while fully preserving users’ privacy.

In the area of unsupervised learning, self-supervised learning has emerged as an attractive approach. In a nutshell, an auxiliary task is formulated based on the available unlabelled data and a fully supervised model is trained to solve such a task. The key idea is that, by solving the auxiliary task, the model is also learning some general purpose representations in a lower dimensional embedding space. Therefore, the embedding encoder, e.g., the portion of the model architecture mapping the input data to the embedding space, can be reused as a feature extractor for different downstream tasks.

Figure 1: Overview of the proposed self-supervised learning tasks.

One of the earliest successes of self-supervised learning was obtained in the context of language models, where Word2Vec

is used to map one-hot-encoded words to word embeddings 

Mikolov et al. (2013). Word2Vec can be formulated in two variants: i) continuous bag-of-words (CBoW), or ii) skip-gram. In the former, the model predicts the current word based on the context of surrounding words. In the latter, the model predicts surrounding words given the current word. Recently, a similar approach has been proposed to map speech to fixed-dimensional embeddings Chung and Glass (2018); Chung et al. (2018b). The Speech2Vec architecture consists of a RNN encoder-decoder which can handle variable-length inputs and outputs.

In this paper we explore self-supervised learning of audio representations, using a small model architecture which can be potentially deployed on mobile devices during both training and inference. We posit that contextual temporal information can be exploited in the case of general audio signals without resorting to any form of explicit supervision. We argue that solving properly designed tasks that involve the temporal context requires extracting some sort of high level semantic information from the underlying raw data, thus leading to reusable embeddings. In this respect, this paper makes the following main contributions:

  • We propose Audio2Vec, a self-supervised learning task that is inspired by Word2Vec, but applied to audio spectrograms. In the CBoW formulation (Figure 1a) the auxiliary task consists of reconstructing a temporal slice of pre-determined duration from a number of past and future slices. In the skip-gram formulation (Figure 1b) the roles of the target and surrounding slices are reversed.

  • We propose TemporalGap, a self-supervised learning task that consists of estimating the distance in time between any two pairs of audio segments extracted at random from a longer audio clip (Figure 1c).

  • We quantitatively evaluate the quality of the embeddings produced by the feature encoders obtained by solving the aforementioned self-supervised tasks. To this end we consider a wide variety of downstream tasks, including speech, music detection, speaker identification and language identification, among others. Our results show that all self-supervised models are able to partially bridge the accuracy gap with fully supervised models.

  • We focus our evaluation on small encoder architectures, which can be suitably deployed on mobile devices Howard et al. (2017); Sandler et al. (2018); Frankle and Carbin (2019). During inference, it makes it possible to explore the potential offered by self-supervised learning to share computational resources across different tasks, by using a common embedding encoder. During training, it enables to capture the true data distribution when used together with federated learning Bonawitz et al. (2019).

The rest of this paper is organized as follows. Section 2 discusses the most relevant literature related to our work. Section 3 presents the proposed methods, which are evaluated in Section 4. Conclusions and future work are given in Section 5.

2 Related work

The work presented in this paper is related to several different areas that have received attention in the recent literature. In particular, learning representations has been explored for different modalities.

Learning audio representations: Unsupervised feature learning can lead to more compact and descriptive representations than traditional handcrafted features, e.g., MFCCs. For example, Lee et al. (2009)

adopt convolutional deep belief networks to learn audio representations, applicable to both speech and music related tasks. More recently, different autoencoder architectures have been explored, e.g., denoising 

Xu et al. (2017), convolutional LSTM autoencoders Meyer et al. (2017) and sequence-to-sequence autoencoders Chung et al. (2016). A self-supervised version of the triplet loss is proposed in Jansen et al. (2018). In the absence of labels, the authors create anchor-positive pairs by adding noise, shifting in time and/or frequency, and sampling temporal neighbors. When tested on AudioSet Gemmeke et al. (2017)

, the self-supervised embeddings partially bridge the gap between a simple log spectrogram baseline and a fully supervised classifier.

Learning visual representations: Several auxiliary tasks have been explored to learn image representations, e.g., predicting the relative position of a pair of patches extracted from the same image Doersch et al. (2015), re-ordering image patches and solving jigsaw puzzles Noroozi and Favaro (2016), or asking the model to discriminate between a patch and transformed version of it Dosovitskiy et al. (2016). In some cases solving seemingly simple tasks can lead to very powerful representations such as, for example, detecting image rotations Gidaris et al. (2018)

. In other cases, representations can be learned as a by-product of solving useful tasks, e.g., in the case of image colorization 

Zhang et al. (2016)

and image inpainting 

Pathak et al. (2016). The latter is to some extent similar to our work, since the CBoW version of Audio2Vec can be seen as a form of inpainting in the spectrogram domain. The representations learned by different self-supervised learning tasks can also be combined to obtain a single representation, as presented in Doersch and Zisserman (2017). In the case of video, it is possible to exploit the temporal dimension to learn visual representations by asking a model to learn whether frames are in the correct temporal order Misra et al. (2016); Fernando et al. (2017), to infer motion by observing a static image Pathak et al. (2017), or detect whether a video is playing forwards or backwards Wei et al. (2018).

Learning multimodal representations: Several papers have recently investigated learning audio representations exploiting the correlation with other modalities, e.g., text Chung et al. (2018a), images Owens et al. (2018) and videos Owens and Efros (2018); Gao et al. (2018); Arandjelović and Zisserman (2018); Korbar et al. (2018); Cramer et al. (2019).

Contextual predictions: After the seminal work on Word2Vec, contextual prediction has been successfully explored as a means for learning representations in other modalities, e.g., in the case of image inpainting Pathak et al. (2016), symbolic music prediction Bretan et al. (2017), and speech Chung and Glass (2018). Recently van den Oord, Yazhe Li (2019) proposed to use contrastive predictive coding, i.e., predicting future samples directly in the embedding space, reporting promising results also in the case of audio-based tasks. Our work is mostly related to this strand of research, in that we evaluate contextual prediction for general audio-based tasks, but we put particular emphasis on learning models that can be deployed on device.

3 Methods

Let denote an audio clip of samples in the time domain and the corresponding real-valued log-mel spectrogram, which consists of temporal frames and frequency bins. Let denote a slice of the spectrogram , starting at frame with temporal frames and a -dimensional embedding computed by processing the input spectrogram with an encoder , whose architecture is detailed in Section 4. Using this notation, in the following we describe the proposed self-supervised learning models.

Audio2Vec (CBoW): The first self-supervised learning task that we propose, Audio2Vec, comes in two variants. In the CBoW variant, we first select a target slice at random, together with a set of surrounding slices used for prediction. Each of the predictor slices is processed by the same encoder, which maps its input into a fixed-dimensional embedding. These embeddings are then concatenated and fed into a decoder, mimicking the same architecture as the encoder, which computes a reconstruction of the target slice. More specifically, let be a slice selected at random from . Then, a set of past () and future slices () are extracted from the same audio clip. The temporal location of the slice is equal to , i.e., we consider non-overlapping slices of size , with an extra gap of temporal frames between any two consecutive slices. The gap is introduced to avoid that the self-supervised model exploits the leakage between adjacent STFT temporal frames as a shortcut to solve the task. Each slice is processed by the same encoder to obtain

. Then, a vector

is obtained by concatenating the embeddings of each of the predictor slices and fed into a convolutional decoder to obtain a reconstruction

. Note that the architecture of the decoder is obtained by reversing the order of the layers in the encoder and replacing max-pooling with nearest-neighbor upsampling. The overall encoder-decoder architecture is trained end-to-end by minimizing the mean-square error loss function


Audio2Vec (skip-gram): The skip-gram variant of Audio2Vec uses a similar architecture. In this case we compute the embeddings of the middle slice , and then let the decoder reconstruct the surrounding slices, i.e., . The decoder is identical to the one used by the CBoW variant, except for one important difference: the last convolutional layer has output channels, one for each of the slices to be reconstructed. The loss function minimizes the average mean-square error computed across the reconstructed slices.

Temporal gap: For the TemporalGap

task, we ask the model to estimate the absolute value of the distance in time between two slices sampled at random from the same audio clip. More specifically, we sample the ground truth temporal gap from a uniform distribution, i.e.,

, where and are the lengths (in time frames) of the slices and the original sample, respectively, and define the normalized temporal gap as . Then, we extract two slices and such that . Note that we do not impose a temporal order between the two slices. We concatenate the embedding representations in a single -dimensional vector and we feed this vector into a fully connected feed forward network with a single hidden layer of size 64 that produces the scalar output . We train the model end-to-end so as to minimize a cross-entropy loss between the ground-truth and the predicted gap. In our experiments, we found that this loss is to be preferred to the mean-square error , presumably because it gives more weight to errors when the ground truth gap is small.

4 Experiments

We compare the quality of the embeddings produced by different self-supervised learning methods according to two different evaluation measures: i) the accuracy of a fully supervised logistic regression model trained using the embeddings and the corresponding labels as inputs; ii) the accuracy of a non-parametric nearest neighbors model that works directly in the embedding space. In the following we describe the datasets used in our evaluation campaign and the baselines to which we compare our results.

Datasets: We use AudioSet Gemmeke et al. (2017) to train all the self-supervised learning tasks, unless stated otherwise. AudioSet contains excerpts of 10 seconds from the soundtracks of YouTube videos. Although the dataset is annotated with labels of more than 500 classes, we discard them in our study. Note that each AudioSet sample can be potentially reused multiple times during training, each time extracting a different target slice (together with surrounding slices) uniformly at random.

We use six publicly available datasets to evaluate a variety of downstream tasks, covering both speech and non-speech related tasks. We use the Speech Commands dataset Warden (2018) to evaluate keyword spotting on 35 distinct keywords. LibriSpeech Panayotov et al. (2015) contains audio books read by 251 different speakers. We use the 100 hours training set to evaluate a speaker identification task. The Spoken Language Identification dataset Oponowicz (2018) contains samples that belong to three different languages: English, Spanish and German, while the MUSAN dataset Snyder et al. (2015) distinguishes across three classes, namely music, speech and noise. Finally, we use two datasets released in the context of the recent DCASE2018 Challenge, Bird Audio Detection Stowell et al. (2018) and TUT Urban Acoustic Scenes 2018 Mesaros et al. (2018), which contains labeled audio samples from 10 different urban environments. To the best of the authors’ knowledge, this is the first paper that comprehensively evaluates the quality of self-supervised learning for audio on such a wide variety of tasks.

Since each dataset is characterized by samples having different durations, during training we preprocess the downstream datasets extracting equal-length slices uniformly at random from the original sample and assign the corresponding label to all of the extracted slices. We consider input samples having the duration of  ms, so as to match the size of the temporal slices used when training the self-supervised tasks. During evaluation, we apply a sliding window of size and a hop size of

, so as to one or more predictions for each input samples, depending on its length. In order to aggregate such predictions and produce a single output for each sample, we apply a simple naive-Bayes classifier.

Encoder architecture

: In our work we consistently use the same audio frontend, which processes input sequences sampled at 16 kHz, with a window size of 25 ms and a hop size equal to 10 ms to compute the short-time Fourier transform (STFT), and then computes

mel-spaced frequency bins in the range 60–7800 Hz. For the encoder

, we use a convolutional neural network, whose architecture is described in Table 

1. Due to its limited size (approximately 125k parameters) it can be potentially deployed on a mobile device and run in an energy-efficient way by exploiting streaming convolutions. Each convolutional layer consists of a series of two convolutions, one along the time axis (with size ) and one along the frequency axis (with size ), in parallel to a pointwise

convolution as a residual connection. All activation functions are ReLUs and batch normalization is used in all convolutional layers. Each layer is followed by max-pooling, to reduce the time-frequency dimensions by a factor of two at each layer. Finally, a global max-pooling layer produces a

-dimensional vector, which is further processed by a fully-connected layer to get the embeddings. By default, we set (corresponding to 975 ms) and , thus reducing the dimensionality of the raw audio samples by a factor of about .

Output Size Num. params FLOPs
Input layer - -
Conv. layer 1 0.2k 2.9M
Conv. layer 2 1k 4M
Conv. layer 3 5k 4M
Conv. layer 4 20k 3.9M
Conv. layer 5 82k 3.9M
FC layer 16k 33k

- 125k 18.7M
Table 1: Encoder architecture. Size of activations, number of parameters and FLOPs.

Models: We compare different self-supervised models trained on AudioSet: Audio2Vec, in its two variants, CBoW and skip-gram, and TemporalGap. For Audio2Vec we use slices on each side of the target, and a gap of temporal frames between consecutive slices. We also include the TripletLoss methods proposed in Jansen et al. (2018) in our evaluations. More specifically, positive/negative pairs are obtained by extracting a slice from, respectively, the same or a different original sample. In addition, we also train an AutoEncoder sharing the same encoder and decoder architectures as Audio2Vec. We tested different variants, including denoising and variational autoencoders, but we did not observe significant differences with respect to the default autoencoder. When evaluating the accuracy in downstream tasks, we extract the portion of the model corresponding to the encoder and use it to map input log-mel spectrograms to -dimensional embeddings.

We compare our results to two different fully supervised baselines based on a simple logistic regression model: i) the Spectrogram model receives directly the (flattened) spectrogram features as input; ii) the Untrained model computes the embeddings with the very same encoder architecture described in Section 3, but using randomly initialized weights.

Since each task is charaterized by a different number of target classes and intrinsic difficulty, we compare the accuracy to the level attained by task-specific fully supervised models (Supervised), each using the same encoder, but trained end-to-end on each of the labeled downstream datasets. In addition, we also trained a MultiHead model, where a single shared encoder is composed with a different fully connected layer for each downstream task. This provides an upper bound for the best performance we could expect, as it uses the same architecture as when using the self-supervised embeddings, but leverages the in-domain labeled data for end-to-end training.

All models are trained with stochastic gradient descent and Adam optimizer with default hyperparameters. The learning rate was set to

for Audio2Vec, AutoEncoder, and all the supervised models, while it was set to for TemporalGap and TripletLoss . We use a mini-batch size equal to 256 and we stop training after approximately 2 days (on five Tesla V100 GPUs), thus iterating between 1.3 and 1.8 million mini-batches. In most cases, the accuracy of downstream tasks saturated after iterating over 500k mini-batches.

Figure 2: (a): Training loss for the Audio2Vec (skip-gram) model and corresponding accuracy on the MUSAN downstream dataset. (b) Accuracy on MUSAN datasets for all models under evaluation.
Spectrogram 0.16 .01 0.28 .04 0.97 .01 0.74 .01 0.36 .03 0.65 .02
(+0%) (+0%) (+0%) (+0%) (+0%) (+0%)
Untrained 0.16 .01 0.48 .04 0.54 .02 0.93 .00 0.57 .03 0.70 .02
(-1%) (+33%) (-1338%) (+77%) (+35%) (+31%)
AutoEncoder 0.28 .01 0.64 .04 0.99 .00 0.94 .00 0.59 .03 0.69 .02
(+21%) (+56%) (+55%) (+81%) (+38%) (+27%)
A2V(CBoW) 0.30 .01 0.57 .04 0.99 .00 0.98 .00 0.66 .03 0.71 .01
(+23%) (+47%) (+82%) (+97%) (+50%) (+40%)
A2V(SG) 0.28 .01 0.55 .04 1.00 .00 0.98 .00 0.67 .03 0.69 .02
(+21%) (+44%) (+85%) (+98%) (+52%) (+28%)
TemporalGap 0.23 .01 0.45 .04 0.97 .01 0.97 .00 0.63 .03 0.71 .01
(+12%) (+27%) (+11%) (+92%) (+44%) (+44%)
TripletLoss 0.18 .01 0.62 .04 1.00 .00 0.97 .00 0.73 .03 0.73 .01
(+3%) (+55%) (+96%) (+95%) (+61%) (+55%)
MultiHead 0.72 .01 0.82 .03 1.00 .00 0.98 .00 0.94 .02 0.78 .01
(+95%) (+88%) (+99%) (+95%) (+96%) (+90%)
Supervised 0.75 .01 0.90 .03 1.00 .00 0.99 .00 0.97 .01 0.79 .01
(+100%) (+100%) (+100%) (+100%) (+100%) (+100%)
Table 2: Accuracy on downstream tasks (and fraction of accuracy recovered wrt. baselines). Downstream tasks: SPC: (Speech Commands), LSP: (LibriSpeech), TUT: TUT Urban Acoustic Scenes 2018, MUS: MUSAN, BSD: Bird Audio Detection, LID: Spoken Language Identification. In bold the highest accuracy attained by self-supervised models for each task.

Main results: In our results we report the prediction accuracy on the eval set of each of the six datasets. During training we monitor both the loss of the self-supervised task as well as the accuracy on each of the downstream tasks. As an illustration, Figure 1(a) shows that the accuracy of the MUSAN downstream task increases as the reconstruction loss of Audio2Vec (skip-gram) decreases, and both tend to saturate after approximately 300k iterations. For the same dataset, Figure 1(b) shows that all self-supervised methods attain a level of accuracy that is in-between the baselines and the fully supervised benchmarks, with Audio2Vec (skip-gram) outperforming the other models on this task. We repeated the evaluation on all downstream tasks and show the results in Table 2

. We report the level of accuracy, with 95% confidence intervals capturing the uncertainty due to the finite size of the evaluation datasets. In brackets we also report the accuracy normalized between 0% (

Spectrogram) and 100% (Supervised). We observe that the proposed self-supervised learning models are able to recover between 11% and 98% of the accuracy of the Supervised model. Generally, Audio2Vec (skip-gram) and TripletLoss seem to outperform other self-supervised models. The best results are obtained on MUSAN and LibriSpeech, presumably because these tasks require to capture relatively stationary spectral characteristics of the inputs. Conversely, all self-supervised models achieve relatively poor performance on the Speech Commands dataset. This might be explained by the fact that for this dataset it is particularly important to recognize the non-stationary variation of the spectral features along the temporal dimension, which does not seem to be captured by the embeddings generated by the self-supervised models. Note that the different self-supervised models might be capturing different characteristics of the underlying audio data. Therefore, there might be the possibility of merging the different representations, as recently proposed in Pascual et al. (2019) for the case of speech embeddings.

We repeated a similar evaluation working directly in the embedding space, by training a simple -nearest neighbour classifier (=10) on each dataset. More specifically, we extracted 975ms samples at random from the original audio clips (10000 samples for training and 2000 samples for evaluation, for each dataset), and mapped each sample to a 128-dimensional embedding. The classifier computes Euclidean distances directly in the embedding space. For the Spectrogram

baseline, we first perform dimensionality reduction by applying a random projection matrix sampled from a Gaussian distribution to map the flattened

spectrogram to a 128-dimensional space. Table 3 reports the results, showing that also in this case the proposed self-supervised models recover between 10% and 99% of the accuracy of the Supervised model. This demonstrates that salient representations of the underlying audio data are indeed captured directly in the embedding space.

Spectrogram 0.02 .01 0.39 .02 0.00 .00 0.10 .01 0.11 .01 0.49 .02
(+0%) (+0%) (+0%) (+0%) (+0%) (+0%)
Untrained 0.08 .01 0.38 .02 0.04 .01 0.87 .01 0.41 .02 0.68 .02
(+9%) (-3%) (+3%) (+88%) (+41%) (+77%)
AutoEncoder 0.24 .02 0.44 .02 0.03 .01 0.68 .02 0.52 .02 0.67 .02
(+30%) (+20%) (+3%) (+67%) (+55%) (+70%)
A2V(CBoW) 0.14 .02 0.43 .02 0.10 .01 0.94 .01 0.52 .02 0.69 .02
(+17%) (+16%) (+10%) (+96%) (+55%) (+81%)
A2V(SG) 0.12 .01 0.43 .02 0.26 .02 0.96 .01 0.60 .02 0.70 .02
(+14%) (+15%) (+27%) (+99%) (+67%) (+84%)
TemporalGap 0.10 .01 0.37 .02 0.35 .02 0.92 .01 0.55 .02 0.70 .02
(+11%) (-10%) (+36%) (+93%) (+60%) (+84%)
TripletLoss 0.09 .01 0.25 .02 0.69 .02 0.96 .01 0.70 .02 0.72 .02
(+10%) (-59%) (+71%) (+99%) (+80%) (+91%)
MultiHead 0.69 .02 0.52 .02 0.86 .02 0.95 .01 0.75 .02 0.75 .02
(+91%) (+52%) (+89%) (+97%) (+88%) (+102%)
Supervised 0.76 .02 0.63 .02 0.97 .01 0.97 .01 0.84 .02 0.74 .02
(+100%) (+100%) (+100%) (+100%) (+100%) (+100%)
Table 3:

Accuracy on kNN classification (and fraction of accuracy recovered wrt. baselines).

Impact of training dataset: All the results reported so far use the AudioSet dataset to train the self-supervised models. AudioSet contains a wide variety of audio clips, including music, speech, ambient noise, acoustic events, etc. In order to evaluate the impact of the choice of the dataset, we repeated self-supervised training using LibriSpeech (discarding the speaker labels). We chose LibriSpeech because the original samples are sufficiently long to support our self-learning tasks and because it contains audio of different content than AudioSet (i.e., speech only). Table 4 reports how the evaluation results shown in Table 2 change when training the self-supervised models on LibriSpeech instead of AudioSet. In most cases, we observe a decrease in the level of accuracy on downstream tasks, especially for TemporalGap and TripletLoss, suggesting that a richer content variety in the training set is preferable when learning general-purpose audio representations.

AutoEncoder 0.27 0.65 0.96 0.87 0.56 0.67
(-3%) (+1%) (-3%) (-7%) (-5%) (-2%)
A2V(CBoW) 0.26 0.55 0.99 0.96 0.65 0.70
(-13%) (-3%) (+0%) (-2%) (-1%) (-1%)
A2V(SG) 0.23 0.65 0.99 0.97 0.66 0.71
(-17%) (+18%) (-1%) (-1%) (-1%) (+2%)
TemporalGap 0.18 0.55 0.93 0.94 0.59 0.64
(-21%) (+22%) (-4%) (-3%) (-6%) (-9%)
TripletLoss 0.10 0.34 1.00 0.93 0.56 0.65
(-44%) (-45%) (+0%) (-4%) (-23%) (-10%)
Table 4: Accuracy obtained when training self-supervised models on LibriSpeech (and the relative difference with respect to training on AudioSet). Red indicates a decrease in the level of accuracy.

Encoder fine-tuning: So far we considered the case in which the encoder is shared completely across different tasks, and only the last layer is allowed to learn task-specific parameters. It is interesting to observe what happens when we relax this assumption, allowing to retrain one (or more) of the deepest layers of the encoder. Figure 3 shows the trade-off between the level of accuracy and the number of task specific parameters for two datasets, Speech Commands and TUT Urban Acoustic Scenes 2018, for which Audio2Vec (skip-gram) was able to only partially bridge the accuracy gap with respect to the Supervised model. The left-most (blue) point corresponds to the accuracy already reported in Table 2. Note that in this case the number of task-specific parameters is equal to , where is the number of classes (equal to, respectively, 35 and 10 for these datasets). The second (orange) point from the left corresponds to retraining the fully-connected layer, while the remaining points correspond to retraining the until the fifth and fourth convolutional layers included. Generally, retraining the last two layers is needed to recover most of the accuracy of the fully supervised model. Note that, although the last two layers account for approximately 80% of the parameters, they only contribute to 20% of the FLOPs, and this is particularly useful when deploying on mobile devices.

(a) Speech Commands
(b) TUT Urban Acoustic Scenes 2018
Figure 3: Accuracy obtained when retraining the last layers of the Audio2Vec (skip-gram) encoder.

Impact of encoder architecture size: Although the focus of this paper is on encoder architectures that can be deployed on mobile devices, the proposed self-supervised methods are general and they can be applied also to larger models. Therefore, we repeated our evaluation by increasing the size of the encoder architecture described in Table 1. Namely, we increased the number of channels in each convolutional layer by a factor of 4, and we increased the number of outputs in the last fully connected layer to obtain 256-dimensional embeddings. Table 5 shows that the accuracy on downstream tasks increases, and Audio2Vec (skip-gram) achieves the highest accuracy on almost all datasets.

AutoEncoder 0.35 0.62 1.00 0.96 0.65 0.70
(+24%) (-3%) (+1%) (+2%) (+10%) (+1%)
A2V(SG) 0.46 0.81 1.00 0.99 0.78 0.76
(+64%) (+47%) (+0%) (+1%) (+16%) (+10%)
TemporalGap 0.37 0.77 1.00 0.98 0.73 0.74
(+60%) (+71%) (+3%) (+1%) (+15%) (+4%)
TripletLoss 0.30 0.73 1.00 0.99 0.81 0.76
(+66%) (+17%) (+0%) (+2%) (+10%) (+4%)
Table 5: Accuracy obtained when using a larger encoder architecture (relative change wrt. Table 2).

5 Conclusion

In this paper we present self-supervised learning methods that exploit the temporal context in audio clips. Our results show that both Audio2Vec and TemporalGap are able to produce representations that can be re-used for different downstream tasks, without having access to labelled datasets during training. We based our experiments on small encoder architectures, which can be potentially deployed on mobile devices. This is motivated by the fact that, in our future work, we will investigate training self-supervised models directly on device in a distributed fashion, by taking advantage of federated learning. Another interesting direction is merging representations learned by different self-supervised models, as recently proposed in Pascual et al. (2019) for the case of speech embeddings.