In recent years, self-supervised learning (SSL) has proven to be an effective method for representation learning without the need for labeled supervision in not only the language  and visual  domains, but also in the audio domain. In the audio domain, SSL for speech tasks have shown extremely promising results . However, self-supervised learning for non-speech audio tasks have been explored to a relatively lesser extent. In several cases, non-speech audio tasks have been explored in the context of multimodal self-supervised learning , but an extensive evaluation of SSL in the domain of non-speech audio-only tasks remains to be seen. A number of self-supervised representation learning frameworks have been proposed, of which contrastive learning based methods have been found to be particularly effective in both image and speech domains. It aims to push semantically comparable samples closer together and dissimilar samples apart in the feature space. Wav2vec 2.0  is one such contrastive learning framework which have been demonstrated to be highly efficient in speech-related tasks [29, 6]. Wav2vec 2.0 uses transformers to build contextualized representations of the speech sequence, but transformers can only simulate global dependencies. While convolutional networks (CNN) can capture local associations, obtaining global information from the model would necessitate the use of additional layers or parameters. Conformers , on the other hand, captures both local and global dependencies while requiring fewer parameters as compared to convolutional models. Prior works [36, 24] have shown that conformers can outperform transformers and CNN models on both speech and non-speech tasks. In this work, we combine the conformer architecture with contrastive learning to learn representations for non-speech audio in a self-supervised manner. We replace the transformer in wav2vec 2.0 with a conformer in order to capture both global and local information in the audio signal. Unlike prior self-supervised works on speech representation learning (e.g. ) which directly learn from raw audio waveforms, we use logmel spectrograms as the fundamental representation of the audio signal. Besides their proven efficacy in different audio tasks , logmels are much more compact and lead to better memory and compute efficiency. The network is pre-trained on a large-scale unlabeled dataset of 67,000 hours to learn audio representations, and then fine-tuned on several audio tasks to show the generalization capabilities of the proposed SSL framework. The contributions of this work are the following: (1) We propose a conformer-based SSL framework for general-purpose audio representation learning applicable to many audio tasks, which can reduce the need for labeled data by two-thirds; (2) On the AudioSet  benchmark, our fine-tuned representations achieve a new state-of-the-art (SOTA) mean average precision (mAP) of 0.415 for self-supervised learning with only audio, outperforming the previous best score of 0.329; (3) Our method can surpass or match the performance of previous systems pre-trained in a supervised way on 3 out of 4 classification tasks: acoustic scenes, music and human actions; (4) We show the effect of the design choices during pre-training and identify the main parameters to tune to avoid overfitting during fine-tuning.
2 Self-Supervised Conformer Training
2.1 Upstream Data
The data for pre-training was selected from de-identified audio tracks of publicly shared Facebook user videos. We ran an in-house version of TALNet 
on the audio tracks of 440 million user videos up to five minutes long, and obtained estimated probabilities of the 527 types of acoustic events in the AudioSet ontology. Then, for each event type, we selected 9,500 videos with the highest probabilities (some videos might be selected for multiple event types). This resulted in a dataset with3.9M videos totaling 67k hours, in which all the types of acoustic events are approximately evenly distributed.
Frontend. We use 64-dimensional logmel spectrograms extracted with a window size of 64 ms and a hop size of 20 ms as the input . Encoder. As shown in Figure 1, the logmel spectrograms are fed into two types of encoders: feature and context encoders. In the feature encoder, the time stacking layer stacks every
consecutive frames into a single frame; then a linear layer converts the stacked spectrogram into latent representation vectors. Before feeding the latent representation to the context encoder, we mask a proportion of it, similar to the original wav2vec 2.0 framework. The context encoder consists of a linear layer, multiple conformer blocks  and another linear layer at the end. We experiment with two context encoder architectures of differing sizes:
conformer small (cf_S): 12 conformer blocks with 256-D encoder embeddings and 8 attention heads (18.4M parameters).
conformer large (cf_L): 12 conformer blocks with 768-D encoder embeddings and 12 attention heads (88.1M parameters).
Both encoders use a feed-forward network (FFN) dimension of 1024, dropout 0.1, kernel size 31 in the first conformer block and 15 for the rest. We further explore different design choices in Section 5.1.2. Additionally, following , we remove the original relative positional encoding, and reuse the existing convolution module for positional encoding by swapping the order of the convolution and multi-head self-attention modules, which speeds up both training and inference. Contrastive Loss. We learn audio representations by solving a contrastive task: identify the true latent representation for a masked time step within a set of candidates , which includes and distractors. Similar to the original wav2vec 2.0, the distractors are uniformly sampled from other masked time steps of the same audio snippet. Given the context network output centered over a masked time step , the contrastive loss is defined as:
is the cosine similarity.
For all pre-training experiments, we use the Adam  optimizer with . We warm up the learning rate for the first 10k updates linearly to a peak of , and then decay it linearly to zero until 300k updates. We regularize the model using weight decay with a decay factor of 0.01. For the contrastive loss we use distractors and mask 30% of the latent representations .
3 Downstream Tasks
We evaluate the generalizability of the learnt representations on sound event detection (SED) and other non-speech audio tasks. Within SED, we evaluate on the AudioSet dataset separately in Sec. 5, because it is much larger than the data used in all the other downstream tasks, and has been used in several priors works as well.
3.1 Sound Event Classification
AudioSet : AudioSet contains 2 million 10-second audio clips from YouTube videos, labeled using an ontology of 527 sound classes. Each clip may have multiple labels. The Balanced training set of AudioSet is a subset with at least 59 clips per class. The balanced training, full training, and evaluation sets contain 22k, 2M, and 20k samples, respectively. For AudioSet, we use the larger model (cf_L). ESC-50 : ESC-50 consists of 2,000 5-seconds audio clips belonging to 50 environmental sound categories. It contains 40 samples for each category, and comes divided into 5 folds for cross validation. FSDKaggle2019 : FSDKaggle2019 is a multi-label dataset consisting of 29,266 audio files annotated with 80 labels form the AudioSet ontology. It further consists of a curated set and a noisy set: the former is annotated by humans but the labels may be incomplete; the latter contains many wrong labels.
3.2 Other Audio Tasks: Scenes, Music, and Human Actions
Acoustic scene classification
Acoustic scene classification: We use the dataset from Task 1a of the 2019 DCASE challenge . It consists of 9,185 and 4,185 segments in the training and the test sets respectively, belonging to 10 acoustic scene classes such as “airport”, “park”, and “metro station”. Music tagging: This is a multi-label classification problem where each recording can have one genre label and multiple instrument labels. For this task, we use the MagnaTagATune  dataset which consists of 25,863 music clips, each clip lasting 29 seconds. We follow the most common 12:1:3 split of training, validation, and evaluation data, and only use the 50 most popular tags out of the 188. Human action classification: This task involves recognizing human actions such as “dancing” and “playing guitar” in video recordings. The Kinetics700 dataset  is a collection of 650k 10-second video clips covering 700 human action classes. This problem has primarily been addressed using images in a uni-  or multimodal  approach; so far there is only one audio-only solution .
|Pre-train + fine-tune||0.276||0.305||0.337||0.356||0.415|
After the self-supervised pre-training, the conformer model can generate an embedding vector (the context representation) for each frame of the input audio. The downstream tasks, however, require predictions over the entire input audio. For AudioSet, we first add a linear classification layer with the sigmoid activation to predict the frame-level probabilities of each event type, then add a linear softmax pooling layer 
to pool them into global probabilities. For all other downstream tasks, we first average-pool the embeddings across all the frames, then stack a linear classification layer to make predictions. During fine-tuning, we allow the weights of the conformer to update. We find this is critical to make the SSL schema work. This is different from some existing works on transfer learning, where a shallow classifier is stacked on top of an encoder with frozen weights.
4.1 Data Processing
To deal with the highly skewed label distribution in Full AudioSet, we apply data balancing to ensure that the training sees different sound classes at approximately equal frequencies. We also apply data augmentations in the following order to all downstream tasks:Temporal jittering: Prior to extracting logmel features, we shift the waveform by a random amount between -200 and 200 samples. The maximum shift (200 samples) is equal to half the frame length. SpecAugment : For each 10 s video, we mask out the logmel features of one random temporal interval up to 2 s. We do not mask out any frequency bins because it does not work well with Mixup. Mixup : For a minibatch of videos, with logmel features denoted by , we shuffle the instances to get , and then mix up the two batches according to
The mixing coefficient
is sampled from the beta distribution, and it is enforced that . The labels of inherits those of ; we do not mix them with the labels of .
4.2 Consistency Loss
In addition to the binary cross-entropy (BCE) loss, we employ a consistency loss as a regularizer for training. For each minibatch, we apply the data augmentation twice with different random numbers, and compute the symmetric KL divergence between the model’s predicted event probabilities on the two versions of data. This KL divergence is multiplied by 2 and added to the BCE loss.
4.3 Hyperparameter Tuning
We find that the fine-tuning process is prone to overfitting, but it can be avoided by carefully tuning the learning rate, the learning rate schedule, and the batch size. We utilize a three-stage learning rate scheduler  for all downstream datasets. The three stages are: warmup, hold, and exponential decay, and they typically last for 30%, 30%, and 40% of all the updates. While high-resource datasets perform best with larger batches, smaller datasets prefer smaller batches. For example, we use a batch size of 640 and 64 for Full AudioSet and MagnaTagATune, respectively. We also regularize the model more for low-resource datasets by adding dropout to the output layer of the pre-trained conformer. We sweep different combinations of the parameters to find the one that optimizes validation performance.
|Self-supervised||Multi-Format ||w + l||✓||0.329|
|C ||l + v||✓||0.285|
|Multimodal||MMV ||l + v + t||✓||0.309|
|Self-supervised||VATT ||l + v + t||✓||✓||0.394|
|w + l + v||✓||0.424|
|Supervised||PSLA (w/o ensemble) ||w + l||✓||0.444|
|AST (w/o ensemble) ||l||✓||0.459|
|No pre-training||WEANet-SUSTAIN ||l||0.398|
|(From scratch)||PANNs ||w + l||0.431|
|2 Task||Dataset||# Training||Metric||SS Conformer||Prior Works|
|Samples||cf_S||cf_L||SS + shallow||S + shallow||S + fine-tune|
|Sound Events||ESC50||1,600||Accuracy||80.7||88.0||86.3 ||94.1 ||94.7 |
|Acoustic Scenes||DCASE2019 Task 1a||9,185||Accuracy||72.4||76.1||✗||68.0 ||✗|
|Music Tagging||MagnaTagATune||15,247||MAUC||90.2||91.2||89.3 ||91.5 ||✗|
|Human Actions||Kinetics700||504,443||Top-1 Accuracy||20.4||23.5||✗||18.0 ||✗|
5.1 Evaluation on AudioSet
5.1.1 Effect of Pre-training
First, we evaluate the merit of pre-training in our proposed SSL schema in Table 1. On both the Balanced and the Full training sets, we compare the performance of the cf_L model trained from scratch vs pre-trained then fine-tuned. The model trained from scratch performs 50% and 12% worse than the pre-trained and fine-tuned version on Balanced and Full AudioSet, respectively. This emphasizes the necessity of pre-training, especially for smaller downstream datasets. The merit of pre-training can also be quantified by the reduced need for labeled data. By varying the amount of fine-tuning data from 20k to 1.9M audio clips, we find that pre-training and fine-tuning with 600k clips approximately matches the performance of training from scratch with 1.9M clips. That is, pre-training with 67k hours of unlabeled data reduces the need for labeled data by two-thirds.
5.1.2 Ablation Studies
We perform detailed ablation studies to investigate the effect of the amount of pre-training data and the design choices regarding the model structure, on both the Full and and Balanced training sets. Amount of pre-training data. As shown in Fig. 2a, the performance on Full AudioSet stays relatively unchanged as we vary the amount of pre-training data from 6k to 60k hours. In other words, we can reduce the need for labeled data by two thirds even with only 6k hours of unlabeled data. On Balanced AudioSet, however, the performance increases significantly and monotonically with every bit of extra pre-training data. This demonstrates that a sufficient amount of unlabeled data can make up for the scarcity of labeled data. Number of conformer blocks
. We vary the depth of the model from 8 conformer blocks (58.9M parameters) to 20 conformer blocks (146.6M parameters), keeping all other hyperparameters identical to the large model. Fig.2b shows that the largest performance gain occurs when going from 8 to 12 blocks. Beyond that, the improvement is minimal (if any) for Full AudioSet, but still moderate for Balanced AudioSet from 16 to 20 blocks. Number of attention heads. Attention heads allow varied levels of focus on different parts of the sequence, producing better predictions than a single weighted average. We vary the number of attention heads from 4 to 24 in our large model, using the same number of heads in all layers. As shown in Fig. 2c, there is a slight improvement from 4 to 12 heads, but the performance drops once the number of attention heads exceeds 12 for both Full and Balanced AudioSet. In general, the pre-training data size and design choices have a larger impact when labeled data for the downstream task is limited.
5.1.3 Comparison with Prior Works
Table 2 lists the Full AudioSet test performance of our system and some prior works. Depending on the type of data used for pre-training, we divide the works into four categories: self-supervised pre-training with audio only, self-supervised pre-training with multimodal data, supervised pre-training, and no pre-training. Not all comparisons are apple to apple: some works pre-train on AudioSet, and others (including ours) fine-tune the weights of the entire network, both of which may give them an advantage. Still, we report our performance with 6k hours of pre-training data without any augmentation, in order to match previous works that pre-train on AudioSet. We give a quick overview of the previous works using audio-only self-supervised pre-training.  uses contrastive predictive coding (CPC) to single out the representation of a future step at a given distance, but also adds a nonlinear learnable similarity metric and adversarial perturbation. All the others [14, 15, 33, 8]
learn encoders such that a pair of audio clips sampled within a certain temporal proximity have similar representations; the difference lies in how the pairs are presented to the encoder, and the loss function. and  present both clips as spectrograms, and use the BCE loss and triplet loss, respectively.  presents one clip as a waveform and the other as a spectrogram, and uses a one-vs-many contrastive loss.  separates a mixed signal into two channels and forms a pair using one of the channels and the original signal, and combines the BCE and the one-vs-many contrastive losses. The current state-of-the-art performance on AudioSet using audio-only SSL is held by , with an mAP of 0.329. Our method outperforms it by a huge 25% relative, and achieves an mAP of 0.411. However, it should be pointed out that we update encoder weights during fine-tuning, while all the existing works in this category, including , use a shallow classifier upon frozen encoder weights. On the multimodal self-supervised learning front, most methods use audio and visual signals to learn relationships between them [14, 2, 1, 32]. In addition to audio and video, some works [2, 1] also make use of textual information, but perform worse than our framework with only audio. Our audio-only learning method is competitive even to the best multimodal SSL work on AudioSet 
, inferior by only 3% relative. Supervised pre-training on ImageNet[12, 11] outperforms our proposed SSL framework, but it requires a large amount of labeled data for pre-training. However, we notice some systems trained from scratch [18, 30] achieve exceedingly high performance. This may indicate that CRNNs may be better suited for SED than conformers.
5.2 Evaluation on Other Downstream Tasks
Table 3 summarizes the results for all the other downstream tasks. We compare the performance of both our conformer models (cf_S and cf_L) with prior works of transfer learning using either self-supervised or supervised pre-training. Most of these prior works use shallow classifiers; only  fine-tunes the entire network as we do. To our knowledge, no self-supervised prior work exists for FSDKaggle2019, DCASE2019, or Kinetics700. For ESC50 and MagnaTagATune, we outperform baselines of self-supervised pre-training with our cf_L architecture. For music tagging, we nearly match the baseline system  pre-trained in a supervised fashion; for the classification of acoustic scenes and human actions, cf_L (cf_S) outperforms supervised pre-training baselines by 11.9% (6.5%) and 30.5% (13.3%), respectively. However, if we compare against supervised pre-training baselines for SED tasks, our best cf_L architecture performs 7.6% (ESC50) and 19.1% (FSDKaggle curated) worse. This significant disparity can be attributed to the fact that the baseline systems were pre-trained on Full AudioSet, which has a substantial overlap in the label space with both ESC50 and FSDKaggle, providing the baseline systems with an edge.
In this paper, we have proposed and evaluated a conformer-based self-supervised framework for learning representations for non-speech audio. The self-supervised pre-training can reduce the need for labeled data by two-thirds. On the widely known AudioSet, our framework produces an mAP of 0.415, outperforming the audio-only self-supervised learning SOTA of 0.329; it also achieves performance comparable to supervised pre-trained baselines on several other downstream non-speech tasks. Our ablation studies shows the significance of critical design choices such as the model depth.
-  (2021) VATT: transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv:2104.11178. Cited by: §1, Table 2, §5.1.3.
-  (2020) Self-supervised multimodal versatile networks. NeurIPS 2 (6), pp. 7. Cited by: Table 2, §5.1.3.
-  (2020) Wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS, Cited by: §1.
-  (2020) A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607. Cited by: §1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1.
-  (2020) Exploring wav2vec 2.0 on speaker verification and language identification. arXiv preprint arXiv:2012.06185. Cited by: §1.
-  (2021) A large-scale study on unsupervised spatiotemporal representation learning. In CVPR, pp. 3299–3309. Cited by: §3.2.
-  (2021) Self-supervised learning from automatically separated sound scenes. arXiv preprint arXiv:2105.02132. Cited by: Table 2, §5.1.3.
-  (2019) Audio tagging with noisy labels and minimal supervision. arXiv preprint arXiv:1906.02975. Cited by: §3.1.
-  (2017) Audio set: an ontology and human-labeled dataset for audio events. In ICASSP, pp. 776–780. Cited by: §1, §3.1.
-  (2021) AST: audio spectrogram transformer. arXiv preprint arXiv:2104.01778. Cited by: Table 2, §5.1.3.
-  (2021) PSLA: improving audio event classification with pretraining, sampling, labeling, and aggregation. arXiv preprint arXiv:2102.01243. Cited by: Table 2, §5.1.3.
-  (2020) Conformer: convolution-augmented transformer for speech recognition. In Interspeech, pp. 5036–5040. Cited by: §1, §2.2.
-  (2020) Coincidence, categorization, and consolidation: learning to recognize sounds with minimal supervision. In ICASSP, pp. 121–125. Cited by: Table 2, §5.1.3.
-  (2018) Unsupervised learning of semantic audio representations. In ICASSP, pp. 126–130. Cited by: Table 2, §5.1.3.
-  (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §3.2.
-  (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §2.3.
-  (2020) . IEEE/ACM TASLP 28, pp. 2880–2894. Cited by: Table 2, Table 3, §5.1.3.
-  (2020) A sequential self teaching approach for improving generalization in sound event recognition. In ICML, pp. 5447–5457. Cited by: Table 2.
-  (2021) Do sound event representations generalize to other audio tasks? A case study in audio transfer learning. In Interspeech, pp. 1214–1218. Cited by: §1, §3.2, Table 3, §5.2.
-  (2009) Evaluation of algorithms using games: the case of music tagging. In ISMIR, pp. 387–392. Cited by: §3.2.
-  (2021) A better and faster end-to-end model for streaming ASR. In ICASSP, pp. 5634–5638. Cited by: §2.2.
-  (2018) A multi-device dataset for urban acoustic scene classification. arXiv preprint arXiv:1807.09840. Cited by: §3.2.
-  (2020) Convolution-augmented transformer for semisupervised sound event detection. In DCASE, pp. 100–104. Cited by: §1.
SpecAugment: A simple data augmentation method for automatic speech recognition. In Interspeech, Cited by: §4.1, §4.3.
-  (2015) ESC: dataset for environmental sound classification. In ACM Int. Conf. on Multimedia, pp. 1015–1018. Cited by: §3.1.
-  (2021) Learning transferable visual models from natural language supervision. In ICML, Cited by: §3.2.
-  (2021) Contrastive learning of musical representations. arXiv preprint arXiv:2103.09410. Cited by: Table 3.
-  (2021) Improved language identification through cross-lingual self-supervised learning. arXiv preprint arXiv:2107.04082. Cited by: §1.
-  (2021) ERANNs: efficient residual audio neural networks for audio pattern recognition. arXiv preprint arXiv:2106.01621. Cited by: Table 2, §5.1.3.
-  (2020) Contrastive predictive coding of audio with an adversary. In Interspeech, pp. 826–830. Cited by: Table 2, §5.1.3.
-  (2021) Multimodal self-supervised learning of general audio representations. arXiv preprint arXiv:2104.12807. Cited by: Table 2, §5.1.3.
-  (2021) Multi-format contrastive learning of audio representations. arXiv preprint arXiv:2103.06508. Cited by: Table 2, Table 3, §5.1.3.
-  (2019) A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In ICASSP, pp. 31–35. Cited by: §2.1, Table 2, §4.
-  (2017) mixup: Beyond Empirical Risk Minimization. arXiv preprint arXiv:1710.09412. Cited by: §4.1.
Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504. Cited by: §1.