Pre-training representations in an unsupervised way, with subsequent fine-tuning on labelled data, has become the standard to extend the performance of deep architectures to applications where annotations are scarce, such as understanding medical images , recognizing speech from under-resourced languages [27, 8], or solving specific language inference tasks . Among unsupervised training schemes, self-supervised learning focuses on designing a proxy training objective, that requires no annotation, such that the representations incidentally learned will generalize well to the task of interest, limiting the amount of labeled data needed for fine-tuning. Such “pretext” tasks, a term coined by Doersch et al. , include learning to colorize an artificially gray-scaled image , inpainting removed patches , multi-modal data comparisons , or recognizing by what angle an original image was rotated . Other approaches for self-supervision include classification of original images after data augmentation  and clustering . Many of these methods, however, such as gray-scale and rotation, cannot be applied to spectrograms for use in audio processing. There has, also been recent work using contrastive learning for audio representation learning  which requires the raw waveform and log-mel spectrogram. Transformers have also been used 
for unsupervised learning on audio. Our method is beneficial because of the simple architecture, relatively easy change to the loss, and semantically meaningful pre-text task.
In this work, we consider the pretext task of reordering patches of the spectrogram for an audio signal, first proposed for images in 
, the analogue of solving a jigsaw puzzle. In this setting, we first split the input into patches and shuffle them by applying a random permutation. We train a neural network to predict which permutation was applied, taking the shuffled patches as inputs. After pre-training with this pretext task, we use the inner representations learned by the neural network as input features to a low-capacity model (see Figure1
) trained for supervised classification (the downstream task). Recent work has used transfer learning on audio to great effect
, using self-supervision on the latent vectors, while our work explores the use of a pre-text task directly on the input data. Permutations provide a promising avenue for this type of self-supervised learning, as they are conceptually general enough to be applied across a large range of modalities, unlike colorization or rotations  that are specific to images. The encouraging results of 
when transferring learned image features for object detection and image retrieval inspire us to advance this method a step forward. However, including permutations into an end-to-end differentiable pipeline is challenging, as permutations are a discontinuous operation. This issue is circumvented in casting this problem as classification over a fixed subset of permutations. Given that the number of possible permutations of patches is , this approach cannot scale to the full set of permutations, even when is moderately small. Alternatively,  use Sinkhorn normalization to produce doubly stochastic matrices as approximations of the true permutation matrix. While this method can be integrated into an end-to-end pipeline, each forward pass through the model relies on an iterative Sinkhorn algorithm, of which each iteration has a cost and which requires choosing a stopping criterion.
In this work, we leverage recent advances in differentiable ranking [3, 5] to integrate permutations into end-to-end neural training. This allows us to solve the permutation inversion task for the entire set of permutations, removing a bottleneck that was heretofore sidestepped in manners that limits downstream performance. Moreover, we demonstrate for the first time the effectiveness of permutations as a pretext task on audio with minimal modality-specific adjustments. In particular, we improve instrument classification and pitch estimation of musical notes by learning to reorder spectrogram frames, over the time and frequency axes.
Ii-a General methodology
In this section, we present a self-supervised pretext task that predicts the permutation applied to patches of an input. We do so in a manner that allows to use all possible permutations as targets during training. This pretext task is performed for pre-training and the internal representation learned by the pretext neural network can be transferred and used on secondary downstream tasks – see Figure 1.
During pre-training, for each audio sequence, we split its spectrogram into patches
. These patches can either be vertical (stacks of time frames), horizontal (stacks of frequency bands), or on both axes (stacks of time-frequency bins). We then permute these patches randomly, and stack them in a tensorof dimension (see Figure 1), which is paired to the applied permutation as a label of size (see Section II-B for details on permutation encoding).
We pre-train the weights of a neural network to invert this permutation, using a differentiable ranking operator . This operator, and other details of pre-training are described in Section II-B; the network and data-processing are described in Section II-C. After pre-training, the network weights are used to generate embeddings at an intermediate layer. These representations can be used in a downstream task, as input to a low-capacity classifier (see Figure 1).
We mostly evaluate our methods by improvements in downstream classification. However, the reordering task can be of interest in itself, as in learning-to-rank problems , and we also report generalization performance in this task.
Ii-B Differentiable ranking methodology
Our methodology for representation learning relies on the ability to incorporate ordering or ranking operations in an end-to-end differentiable pipeline. This is achieved by using a convenient encoding of the permutations of objects in , and differentiable operators that approximate ranking.
For each permutation, the label is a vector of ranks, or relative order, of the elements (e.g. for the identity permutation). For the model prediction, the last two layers consist of: a vector of score values , and network outputs , using differentiable ranking operators .
These operations map any vector of values to a point in the convex hull of permutation encodings in dimension (e.g. over 4 elements), akin to a softmax operator in a classification setting. We consider here two differentiable ranking operations, either using stochastic perturbations  or regularization . In any case, embedding the permutations in this manner (rather than e.g. permutation matrices or classes) puts more emphasis on the relative position of the elements, and enables us to penalize less smaller index differences.
These tools ensure that our model is end-to-end differentiable, and enables to use all permutations in training. This is unlike the models of  and , where reordering is reduced to classification, assigning a set of permutations to one-hot vectors in . This approach is obviously limited: representing all the permutations requires in principle classes, which is quickly not manageable, even for small values of .
Pre-training the network parameters
requires a loss function betweenand . For the version of based on stochastic perturbations, we use the associated Fenchel–Young loss (“Perturbed F-Y” in empirical results) , that acts directly on written here as . Its gradients, given by , are easy to compute. For the regularized version of , we use . (“Fast Soft Ranking” in empirical results).
Ii-C Implementation and architecture
When constructing the self-supervised task, we slice inputs in patches. This slicing is controlled by two variables and , determining respectively the number of columns and rows used. In , square patches are used for images. On spectrograms, this choice is conceptually richer as using means slicing frequency bands, while using means slicing along the time axis. Using both and allows slicing along both axes, see Figure 1 for an illustration.
For the reordering pretext task, we use a Context Free Network (CFN) from . This network uses an AlexNet  backbone which processes each patch individually while sharing weights across all patches as shown in Figure 1. By processing each patch independently, but with shared weights, the network cannot rely on global structure. After the shared backbone, the patches are passed together through two fully connected layers. The output layer represents the predicted ranks of the input permutation.
In the downstream tasks we use 3-layer multi-layer perceptrons (MLP) trained on embeddings extracted at the first aggregate fully connected layer of the pretext network (whose weights are frozen during this part). The MLP’s output layer is task-dependent. For a regression downstream task, the output of the MLP is a single scalar and the downstream model is trained to minimize mean-squared error. For classification, the output of the MLP is a softmax over the class logits, and we train the downstream model by minimizing the cross-entropy loss.
We demonstrate the effectiveness of permutation-based pre-training as measured by the test accuracy in instrument classification and pitch estimation tasks. All experiments are carried out in Tensorflow and run on a single P100 GPU. We also report the performance on the pre-training task using partial ranks, the proportion of patches ranked in the correct position.
Iii-a Experimental setup.
The NSynth dataset 
offers about 300,000 audio samples of musical notes, each with a unique pitch, timbre, and envelope recorded from 1,006 different instruments. The recordings, sampled at 16kHz, are 4 seconds long and can be used for 3 downstream classification tasks: predicting the instrument itself (1,006 classes) the instrument family (11 classes) and predicting the pitch of the note (128 values). We formulate pitch estimation as a regression task and report the mean squared error (MSE). The input representation is a log-compressed spectrogram, computed over 25ms with a 10ms stride and 513 bins. The 2D structure of the spectrogram allows us to use a 2D convolutional neural network, as is done with images. We train our CFN with an AlexNet backbone on the pre-training task of predicting applied permutations for 1000 epochs, over mini batches of size 32 and with an Adam optimizer with a learning rate of . We then evaluate the downstream generalization performance over the 3 NSynth tasks, by replacing the last layers of the network by a task specific 3-layer MLP and replacing the random permutation by the identity.
We compare the different variants of our method (number and nature of the patches) with 2 baseline alternatives: i) training the downstream head on an untrained encoder (Random Embedding) and ii) solving the same pre-training task but using instead a finite set of 100 permutations as proposed by  (Fixed Permutation). We also compare different losses to train the permutation pretext task: a) cross entropy (XE) when learning over 100 permutations, b) MSE loss (MSE), c) soft ranking via perturbations (Perturbed F-Y) and d) soft ranking (Fast Soft Ranking).
Iii-B Empirical results.
First, we compare the different methods across several downstream data regimes and report the results in Figure 2. Here, all pre-training models slice the spectrogram over the frequency axis as it gives the best performance (see III-C for an ablation study). Additionally, the experiments in this figure use 10 patches for all methods and 1000 permutations for the XE method. We observe that in the low data regime our method strongly outperforms an end-to-end fully supervised model. Moreover, this advantage is maintained when increasing the number of training examples in the downstream tasks. We also observe that pretraining is particularly impactful for pitch estimation which aligns with results of .
|Instr. Family (ACC)|
|Instr. Label (ACC)|
|Partial Ranks Accuracy||-||-|
We report in Table I the results for the the three downstream tasks, using 1000 training examples. Those experiments are run using 10 frequency bands, which corresponds to permutations. In this context, casting permutation inversion as a classification problem is limited to exploiting only a small proportion of possible permutations, as it would otherwise amount to 3.6M classes. On the other hand, our method scales very well in this setting. We first observe that random embeddings perform poorly but do represent a good baseline to be compared to, as the difference in performance with other methods illustrates what is gained from pre-training. Second, the baseline is significantly outperformed when using a fixed set of permutation and a classification loss, as in . We then observe that even with a mean squared error loss, the performance on the downstream task is comparable or better than the fixed permutation method and we show that using a ranking loss further increases the performance. Furthermore, Fig.3 shows the effect of the number of frequency bands on the downstream performance. As the number of permutations grows, the overall performance over the downstream task increases, providing better representations, up to 9-10 patches. These results tend to confirm that i) permutation is an interesting pretext task, ii) considering all possible permutation helps building better representations and iii) the use of a ranking loss is the right choice of loss for such tasks. We then report in the last row of Table I performance in the pretext task. Good performance on the downstream task is often connected to good performance on the pretext task. Here, we measure performance by the ability of the CFN network to reorder the shuffled inputs, reporting the proportion of items ranked in the correct position.
|Instr. Family (ACC)|
|Instr. Label (ACC)|
Iii-C Time-frequency structure and permutations
Unlike images, the horizontal and vertical dimensions of a spectrogram are semantically different, respectively representing time and frequency. While  only exploited square patches, experimenting with audio allows exploring permutations over frequency bands (horizontal patches), time frames (vertical patches) or square time-frequency patches, and comparing the resulting downstream performance. Table II reports a comparison between these three settings. Overall, shuffling along the frequency axis is the best pre-training strategy. These results illustrate a specificity of the dataset: our inputs are single notes, many of them having an harmonic structure. In this context, learning the relation between frequency bands is meaningful both to recognize which instrument is playing, as well as which note (pitch) is being played. This also explains the poor performance of slicing along the time axis. Pitch is a time-independent characteristic, so the time structure is not relevant for this task. Moreover, musical notes have an easily identifiable time structure (fast attack and slow decay), which may make the task of reordering time frames trivial. We hypothesize that signals with a richer, non-stationary time structure, such as speech, would benefit more from shuffling time frames.
We present a general pre-training method that uses permutations to learn high-quality representations from spectrograms, and improves the downstream performance of audio classification and regression tasks on musical notes. We demonstrate that our method outperforms previous permutation learning schemes by incorporating fully differentiable ranking as a pretext loss, enabling us to take advantage of all permutations, instead of a small fixed set. In particular, we show significant improvements in low data regimes.
Tensorflow: a system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283. Cited by: §III.
-  (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33. Cited by: §I.
-  (2020) Learning with differentiable perturbed optimizers. arXiv preprint arXiv:2002.08676. Cited by: §I, §II-B.
-  (2020) Learning with fenchel-young losses.. Journal of Machine Learning Research 21 (35), pp. 1–69. Cited by: §II-B.
-  (2020) Fast differentiable sorting and ranking. arXiv preprint arXiv:2002.08871. Cited by: §I, §II-B, §II-B.
Deep clustering for unsupervised learning of visual features.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §I.
-  (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §I.
-  (2020) Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979. Cited by: §I.
-  (2019) Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, pp. 6861–6871. Cited by: §II-B.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
-  (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430. Cited by: §I.
Neural audio synthesis of musical notes with wavenet autoencoders. External Links: Cited by: §III-A.
-  (2020) SPICE: self-supervised pitch estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 1118–1128. Cited by: §III-B.
-  (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §I, §I.
-  (2019) Stochastic optimization of sorting networks via continuous relaxations. arXiv preprint arXiv:1903.08850. Cited by: §II-B.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-A.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §II-C.
Colorization as a proxy task for visual understanding.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6874–6883. Cited by: §I, §I.
-  (2017) Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §II-B.
-  (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6419–6423. Cited by: §I.
-  (2011) Learning to rank for information retrieval. Springer Science & Business Media. Cited by: §II-A.
-  (2018) Learning latent permutations with gumbel-sinkhorn networks. arXiv preprint arXiv:1802.08665. Cited by: §II-B.
-  (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §I, §II-B, §II-C, §II-C, §III-A, §III-B, §III-C.
-  (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §I.
Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225. Cited by: §I.
-  (2020) Multi-task self-supervised learning for robust speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6989–6993. Cited by: §I.
-  (2020) Unsupervised pretraining transfers well across languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7414–7418. Cited by: §I.
-  (2020) Optimizing rank-based metrics with blackbox differentiation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7620–7630. Cited by: §II-B.
-  (2017) Deeppermnet: visual permutation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3949–3957. Cited by: §I, §II-B.
-  (2019) Differentiation of blackbox combinatorial solvers. arXiv preprint arXiv:1912.02175. Cited by: §II-B.
-  (2020) Multi-format contrastive learning of audio representations. NeurIPS Workshops (Self-Supervised Learning for Speech and Audio Processing). Cited by: §I.