Self-Supervised Learning of Audio Representations from Permutations with Differentiable Ranking

by   Andrew N. Carr, et al.

Self-supervised pre-training using so-called "pretext" tasks has recently shown impressive performance across a wide range of modalities. In this work, we advance self-supervised learning from permutations, by pre-training a model to reorder shuffled parts of the spectrogram of an audio signal, to improve downstream classification performance. We make two main contributions. First, we overcome the main challenges of integrating permutation inversions into an end-to-end training scheme, using recent advances in differentiable ranking. This was heretofore sidestepped by casting the reordering task as classification, fundamentally reducing the space of permutations that can be exploited. Our experiments validate that learning from all possible permutations improves the quality of the pre-trained representations over using a limited, fixed set. Second, we show that inverting permutations is a meaningful pretext task for learning audio representations in an unsupervised fashion. In particular, we improve instrument classification and pitch estimation of musical notes by reordering spectrogram patches in the time-frequency space.


page 1

page 2


Learning neural audio features without supervision

Deep audio classification, traditionally cast as training a deep neural ...

Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning

Annotating musical beats is a very long in tedious process. In order to ...

Deep Clustering For General-Purpose Audio Representations

We introduce DECAR, a self-supervised pre-training approach for learning...

End-to-end Music Remastering System Using Self-supervised and Adversarial Training

Mastering is an essential step in music production, but it is also a cha...

Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency

Pre-training on time series poses a unique challenge due to the potentia...

Learnt dynamics generalizes across tasks, datasets, and populations

Differentiating multivariate dynamic signals is a difficult learning pro...

Sparse Activity and Sparse Connectivity in Supervised Learning

Sparseness is a useful regularizer for learning in a wide range of appli...

I Introduction

Fig. 1: Our framework. [Bottom] Patches are generated and permuted on the fly. The network is pre-trained on the associated permutation. Dotted layers indicate weight sharing across input patches. Embeddings used for downstream tasks are extracted by removing the network’s last few layers. [Top]

Downstream training is achieved by freezing the weights of the network up to the embeddings (in blue), and training different shallow classifiers for a variety of tasks. Further, permutations as a self-supervised technique can handle a variety of slicing methods with no significant changes to the network architecture.

Pre-training representations in an unsupervised way, with subsequent fine-tuning on labelled data, has become the standard to extend the performance of deep architectures to applications where annotations are scarce, such as understanding medical images [25], recognizing speech from under-resourced languages [27, 8], or solving specific language inference tasks [10]. Among unsupervised training schemes, self-supervised learning focuses on designing a proxy training objective, that requires no annotation, such that the representations incidentally learned will generalize well to the task of interest, limiting the amount of labeled data needed for fine-tuning. Such “pretext” tasks, a term coined by Doersch et al. [11], include learning to colorize an artificially gray-scaled image [18], inpainting removed patches [24], multi-modal data comparisons [26], or recognizing by what angle an original image was rotated [14]. Other approaches for self-supervision include classification of original images after data augmentation [7] and clustering [6]. Many of these methods, however, such as gray-scale and rotation, cannot be applied to spectrograms for use in audio processing. There has, also been recent work using contrastive learning for audio representation learning [31] which requires the raw waveform and log-mel spectrogram. Transformers have also been used [20]

for unsupervised learning on audio. Our method is beneficial because of the simple architecture, relatively easy change to the loss, and semantically meaningful pre-text task.

In this work, we consider the pretext task of reordering patches of the spectrogram for an audio signal, first proposed for images in [23]

, the analogue of solving a jigsaw puzzle. In this setting, we first split the input into patches and shuffle them by applying a random permutation. We train a neural network to predict which permutation was applied, taking the shuffled patches as inputs. After pre-training with this pretext task, we use the inner representations learned by the neural network as input features to a low-capacity model (see Figure 


) trained for supervised classification (the downstream task). Recent work has used transfer learning on audio to great effect


, using self-supervision on the latent vectors, while our work explores the use of a pre-text task directly on the input data. Permutations provide a promising avenue for this type of self-supervised learning, as they are conceptually general enough to be applied across a large range of modalities, unlike colorization

[18] or rotations [14] that are specific to images. The encouraging results of [23]

when transferring learned image features for object detection and image retrieval inspire us to advance this method a step forward. However, including permutations into an end-to-end differentiable pipeline is challenging, as permutations are a discontinuous operation. This issue is circumvented in

[23] casting this problem as classification over a fixed subset of permutations. Given that the number of possible permutations of patches is , this approach cannot scale to the full set of permutations, even when is moderately small. Alternatively, [29] use Sinkhorn normalization to produce doubly stochastic matrices as approximations of the true permutation matrix. While this method can be integrated into an end-to-end pipeline, each forward pass through the model relies on an iterative Sinkhorn algorithm, of which each iteration has a cost and which requires choosing a stopping criterion.

In this work, we leverage recent advances in differentiable ranking [3, 5] to integrate permutations into end-to-end neural training. This allows us to solve the permutation inversion task for the entire set of permutations, removing a bottleneck that was heretofore sidestepped in manners that limits downstream performance. Moreover, we demonstrate for the first time the effectiveness of permutations as a pretext task on audio with minimal modality-specific adjustments. In particular, we improve instrument classification and pitch estimation of musical notes by learning to reorder spectrogram frames, over the time and frequency axes.

The rest of the paper is organized as follows. In Section II we present the problem formulation and methods. In Section III we demonstrate the effectiveness of our system on instrument classification and pitch estimation of musical notes.

Ii Methods

Ii-a General methodology

In this section, we present a self-supervised pretext task that predicts the permutation applied to patches of an input. We do so in a manner that allows to use all possible permutations as targets during training. This pretext task is performed for pre-training and the internal representation learned by the pretext neural network can be transferred and used on secondary downstream tasks – see Figure 1.

During pre-training, for each audio sequence, we split its spectrogram into patches

. These patches can either be vertical (stacks of time frames), horizontal (stacks of frequency bands), or on both axes (stacks of time-frequency bins). We then permute these patches randomly, and stack them in a tensor

of dimension (see Figure 1), which is paired to the applied permutation as a label of size (see Section II-B for details on permutation encoding).

We pre-train the weights of a neural network to invert this permutation, using a differentiable ranking operator . This operator, and other details of pre-training are described in Section II-B; the network and data-processing are described in Section II-C. After pre-training, the network weights are used to generate embeddings at an intermediate layer. These representations can be used in a downstream task, as input to a low-capacity classifier (see Figure 1).

We mostly evaluate our methods by improvements in downstream classification. However, the reordering task can be of interest in itself, as in learning-to-rank problems [21], and we also report generalization performance in this task.

Ii-B Differentiable ranking methodology

Our methodology for representation learning relies on the ability to incorporate ordering or ranking operations in an end-to-end differentiable pipeline. This is achieved by using a convenient encoding of the permutations of objects in , and differentiable operators that approximate ranking.

For each permutation, the label is a vector of ranks, or relative order, of the elements (e.g. for the identity permutation). For the model prediction, the last two layers consist of: a vector of score values , and network outputs , using differentiable ranking operators .

These operations map any vector of values to a point in the convex hull of permutation encodings in dimension (e.g. over 4 elements), akin to a softmax operator in a classification setting. We consider here two differentiable ranking operations, either using stochastic perturbations [3] or regularization [5]. In any case, embedding the permutations in this manner (rather than e.g. permutation matrices or classes) puts more emphasis on the relative position of the elements, and enables us to penalize less smaller index differences.

These tools ensure that our model is end-to-end differentiable, and enables to use all permutations in training. This is unlike the models of [23] and [19], where reordering is reduced to classification, assigning a set of permutations to one-hot vectors in . This approach is obviously limited: representing all the permutations requires in principle classes, which is quickly not manageable, even for small values of .

Pre-training the network parameters

requires a loss function between

and . For the version of based on stochastic perturbations, we use the associated Fenchel–Young loss (“Perturbed F-Y” in empirical results) [4], that acts directly on written here as . Its gradients, given by , are easy to compute. For the regularized version of [5], we use . (“Fast Soft Ranking” in empirical results).

We opt for these two losses for their good theoretical properties and complexity. Other choices [29, 22, 9, 30, 28, 15] are also possible, potentially with higher computational cost, or regions with zero gradient.

Ii-C Implementation and architecture


When constructing the self-supervised task, we slice inputs in patches. This slicing is controlled by two variables and , determining respectively the number of columns and rows used. In [23], square patches are used for images. On spectrograms, this choice is conceptually richer as using means slicing frequency bands, while using means slicing along the time axis. Using both and allows slicing along both axes, see Figure 1 for an illustration.

Pre-training task

For the reordering pretext task, we use a Context Free Network (CFN) from [23]. This network uses an AlexNet [17] backbone which processes each patch individually while sharing weights across all patches as shown in Figure 1. By processing each patch independently, but with shared weights, the network cannot rely on global structure. After the shared backbone, the patches are passed together through two fully connected layers. The output layer represents the predicted ranks of the input permutation.

Downstream task

In the downstream tasks we use 3-layer multi-layer perceptrons (MLP) trained on embeddings extracted at the first aggregate fully connected layer of the pretext network (whose weights are frozen during this part). The MLP’s output layer is task-dependent. For a regression downstream task, the output of the MLP is a single scalar and the downstream model is trained to minimize mean-squared error. For classification, the output of the MLP is a softmax over the class logits, and we train the downstream model by minimizing the cross-entropy loss.

Fig. 2: Performance of our permutation-based pretraining over 3 audio tasks when varying the number of data points in the downstream task. Higher is better for Instrument Family and Instrument Label while lower is better in Pitch prediction.

Iii Experiments

We demonstrate the effectiveness of permutation-based pre-training as measured by the test accuracy in instrument classification and pitch estimation tasks. All experiments are carried out in Tensorflow

[1] and run on a single P100 GPU. We also report the performance on the pre-training task using partial ranks, the proportion of patches ranked in the correct position.

Iii-a Experimental setup.

The NSynth dataset [12]

offers about 300,000 audio samples of musical notes, each with a unique pitch, timbre, and envelope recorded from 1,006 different instruments. The recordings, sampled at 16kHz, are 4 seconds long and can be used for 3 downstream classification tasks: predicting the instrument itself (1,006 classes) the instrument family (11 classes) and predicting the pitch of the note (128 values). We formulate pitch estimation as a regression task and report the mean squared error (MSE). The input representation is a log-compressed spectrogram, computed over 25ms with a 10ms stride and 513 bins. The 2D structure of the spectrogram allows us to use a 2D convolutional neural network, as is done with images. We train our CFN with an AlexNet backbone on the pre-training task of predicting applied permutations for 1000 epochs, over mini batches of size 32 and with an Adam optimizer

[16] with a learning rate of . We then evaluate the downstream generalization performance over the 3 NSynth tasks, by replacing the last layers of the network by a task specific 3-layer MLP and replacing the random permutation by the identity.

We compare the different variants of our method (number and nature of the patches) with 2 baseline alternatives: i) training the downstream head on an untrained encoder (Random Embedding) and ii) solving the same pre-training task but using instead a finite set of 100 permutations as proposed by [23] (Fixed Permutation). We also compare different losses to train the permutation pretext task: a) cross entropy (XE) when learning over 100 permutations, b) MSE loss (MSE), c) soft ranking via perturbations (Perturbed F-Y) and d) soft ranking (Fast Soft Ranking).

Iii-B Empirical results.

First, we compare the different methods across several downstream data regimes and report the results in Figure 2. Here, all pre-training models slice the spectrogram over the frequency axis as it gives the best performance (see III-C for an ablation study). Additionally, the experiments in this figure use 10 patches for all methods and 1000 permutations for the XE method. We observe that in the low data regime our method strongly outperforms an end-to-end fully supervised model. Moreover, this advantage is maintained when increasing the number of training examples in the downstream tasks. We also observe that pretraining is particularly impactful for pitch estimation which aligns with results of [13].

Fig. 3: Performance on the downstream tasks, as a function of the number of frequency patches used for pre-training.
Task Random Fixed Fast Perturbed MSE
Embedding Permutation Soft Ranking F-Y
Instr. Family (ACC)
Instr. Label (ACC)
Pitch (MSE)
Partial Ranks Accuracy - -
TABLE I: Performance on three downstream tasks with 1000 downstream data points taken from NSynth. ACC stands for accuracy, and MSE for mean squared error.

We report in Table I the results for the the three downstream tasks, using 1000 training examples. Those experiments are run using 10 frequency bands, which corresponds to permutations. In this context, casting permutation inversion as a classification problem is limited to exploiting only a small proportion of possible permutations, as it would otherwise amount to 3.6M classes. On the other hand, our method scales very well in this setting. We first observe that random embeddings perform poorly but do represent a good baseline to be compared to, as the difference in performance with other methods illustrates what is gained from pre-training. Second, the baseline is significantly outperformed when using a fixed set of permutation and a classification loss, as in [23]. We then observe that even with a mean squared error loss, the performance on the downstream task is comparable or better than the fixed permutation method and we show that using a ranking loss further increases the performance. Furthermore, Fig.3 shows the effect of the number of frequency bands on the downstream performance. As the number of permutations grows, the overall performance over the downstream task increases, providing better representations, up to 9-10 patches. These results tend to confirm that i) permutation is an interesting pretext task, ii) considering all possible permutation helps building better representations and iii) the use of a ranking loss is the right choice of loss for such tasks. We then report in the last row of Table I performance in the pretext task. Good performance on the downstream task is often connected to good performance on the pretext task. Here, we measure performance by the ability of the CFN network to reorder the shuffled inputs, reporting the proportion of items ranked in the correct position.

Task Frequency Time Time-Frequency
Instr. Family (ACC)
Instr. Label (ACC)
Pitch (MSE)
TABLE II: Slicing biases on Nsynth downstream tasks.

Iii-C Time-frequency structure and permutations

Unlike images, the horizontal and vertical dimensions of a spectrogram are semantically different, respectively representing time and frequency. While [23] only exploited square patches, experimenting with audio allows exploring permutations over frequency bands (horizontal patches), time frames (vertical patches) or square time-frequency patches, and comparing the resulting downstream performance. Table II reports a comparison between these three settings. Overall, shuffling along the frequency axis is the best pre-training strategy. These results illustrate a specificity of the dataset: our inputs are single notes, many of them having an harmonic structure. In this context, learning the relation between frequency bands is meaningful both to recognize which instrument is playing, as well as which note (pitch) is being played. This also explains the poor performance of slicing along the time axis. Pitch is a time-independent characteristic, so the time structure is not relevant for this task. Moreover, musical notes have an easily identifiable time structure (fast attack and slow decay), which may make the task of reordering time frames trivial. We hypothesize that signals with a richer, non-stationary time structure, such as speech, would benefit more from shuffling time frames.

Iv Conclusion

We present a general pre-training method that uses permutations to learn high-quality representations from spectrograms, and improves the downstream performance of audio classification and regression tasks on musical notes. We demonstrate that our method outperforms previous permutation learning schemes by incorporating fully differentiable ranking as a pretext loss, enabling us to take advantage of all permutations, instead of a small fixed set. In particular, we show significant improvements in low data regimes.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283. Cited by: §III.
  • [2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33. Cited by: §I.
  • [3] Q. Berthet, M. Blondel, O. Teboul, M. Cuturi, J. Vert, and F. Bach (2020) Learning with differentiable perturbed optimizers. arXiv preprint arXiv:2002.08676. Cited by: §I, §II-B.
  • [4] M. Blondel, A. F. Martins, and V. Niculae (2020) Learning with fenchel-young losses.. Journal of Machine Learning Research 21 (35), pp. 1–69. Cited by: §II-B.
  • [5] M. Blondel, O. Teboul, Q. Berthet, and J. Djolonga (2020) Fast differentiable sorting and ranking. arXiv preprint arXiv:2002.08871. Cited by: §I, §II-B, §II-B.
  • [6] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 132–149. Cited by: §I.
  • [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §I.
  • [8] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli (2020) Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979. Cited by: §I.
  • [9] M. Cuturi, O. Teboul, and J. Vert (2019) Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, pp. 6861–6871. Cited by: §II-B.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
  • [11] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430. Cited by: §I.
  • [12] J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi (2017)

    Neural audio synthesis of musical notes with wavenet autoencoders

    External Links: arXiv:1704.01279 Cited by: §III-A.
  • [13] B. Gfeller, C. Frank, D. Roblek, M. Sharifi, M. Tagliasacchi, and M. Velimirović (2020) SPICE: self-supervised pitch estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 1118–1128. Cited by: §III-B.
  • [14] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §I, §I.
  • [15] A. Grover, E. Wang, A. Zweig, and S. Ermon (2019) Stochastic optimization of sorting networks via continuous relaxations. arXiv preprint arXiv:1903.08850. Cited by: §II-B.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-A.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §II-C.
  • [18] G. Larsson, M. Maire, and G. Shakhnarovich (2017) Colorization as a proxy task for visual understanding. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6874–6883. Cited by: §I, §I.
  • [19] H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §II-B.
  • [20] A. T. Liu, S. Yang, P. Chi, P. Hsu, and H. Lee (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6419–6423. Cited by: §I.
  • [21] T. Liu (2011) Learning to rank for information retrieval. Springer Science & Business Media. Cited by: §II-A.
  • [22] G. Mena, D. Belanger, S. Linderman, and J. Snoek (2018) Learning latent permutations with gumbel-sinkhorn networks. arXiv preprint arXiv:1802.08665. Cited by: §II-B.
  • [23] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §I, §II-B, §II-C, §II-C, §III-A, §III-B, §III-C.
  • [24] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §I.
  • [25] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. (2017)

    Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning

    arXiv preprint arXiv:1711.05225. Cited by: §I.
  • [26] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio (2020) Multi-task self-supervised learning for robust speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6989–6993. Cited by: §I.
  • [27] M. Rivière, A. Joulin, P. Mazaré, and E. Dupoux (2020) Unsupervised pretraining transfers well across languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7414–7418. Cited by: §I.
  • [28] M. Rolínek, V. Musil, A. Paulus, M. Vlastelica, C. Michaelis, and G. Martius (2020) Optimizing rank-based metrics with blackbox differentiation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7620–7630. Cited by: §II-B.
  • [29] R. Santa Cruz, B. Fernando, A. Cherian, and S. Gould (2017) Deeppermnet: visual permutation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3949–3957. Cited by: §I, §II-B.
  • [30] M. Vlastelica, A. Paulus, V. Musil, G. Martius, and M. Rolínek (2019) Differentiation of blackbox combinatorial solvers. arXiv preprint arXiv:1912.02175. Cited by: §II-B.
  • [31] L. Wang and A. van den Oord (2020) Multi-format contrastive learning of audio representations. NeurIPS Workshops (Self-Supervised Learning for Speech and Audio Processing). Cited by: §I.