I Introduction
Pretraining representations in an unsupervised way, with subsequent finetuning on labelled data, has become the standard to extend the performance of deep architectures to applications where annotations are scarce, such as understanding medical images [25], recognizing speech from underresourced languages [27, 8], or solving specific language inference tasks [10]. Among unsupervised training schemes, selfsupervised learning focuses on designing a proxy training objective, that requires no annotation, such that the representations incidentally learned will generalize well to the task of interest, limiting the amount of labeled data needed for finetuning. Such “pretext” tasks, a term coined by Doersch et al. [11], include learning to colorize an artificially grayscaled image [18], inpainting removed patches [24], multimodal data comparisons [26], or recognizing by what angle an original image was rotated [14]. Other approaches for selfsupervision include classification of original images after data augmentation [7] and clustering [6]. Many of these methods, however, such as grayscale and rotation, cannot be applied to spectrograms for use in audio processing. There has, also been recent work using contrastive learning for audio representation learning [31] which requires the raw waveform and logmel spectrogram. Transformers have also been used [20]
for unsupervised learning on audio. Our method is beneficial because of the simple architecture, relatively easy change to the loss, and semantically meaningful pretext task.
In this work, we consider the pretext task of reordering patches of the spectrogram for an audio signal, first proposed for images in [23]
, the analogue of solving a jigsaw puzzle. In this setting, we first split the input into patches and shuffle them by applying a random permutation. We train a neural network to predict which permutation was applied, taking the shuffled patches as inputs. After pretraining with this pretext task, we use the inner representations learned by the neural network as input features to a lowcapacity model (see Figure
1) trained for supervised classification (the downstream task). Recent work has used transfer learning on audio to great effect
[2], using selfsupervision on the latent vectors, while our work explores the use of a pretext task directly on the input data. Permutations provide a promising avenue for this type of selfsupervised learning, as they are conceptually general enough to be applied across a large range of modalities, unlike colorization
[18] or rotations [14] that are specific to images. The encouraging results of [23]when transferring learned image features for object detection and image retrieval inspire us to advance this method a step forward. However, including permutations into an endtoend differentiable pipeline is challenging, as permutations are a discontinuous operation. This issue is circumvented in
[23] casting this problem as classification over a fixed subset of permutations. Given that the number of possible permutations of patches is , this approach cannot scale to the full set of permutations, even when is moderately small. Alternatively, [29] use Sinkhorn normalization to produce doubly stochastic matrices as approximations of the true permutation matrix. While this method can be integrated into an endtoend pipeline, each forward pass through the model relies on an iterative Sinkhorn algorithm, of which each iteration has a cost and which requires choosing a stopping criterion.In this work, we leverage recent advances in differentiable ranking [3, 5] to integrate permutations into endtoend neural training. This allows us to solve the permutation inversion task for the entire set of permutations, removing a bottleneck that was heretofore sidestepped in manners that limits downstream performance. Moreover, we demonstrate for the first time the effectiveness of permutations as a pretext task on audio with minimal modalityspecific adjustments. In particular, we improve instrument classification and pitch estimation of musical notes by learning to reorder spectrogram frames, over the time and frequency axes.
Ii Methods
Iia General methodology
In this section, we present a selfsupervised pretext task that predicts the permutation applied to patches of an input. We do so in a manner that allows to use all possible permutations as targets during training. This pretext task is performed for pretraining and the internal representation learned by the pretext neural network can be transferred and used on secondary downstream tasks – see Figure 1.
During pretraining, for each audio sequence, we split its spectrogram into patches
. These patches can either be vertical (stacks of time frames), horizontal (stacks of frequency bands), or on both axes (stacks of timefrequency bins). We then permute these patches randomly, and stack them in a tensor
of dimension (see Figure 1), which is paired to the applied permutation as a label of size (see Section IIB for details on permutation encoding).We pretrain the weights of a neural network to invert this permutation, using a differentiable ranking operator . This operator, and other details of pretraining are described in Section IIB; the network and dataprocessing are described in Section IIC. After pretraining, the network weights are used to generate embeddings at an intermediate layer. These representations can be used in a downstream task, as input to a lowcapacity classifier (see Figure 1).
We mostly evaluate our methods by improvements in downstream classification. However, the reordering task can be of interest in itself, as in learningtorank problems [21], and we also report generalization performance in this task.
IiB Differentiable ranking methodology
Our methodology for representation learning relies on the ability to incorporate ordering or ranking operations in an endtoend differentiable pipeline. This is achieved by using a convenient encoding of the permutations of objects in , and differentiable operators that approximate ranking.
For each permutation, the label is a vector of ranks, or relative order, of the elements (e.g. for the identity permutation). For the model prediction, the last two layers consist of: a vector of score values , and network outputs , using differentiable ranking operators .
These operations map any vector of values to a point in the convex hull of permutation encodings in dimension (e.g. over 4 elements), akin to a softmax operator in a classification setting. We consider here two differentiable ranking operations, either using stochastic perturbations [3] or regularization [5]. In any case, embedding the permutations in this manner (rather than e.g. permutation matrices or classes) puts more emphasis on the relative position of the elements, and enables us to penalize less smaller index differences.
These tools ensure that our model is endtoend differentiable, and enables to use all permutations in training. This is unlike the models of [23] and [19], where reordering is reduced to classification, assigning a set of permutations to onehot vectors in . This approach is obviously limited: representing all the permutations requires in principle classes, which is quickly not manageable, even for small values of .
Pretraining the network parameters
requires a loss function between
and . For the version of based on stochastic perturbations, we use the associated Fenchel–Young loss (“Perturbed FY” in empirical results) [4], that acts directly on written here as . Its gradients, given by , are easy to compute. For the regularized version of [5], we use . (“Fast Soft Ranking” in empirical results).IiC Implementation and architecture
Dataprocessing
When constructing the selfsupervised task, we slice inputs in patches. This slicing is controlled by two variables and , determining respectively the number of columns and rows used. In [23], square patches are used for images. On spectrograms, this choice is conceptually richer as using means slicing frequency bands, while using means slicing along the time axis. Using both and allows slicing along both axes, see Figure 1 for an illustration.
Pretraining task
For the reordering pretext task, we use a Context Free Network (CFN) from [23]. This network uses an AlexNet [17] backbone which processes each patch individually while sharing weights across all patches as shown in Figure 1. By processing each patch independently, but with shared weights, the network cannot rely on global structure. After the shared backbone, the patches are passed together through two fully connected layers. The output layer represents the predicted ranks of the input permutation.
Downstream task
In the downstream tasks we use 3layer multilayer perceptrons (MLP) trained on embeddings extracted at the first aggregate fully connected layer of the pretext network (whose weights are frozen during this part). The MLP’s output layer is taskdependent. For a regression downstream task, the output of the MLP is a single scalar and the downstream model is trained to minimize meansquared error. For classification, the output of the MLP is a softmax over the class logits, and we train the downstream model by minimizing the crossentropy loss.
Iii Experiments
We demonstrate the effectiveness of permutationbased pretraining as measured by the test accuracy in instrument classification and pitch estimation tasks. All experiments are carried out in Tensorflow
[1] and run on a single P100 GPU. We also report the performance on the pretraining task using partial ranks, the proportion of patches ranked in the correct position.Iiia Experimental setup.
The NSynth dataset [12]
offers about 300,000 audio samples of musical notes, each with a unique pitch, timbre, and envelope recorded from 1,006 different instruments. The recordings, sampled at 16kHz, are 4 seconds long and can be used for 3 downstream classification tasks: predicting the instrument itself (1,006 classes) the instrument family (11 classes) and predicting the pitch of the note (128 values). We formulate pitch estimation as a regression task and report the mean squared error (MSE). The input representation is a logcompressed spectrogram, computed over 25ms with a 10ms stride and 513 bins. The 2D structure of the spectrogram allows us to use a 2D convolutional neural network, as is done with images. We train our CFN with an AlexNet backbone on the pretraining task of predicting applied permutations for 1000 epochs, over mini batches of size 32 and with an Adam optimizer
[16] with a learning rate of . We then evaluate the downstream generalization performance over the 3 NSynth tasks, by replacing the last layers of the network by a task specific 3layer MLP and replacing the random permutation by the identity.We compare the different variants of our method (number and nature of the patches) with 2 baseline alternatives: i) training the downstream head on an untrained encoder (Random Embedding) and ii) solving the same pretraining task but using instead a finite set of 100 permutations as proposed by [23] (Fixed Permutation). We also compare different losses to train the permutation pretext task: a) cross entropy (XE) when learning over 100 permutations, b) MSE loss (MSE), c) soft ranking via perturbations (Perturbed FY) and d) soft ranking (Fast Soft Ranking).
IiiB Empirical results.
First, we compare the different methods across several downstream data regimes and report the results in Figure 2. Here, all pretraining models slice the spectrogram over the frequency axis as it gives the best performance (see IIIC for an ablation study). Additionally, the experiments in this figure use 10 patches for all methods and 1000 permutations for the XE method. We observe that in the low data regime our method strongly outperforms an endtoend fully supervised model. Moreover, this advantage is maintained when increasing the number of training examples in the downstream tasks. We also observe that pretraining is particularly impactful for pitch estimation which aligns with results of [13].
Task  Random  Fixed  Fast  Perturbed  MSE 

Embedding  Permutation  Soft Ranking  FY  
Instr. Family (ACC)  
Instr. Label (ACC)  
Pitch (MSE)  
Partial Ranks Accuracy     
We report in Table I the results for the the three downstream tasks, using 1000 training examples. Those experiments are run using 10 frequency bands, which corresponds to permutations. In this context, casting permutation inversion as a classification problem is limited to exploiting only a small proportion of possible permutations, as it would otherwise amount to 3.6M classes. On the other hand, our method scales very well in this setting. We first observe that random embeddings perform poorly but do represent a good baseline to be compared to, as the difference in performance with other methods illustrates what is gained from pretraining. Second, the baseline is significantly outperformed when using a fixed set of permutation and a classification loss, as in [23]. We then observe that even with a mean squared error loss, the performance on the downstream task is comparable or better than the fixed permutation method and we show that using a ranking loss further increases the performance. Furthermore, Fig.3 shows the effect of the number of frequency bands on the downstream performance. As the number of permutations grows, the overall performance over the downstream task increases, providing better representations, up to 910 patches. These results tend to confirm that i) permutation is an interesting pretext task, ii) considering all possible permutation helps building better representations and iii) the use of a ranking loss is the right choice of loss for such tasks. We then report in the last row of Table I performance in the pretext task. Good performance on the downstream task is often connected to good performance on the pretext task. Here, we measure performance by the ability of the CFN network to reorder the shuffled inputs, reporting the proportion of items ranked in the correct position.
Task  Frequency  Time  TimeFrequency 

Instr. Family (ACC)  
Instr. Label (ACC)  
Pitch (MSE) 
IiiC Timefrequency structure and permutations
Unlike images, the horizontal and vertical dimensions of a spectrogram are semantically different, respectively representing time and frequency. While [23] only exploited square patches, experimenting with audio allows exploring permutations over frequency bands (horizontal patches), time frames (vertical patches) or square timefrequency patches, and comparing the resulting downstream performance. Table II reports a comparison between these three settings. Overall, shuffling along the frequency axis is the best pretraining strategy. These results illustrate a specificity of the dataset: our inputs are single notes, many of them having an harmonic structure. In this context, learning the relation between frequency bands is meaningful both to recognize which instrument is playing, as well as which note (pitch) is being played. This also explains the poor performance of slicing along the time axis. Pitch is a timeindependent characteristic, so the time structure is not relevant for this task. Moreover, musical notes have an easily identifiable time structure (fast attack and slow decay), which may make the task of reordering time frames trivial. We hypothesize that signals with a richer, nonstationary time structure, such as speech, would benefit more from shuffling time frames.
Iv Conclusion
We present a general pretraining method that uses permutations to learn highquality representations from spectrograms, and improves the downstream performance of audio classification and regression tasks on musical notes. We demonstrate that our method outperforms previous permutation learning schemes by incorporating fully differentiable ranking as a pretext loss, enabling us to take advantage of all permutations, instead of a small fixed set. In particular, we show significant improvements in low data regimes.
References

[1]
(2016)
Tensorflow: a system for largescale machine learning
. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283. Cited by: §III.  [2] (2020) Wav2vec 2.0: a framework for selfsupervised learning of speech representations. Advances in Neural Information Processing Systems 33. Cited by: §I.
 [3] (2020) Learning with differentiable perturbed optimizers. arXiv preprint arXiv:2002.08676. Cited by: §I, §IIB.
 [4] (2020) Learning with fenchelyoung losses.. Journal of Machine Learning Research 21 (35), pp. 1–69. Cited by: §IIB.
 [5] (2020) Fast differentiable sorting and ranking. arXiv preprint arXiv:2002.08871. Cited by: §I, §IIB, §IIB.

[6]
(2018)
Deep clustering for unsupervised learning of visual features.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 132–149. Cited by: §I.  [7] (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §I.
 [8] (2020) Unsupervised crosslingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979. Cited by: §I.
 [9] (2019) Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, pp. 6861–6871. Cited by: §IIB.
 [10] (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
 [11] (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430. Cited by: §I.

[12]
(2017)
Neural audio synthesis of musical notes with wavenet autoencoders
. External Links: arXiv:1704.01279 Cited by: §IIIA.  [13] (2020) SPICE: selfsupervised pitch estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 1118–1128. Cited by: §IIIB.
 [14] (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §I, §I.
 [15] (2019) Stochastic optimization of sorting networks via continuous relaxations. arXiv preprint arXiv:1903.08850. Cited by: §IIB.
 [16] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IIIA.
 [17] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §IIC.

[18]
(2017)
Colorization as a proxy task for visual understanding.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 6874–6883. Cited by: §I, §I.  [19] (2017) Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §IIB.
 [20] (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6419–6423. Cited by: §I.
 [21] (2011) Learning to rank for information retrieval. Springer Science & Business Media. Cited by: §IIA.
 [22] (2018) Learning latent permutations with gumbelsinkhorn networks. arXiv preprint arXiv:1802.08665. Cited by: §IIB.
 [23] (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §I, §IIB, §IIC, §IIC, §IIIA, §IIIB, §IIIC.
 [24] (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §I.

[25]
(2017)
Chexnet: radiologistlevel pneumonia detection on chest xrays with deep learning
. arXiv preprint arXiv:1711.05225. Cited by: §I.  [26] (2020) Multitask selfsupervised learning for robust speech recognition. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6989–6993. Cited by: §I.
 [27] (2020) Unsupervised pretraining transfers well across languages. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7414–7418. Cited by: §I.
 [28] (2020) Optimizing rankbased metrics with blackbox differentiation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7620–7630. Cited by: §IIB.
 [29] (2017) Deeppermnet: visual permutation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3949–3957. Cited by: §I, §IIB.
 [30] (2019) Differentiation of blackbox combinatorial solvers. arXiv preprint arXiv:1912.02175. Cited by: §IIB.
 [31] (2020) Multiformat contrastive learning of audio representations. NeurIPS Workshops (SelfSupervised Learning for Speech and Audio Processing). Cited by: §I.