Are Nearby Neighbors Relatives?: Diagnosing Deep Music Embedding Spaces

by   Jaehun Kim, et al.

Deep neural networks have frequently been used to directly learn representations useful for a given task from raw input data. In terms of overall performance metrics, machine learning solutions employing deep representations frequently have been reported to greatly outperform those using hand-crafted feature representations. At the same time, they may pick up on aspects that are predominant in the data, yet not actually meaningful or interpretable. In this paper, we therefore propose a systematic way to diagnose the trustworthiness of deep music representations, considering musical semantics. The underlying assumption is that in case a deep representation is to be trusted, distance consistency between known related points should be maintained both in the input audio space and corresponding latent deep space. We generate known related points through semantically meaningful transformations, both considering imperceptible and graver transformations. Then, we examine within- and between-space distance consistencies, both considering audio space and latent embedded space, the latter either being a result of a conventional feature extractor or a deep encoder. We illustrate how our method, as a complement to task-specific performance, provides interpretable insight into what a network may have captured from training data signals.



There are no comments yet.


page 1

page 2

page 3

page 4


COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

Audio representation learning based on deep neural networks (DNNs) emerg...

DLR : Toward a deep learned rhythmic representation for music content analysis

In the use of deep neural networks, it is crucial to provide appropriate...

Analysis of Feature Representations for Anomalous Sound Detection

In this work, we thoroughly evaluate the efficacy of pretrained neural n...

Towards Deep Modeling of Music Semantics using EEG Regularizers

Modeling of music audio semantics has been previously tackled through le...

Inspecting and Interacting with Meaningful Music Representations using VAE

Variational Autoencoders(VAEs) have already achieved great results on im...

Learning Complex Basis Functions for Invariant Representations of Audio

Learning features from data has shown to be more successful than using h...

The Representation Race - Preprocessing for Handling Time Phenomena

Designing the representation languages for the input, L E, and output, L...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Music audio is a complex signal. Frequencies in the signal usually belong to multiple pitches, which are organized harmonically and rhythmically, and often originate from multiple acoustic sources in the presence of noise. When solving tasks in the Music Information Retrieval (MIR) field, within this noisy signal, the optimal subset of information needs to be found that leads to quantifiable and musical descriptors. Commonly, this process is handled by pipelines exploiting a wide range of signal processing and machine learning algorithms. Beyond the use of hand-crafted music representations, which are informed by human domain knowledge, as an alternative, deep music representations have emerged, that are trained by employing deep neural networks (DNNs) and massive amounts of training data observations. Such deep representations are usually reported to outperform hand-crafted representations [1, 2, 3, 4].

Figure 1: Simplified example illustrating distance assumption within a space. Circles without a cross indicate music clips. Yellow circles with crosses refer to hardly perceptible transformations of the yellow original clip. The top-right transformation, marked with a red outer circle, actually is closer to another original clip (green) than to its own original (yellow), which violates the assumption it should be closest to its original, and hence may be seen as an error-inducing transformation under a nearest-neighbor scheme.

At the same time, the performance of MIR systems may be vulnerable to subtle input manipulation. The addition of small noise may lead to unexpected random behavior, regardless of whether traditional or deep models are used [5, 6, 7, 8]. In a similar line of thought, in the broader deep learning (DL) community, increasing attention is given to adversarial examples that are barely differentiable from original samples, but greatly impact a network’s performance [9, 8].

So far, the sensitivity of representations with respect to subtle input changes has mostly been tested in relation to dedicated machine learning tasks (e.g. object recognition, music genre classification), and examined by investigating whether these input changes cause performance drops. When purely considering the questions whether relevant input signal information can automatically be encoded into a representation, and to what extent the representation can be deemed ‘reliable’, in principle, the learned representation should be general and useful to different types of tasks. Therefore, in this work, we will not focus on performance obtained by using a learned representation for certain machine learning tasks, but rather on a systematic way to verify assumptions on distance relationships between several representation spaces: the audio space and the learned space.

Inspired by [5], we will also investigate the effect of musical and acoustic transformations of audio input signals, in combination with an arbitrary encoder of the input signal, which either may be a conventional feature extractor or deep learning-based encoder. In doing this, we have the following major assumptions according to the rationale can be found in Figure 1:

  1. [label=()   , itemindent=2em]

  2. if a small, humanly imperceptible transformation is introduced, the distance between the original and transformed signal should be very small, both in the audio and encoded space.

  3. however, if a more grave transformation is introduced, the distance between the original and transformed signal should be larger, both in the audio and encoded space.

  4. since an encoder obtained under a machine learning framework will have optimized its encoding behavior based on the task and the data, the relational structure of signals will be morphed with respect to them as well.

To examine the above assumptions, we seek to answer the following research questions:

  1. [label=RQ 0.   , itemindent=4em]

  2. Do assumption (i) and (ii) hold for conventional and deep learning-based encoders?

  3. Does assumption (iii) hold for music-related tasks and corresponding datasets, especially when deep learning is applied?

With this work, we intend to offer directions towards a complementary evaluation method for deep machine learning pipelines, that focuses on space diagnosis rather than the troubleshooting of pipeline output. Our intention is that this will yield the researcher additional insight into the reliability and potential semantic sensitivities of deep learned spaces.

In the remainder of this paper, we first describe our approaches including the details on the learning setup (2) and the methodology to assess distance consistency (3), followed by the experimental setup (4). Further, we report the result from our experiments (5). Afterwards we discuss on the results and conclude this work (6).

2 Learning

To diagnose a deep music representation space, such a space should first exist. For this, one needs to find a learnable deep encoder that transforms the input audio representation

to a latent vector

, while taking into account the desired output for a given learning task. The learning of can be done by adjusting the parametrization to optimize the objective function, which should be defined in accordance to a given task.

2.1 Tasks

We consider representations learned for four different tasks. By doing this, we take a broad range of problems into account that are particularly common in MIR field: Autoencoder (AE), music auto-tagging (AT), predominant instrument recognition (IR), and finally singing voice separation

(VS). AE is a representative task for unsupervised learning using DNNs, and AT is a popular supervised learning task in the MIR field 

[3, 10, 11, 12, 13, 14]. AT is a multi-label classification problem, in which individual labels are not always mutually exclusive and often highly inter-correlated. As such, it can be seen as a more challenging problem than IR, which is a single-label classification problem. Furthermore, IR labels involve instruments, which can be seen as more objective and taxonomically stable labels than e.g. genres or moods. Finally, VS is a task that can be formulated as a regression problem, that learns a mask to segregate a certain region of interest out of a given signal mixture.

2.1.1 Autoencoder

The objective of an Autoencoder is to find a set of encoder and decoder functions, minimizing the reconstruction error given by:


Here, the is the output of decoding function parameterized by and is the given set of training samples.

2.1.2 Music Auto-Tagging

The typical approach to music auto-tagging using DNNs is to consider the problem as a multi-label classification problem, for which the objective is to minimize the binary cross-entropy of each music tag , which is expressed as follows:


where is the binary label that indicates whether the tag is related to the input audio signal and is the output of function

, which is the prediction layer with sigmoid nonlinearity to transform the deep representation

into the , and parameterized by . The optimal functions and are found by adjusting and such that (2) is minimized.

2.1.3 Predominant Musical Instrument Recognition

The learning of the IR task can be formulated as a single-label, multi-class classification, which often aims at minimizing the following objective function, especially in the context of neural network learning:


where is a instrument label. In general, the learning process of (2), except the output posterior distribution , is derived as the categorical distribution by transformation such as the softmax function , where is the output of .

2.1.4 Singing Voice Separation

There are multiple ways to set up an objective function for source separation task. It can be achieved by simply applying 1 between the output of the network to the desired isolated signal, or, as introduced in [15], one can learn a mask that segments the target component from the mixture as follows:


where is the raw-level representation of the isolated signal, which serves as the regression target, is the representation of the original input mixture, and refers the element-wise multiplication. is the mask inferred by of which the elements are bounded in the range such that they can be used for the separation of the target source. Note, that both input

and estimated target source

are magnitude spectra, so we use the original phase of input to reconstruct a time-domain signal.

2.2 Network Architectures

The architecture of a DNN determines the overall structure of the network, which defines the details of the desired patterns to be captured by the learning process [16]. In other words, it reflects the way in which a network should interpret a given input data representation. In this work, we use a VGG-like architecture, one of the most popular and general architectures frequently employed in the MIR field.

Layers Output shape

, BN, ReLU

Conv , BN, ReLU
Conv , BN, ReLU
Conv , BN, ReLU
Conv , BN, ReLU
Conv , BN, ReLU
Table 1: Employed network architectures. A decoder is constructed reversing the layers: convolution (Conv) and fully-connected (FC) layers are transposed, and pooling layers repeat the maximum input values in the pooling window.

The VGG-like

architecture is a Convolutional Neural Network (CNN) architecture introduced by  

[17, 18]

, which employs tiny rectangular filters. Successes of VGG-like architectures have not only been reported for computer vision tasks, but also in various MIR fields 

[3, 8]. The detailed architecture design used in our work can be found in the Table 1.

2.3 Architecture and Learning Details

For both architectures, we used Rectified Linear Units (ReLU) 


for the nonlinearity, and Batch Normalization (BN) in every convolutional and fully-connected layer for fast training and regularization 

[20]. We use ADAM [21] as optimization algorithm during training, where the learning rate is set for

across all models. We trained models with respect to their objective function, which requires different optimization strategies. Nonetheless, we regularized the other factors except the number of epochs per task, which inherently depends on the dataset and the task. The termination point of the training is set manually, where either the validation loss reaches to the plateau or starts to increase. More specifically, we stopped the training for each task at the epoch of

for the AE, AT, IR, VS task, respectively.

Figure 2: Network architecture used in this work. The left half of the model is the encoder pipeline , whose architecture is kept the same across all the tasks of our experiments. The pink vertical bar represents the latent space of , in which all the measures we propose are tested. The right half of the diagram refers to the four different prediction pipelines with respect to the tasks. The top block describes the decoder and the error function of the task (where, for simplicity, detailed illustrations of decoder of are omitted). The second and third block represent the AT and IR task, respectively. Here, the smaller pink bar represents the terminal layer for the prediction of the posterior distribution for music tags or musical instruments. Finally, the lowest block is describing the mask predictor , prediction process and the way the error function is calculated. Also, this architecture includes the skip-connections from each convolution block of the encoder, which is the key characteristic of the U-Net [22].

3 Measuring Distance Consistency

In this work, among the set of potential representation spaces , we consider two representation spaces of interest: the audio input space and the latent embedding space . indicates the set of different models that can be considered within a given space. For all relevant spaces, we will assess space reliability by examining the distance consistency with respect to a set of transformations .

In Section 3.1, we describe how distance consistency will be measured. Section 3.2 will discuss the distance measures that will be used, while Section 3.3 discusses what transformations will be adopted in our experiments.

3.1 Distance Consistency

For distance consistency, we will compute within-space consistency and between-space consistency.

3.1.1 Within-Space Consistency

We first obtain the transformed samples from all belonging to the test set . Afterwards, we calculate the error function of each transformed sample as follows:


Here, we see whether a transformed sample is closer to any other original sample than its own original sample. is a representation of a single music excerpt and is the set of the collection of all the points on the given space determined by . All neural network encoders belong to the set of models we tested in this work. is a distance function belonging to the set of distance measures considered for the given space .

As indicates how the space is unreliable at the clip-level, the within-space consistency can be defined as the complement of :


where refers a array of fault .

3.1.2 Between-Space Consistency

To measure consistency between the associated spaces, one can measure how they are correlated. The distances between a transformed point and its original sample will be used as characteristic information to make comparisons between spaces. As mentioned above, we consider two specific spaces: the audio input space and the embedding space . Consequently, we can calculate the correlation of distances for the points belonging to each subset of spaces as follows:


where is Spearman’s rank correlation, and refers to the distance array .

On the other hand, one can also simply measure the agreement between distances using the binary accuracy between and , which is given by:


where denotes the binary accuracy function.

3.2 Distance Measures

The main assessment of this work is based on distance comparisons between original clip fragments and their transformations, both in audio and embedded space. To our best knowledge, not many general ways are developed to calculate the distance between raw audio representations of music signals directly. Therefore, we choose to calculate the distance between audio samples using time-frequency representations as the potential proxy of true distance between the music signals. More specifically, we use Mel Frequency Cepstral Coefficients (MFCCs) with 25 coefficients, dropping the first coefficient when the actual distance is calculated. Eventually, we employ two distance measures on the audio domain:

  • Dynamic Time Warping (DTW) is a well-known dynamic programming method for calculating similarities between time series. For our experiments, we use the FastDTW implementation [23].

  • Similarity Matrix Profile (SiMPle) [24] measures the similarity between two given music recordings using a similarity join[24]. We take the median of the profile array as the overall distance between two audio signals.

For deep embedding space, since any deep representation of input is encoded as a fixed length vector in our models, we adopted two general distance measures for vectors: Euclidean distance and cosine distance.

3.3 Transformations

In this subsection, we describe the details on the transformations we employed in our experiment. In all cases, we will consider a range from very small, humanly imperceptible transformations, up to transformations within the same category, that should be large enough to become humanly noticeable. While it is not trivial to set an upper bound for the transformation magnitudes, at which a transformed sample may be recognized as a ‘different’ song from the original, we introduce a reasonable range of magnitudes, such that we can investigate the overall robustness of our target encoders as transformations will become more grave. The selected range per each transformation is illustrated in Figure 3.

Figure 3: The selected range of magnitudes with respect to the transformations. Each row indicates a transformation category; each dot represents the selected magnitudes. We selected relatively more points in the range in which transformations should have small effect, except for the case of MP3 compression. Here, we tested all the possible transformations (kb/s levels) as supported by the compression software we employed. The red vertical lines indicate the position of the original sample with respect to the transformation magnitudes. For TS and PS, these consider no transformation; for PN, EN and MP, they consider the transformation magnitude that will be closest to the original sample.
  • Noise: As a randomized transformation, we applied both pink noise (PN) and environmental noise (EN) transformations. More specifically, for EN, we used noise recorded in a bar, as collected from freesound.111 The test range of the magnitude, expressed in terms of Signal to Noise Ratio, spans from -15dB to 30dB, with denser sampling for high Signal to Noise Ratios (which are situations in which transformed signals should be very close to the original signal) [25]. This strategy also is adopted for the rest of the transformations.

  • Tempo Shift: We applied a tempo shift (TS), transforming a signal to a new tempo, ranging from 30% to 150% of the original tempo. Therefore, we both slow down and speed up the signal. Close to the original tempo, we employed a step size of 2%, as a -2% and 2% tempo change has been considered as an irrelevant slowdown or speedup in previous work [5]. We employed an implementation222 using a phase vocoder and resampling algorithm.

  • Pitch Shift: We also employed a pitch shift (PS), changing the pitch of a signal, making it lower or higher. Close to the original pitch, we consider transformation steps of cents, which is 50% smaller than the error bound considered in the MIREX challenge of multiple fundamental frequency estimation & tracking [26]. Beyond a difference of 1 semitone with respect to the original, we whole tone interval steps.

  • Compression: For compression (MP), we simply compress the original audio sample using an MP3 encoder, taking all kb/s compression rates as provided by the FFmpeg software [27].

For the rest of the paper, for brevity, we use OG as the acronym of the original samples.

4 Experiment

4.1 Audio Pre-processing

For the input time-frequency representation to the DNNs, we use the dB-scale magnitude STFT matrix. For the calculation, the audio was resampled at 22,050 kHz. The window and overlap size are 1,024 and 256 respectively. It leads to the dimensionality of the frequency axis to be

, only taking positive frequencies into account. The standardization over the frequency axis is applied by taking the mean and the standard deviation of all magnitude spectra in the training set.

Also, we use the short excerpts of the original input audio track with , which yields approximately 2 seconds per excerpt in the setup we used. Each batch of excerpts is randomly cropped from 24 randomly chosen music clips before being served to the training loop.

When applying the transformations, it turned out that some of the libraries we used did not only apply the transformation, but also changed the loudness of the transformed signal. To mitigate this, and only consider the actual transformation of interest, we applied a loudness normalization based on the EBU-R 128 loudness measure [28]. More specifically, we calculated the mean loudness of the original sample, and then ensured that transformed audio samples would have equal mean loudness to their original.

4.2 Baseline

Beyond deep encoders, we also consider a conventional feature extractor: MFCCs, as also used in [10]. The MFCC extractor can also be seen as an encoder, that projects raw audio measurements into a latent embedding space, where the projection was hand-crafted by humans to be perceptually meaningful.

We first calculate the first- and second-order time derivatives of the given MFCCs and then take the mean and standard deviation over the time axis, for the original and its derivatives. Finally, we concatenate all statistics into one vector. Using the 25 coefficients excluding the first coefficient, we obtain from all the points in . For the AT task, we trained a dedicated for auto-tagging, with the same objective as Eq. 2, while is substituted as .

4.3 Dataset

We use a subset of the Million Song Dataset (MSD) [29] both for training and testing of AT and AE task. The number of the training samples is 71,512, which is bootstrapped from original subset of 102,161 samples. For the test set , we used 1,000 excerpts randomly sampled from 1,000 preview clips which are not used at training time. As suggested in [3], we used the top social tags.

As for the IR task, we choose to use the training set of the IRMAS dataset [30], which contains 6,705 audio clips of 3-second polyphonic mixtures of music audio, from more than 2,000 songs. The pre-dominant instrument of each short excerpt is labeled. As excerpts may have been clipped from a single song multiple times, we split the dataset into training, validation and test sets at the song level, to avoid unwanted bleeding among splits.

Finally, for VS, we employed the MUSDB18 dataset [31]. This dataset is developed for musical blind source separation tasks, and has been used in public benchmarking challenges [32]. The dataset consists of 150 unique full-length songs, both with mixtures and isolated sources of selected instrument groups: vocals, bass, drums and other. Originally, the dataset is split into a training and test set; we split the training set into a training and validation set (with a 7:3 ratio), to secure validation monitoring capability.

Note that since we use different datasets with respect to the tasks, the measurements we investigate will also depend on the datasets and tasks. However, across tasks, we always use the same encoder architecture, such that comparisons between tasks can still validly be made.

4.4 Performance Measures

As introduced in Section 3, we use distance consistency measures as primary evaluation criterion of our work. Next to this, we also measure the performance per employed learning task. For the AE task, the Mean Square Error (MSE) is used as a measure of reconstruction error. For the AT task, we apply a measure derived from the popular Area Under ROC Curve (AUC): more specifically, we apply , averaging the AUC measure over clips. As for the IR task, we choose to use accuracy. Finally, as for the VS task, we choose to use the Signal to Distortion Ratio (SDR), which is one of the evaluation measures used in the original benchmarking campaign. For this, we employ the public software as released by the benchmark organizers. While beyond SDR, this software suite also can calculate 3 more evaluation measures (Image to Spatial distortion Ratio (ISR), Source to Interference Ratio (SIR), Sources to Artifacts Ratios (SAR)), in this study, we choose to only employ SDR: the other metrics consider spatial distortion, while this is irrelevant to our experimental setup, in which we only use mono sources.

5 Results

In the following subsections, we present the major analysis results for task-specific performance, within-space consistency, and finally, between-space consistency. Shared conclusions and discussions following from our observations will be presented in Section 6.

5.1 Task-Specific Performance

Figure 4:

Task specific performance results. Blue and yellow curves indicate the performance of different encoders for each task, over the range of magnitude with respect to the transformations. The performance of original samples is indicated as dotted horizontal lines. For the remaining of the paper including this figure, all the confidence intervals are computed with 1,000 bootstraps at the 95% level.

To analyze task-specific performance, we ran predictions for the original samples in , as well as their transformations using all with all the magnitudes we selected. The overall results, grouped by transformation, task and encoder, are illustrated in Figure 4. For most parts, we observe similar degradation patterns within the same transformation type. For instance, in the presence of PN and EN transformations, performance decreases in a characteristic non-linear fashion as more noise is added. The exception seems to be the AE task, which shows somewhat unique trends with a more distinct difference between encoders. In particular, when EN is introduced, performance increases with the severity of the transformation. This is likely to be caused by the fact that the environmental noise that we employed is semantically irrelevant for the other tasks, thus causing a degradation in performance. However, because the AE task just reconstructs the given input audio regardless of the semantic context, and the environmental noise that we use is likely not as complex as music or pink noise, the overall reconstruction gets better.

To better understand the effect of transformations, we fitted a Generalized Additive Model (GAM) on the data, using as predictors the main effects of the task, encoder and transformation, along with their two-factor interactions. Because the relationship between performance and transformation magnitude is very characteristic in each case, we included an additional spline term to smooth the effect of the magnitude for every combination of transformation, task and encoder. In addition, and given the clear heterogeneity of distributions across tasks, we standardized performance scores using the within-task mean and standard deviation scores. Furthermore, MSE scores in the AE task are reversed, so that higher scores imply better performance. The analysis model explains most of the variability ().

An Analysis on Variance (ANOVA) using the marginalized effects clearly reveals that the largest effect is due to the encoders (

), as evidenced by Figure 4. Indeed, the VGG-like network has an estimated mean performance of () standardized units, while MFCCs has an estimated performance of standardized units. The second largest effect is the interaction between transformation and task (), mainly because of the VS task. Comparing the VGG-like and MFCC encoders on the same task (), the largest performance differences appear in the AE task, with VS showing the smallest differences. It suggests that MFCCs loses a substantial amount of information required for reconstruction, while a neural network is capable of maintaining sufficient information to do a reconstruction task. The smallest performance differences in the VS task mostly relate to the performance of the VGG-like encoder, that shows substantial performance degradation in response to the transformations. Figure 5 shows the estimated mean performance.

Figure 5: Estimated marginal mean of standardized performance by encoders and tasks, with 95% confidence intervals. Blue points and brown points indicate the performance of MFCC and VGG-like, respectively.

5.2 Within-Space Consistency

In terms of within-space consistency, we first examine the original audio space . As depicted in Figure 6, both the DTW and SiMPle measures show very high consistency for small transformations. As transformations have higher magnitude, as expected, the consistency decreases, but at different rates, depending on the transformation. The clear exception is the TS transformation, where both measures, and in particular DTW, are highly robust to the magnitude of the shift. This result implies that the explicit consideration of both measures on the temporal dynamics can be beneficial.

Figure 6: Within-space consistency by transformation on the audio space . Each curve indicates the within-space consistency .

With respect to the within-consistency of the latent space, Figure 7 and 8 depicts the results for both the Euclidean and cosine distance measures. In general, the trends are similar to those found in Figure 6. For analysis, we fitted a similar GAM model, including the main effect of the transformation and task, their interaction, and a smoother for the magnitude of each transformation within each task. When modeling consistency with respect to Euclidean distance, this analysis model achieved . An ANOVA analysis shows very similar effects due to transformation () and due to tasks (), with a smaller effect of the interaction. In particular, the model confirms the observation from the plots that the MFCC encoder has significantly higher consistency () than the others. For the VGG-like cases, AT shows the highest consistency (), followed by IR (), VS () and lastly by AE (). As Figure 8 shows, all these differences are statistically significant.

A similar model to analyze consistency with respect to the cosine distance yielded very similar results (). However, the effect of the task () was larger than the effect of the transformation (), indicating that the cosine distance is slightly more robust to transformations than the Euclidean distance.

Figure 7: Within-space consistency by transformation on the audio space . Each curve indicates the within-space consistency by task and transformation. The gray curves indicate on , taken as a weak upper bound for the consistency in the latent space. Confidence intervals are drawn at the 95% level. Points indicate individual observations from different trials.
Figure 8: Estimated marginal mean within-space consistency in the latent domain. Confidence interval are at 95% level.

To investigate observed effects more intuitively, we visualize in Figure 9 the original dataset samples and their smallest transformations, which should be hardly perceptible to imperceptible to human ears [5, 8, 26]333The smallest transformations are cents in PS, in TS, 30dB in PN and EN, and 192 kb/s in MP. in a 2-dimensional space, using t-SNE [33]. In MFCC space, (Figure 9), the distributions of colored points, corresponding to each of the transformation categories, are virtually identical to those of the original points. This matches our assumption that very subtle transformations, that humans will not easily recognize, should stay very close to the original points. Therefore, if the hidden latent embedded space had high consistency with respect to the audio space, the distribution of colored points should be virtually identical to the distribution of original points. However, this is certainly not the case for neural networks, especially for tasks such as AE and VS (see Figure 9). For instance, in the AE task every transformation visibly causes clusters that do not cover the full space. This suggests that the model may recognize transformations as important features, characterizing a subset of the overall problem space.

Figure 9: Scatter plot of encoded representations and their transformations for baseline MFCC and encoders with respect to the tasks we investigated. For all panes, black points indicate original audio samples in the encoded space, and the colored, overlaid points indicate the embeddings of transformations according to the indicated category.

5.3 Between-Space Consistency

Next, we discuss between-space consistency according to and , as discussed in Section 3.1.2. As in the previous section, we first provide a visualization of the relationship between transformations and consistency, and then employ the same GAM model to analyze individual effects. The analysis will be presented for all pairs of distance measures and between-space consistency measures, which results in 4 models for and another 4 models for . As in the within-space consistency analysis, we set the MFCC and other VGG-like networks from different learning tasks as independent ‘encoder’ to a latent embedded space.

5.3.1 Accuracy:

The between-space consistency, according to the criterion, is plotted in the upper plots of Figure 10. Comparing this plot to the within-space consistency plots for (Figure 6) and (Figure 8), one trend is striking: when within-space consistency in and becomes substantially low, the between-space consistency becomes high. This can be interpreted: when grave transformations are applied, the within-space consistencies in both and space will converge to , and comparing the two spaces, this behavior is consistent.

A first model to analyze the between-space consistency with respect to the SiMPle and cosine measures (), reveals that the largest effect is that of the task/encoder ), followed by the effect of the transformation (). The left plot of the first row in Figure 11 confirms that the estimated consistency of the MFCC encoder () is significantly higher than that of the VGG-like alternatives, which range between and . In fact, the relative order is the same as observed in the within-space case: MFCC is followed by AT, IR, VS, and finally AE.

We separately analyzed the data with respect to the other three combinations of measures, and found very similar results. The largest effect is due to the task/encoder, followed by the transformation; the effect of the interaction is considerably smaller. As the first rows of Figure 11 shows, the same results are observed in all four cases, with statistically significant differences among tasks.

Figure 10: (top) and (bottom) between-space consistency by transformation and magnitude. Each curve indicates the between-space consistency with respect to the task. Confidence intervals are drawn at the 95% level. Points indicate individual observations from different trials.

5.3.2 Correlation:

The bottom plots in Figure 10 show the results for between-space consistency measured with . It can be clearly seen that MFCC preserves the consistency between spaces much better than VGG-like encoders, and in general, all encoders are quite robust to the magnitude of the perturbations.

Figure 11: Estimated marginal means for between-space consistency by encoder . The first and second rows are for and the third and fourth rows are for . Confidence intervals are at the 95% level.

Analyzing data again using a GAM model confirms these observations. For instance, when analyzing consistency with respect to the DTW and Euclidean measures (), the largest effect is by far that of the task/encoder (), with the transformation and interaction effect being two orders of magnitude smaller. This is because of the clear superiority of MFCC, with an estimated consistency of , followed by AE (), IR (), VS () and finally AT () (see right plot of the fourth row in 11).

As before, we separately analyzed the data with respect to the other three combinations of measures, and found very similar results. As first two rows of Figure 11 shows, the same qualitative observations can be made in all four cases, with statistically significant differences among tasks. Noticeably, the superiority of MFCC is even clearer when employing the Euclidean distance. Finally, another visible difference is that the relative order of VGG-like networks is reversed with respect to , with AE being the most consistent, followed by VS, IR, and finally AT.

5.4 Sensitivity to Imperceptible Transformations

5.4.1 Task-Specific Performance

In this subsection, we focus more on the special cases of transformations with a magnitude such that they are hardly perceived by humans [5, 8, 26] As the first row of Figure 12 shows, performance is degraded even with such small transformations, confirming the findings from [5]. In particular, the VS task shows more variability among transformations compared to other tasks. Between transformations, the PS cases show relatively higher degradation.

Figure 12: Performance, within-space consistency, and between-space consistency distribution on the minimum transformations. The points are individual observations with respect to the transformation types. For PS and TS, we distinguish in the direction of the transformation (+: pitch/tempo up, -: pitch/tempo down). The first row indicates the task-specific performance, and the second row depicts the within-space consistency , and finally, the third and fourth rows show the between-space consistency and , respectively. The performance is standardized per task, and the sign of AE performance is flipped, similarly to our analysis models.

5.4.2 Within-Space Consistency

The second row of Figure 12 illustrates the within-space consistency on the space when considering these smallest transformations. As before, there is no substantial difference between the distance metrics. In general, the MFCC, AT, and IR encoder/tasks are relatively robust on these small transformations, with their median consistencies close to 1. However, encoders trained on the VS and AE tasks show undesirably high sensitivity to these small transformations. In this case, the effect of the PS transformations is even more clear, causing considerable variance for most of the tasks. The exception is AE, which is more uniformly spread in the first place.

5.4.3 Between-Space Consistency

Finally, the between-space consistencies on the minimum transformations are depicted in the last two rows of Figure 12. First, we see no significant differences between pairs of distance measures. When focusing on , the plots highly resemble those from 5.4.2, which can be expected, because the within-space consistency on is approximately 1 for all these transformations, as illustrated in Figure 6. On the other hand, when focusing on , The last row of Figure 12 shows that even such small transformations already result in large inconsistencies between spaces when employing neural network representations.

6 Discussion and Conclusion

6.1 Effect of the Encoder

For most of our experiments, the largest differences are found between encoders. As is well-known, the VGG-like deep neural network shows significantly better task-specific performance in comparison to the MFCC encoder. However, when considering distance consistency, MFCC is shown to be the most consistent encoder for all cases, with neural network approaches performing substantially worse in this respect. This suggests that, in case a task requires robustness to potential musical/acoustical deviations in the audio input space, it may be more preferable to employ MFCCs than neural network encoders.

6.2 Effect of the Learning Task

Considering the neural networks, our results show that the choice of learning task is the most important factor affecting consistency. For instance, a VGG-like network trained on the AE task seems to preserve the relative distances among samples (high ), but individual transformed samples will fall closer to originals that were not the actual original the transformation was applied to (low ). On the other hand, a task like AT yields high consistency in the neighborhood of corresponding original samples (high ), but does not preserve the general structure of the audio space (low ). This means that a network trained on a low-level task like AE is more consistent than a network trained on a high-level task like AT, because the resulting latent space is less morphed and it more closely resembles the original audio space. In fact, in our results we see that the semantic high-levelness of the task (AT IR VS AE) is positively correlated with , while negatively correlated with .

Figure 13: on the original samples, including all the possible distance pairs between audio and latent domain.

To further confirm this observation, we also computed the between-space consistency only on the set of original samples. The results, in Figure 13, are very similar to those in the last two rows of Figure 11 and 12. This suggests that in general, the global distance structure of an embedded latent space with respect to the original samples generalizes over the vicinity of those originals, at least for the transformations that we employed.

Figure 14: 2-dimensional scatter plot using t-SNE. Each point represents 2-second audio mixture signal chunks that are encoded by a VS-specialized encoder. In the left plot, the color map of points is based on the loudness of the isolated vocal signal for a given mixture signal. The red color indicates higher loudness, and the blue color indicates smaller loudness. On the right plot, the same chunks are colored by the song each chunk belongs to. The samples are randomly sampled from the MUSDB18 dataset.

Considering that AE is an unsupervised learning task, and its objective is merely to embed an original data point into a low-dimensional latent space by minimizing the reconstruction error, the odds are lower that data-points will cluster according to more semantic criteria, as implicitly encoded in supervised learning tasks. For instance, in contrast, the VS task should morph the latent space such, that input clips with similar degrees of “vocalness” should fall close together, as indeed is shown in Figure

14. As the task becomes more complex and high-level, such as with AT, this clustering effect will become more multi-faceted and complex, potentially morphing the latent space with respect to the semantic space that is used as the source of supervision.

6.3 Effect of the Transformation

Across almost all experimental results, significant differences between transformation categories are observed. On the one hand, this supports the findings from [5, 8], which show the vulnerability of MIR systems to small audio transformations. On the other hand, this also implies that different types of transformations have different effects on the latent space, as depicted in Figure 7.

6.4 Are Nearby Neighbors Relatives?

As depicted in Figure 7, substantial inconsistencies emerge in when compared to . Clearly, these inconsistencies are not desirable, especially when the transformations we applied are not supposed to have noticeable effects. However, as our consistency investigations showed, the MFCC baseline encoder behaves surprisingly well in terms of consistency, evidencing that hand-crafted features should not always be considered as inferior to deep representations.

While in a conventional audio feature extraction pipeline, important salient data patterns may not be captured due to accidental human omission, our experimental results indicate that DNN representations may be unexpectedly unreliable. In the deep music embedding space,‘known relatives’ in the audio space may suddenly become faraway pairs. That a representation has certain unexpected inconsistencies should be carefully studied and taken into account, specially given the increasing interest in applying transfer learning using DNN representations, not only in the MIR field. For example, if a system requires to use degraded audio inputs for a pre-trained DNN (which e.g. may be done in music identification tasks), while humans may barely recognize the differences between the inputs and their original form, it does not guarantee that this transformed input may be embedded at a similar position to its original version in a latent space.

6.5 Towards Reliable Deep Music Embeddings

In this work, we proposed to use several distance consistency-based criteria, in order to assess whether representations in various spaces can be deemed as consistent. We see this as a complementary means of diagnosis beyond task-related performance criteria, when aiming to learn more general and robust deep representations. More specifically, we investigated whether deep latent spaces are consistent in terms of distance structure, when smaller and larger transformations on raw audio are introduced (RQ 1). Next to this, we investigated how various types of learning tasks used to train deep encoders impact the consistencies (RQ 2).

Consequentially, we conducted an experiment employing 4 MIR tasks, and considering deep encoders versus a conventional hand-crafted MFCC encoder, to measure the consistency for different scenarios. Our findings can be summarized as follows:

  1. [label=RQ 0.   , itemindent=4em]

  2. Compared to the MFCC baseline, all DNN encoders indicate lower consistency, both in terms of within-space consistency and between-space consistency, especially when transformations grow from imperceptibly small to larger, more perceptible ones.

  3. Considering learning tasks, the high-levelness of a task is correlated with the consistency of resulting encoder. For instance, an AT-specialized encoder, which needs to deal with semantically high-level task, yields the highest within-space consistency, but the lowest between-space consistency. On the other hand, an AE-specialized encoder, which deals with a semantically low-level task, shows opposite trends.

To realize a fully robust assessment framework, there still are a number of aspects to be investigated. First of all, more in-depth study is required considering different magnitudes in the transformations, and their possible comparability. While we applied different magnitudes for each transformations, we decided not to comparatively consider the magnitude ranges in the analysis at this moment. This was done, as we do not have any exact means to compare the perceptual effect of different magnitudes, which will be crucial to regularize between transformations.

Furthermore, similar analysis techniques can be applied to more diverse settings of DNNs, including different architectures, different levels of regularizations, and so on. Also, as suggested in  [8, 9], the same measurement and analysis techniques can be used for adversarial examples generated from the DNN itself, as another important means of studying a DNN’s reliability.

Moreover, and based on the observations from our study, it may be possible to develop countermeasures for maintaining high consistency of a model, while yielding high task-specific performance. For instance, it can be effective if, during learning, a network is directly supervised to treat transformations in similar ways as their original versions in the latent space. This can be implemented as an auxiliary objective to the main objective of the learning procedure, or introducing directly the transformed examples as the data augmentation.

Finally, we believe that our work can be a step forward towards a practical framework for more interpretable deep learning models, in the sense that we suggest a less task-dependent measure for evaluating a deep representation, that still is based on known semantic relationships in the original item space.


This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative.


  • [1] Aäron van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based music recommendation. In Advances in Neural Information Processing Systems 26 NIPS, pages 2643–2651, Lake Tahoe, NV, USA, December 2013.
  • [2] Eric J. Humphrey, Juan Pablo Bello, and Yann LeCun. Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR, pages 403–408, October 2012.
  • [3] Keunwoo Choi, György Fazekas, Mark B. Sandler, and Kyunghyun Cho.

    Convolutional recurrent neural networks for music classification.

    In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pages 2392–2396, March 2017.
  • [4] Pritish Chandna, Marius Miron, Jordi Janer, and Emilia Gómez. Monoaural audio source separation using deep convolutional neural networks. In Latent Variable Analysis and Signal Separation - 13th International Conference, LVA/ICA, Proceedings, pages 258–266, Grenoble, France, February 2017.
  • [5] Bob L. Sturm. A simple method to determine if a music information retrieval system is a "horse". IEEE Trans. Multimedia, 16(6):1636–1644, 2014.
  • [6] Francisco Rodríguez-Algarra, Bob L. Sturm, and Hugo Maruri-Aguilar. Analysing scattering-based music content analysis systems: Where’s the music? In Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR, pages 344–350, August 2016.
  • [7] Bob L. Sturm. The "horse" inside: Seeking causes behind the behaviors of music content analysis systems. Computers in Entertainment, 14(2):3:1–3:32, 2016.
  • [8] Corey Kereliuk, Bob L. Sturm, and Jan Larsen. Deep learning and music adversaries. IEEE Trans. Multimedia, 17(11):2059–2071, 2015.
  • [9] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR, Conference Track Proceedings, May 2015.
  • [10] Keunwoo Choi, György Fazekas, Mark B. Sandler, and Kyunghyun Cho. Transfer learning for music classification and regression tasks. In Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR, pages 141–149, Suzhou, China, October 2017.
  • [11] Jongpil Lee, Taejun Kim, Jiyoung Park, and Juhan Nam. Raw waveform-based audio classification using sample-level CNN architectures. CoRR, abs/1712.00866, 2017.
  • [12] Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In 14th Sound and Music Computing Conference, SMC, Espoo, Finland, July 2017.
  • [13] Sander Dieleman and Benjamin Schrauwen. End-to-end learning for music audio. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pages 6964–6968, Florence, Italy, May 2014. IEEE.
  • [14] Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. Samplecnn: End-to-end deep convolutional neural networks using very small filters for music classification. Applied Sciences, 8(1), 2018.
  • [15] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. In Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR, pages 745–751, October 2017.
  • [16] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. Deep Learning. Adaptive computation and machine learning. MIT Press, 2016.
  • [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. volume 60, pages 84–90, New York, NY, USA, May 2017. ACM.
  • [18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In 3th International Conference on Learning Representations, ICLR, San Diego, CA, USA, May 2015.
  • [19] Vinod Nair and Geoffrey E. Hinton.

    Rectified linear units improve restricted boltzmann machines.

    In Proceedings of the 27th International Conference on Machine Learning ICML, pages 807–814, Haifa, Israel, June 2010. Omnipress.
  • [20] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML, pages 448–456, Lille, France, July 2015. JMLR, Inc.
  • [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3th International Conference on Learning Representations, ICLR, May 2015.
  • [22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI - 18th International Conference, Proceedings, Part III, pages 234–241, October 2015.
  • [23] Stan Salvador and Philip Chan. Fastdtw: Toward accurate dynamic time warping in linear time and space. In 3 rd International Workshop on Mining Temporal and Sequential Data (TDM-04). Citeseer, 2004.
  • [24] Diego Furtado Silva, Chin-Chia Michael Yeh, Gustavo E. A. P. A. Batista, and Eamonn J. Keogh. Simple: Assessing music similarity using subsequences joins. In Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR, pages 23–29, August 2016.
  • [25] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016.
  • [26] Justin Salamon and Julián Urbano. Current challenges in the evaluation of predominant melody extraction algorithms. In Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR, pages 289–294, October 2012.
  • [27] Suramya Tomar. Converting video formats with ffmpeg. Linux Journal, 2006(146):10, 2006.
  • [28] EBU. Loudness normalisation and permitted maximum level of audio signals. 2010.
  • [29] Thierry Bertin-Mahieux, Daniel P. W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR, pages 591–596, Miami, FL, USA, October 2011. University of Miami.
  • [30] Juan J. Bosch, Jordi Janer, Ferdinand Fuhrmann, and Perfecto Herrera. A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals. In Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR, pages 559–564, October 2012.
  • [31] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, December 2017.
  • [32] Fabian-Robert Stöter, Antoine Liutkus, and Nobutaka Ito. The 2018 signal separation evaluation campaign. In LVA/ICA, volume 10891 of Lecture Notes in Computer Science, pages 293–305. Springer, 2018.
  • [33] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(November):2579–2605, 2008.