1 Introduction
Music audio is a complex signal. Frequencies in the signal usually belong to multiple pitches, which are organized harmonically and rhythmically, and often originate from multiple acoustic sources in the presence of noise. When solving tasks in the Music Information Retrieval (MIR) field, within this noisy signal, the optimal subset of information needs to be found that leads to quantifiable and musical descriptors. Commonly, this process is handled by pipelines exploiting a wide range of signal processing and machine learning algorithms. Beyond the use of handcrafted music representations, which are informed by human domain knowledge, as an alternative, deep music representations have emerged, that are trained by employing deep neural networks (DNNs) and massive amounts of training data observations. Such deep representations are usually reported to outperform handcrafted representations [1, 2, 3, 4].
At the same time, the performance of MIR systems may be vulnerable to subtle input manipulation. The addition of small noise may lead to unexpected random behavior, regardless of whether traditional or deep models are used [5, 6, 7, 8]. In a similar line of thought, in the broader deep learning (DL) community, increasing attention is given to adversarial examples that are barely differentiable from original samples, but greatly impact a network’s performance [9, 8].
So far, the sensitivity of representations with respect to subtle input changes has mostly been tested in relation to dedicated machine learning tasks (e.g. object recognition, music genre classification), and examined by investigating whether these input changes cause performance drops. When purely considering the questions whether relevant input signal information can automatically be encoded into a representation, and to what extent the representation can be deemed ‘reliable’, in principle, the learned representation should be general and useful to different types of tasks. Therefore, in this work, we will not focus on performance obtained by using a learned representation for certain machine learning tasks, but rather on a systematic way to verify assumptions on distance relationships between several representation spaces: the audio space and the learned space.
Inspired by [5], we will also investigate the effect of musical and acoustic transformations of audio input signals, in combination with an arbitrary encoder of the input signal, which either may be a conventional feature extractor or deep learningbased encoder. In doing this, we have the following major assumptions according to the rationale can be found in Figure 1:

[label=() , itemindent=2em]

if a small, humanly imperceptible transformation is introduced, the distance between the original and transformed signal should be very small, both in the audio and encoded space.

however, if a more grave transformation is introduced, the distance between the original and transformed signal should be larger, both in the audio and encoded space.

since an encoder obtained under a machine learning framework will have optimized its encoding behavior based on the task and the data, the relational structure of signals will be morphed with respect to them as well.
To examine the above assumptions, we seek to answer the following research questions:

[label=RQ 0. , itemindent=4em]

Do assumption (i) and (ii) hold for conventional and deep learningbased encoders?

Does assumption (iii) hold for musicrelated tasks and corresponding datasets, especially when deep learning is applied?
With this work, we intend to offer directions towards a complementary evaluation method for deep machine learning pipelines, that focuses on space diagnosis rather than the troubleshooting of pipeline output. Our intention is that this will yield the researcher additional insight into the reliability and potential semantic sensitivities of deep learned spaces.
In the remainder of this paper, we first describe our approaches including the details on the learning setup (2) and the methodology to assess distance consistency (3), followed by the experimental setup (4). Further, we report the result from our experiments (5). Afterwards we discuss on the results and conclude this work (6).
2 Learning
To diagnose a deep music representation space, such a space should first exist. For this, one needs to find a learnable deep encoder that transforms the input audio representation
to a latent vector
, while taking into account the desired output for a given learning task. The learning of can be done by adjusting the parametrization to optimize the objective function, which should be defined in accordance to a given task.2.1 Tasks
We consider representations learned for four different tasks. By doing this, we take a broad range of problems into account that are particularly common in MIR field: Autoencoder (AE), music autotagging (AT), predominant instrument recognition (IR), and finally singing voice separation
(VS). AE is a representative task for unsupervised learning using DNNs, and AT is a popular supervised learning task in the MIR field
[3, 10, 11, 12, 13, 14]. AT is a multilabel classification problem, in which individual labels are not always mutually exclusive and often highly intercorrelated. As such, it can be seen as a more challenging problem than IR, which is a singlelabel classification problem. Furthermore, IR labels involve instruments, which can be seen as more objective and taxonomically stable labels than e.g. genres or moods. Finally, VS is a task that can be formulated as a regression problem, that learns a mask to segregate a certain region of interest out of a given signal mixture.2.1.1 Autoencoder
The objective of an Autoencoder is to find a set of encoder and decoder functions, minimizing the reconstruction error given by:
(1) 
Here, the is the output of decoding function parameterized by and is the given set of training samples.
2.1.2 Music AutoTagging
The typical approach to music autotagging using DNNs is to consider the problem as a multilabel classification problem, for which the objective is to minimize the binary crossentropy of each music tag , which is expressed as follows:
(2) 
where is the binary label that indicates whether the tag is related to the input audio signal and is the output of function
, which is the prediction layer with sigmoid nonlinearity to transform the deep representation
into the , and parameterized by . The optimal functions and are found by adjusting and such that (2) is minimized.2.1.3 Predominant Musical Instrument Recognition
The learning of the IR task can be formulated as a singlelabel, multiclass classification, which often aims at minimizing the following objective function, especially in the context of neural network learning:
(3) 
where is a instrument label. In general, the learning process of (2), except the output posterior distribution , is derived as the categorical distribution by transformation such as the softmax function , where is the output of .
2.1.4 Singing Voice Separation
There are multiple ways to set up an objective function for source separation task. It can be achieved by simply applying 1 between the output of the network to the desired isolated signal, or, as introduced in [15], one can learn a mask that segments the target component from the mixture as follows:
(4) 
where is the rawlevel representation of the isolated signal, which serves as the regression target, is the representation of the original input mixture, and refers the elementwise multiplication. is the mask inferred by of which the elements are bounded in the range such that they can be used for the separation of the target source. Note, that both input
and estimated target source
are magnitude spectra, so we use the original phase of input to reconstruct a timedomain signal.2.2 Network Architectures
The architecture of a DNN determines the overall structure of the network, which defines the details of the desired patterns to be captured by the learning process [16]. In other words, it reflects the way in which a network should interpret a given input data representation. In this work, we use a VGGlike architecture, one of the most popular and general architectures frequently employed in the MIR field.
Layers  Output shape 

Input  
Conv , BN, ReLU 

MaxPooling  
Conv , BN, ReLU  
MaxPooling  
Conv , BN, ReLU  
MaxPooling  
Conv , BN, ReLU  
MaxPooling  
Conv , BN, ReLU  
MaxPooling  
Conv , BN, ReLU  
MaxPooling  
GlobalAveragePooling 
The VGGlike
architecture is a Convolutional Neural Network (CNN) architecture introduced by
[17, 18], which employs tiny rectangular filters. Successes of VGGlike architectures have not only been reported for computer vision tasks, but also in various MIR fields
[3, 8]. The detailed architecture design used in our work can be found in the Table 1.2.3 Architecture and Learning Details
For both architectures, we used Rectified Linear Units (ReLU)
[19]for the nonlinearity, and Batch Normalization (BN) in every convolutional and fullyconnected layer for fast training and regularization
[20]. We use ADAM [21] as optimization algorithm during training, where the learning rate is set foracross all models. We trained models with respect to their objective function, which requires different optimization strategies. Nonetheless, we regularized the other factors except the number of epochs per task, which inherently depends on the dataset and the task. The termination point of the training is set manually, where either the validation loss reaches to the plateau or starts to increase. More specifically, we stopped the training for each task at the epoch of
for the AE, AT, IR, VS task, respectively.3 Measuring Distance Consistency
In this work, among the set of potential representation spaces , we consider two representation spaces of interest: the audio input space and the latent embedding space . indicates the set of different models that can be considered within a given space. For all relevant spaces, we will assess space reliability by examining the distance consistency with respect to a set of transformations .
In Section 3.1, we describe how distance consistency will be measured. Section 3.2 will discuss the distance measures that will be used, while Section 3.3 discusses what transformations will be adopted in our experiments.
3.1 Distance Consistency
For distance consistency, we will compute withinspace consistency and betweenspace consistency.
3.1.1 WithinSpace Consistency
We first obtain the transformed samples from all belonging to the test set . Afterwards, we calculate the error function of each transformed sample as follows:
(5) 
Here, we see whether a transformed sample is closer to any other original sample than its own original sample. is a representation of a single music excerpt and is the set of the collection of all the points on the given space determined by . All neural network encoders belong to the set of models we tested in this work. is a distance function belonging to the set of distance measures considered for the given space .
As indicates how the space is unreliable at the cliplevel, the withinspace consistency can be defined as the complement of :
(6) 
where refers a array of fault .
3.1.2 BetweenSpace Consistency
To measure consistency between the associated spaces, one can measure how they are correlated. The distances between a transformed point and its original sample will be used as characteristic information to make comparisons between spaces. As mentioned above, we consider two specific spaces: the audio input space and the embedding space . Consequently, we can calculate the correlation of distances for the points belonging to each subset of spaces as follows:
(7) 
where is Spearman’s rank correlation, and refers to the distance array .
On the other hand, one can also simply measure the agreement between distances using the binary accuracy between and , which is given by:
(8) 
where denotes the binary accuracy function.
3.2 Distance Measures
The main assessment of this work is based on distance comparisons between original clip fragments and their transformations, both in audio and embedded space. To our best knowledge, not many general ways are developed to calculate the distance between raw audio representations of music signals directly. Therefore, we choose to calculate the distance between audio samples using timefrequency representations as the potential proxy of true distance between the music signals. More specifically, we use Mel Frequency Cepstral Coefficients (MFCCs) with 25 coefficients, dropping the first coefficient when the actual distance is calculated. Eventually, we employ two distance measures on the audio domain:

Dynamic Time Warping (DTW) is a wellknown dynamic programming method for calculating similarities between time series. For our experiments, we use the FastDTW implementation [23].
For deep embedding space, since any deep representation of input is encoded as a fixed length vector in our models, we adopted two general distance measures for vectors: Euclidean distance and cosine distance.
3.3 Transformations
In this subsection, we describe the details on the transformations we employed in our experiment. In all cases, we will consider a range from very small, humanly imperceptible transformations, up to transformations within the same category, that should be large enough to become humanly noticeable. While it is not trivial to set an upper bound for the transformation magnitudes, at which a transformed sample may be recognized as a ‘different’ song from the original, we introduce a reasonable range of magnitudes, such that we can investigate the overall robustness of our target encoders as transformations will become more grave. The selected range per each transformation is illustrated in Figure 3.

Noise: As a randomized transformation, we applied both pink noise (PN) and environmental noise (EN) transformations. More specifically, for EN, we used noise recorded in a bar, as collected from freesound.^{1}^{1}1https://freesound.org The test range of the magnitude, expressed in terms of Signal to Noise Ratio, spans from 15dB to 30dB, with denser sampling for high Signal to Noise Ratios (which are situations in which transformed signals should be very close to the original signal) [25]. This strategy also is adopted for the rest of the transformations.

Tempo Shift: We applied a tempo shift (TS), transforming a signal to a new tempo, ranging from 30% to 150% of the original tempo. Therefore, we both slow down and speed up the signal. Close to the original tempo, we employed a step size of 2%, as a 2% and 2% tempo change has been considered as an irrelevant slowdown or speedup in previous work [5]. We employed an implementation^{2}^{2}2https://breakfastquay.com/rubberband/ using a phase vocoder and resampling algorithm.

Pitch Shift: We also employed a pitch shift (PS), changing the pitch of a signal, making it lower or higher. Close to the original pitch, we consider transformation steps of cents, which is 50% smaller than the error bound considered in the MIREX challenge of multiple fundamental frequency estimation & tracking [26]. Beyond a difference of 1 semitone with respect to the original, we whole tone interval steps.

Compression: For compression (MP), we simply compress the original audio sample using an MP3 encoder, taking all kb/s compression rates as provided by the FFmpeg software [27].
For the rest of the paper, for brevity, we use OG as the acronym of the original samples.
4 Experiment
4.1 Audio Preprocessing
For the input timefrequency representation to the DNNs, we use the dBscale magnitude STFT matrix. For the calculation, the audio was resampled at 22,050 kHz. The window and overlap size are 1,024 and 256 respectively. It leads to the dimensionality of the frequency axis to be
, only taking positive frequencies into account. The standardization over the frequency axis is applied by taking the mean and the standard deviation of all magnitude spectra in the training set.
Also, we use the short excerpts of the original input audio track with , which yields approximately 2 seconds per excerpt in the setup we used. Each batch of excerpts is randomly cropped from 24 randomly chosen music clips before being served to the training loop.
When applying the transformations, it turned out that some of the libraries we used did not only apply the transformation, but also changed the loudness of the transformed signal. To mitigate this, and only consider the actual transformation of interest, we applied a loudness normalization based on the EBUR 128 loudness measure [28]. More specifically, we calculated the mean loudness of the original sample, and then ensured that transformed audio samples would have equal mean loudness to their original.
4.2 Baseline
Beyond deep encoders, we also consider a conventional feature extractor: MFCCs, as also used in [10]. The MFCC extractor can also be seen as an encoder, that projects raw audio measurements into a latent embedding space, where the projection was handcrafted by humans to be perceptually meaningful.
We first calculate the first and secondorder time derivatives of the given MFCCs and then take the mean and standard deviation over the time axis, for the original and its derivatives. Finally, we concatenate all statistics into one vector. Using the 25 coefficients excluding the first coefficient, we obtain from all the points in . For the AT task, we trained a dedicated for autotagging, with the same objective as Eq. 2, while is substituted as .
4.3 Dataset
We use a subset of the Million Song Dataset (MSD) [29] both for training and testing of AT and AE task. The number of the training samples is 71,512, which is bootstrapped from original subset of 102,161 samples. For the test set , we used 1,000 excerpts randomly sampled from 1,000 preview clips which are not used at training time. As suggested in [3], we used the top social tags.
As for the IR task, we choose to use the training set of the IRMAS dataset [30], which contains 6,705 audio clips of 3second polyphonic mixtures of music audio, from more than 2,000 songs. The predominant instrument of each short excerpt is labeled. As excerpts may have been clipped from a single song multiple times, we split the dataset into training, validation and test sets at the song level, to avoid unwanted bleeding among splits.
Finally, for VS, we employed the MUSDB18 dataset [31]. This dataset is developed for musical blind source separation tasks, and has been used in public benchmarking challenges [32]. The dataset consists of 150 unique fulllength songs, both with mixtures and isolated sources of selected instrument groups: vocals, bass, drums and other. Originally, the dataset is split into a training and test set; we split the training set into a training and validation set (with a 7:3 ratio), to secure validation monitoring capability.
Note that since we use different datasets with respect to the tasks, the measurements we investigate will also depend on the datasets and tasks. However, across tasks, we always use the same encoder architecture, such that comparisons between tasks can still validly be made.
4.4 Performance Measures
As introduced in Section 3, we use distance consistency measures as primary evaluation criterion of our work. Next to this, we also measure the performance per employed learning task. For the AE task, the Mean Square Error (MSE) is used as a measure of reconstruction error. For the AT task, we apply a measure derived from the popular Area Under ROC Curve (AUC): more specifically, we apply , averaging the AUC measure over clips. As for the IR task, we choose to use accuracy. Finally, as for the VS task, we choose to use the Signal to Distortion Ratio (SDR), which is one of the evaluation measures used in the original benchmarking campaign. For this, we employ the public software as released by the benchmark organizers. While beyond SDR, this software suite also can calculate 3 more evaluation measures (Image to Spatial distortion Ratio (ISR), Source to Interference Ratio (SIR), Sources to Artifacts Ratios (SAR)), in this study, we choose to only employ SDR: the other metrics consider spatial distortion, while this is irrelevant to our experimental setup, in which we only use mono sources.
5 Results
In the following subsections, we present the major analysis results for taskspecific performance, withinspace consistency, and finally, betweenspace consistency. Shared conclusions and discussions following from our observations will be presented in Section 6.
5.1 TaskSpecific Performance
To analyze taskspecific performance, we ran predictions for the original samples in , as well as their transformations using all with all the magnitudes we selected. The overall results, grouped by transformation, task and encoder, are illustrated in Figure 4. For most parts, we observe similar degradation patterns within the same transformation type. For instance, in the presence of PN and EN transformations, performance decreases in a characteristic nonlinear fashion as more noise is added. The exception seems to be the AE task, which shows somewhat unique trends with a more distinct difference between encoders. In particular, when EN is introduced, performance increases with the severity of the transformation. This is likely to be caused by the fact that the environmental noise that we employed is semantically irrelevant for the other tasks, thus causing a degradation in performance. However, because the AE task just reconstructs the given input audio regardless of the semantic context, and the environmental noise that we use is likely not as complex as music or pink noise, the overall reconstruction gets better.
To better understand the effect of transformations, we fitted a Generalized Additive Model (GAM) on the data, using as predictors the main effects of the task, encoder and transformation, along with their twofactor interactions. Because the relationship between performance and transformation magnitude is very characteristic in each case, we included an additional spline term to smooth the effect of the magnitude for every combination of transformation, task and encoder. In addition, and given the clear heterogeneity of distributions across tasks, we standardized performance scores using the withintask mean and standard deviation scores. Furthermore, MSE scores in the AE task are reversed, so that higher scores imply better performance. The analysis model explains most of the variability ().
An Analysis on Variance (ANOVA) using the marginalized effects clearly reveals that the largest effect is due to the encoders (
), as evidenced by Figure 4. Indeed, the VGGlike network has an estimated mean performance of () standardized units, while MFCCs has an estimated performance of standardized units. The second largest effect is the interaction between transformation and task (), mainly because of the VS task. Comparing the VGGlike and MFCC encoders on the same task (), the largest performance differences appear in the AE task, with VS showing the smallest differences. It suggests that MFCCs loses a substantial amount of information required for reconstruction, while a neural network is capable of maintaining sufficient information to do a reconstruction task. The smallest performance differences in the VS task mostly relate to the performance of the VGGlike encoder, that shows substantial performance degradation in response to the transformations. Figure 5 shows the estimated mean performance.5.2 WithinSpace Consistency
In terms of withinspace consistency, we first examine the original audio space . As depicted in Figure 6, both the DTW and SiMPle measures show very high consistency for small transformations. As transformations have higher magnitude, as expected, the consistency decreases, but at different rates, depending on the transformation. The clear exception is the TS transformation, where both measures, and in particular DTW, are highly robust to the magnitude of the shift. This result implies that the explicit consideration of both measures on the temporal dynamics can be beneficial.
With respect to the withinconsistency of the latent space, Figure 7 and 8 depicts the results for both the Euclidean and cosine distance measures. In general, the trends are similar to those found in Figure 6. For analysis, we fitted a similar GAM model, including the main effect of the transformation and task, their interaction, and a smoother for the magnitude of each transformation within each task. When modeling consistency with respect to Euclidean distance, this analysis model achieved . An ANOVA analysis shows very similar effects due to transformation () and due to tasks (), with a smaller effect of the interaction. In particular, the model confirms the observation from the plots that the MFCC encoder has significantly higher consistency () than the others. For the VGGlike cases, AT shows the highest consistency (), followed by IR (), VS () and lastly by AE (). As Figure 8 shows, all these differences are statistically significant.
A similar model to analyze consistency with respect to the cosine distance yielded very similar results (). However, the effect of the task () was larger than the effect of the transformation (), indicating that the cosine distance is slightly more robust to transformations than the Euclidean distance.
To investigate observed effects more intuitively, we visualize in Figure 9 the original dataset samples and their smallest transformations, which should be hardly perceptible to imperceptible to human ears [5, 8, 26]^{3}^{3}3The smallest transformations are cents in PS, in TS, 30dB in PN and EN, and 192 kb/s in MP. in a 2dimensional space, using tSNE [33]. In MFCC space, (Figure 9), the distributions of colored points, corresponding to each of the transformation categories, are virtually identical to those of the original points. This matches our assumption that very subtle transformations, that humans will not easily recognize, should stay very close to the original points. Therefore, if the hidden latent embedded space had high consistency with respect to the audio space, the distribution of colored points should be virtually identical to the distribution of original points. However, this is certainly not the case for neural networks, especially for tasks such as AE and VS (see Figure 9). For instance, in the AE task every transformation visibly causes clusters that do not cover the full space. This suggests that the model may recognize transformations as important features, characterizing a subset of the overall problem space.
5.3 BetweenSpace Consistency
Next, we discuss betweenspace consistency according to and , as discussed in Section 3.1.2. As in the previous section, we first provide a visualization of the relationship between transformations and consistency, and then employ the same GAM model to analyze individual effects. The analysis will be presented for all pairs of distance measures and betweenspace consistency measures, which results in 4 models for and another 4 models for . As in the withinspace consistency analysis, we set the MFCC and other VGGlike networks from different learning tasks as independent ‘encoder’ to a latent embedded space.
5.3.1 Accuracy:
The betweenspace consistency, according to the criterion, is plotted in the upper plots of Figure 10. Comparing this plot to the withinspace consistency plots for (Figure 6) and (Figure 8), one trend is striking: when withinspace consistency in and becomes substantially low, the betweenspace consistency becomes high. This can be interpreted: when grave transformations are applied, the withinspace consistencies in both and space will converge to , and comparing the two spaces, this behavior is consistent.
A first model to analyze the betweenspace consistency with respect to the SiMPle and cosine measures (), reveals that the largest effect is that of the task/encoder ), followed by the effect of the transformation (). The left plot of the first row in Figure 11 confirms that the estimated consistency of the MFCC encoder () is significantly higher than that of the VGGlike alternatives, which range between and . In fact, the relative order is the same as observed in the withinspace case: MFCC is followed by AT, IR, VS, and finally AE.
We separately analyzed the data with respect to the other three combinations of measures, and found very similar results. The largest effect is due to the task/encoder, followed by the transformation; the effect of the interaction is considerably smaller. As the first rows of Figure 11 shows, the same results are observed in all four cases, with statistically significant differences among tasks.
5.3.2 Correlation:
The bottom plots in Figure 10 show the results for betweenspace consistency measured with . It can be clearly seen that MFCC preserves the consistency between spaces much better than VGGlike encoders, and in general, all encoders are quite robust to the magnitude of the perturbations.
Analyzing data again using a GAM model confirms these observations. For instance, when analyzing consistency with respect to the DTW and Euclidean measures (), the largest effect is by far that of the task/encoder (), with the transformation and interaction effect being two orders of magnitude smaller. This is because of the clear superiority of MFCC, with an estimated consistency of , followed by AE (), IR (), VS () and finally AT () (see right plot of the fourth row in 11).
As before, we separately analyzed the data with respect to the other three combinations of measures, and found very similar results. As first two rows of Figure 11 shows, the same qualitative observations can be made in all four cases, with statistically significant differences among tasks. Noticeably, the superiority of MFCC is even clearer when employing the Euclidean distance. Finally, another visible difference is that the relative order of VGGlike networks is reversed with respect to , with AE being the most consistent, followed by VS, IR, and finally AT.
5.4 Sensitivity to Imperceptible Transformations
5.4.1 TaskSpecific Performance
In this subsection, we focus more on the special cases of transformations with a magnitude such that they are hardly perceived by humans [5, 8, 26] As the first row of Figure 12 shows, performance is degraded even with such small transformations, confirming the findings from [5]. In particular, the VS task shows more variability among transformations compared to other tasks. Between transformations, the PS cases show relatively higher degradation.
5.4.2 WithinSpace Consistency
The second row of Figure 12 illustrates the withinspace consistency on the space when considering these smallest transformations. As before, there is no substantial difference between the distance metrics. In general, the MFCC, AT, and IR encoder/tasks are relatively robust on these small transformations, with their median consistencies close to 1. However, encoders trained on the VS and AE tasks show undesirably high sensitivity to these small transformations. In this case, the effect of the PS transformations is even more clear, causing considerable variance for most of the tasks. The exception is AE, which is more uniformly spread in the first place.
5.4.3 BetweenSpace Consistency
Finally, the betweenspace consistencies on the minimum transformations are depicted in the last two rows of Figure 12. First, we see no significant differences between pairs of distance measures. When focusing on , the plots highly resemble those from 5.4.2, which can be expected, because the withinspace consistency on is approximately 1 for all these transformations, as illustrated in Figure 6. On the other hand, when focusing on , The last row of Figure 12 shows that even such small transformations already result in large inconsistencies between spaces when employing neural network representations.
6 Discussion and Conclusion
6.1 Effect of the Encoder
For most of our experiments, the largest differences are found between encoders. As is wellknown, the VGGlike deep neural network shows significantly better taskspecific performance in comparison to the MFCC encoder. However, when considering distance consistency, MFCC is shown to be the most consistent encoder for all cases, with neural network approaches performing substantially worse in this respect. This suggests that, in case a task requires robustness to potential musical/acoustical deviations in the audio input space, it may be more preferable to employ MFCCs than neural network encoders.
6.2 Effect of the Learning Task
Considering the neural networks, our results show that the choice of learning task is the most important factor affecting consistency. For instance, a VGGlike network trained on the AE task seems to preserve the relative distances among samples (high ), but individual transformed samples will fall closer to originals that were not the actual original the transformation was applied to (low ). On the other hand, a task like AT yields high consistency in the neighborhood of corresponding original samples (high ), but does not preserve the general structure of the audio space (low ). This means that a network trained on a lowlevel task like AE is more consistent than a network trained on a highlevel task like AT, because the resulting latent space is less morphed and it more closely resembles the original audio space. In fact, in our results we see that the semantic highlevelness of the task (AT IR VS AE) is positively correlated with , while negatively correlated with .
To further confirm this observation, we also computed the betweenspace consistency only on the set of original samples. The results, in Figure 13, are very similar to those in the last two rows of Figure 11 and 12. This suggests that in general, the global distance structure of an embedded latent space with respect to the original samples generalizes over the vicinity of those originals, at least for the transformations that we employed.
Considering that AE is an unsupervised learning task, and its objective is merely to embed an original data point into a lowdimensional latent space by minimizing the reconstruction error, the odds are lower that datapoints will cluster according to more semantic criteria, as implicitly encoded in supervised learning tasks. For instance, in contrast, the VS task should morph the latent space such, that input clips with similar degrees of “vocalness” should fall close together, as indeed is shown in Figure
14. As the task becomes more complex and highlevel, such as with AT, this clustering effect will become more multifaceted and complex, potentially morphing the latent space with respect to the semantic space that is used as the source of supervision.6.3 Effect of the Transformation
Across almost all experimental results, significant differences between transformation categories are observed. On the one hand, this supports the findings from [5, 8], which show the vulnerability of MIR systems to small audio transformations. On the other hand, this also implies that different types of transformations have different effects on the latent space, as depicted in Figure 7.
6.4 Are Nearby Neighbors Relatives?
As depicted in Figure 7, substantial inconsistencies emerge in when compared to . Clearly, these inconsistencies are not desirable, especially when the transformations we applied are not supposed to have noticeable effects. However, as our consistency investigations showed, the MFCC baseline encoder behaves surprisingly well in terms of consistency, evidencing that handcrafted features should not always be considered as inferior to deep representations.
While in a conventional audio feature extraction pipeline, important salient data patterns may not be captured due to accidental human omission, our experimental results indicate that DNN representations may be unexpectedly unreliable. In the deep music embedding space,‘known relatives’ in the audio space may suddenly become faraway pairs. That a representation has certain unexpected inconsistencies should be carefully studied and taken into account, specially given the increasing interest in applying transfer learning using DNN representations, not only in the MIR field. For example, if a system requires to use degraded audio inputs for a pretrained DNN (which e.g. may be done in music identification tasks), while humans may barely recognize the differences between the inputs and their original form, it does not guarantee that this transformed input may be embedded at a similar position to its original version in a latent space.
6.5 Towards Reliable Deep Music Embeddings
In this work, we proposed to use several distance consistencybased criteria, in order to assess whether representations in various spaces can be deemed as consistent. We see this as a complementary means of diagnosis beyond taskrelated performance criteria, when aiming to learn more general and robust deep representations. More specifically, we investigated whether deep latent spaces are consistent in terms of distance structure, when smaller and larger transformations on raw audio are introduced (RQ 1). Next to this, we investigated how various types of learning tasks used to train deep encoders impact the consistencies (RQ 2).
Consequentially, we conducted an experiment employing 4 MIR tasks, and considering deep encoders versus a conventional handcrafted MFCC encoder, to measure the consistency for different scenarios. Our findings can be summarized as follows:

[label=RQ 0. , itemindent=4em]

Compared to the MFCC baseline, all DNN encoders indicate lower consistency, both in terms of withinspace consistency and betweenspace consistency, especially when transformations grow from imperceptibly small to larger, more perceptible ones.

Considering learning tasks, the highlevelness of a task is correlated with the consistency of resulting encoder. For instance, an ATspecialized encoder, which needs to deal with semantically highlevel task, yields the highest withinspace consistency, but the lowest betweenspace consistency. On the other hand, an AEspecialized encoder, which deals with a semantically lowlevel task, shows opposite trends.
To realize a fully robust assessment framework, there still are a number of aspects to be investigated. First of all, more indepth study is required considering different magnitudes in the transformations, and their possible comparability. While we applied different magnitudes for each transformations, we decided not to comparatively consider the magnitude ranges in the analysis at this moment. This was done, as we do not have any exact means to compare the perceptual effect of different magnitudes, which will be crucial to regularize between transformations.
Furthermore, similar analysis techniques can be applied to more diverse settings of DNNs, including different architectures, different levels of regularizations, and so on. Also, as suggested in [8, 9], the same measurement and analysis techniques can be used for adversarial examples generated from the DNN itself, as another important means of studying a DNN’s reliability.
Moreover, and based on the observations from our study, it may be possible to develop countermeasures for maintaining high consistency of a model, while yielding high taskspecific performance. For instance, it can be effective if, during learning, a network is directly supervised to treat transformations in similar ways as their original versions in the latent space. This can be implemented as an auxiliary objective to the main objective of the learning procedure, or introducing directly the transformed examples as the data augmentation.
Finally, we believe that our work can be a step forward towards a practical framework for more interpretable deep learning models, in the sense that we suggest a less taskdependent measure for evaluating a deep representation, that still is based on known semantic relationships in the original item space.
Acknowledgments
This work was carried out on the Dutch national einfrastructure with the support of SURF Cooperative.
References
 [1] Aäron van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep contentbased music recommendation. In Advances in Neural Information Processing Systems 26 NIPS, pages 2643–2651, Lake Tahoe, NV, USA, December 2013.
 [2] Eric J. Humphrey, Juan Pablo Bello, and Yann LeCun. Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR, pages 403–408, October 2012.

[3]
Keunwoo Choi, György Fazekas, Mark B. Sandler, and Kyunghyun Cho.
Convolutional recurrent neural networks for music classification.
In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pages 2392–2396, March 2017.  [4] Pritish Chandna, Marius Miron, Jordi Janer, and Emilia Gómez. Monoaural audio source separation using deep convolutional neural networks. In Latent Variable Analysis and Signal Separation  13th International Conference, LVA/ICA, Proceedings, pages 258–266, Grenoble, France, February 2017.
 [5] Bob L. Sturm. A simple method to determine if a music information retrieval system is a "horse". IEEE Trans. Multimedia, 16(6):1636–1644, 2014.
 [6] Francisco RodríguezAlgarra, Bob L. Sturm, and Hugo MaruriAguilar. Analysing scatteringbased music content analysis systems: Where’s the music? In Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR, pages 344–350, August 2016.
 [7] Bob L. Sturm. The "horse" inside: Seeking causes behind the behaviors of music content analysis systems. Computers in Entertainment, 14(2):3:1–3:32, 2016.
 [8] Corey Kereliuk, Bob L. Sturm, and Jan Larsen. Deep learning and music adversaries. IEEE Trans. Multimedia, 17(11):2059–2071, 2015.
 [9] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR, Conference Track Proceedings, May 2015.
 [10] Keunwoo Choi, György Fazekas, Mark B. Sandler, and Kyunghyun Cho. Transfer learning for music classification and regression tasks. In Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR, pages 141–149, Suzhou, China, October 2017.
 [11] Jongpil Lee, Taejun Kim, Jiyoung Park, and Juhan Nam. Raw waveformbased audio classification using samplelevel CNN architectures. CoRR, abs/1712.00866, 2017.
 [12] Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. Samplelevel deep convolutional neural networks for music autotagging using raw waveforms. In 14th Sound and Music Computing Conference, SMC, Espoo, Finland, July 2017.
 [13] Sander Dieleman and Benjamin Schrauwen. Endtoend learning for music audio. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pages 6964–6968, Florence, Italy, May 2014. IEEE.
 [14] Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. Samplecnn: Endtoend deep convolutional neural networks using very small filters for music classification. Applied Sciences, 8(1), 2018.
 [15] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep unet convolutional networks. In Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR, pages 745–751, October 2017.
 [16] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. Deep Learning. Adaptive computation and machine learning. MIT Press, 2016.
 [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. volume 60, pages 84–90, New York, NY, USA, May 2017. ACM.
 [18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In 3th International Conference on Learning Representations, ICLR, San Diego, CA, USA, May 2015.

[19]
Vinod Nair and Geoffrey E. Hinton.
Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th International Conference on Machine Learning ICML, pages 807–814, Haifa, Israel, June 2010. Omnipress.  [20] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML, pages 448–456, Lille, France, July 2015. JMLR, Inc.
 [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3th International Conference on Learning Representations, ICLR, May 2015.
 [22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In Medical Image Computing and ComputerAssisted Intervention  MICCAI  18th International Conference, Proceedings, Part III, pages 234–241, October 2015.
 [23] Stan Salvador and Philip Chan. Fastdtw: Toward accurate dynamic time warping in linear time and space. In 3 rd International Workshop on Mining Temporal and Sequential Data (TDM04). Citeseer, 2004.
 [24] Diego Furtado Silva, ChinChia Michael Yeh, Gustavo E. A. P. A. Batista, and Eamonn J. Keogh. Simple: Assessing music similarity using subsequences joins. In Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR, pages 23–29, August 2016.
 [25] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016.
 [26] Justin Salamon and Julián Urbano. Current challenges in the evaluation of predominant melody extraction algorithms. In Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR, pages 289–294, October 2012.
 [27] Suramya Tomar. Converting video formats with ffmpeg. Linux Journal, 2006(146):10, 2006.
 [28] EBU. Loudness normalisation and permitted maximum level of audio signals. 2010.
 [29] Thierry BertinMahieux, Daniel P. W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR, pages 591–596, Miami, FL, USA, October 2011. University of Miami.
 [30] Juan J. Bosch, Jordi Janer, Ferdinand Fuhrmann, and Perfecto Herrera. A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals. In Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR, pages 559–564, October 2012.
 [31] Zafar Rafii, Antoine Liutkus, FabianRobert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, December 2017.
 [32] FabianRobert Stöter, Antoine Liutkus, and Nobutaka Ito. The 2018 signal separation evaluation campaign. In LVA/ICA, volume 10891 of Lecture Notes in Computer Science, pages 293–305. Springer, 2018.
 [33] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tSNE. Journal of machine learning research, 9(November):2579–2605, 2008.
Comments
There are no comments yet.