Perceiving Music Quality with GANs

06/11/2020 ∙ by Agrin Hilmkil, et al. ∙ 0

Assessing perceptual quality of musical audio signals usually requires a clean reference signal of unaltered content, hindering applications where a reference is unavailable such as for music generation. We propose training a generative adversarial network on a music library, and using its discriminator as a measure of the perceived quality of music. This method is unsupervised, needs no access to degraded material and can be tuned for various domains of music. Finally, the method is shown to have a statistically significant correlation with human ratings of music.



There are no comments yet.


page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

qa of musical audio is commonly performed by comparing audio to a clean reference signal [3], which is typically available when evaluating audio compression or when measuring degradation of music playback. When a reference is unavailable, e.g. when developing audio generation algorithms, subjective listening tests are employed which are time consuming and hard to reproduce. A nrqa measure would speed up algorithm development and promote reproducible research, yet has remained elusive for music. FAD [5] was recently proposed for scoring the audio quality of many generated songs, but cannot assess individual songs.

Figure 1: Randomly generated spectrograms produced by the gan.

In this work we approach nrqa by generative modelling. gan have recently shown promise in modelling music [4] and variations of the discriminator loss has been observed to correlate with perceived quality of content [1]. Therefore, we propose a gan based method for nrqa. By establishing human quality ratings on degraded music excerpts we explore whether a discriminator correlates with human perception of music quality, and could serve as a nrqa measure.

2 Method

Audio was obtained from Epidemic Sound, an online service with professionally produced, high-quality music. It was split into train/test sets with 1716/65 songs.

We used a time-frequency representation to adopt existing gan architectures like in [4]. The signals were sampled at 16 kHz. We applied a stft with 75% overlapping Hann windows and 2048 samples per window. Magnitudes were mel-filtered and log-scaled.

The gan architecture used is the Biggan [2] without additional losses. We set the channel width multiplier to . The input noise is dimensions. Minibatch size was samples per GPU tower111We used 4x Titan X Pascal GPUs of 12GB VRAM.. Learning rate was for the generator and for the discriminator. The discriminator was updated twice for each generator step. Our proposed nrqa measure is the discriminator activation score, bounded between 0.0 and 1.0.

To compare with human perception, we produced a dataset of quality rated excerpts of music. From the test set we produced excerpts of four seconds each from the songs. Some excerpts were degraded with a random intensity of waveshaping distortion, lowpass filtering, master limiting or additive noise. The final dataset was balanced across degradation types, including non-degraded excerpts. Using amt, 488 human participants were asked to rate 10 excerpts each by the question “How do you rate the audio quality of this music segment?” on a five-point scale. The median rating per excerpt was compared to the gan discriminator score for that excerpt.

3 Results and discussion

Figure 2: The distribution of discriminator score for the median human rating of each excerpt.

As a precaution, we look at randomly generated content from the generator (Figure 1) to avoid mode collapse. Figure 2 shows that the median discriminator score monotonically increases with the ratings, suggesting that the method may be particularly suitable for ranking collections of data. The Spearman correlation between human ratings and our method is 0.426 with a high significance (), thus a gan discriminator could be used as a perceptually informed nrqa measure of music.

Interestingly, the discriminator is able to perform this task without access to any type of degradations during training, and without a clean reference at test time.

While training discriminative models for nrqa may be feasible, designing training examples would require extensive domain knowledge of acoustics and the audio artifacts that might be encountered at test time. Our method circumvents this as it only requires examples of high quality musical audio.

Finally, while FAD [5] is unable to score individual songs, we plan on comparing our methods performance against it when scoring large collections of music in the future. In particular, we will explore their use for scoring generated music and follow how the measures evolve during training of the generating algorithm.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In

    Proceedings of the 34th International Conference on Machine Learning

    , pages 214–223, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [2] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 2019.
  • [3] D. Campbell, E. Jones, and M. Glavin. Audio quality assessment techniques: A review, and recent developments. Signal Processing, 89(8):1489–1500, 2009.
  • [4] C. Donahue, J. McAuley, and M. Puckette. Adversarial audio synthesis. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 2019.
  • [5] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi. Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms. In Proceedings of Interspeech 2019, pages 2350–2354, 2019.