Disentanglement learning has been an influential field of studies for controllable music generation with variational autoencoders (VAEs). A number of previous studies have attempted supervised disentanglement learning techniques on several semantic attributes such as rhythm, pitch range[Pati2019LatentAttributes], note density, contour [Pati2020Attribute-basedAuto-encoders], arousal [Tan2020MusicModelling], style [Hung2019MusicalRepresentations], and genre [Brunner2018MIDI-VAE:Transfer] to varying degrees of success. However, learning to simultaneously manipulate multiple attributes, in particular, remains a difficult task to both achieve and objectively evaluate [Pati2021IsGeneration] due to the limitation of current metrics.
One major issue with popular disentanglement metrics [Do2020TheoryRepresentations], such as mutual information gap (MIG) [Chen2018IsolatingAutoencoders], separate attribute predictability (SAP) [Kumar2018VariationalObservations], and modularity [Ridgeway2018LearningLoss], is that they were designed for independent generative factors, rather than real-world semantic attributes. As semantic attributes related to music are often highly interdependent, these metrics do not provide an accurate reflection of the ‘quality’ of learnt latent representation regularized for multiple interdependent attributes. Information inherently shared between attributes is penalized in the same way as that due to undesired entanglement issues.
In this work, we propose a dependency-aware metric based on mutual information (MI) to act as a drop-in replacement for MIG. Preliminary experiments were carried out to demonstrate the benefits of the proposed metrics over MIG.
2 Proposed Metrics
Consider a set of attributes
and a latent vectorwith . Without loss of generality, for , we assume is regularized for . The remaining dimensions are unregularized. denotes entropy while denotes mutual information.
MIG was proposed in [Chen2018IsolatingAutoencoders] to measure the degree of disentanglement in a latent space. The idea behind MIG can be said to measure: for each attribute, the normalized difference between the mutual information between the attribute and its most informative latent dimension, and that between the attribute the second-most informative latent dimension. Mathematically, MIG is given by
where . It is reasonable to assume in a supervised setting; otherwise MIG takes negative values to indicate regularization failure. The normalization is given by , which would be the maximum possible difference in MI between a latent dimension coding perfectly for , i.e., and the second-most containing no information about , i.e., . As such, MIG is bounded above by one.
However, given the interdependence of semantic attributes, if , the ideal value of the difference is no longer since
For regularized latent dimensions, we consider a pair of inherently entangled attributes , i.e., . Under the ideal case where is fully informative [Do2020TheoryRepresentations] about , i.e., , we have
Moreover, in the ideal case, and also have an invertible mapping between each other, this means that . Hence, in the ideal case, the difference is given by
As such, we extend the definition of mutual information gap to the dependency-aware mutual information gap (DMIG) as follows
where . DMIG remains faithful to the core idea of MIG but modifies the normalization to properly account for inter-attribute dependencies. When and are independent, and the DMIG reduces to vanilla MIG.
Note that in the case of continuous random variables, differential entropy can be negative, unlike discrete Shannon entropy. This is particularly evident with conditional differential entropy and may result in DMIG values above unity wheneveris negative.
To illustrate the key features of the dependency-aware metrics, we evaluate the latent space of a VAE model trained to reconstruct raw musical audio while being regularized for two highly correlated attributes111See the supplementary materials for full experimental details at https://github.com/karnwatcharasupat/dependency-aware-mi-metrics..
3.1 Data and model
We use the NSynth dataset [Engel2017NeuralAutoencoders], which is a large-scale dataset of musical notes played by various instruments with diverse timbral qualities. The dataset provides 4-second snippets sampled at . From the raw audio provided by NSynth, we extract two semantic attributes, namely, brightness and depth using the AudioCommons Timbral Model [Pearce2019DeliverableContent]. Since both the brightness and depth features are heavily influenced by the spectral distribution of the sound [Pearce2017DeliverableContent], they are strongly correlated.
We trained a convolutional VAE model to reconstruct the log-magnitude spectrogram of the audio and obtain reconstructed time-domain audio using a phase-bypass reconstruction. The models are trained using the attribute-regularized
-VAE loss function[Kingma2014Auto-encodingBayes, Higgins2017, Pati2020Attribute-basedAuto-encoders]
where is the reconstruction loss implemented via the mean square error on the log-magnitude spectrograms, is the KL divergence term with a standard normal prior, and is the AR-VAE regularization from [Pati2020Attribute-basedAuto-encoders]. We used , , and .
Figure 1 plots the MIG, DMIG, and Spearman correlation coefficient (SCC) of the attributes (brightness and depth) with respect to their respective regularized latent dimensions on the validation set over the course of the training. Due to the high correlation between brightness and depth, for most of the training, the most and second-most informative latent dimensions in MIG/DMIG are the regularized ones that encode for the attributes.
As seen from Figure 1(a), the MIG values are generally very low (in the order of , out of maximum 1) despite the SCC indicating successful encoding of the attribute information into the latent dimension. This is due to the high mutual information between brightness and depth, resulting in a very low true bound for MIG. On the other hand, we can observe from Figure 1(b) that DMIG reflects more clearly the quality of the latent space as it encodes the attribute; the rapid improvement in SCC mostly occurred before DMIG reaches one (dotted line). In Figure 1
(c), the highly linear relationship between MIG and DMIG further demonstrates the idea that DMIG is simply MIG renormalized to better reflect the dependencies between semantic attributes coded by the model. Admittedly, the peculiarities of differential conditional entropy and the practical computation of mutual information and entropy estimates[scikit-learn] have contributed to a DMIG range that is much larger than vanilla MIG. We will be working to resolve this limitation in future work.
In this work, we propose a dependency-aware extension to a popular disentanglement metrics, mutual information gap (MIG), to better account for inter-attribute dependencies often observed in real-world datasets. Key features of the proposed dependency-aware MIG were demonstrated via an experiment on an audio dataset with highly correlated timbral attributes.
K. N. Watcharasupat acknowledges the support from the CN Yang Scholars Programme, Nanyang Technological University, Singapore.