Variational auto-encoders for audio
In this paper, we learn disentangled representations of timbre and pitch for musical instrument sounds. We adapt a framework based on variational autoencoders with Gaussian mixture latent distributions. Specifically, we use two separate encoders to learn distinct latent spaces for timbre and pitch, which form Gaussian mixture components representing instrument identity and pitch, respectively. For reconstruction, latent variables of timbre and pitch are sampled from corresponding mixture components, and are concatenated as the input to a decoder. We show the model efficacy by latent space visualization, and a quantitative analysis indicates the discriminability of these spaces, even with a limited number of instrument labels for training. The model allows for controllable synthesis of selected instrument sounds by sampling from the latent spaces. To evaluate this, we trained instrument and pitch classifiers using original labeled data. These classifiers achieve high accuracy when tested on our synthesized sounds, which verifies the model performance of controllable realistic timbre and pitch synthesis. Our model also enables timbre transfer between multiple instruments, with a single autoencoder architecture, which is evaluated by measuring the shift in posterior of instrument classification. Our in depth evaluation confirms the model ability to successfully disentangle timbre and pitch.READ FULL TEXT VIEW PDF
Variational auto-encoders for audio
A disentangled feature representation is defined as having disjoint subsets of feature dimensions that are only sensitive to changes in corresponding factors of variation from observed data [27, 2, 32]. Deep generative models [19, 25, 13, 33] have been exploited to learn disentangled representations in different domains. In the visual domain, studies are focused on learning independent representations for data generative factors such as identity and azimuth [5, 26, 14]
. In natural language generation, efforts have been made to generate texts with controlled sentiment[18, 36, 10]. Also in the speech domain, we have witnessed successful attempts in controllable speech synthesis by disentangling factors such as speaker identity, speed of speech, emotion, and noise level [15, 35, 17]. There has been relatively little research on learning disentangled representations for music. In this paper, we disentangle the pitch and timbre of musical instrument sound recordings.
Pitch and timbre are essential properties of musical sounds. Given that one pitch can be played with different instruments, we assume they can be separated. From the perspective of music analysis, disentangled representations of pitch and timbre can be regarded as timbre- and pitch-invariant features which could be exploited for downstream tasks [29, 30]. From the synthesis point of view, disentangled representations enable the generation of musical notes with identical pitches (timbres) and different timbres (pitches). Recently, Hung et al. presented the first attempt to learn disentangled representations of pitch and timbre for synthesized music by using frame-level instrument and pitch labels based on encoder-decoder networks . Even though the authors managed to change instrumentation to some extent without affecting pitch structure, the approach was restrictive, as it worked with MIDI-synthesized audio and relied on clean frame-level labels, which are scarce to find. Disentangled representations allow for several applications, including music style transfer. Brunner et al. proposed a model based on variational autoencoders (VAEs)  to generate music with controllable attributes . While genre was factorized by an auxiliary classifier, other musical properties were entangled. Besides the aforementioned models based on MIDI, research on audio has focused on translating between different domains of instrumentation [7, 28, 3, 20]. None of them, however, has addressed learning disentangled latent variables of both pitch and timbre.
This research distinguishes itself from others by disentangling instrument sounds into distinct sets of latent variables (i.e., pitch and timbre), with a framework based on Gaussian Mixture VAEs (GMVAEs). We model the generative process of an isolated musical note by independently sampling pitch and timbre (instrument) categorical variables. Note that the two factors are actually dependent in a sense that range of pitch is instrument-dependent, however, we verify the model’s capability to disentangle them under this simplified assumption of independence. Conditioned on these categorical variables, Gaussian-distributed latent variables are then sampled that characterize variation in the sampled pitch and instrument, respectively. Finally, the data are generated conditioned on the two latent variables. We favor the proposed framework over vanilla VAEs[8, 9] for its more flexible latent distribution compared to a standard Gaussian. In addition, it allows for unsupervised or semi-supervised clustering, which can learn interpretable mixture components and corresponding Gaussian parameters. More importantly, such a framework facilitates the applications in this research: controllable synthesis of instrument sounds, and many-to-many transfer of instrument timbres. Our proposed framework differs from previous studies on timbre transfer, in that we achieve transfer between multiple instruments without training a domain-specific decoder for each instrument (e.g. ), and we infer both the pitch and timbre latent variable without requiring categorical conditions of source pitch and instrument as in . We evaluate our model by visualizing both the latent space and the synthesized spectrograms, and explore the classification F-scores of classifiers trained in an end-to-end fashion. The results confirm the model’s ability to learn disentangled pitch and timbre representations. The rest of the paper is organized as follows: in Section 2, we discuss the proposed framework, and Section 3 describes the dataset and experimental setup. Experiments and results are reported in Section 4. We conclude our work and provide future directions in Section 5.
In this section, we briefly describe VAEs and GMVAEs, and elaborate on the proposed framework and architecture.
are unsupervised generative models that combine latent variable models and deep learning. We denote the observed data and the latent variables respectively by and . A graphical model, corresponding to , is trained by maximizing the lower bound of the log marginal likelihood . The intractable posterior is approximated by introducing a variational distribution
parameterized with neural networks. In regular VAEs, a common choice for the prior distributionis an isotropic Gaussian, which encourages each dimension of the latent variables to capture an independent factor of variation from the data, and results in a disentangled representation . Such a unimodal prior, however, does not allow for multi-modal representations. GMVAEs [6, 24, 22] extend the prior to a mixture of Gaussians, and assume the observed data are generated by first determining the mode from which it was generated, which corresponds to learning a graphical model . This introduces a categorical variable , and
, which infers the classes of data. This enables semi-supervised learning and unsupervised clustering [6, 22] in deep generative models. In the speech domain, Hsu et al. used two mixture distributions to separately model the supervised speaker and unsupervised utterance attributes, which allowed for extra flexibility in conditional speech generation . We build upon this idea to learn separate latent distributions to represent the pitch and timbre of musical instrument sounds. More importantly, to facilitate downstream creative applications such as controllable synthesis and instrument timbre transfer in music, we propose to model supervised pitch representations and semi-supervised timbre representations, with labels of pitch and instrument identity. As such, the mixture components in latent space of pitch and timbre can be clearly interpreted as the classes, i.e., pitch and instrument identity.
The latent variables of pitch and timbre for an isolated musical note are denoted as (pitch code) and (timbre code), respectively. To represent Gaussian mixture latent distributions, two categorical variables are introduced: an M-way categorical variable for pitch, where M is the number of recorded pitches in the dataset, and a K-way categorical variable for timbre, where K is the number of instrument classes. We consider to be observed (fully supervised), which assumes the availability of pitch labels during training, and is reasonable as we model isolated instrument sounds in this research. For , we investigate both unsupervised and semi-supervised learning, i.e., using varying numbers of instrument labels for training. It is shown in Section 4 that our model can efficiently leverage the limited number of labels. Without loss of generality, we denote as unobserved (unsupervised) as in 
. The joint probability of, , and is written as:
is uniform-distributed, i.e., we do not assume to know the instrument distribution in the dataset. Both the conditional distributionsand
are assumed to be diagonal-covariance Gaussians with learnable means and constant variances. This amounts to both the marginal priorand
being Gaussian mixture models (GMMs) with diagonal covariances. Ideally, each mixture component in the former (pitch space) uniquely represents the pitch of , while that in the latter (timbre space) is interpreted as the instrument identity. As we will see in Section 4.1, however, moderate supervision is essential to learn a timbre space that groups instruments perfectly. For creative applications such as the synthesis and timbre transfer of instrument sounds, the proposed model has numerous merits: 1) the learnt representations are not restricted to be unimodal, which offers a more discriminative timbre space than regular VAEs (Section 4.1 and 4.2); 2) direct and intuitive sampling from pitch and timbre space allows for consistent and controllable synthesis of instrument sounds, attributed to the fact that Gaussian parameters of each interpretable mixture component are readily available after training (Section 4.3); and 3) simple arithmetic manipulations between means of mixture components facilitate many-to-many transfer between instrument timbres (Section 4.4). For the training objective, we closely follow the derivation in  and train the model by maximizing the evidence lower bound (ELBO) as follows:
where , , and are parameterized with neural networks, referred to as the decoder, pitch encoder222A common alternative is conditioning the model with categorical pitch labels such that one does not have to train a pitch encoder [7, 3]. It, however, requires the pitch of the inputs to be known a priori to performing tasks such as timbre transfer , and also prohibits the model from extracting pitch features for downstream tasks. By training this extra encoder, we also demonstrate how one can extend the model to possibly learn multiple interpretable latent variables., and timbre encoder, respectively. Instead of using another neural network, we approximate by . Readers interested in detailed derivation are referred to Appendix A in .
Our model is composed of a shared decoder and separate encoders for pitch and timbre, as illustrated in fig:architecture. Specifically, we reshape the -by- spectrogram to have number of channels , each of which is a -by-vector, where and refer to time and frequency. Each encoder contains two one-dimensional convolutional layers, each with 512 filters of shape , and a fully connected layer with 512 units. A Gaussian parametric layer follows and outputs two dimensional vectors which represent mean and log variance. and are sampled from the Gaussian layer with the reparameterization trick 
, which enables stochastic gradient descent, and are then concatenated for the decoder to reconstruct the input. The architecture of the decoder is symmetric to the encoder. Batch normalization followed by the activation functionrelu are used for every layer except for the Gaussian and the output layer. We use the activation function tanh for the output layer as we normalize the data within .
|Instrument Classification||Pitch Classification|
In this section, we describe the experimental setup, including details of the dataset, input representations, and model configurations.
Inspired by Esling et al. , we use a subset of Studio-On-Line (SOL) , a database of instrument note recordings.333Access to the dataset was requested from . The dataset contains 12 instruments, i.e, piano (Pno, 246), violin (Vn, 138), cello (Vc, 147), English horn (Ehn, 128), French horn (Fhn, 214), tenor trombone (Trtb, 63), trumpet (Trop, 194), saxophone (Sax, 99), bassoon (Bn, 251), clarinet (Clr, 180), flute (Fl, 118) and oboe (Ob, 107). There are 1,885 samples in total. All recordings are resampled to 22,050Hz, and only the first 500ms segment () of each recording is considered. We extract Mel-spectrograms with 256 filterbanks (
), derived from the power magnitude spectrum of the short-time Fourier transform (STFT). To compute STFT, we use a Hann window with window size of 92ms and hop size of 11ms. As a result, the input representation is a 43-by-256 Mel-spectrogram. The dataset is split into a training (90%) and validation set (10%), each containing the same distribution of instruments. The magnitude of the Mel-spectrogram is scaled logarithmically, and the minimum and maximum values in the training set are used for normalizing the magnitude withinin a corpus-wide fashion to preserve differences in dynamics.
In order to train both the GMMs in pitch and timbre space, we initialize the means of mixture components using Xavier initialization 
. We set constant standard deviations, rather than trainable ones, for pitch and timbre space. For pitch space,for all mixture components, which is relatively small, as each mixture component represents a pitch, and we do not expect a large variance over recordings that play the same pitch. For timbre space, we let for all mixture components, which captures the timbre variation of each mixture component, i.e., instrument identity. The dimensionality of the latent space is , and the numbers of mixture components are and , equivalent to the numbers of classes of pitch and instrument, respectively. For all experiments, a batch size of 128 is used, model parameters are initialized with Xavier initialization and are trained using the Adam optimizer  with a learning rate of .
In addition to the proposed model (MGMVAE), we consider a baseline (MVAE) that substitutes the timbre space with an isotropic Gaussian as in regular VAEs. Training such a model amounts to optimizing elbo with the last two terms replaced with , where . The experimental results in Section 4.1 and Section 4.2 show that MGMVAE learns a more discriminative and disentangled timbre space than MVAE.
We exploit a moderate number of instrument labels to learn a timbre space in which the clusters clearly represent instrument identity. Similar to Kingma et al. , in the semi-supervised training for MGMVAE, we guide the inference of instrument labels by leveraging limited amounts of supervision. This is done by adding an additional loss term which measures the cross entropy between the inferred and true instrument labels. Because we do not infer in MVAE, we use to train an auxiliary classifier to predict . It has two 128-unit fully-connected layers, and is jointly optimized with MVAE. We consider varying numbers of instrument labels = 0 (unsupervised), 25, 50, 75, and 100% (fully supervised) of the total number. We randomly sample and let the label distribution match the distribution of instruments.
The experiments and the results are presented in this section. We first visualize the timbre space, and quantitatively evaluate the disentangled representations. We then demonstrate the applications of controllable synthesis and many-to-many timbre transfer. Finally, we identify the particular latent dimension that is sensitive to the distribution of the spectral centroid, which allows for finer timbre controls.
fig:tsne visualizes the timbre space using t-distributed stochastic neighbor embedding (t-SNE) , a technique that projects vectors from high- to low-dimensional space. We first observe that MGMVAE learns a Gaussian-mixture distributed timbre space, with means of mixture components marked as crosses in the figure. Second, attributed to the pitch encoder which addresses pitch variations, both MVAE and MGMVAE are able to form clusters of instrument identity even without being trained with instrument labels (the leftmost column). We observe that the wind family (e.g., saxophone, clarinet and flute) forms an ambiguous cluster. Such an ambiguity remains in the MVAE even with increased , while it is less present in the MGMVAE latent space, due to the multi-modal prior distribution. As we will confirm in Section 4.2, MGMVAE outperforms MVAE in learning a more discriminative and disentangled timbre space. Note that in MGMVAE, is assumed to be uniformly distributed over 12 classes of instruments, i.e., mixture components are equally weighted. As a result, instruments with larger within-class variances (e.g., bassoon and trumpet) are assigned to more than one cluster when . In future work we aim to improve the performance of the unsupervised clustering of instruments.
A disentangled pitch (timbre) representation should be discriminative for pitch (instrument identity), and at the same time non-informative of instrument identity (pitch). Therefore, we evaluate and by means of classification. We train linear classifiers to map and
to predict both pitch and instrument labels with one fully connected layer. For comparison, we train an end-to-end convolutional neural network (CNN), whose architecture is the same as the encoder and is a strong baseline, to map the original input Mel-spectrograms to either pitch or instrument labels.
tab:classification shows the results. The CNN achieves high F-scores on both instrument and pitch classification; note that is the supervisory percentage of the total number of instrument labels, and we always use all pitch labels to train the models, which is reasonable as we model isolated notes in this work. In instrument classification, using as the feature representations outperforms by a large margin, as expected. Specifically, in both models, the
learned with unsupervised learning (= 0) is already discriminative enough to predict instruments with linear classifiers. While the F-score of MGMVAE improves with increased , that of MVAE does not. Moreover, the linear classifier trained with outperforms the CNN when . The timbre space of MGMVAE displays the most discriminative power among the models. We attribute the F-scores of instrument classification attained by to the fact that the piano covers all possible pitches in the dataset, while other instruments account for a smaller pitch range. As a result, of notes that were only recorded by piano are correctly classified. Future work can be done to decorrelate particular pitches and instruments by data augmentation and adversarial training as in . In pitch classification, outperforms as expected, and both models achieve comparable results. More importantly, MGMVAE performs better than MVAE in terms of disentanglement, as results in lower F-scores when predicting pitch with increased N.
As shown in fig:tsne, MGMVAE learns a timbre space , whose mixture components are clearly interpreted as instrument identity when trained with moderate supervision. Meanwhile, mixture components in represent pitch. As Gaussian parameters are readily available after training, we can achieve controllable sound synthesis by sampling . To synthesize the target pitch and instrument , we first sample and , where the multiplier serves to examine the effect of sampling latent variables that deviate from the modes. The decoder then synthesizes the Mel-spectrogram by consuming . For evaluation, the CNNs (trained on the original dataset) are used to test whether the synthesized spectrograms are still recognized as belonging to the desired instrument and pitch. High F-scores therefore indicate high controllability of the model in sound synthesis. We use the sound samples in the validation set as the targets to synthesize, and repeat the sampling 30 times for each target.
The F-scores for pitch and instrument classification are reported in fig:controllable. We first note that increasing degrades classification performance. This is expected, as a sample which is synthesized using a latent variable far from its corresponding mean of mixture component deviates more from the intended instrument or pitch distribution. Moreover, the fact that the CNN was trained on the original samples while tested on the synthesized ones also contributes to the inferior performance. Second, increasing improves instrument classification performance. Finally, the high F-scores across all ’s when indicate accurate and consistent synthesis of instrument sounds with intended pitches and instruments, even with a timbre space trained using a limited number of instrument labels. This implies that MGMVAE efficiently exploits the instrument labels, and learns a discriminative mixture distribution of timbre, which is consistent with the visualization in fig:tsne (bottom row, ). We do not explore the timbre space resulting from unsupervised learning () in this experiment, as the instrument identity of each mixture component is not directly available. We can, however, infer the instrument identity of each mixture component by sampling and synthesis, and expect reasonably good performance for controllable synthesis if the clustering of instruments shown in the bottom left of fig:tsne is improved. This will be explored in future work.
In this experiment, we demonstrate many-to-many transfer of instrument timbre. In Mor et al., a domain-specific decoder was trained for each target . To achieve timbre transfer with a single encoder-decoder architecture, Bitton et al. proposed to use a conditional layer  which takes both instrument and pitch labels as inputs . On the other hand, our model infers and , and only uses a single joint decoder. As illustrated in fig:timbretransfer, timbre transfer is achieved by decoding , i.e., transferring timbre while keeping pitch unchanged, where , , and . Once again, we rely on the trained CNNs in tab:classification for evaluation. More specifically, we examine the posterior shift in instrument prediction of the CNN, before and after transferring from source to target instruments with . For simplicity, the most frequent instruments (i.e., French horn, piano, cello, and bassoon) of the four families are selected as the representatives, and we perform timbre transfer using the samples in the validation set as the source. For example, consider Fhn as the source and Pno as target, as shown in fig:timbretransfer. We modify the timbre code as , where is the timbre code of the th Fhn sample, and . We decode as described earlier and report the averaged posterior (over NFhn) of instrument prediction of the CNN.
For simplicity, in fig:transfer, we report the results of the source-target pairs , , and . Each subfigure refers to a source-target pair, and represents the averaged posterior shift of instrument classification of the CNN, with varying . For all pairs, the biggest posterior shift (hence the prediction change) happens when . This also applies to the rest of the possible instrument pairs not shown in the figure. Meanwhile, by using pitch classification, we examine if the pitches are the same before and after timbre transfer, and we use the original pitch labels as ground-truths. We find that, except in the case where the source is piano, all source-target pairs attain a perfect F-score in terms of pitch. This confirms the ability of the model to successfully perform many-to-many timbre transfer. A special case arises when piano is the source. The F-scores before transfer, after transfer to French horn, to cello, and to bassoon, are 0.958, 0.750, 0.791, and 0.791, respectively. As described earlier in Section 4.2, lower F-scores can be attributed to the fact that the range of piano is much larger than that of the target instruments, or the classifier fails to predict the synthesized samples that have unseen combinations of pitch and instrument. The other possible reason is the model falls short of generalization. Nevertheless, this only happens in some cases when the source is piano; as demonstrated in fig:spec_transfer, the model is able to transfer Pno G6 to cello (the first row), which is an example of generalizing to an out-of-range pitch for the target instrument. In the first and third row, the high-frequency components appear with increased , and the energy distributes over the segment without decay. The model, however, falls short in generalizing to the higher pitch, i.e., Pno C7 (the second row), where the energy remains focused at the onset, and high-frequency components are smeared. In the future, we could improve the model generalizability by performing data augmentation and adversarial training as in .
A diagonal-covariance Gaussian prior encourages the model to learn disentangled latent dimensions . This applies to all mixture components in our model. In particular, we identify a latent dimension that correlates with the spectral centroid. we modify the 13th dimension of , , of each sound sample in the validation set by , where
for all instruments, and then synthesize the spectrograms, for which we then calculate the spectral centroid. fig:spec_centroid shows the distributions of the spectral centroid before and after the modifications. The two-tailed t-test indicates significant differences () between and for all instruments. As demonstrated in fig:spec_centroid_spec, we observe that increased reduces the energy of high-frequency components and results in lower spectral centroid values. In future research, we will further investigate disentangling specific acoustic features for finer control of sound synthesis beyond pitch and instrument.
We have proposed a framework based on GMVAEs to learn disentangled timbre and pitch representations for musical instrument sounds, which is verified by our experimental setup. We demonstrate its applicability in controllable sound synthesis and many-to-many timbre transfer. In future work, we plan to conduct listening tests for a more comprehensive evaluation of the applications, and further disentangle both low- (e.g., acoustic features) and high-level (e.g., playing techniques) sound attributes, enabling finer control of synthesized timbres. By using supervised and unsupervised learning in a deep generative model, the framework can be easily adapted to learn interpretable mixtures such as singer identity, music style, emotion, etc., which facilitates music representation learning and creative applications.
We would like to thank the anonymous reviewers for their constructive reviews. This work is supported by a Singapore International Graduate Award (SINGA) provided by the Agency for Science, Technology and Research (A*STAR), under reference number SING-2018-01-1270.
Proc. of the International Conference on Machine Learning, pages 1068–1077, 2017.
International Joint Conference on Artificial Intelligence, 2017.