Timbre is a central element in musical expression and sound perception , which can be seen as a set of spectral properties that allows us to distinguish instruments played at the same pitch and velocity. Synthesis of musical timbre has been studied by analyzing the feature relationships between instruments. A disentangled representation of pitch and timbre was proposed in  which allows to generate musical notes with instrument control. Perceptual timbre relationships were explicitly modeled in , and latent timbre synthesis could be iteratively mapped to target acoustic variations. However, both techniques are not evaluated in the signal domain and acoustic properties remain entangled. A timbre-invariant representation of variable-length waveforms is learned in  to perform unsupervised translation of an instrument performance into another, which we refer to as timbre transfer. However, such representation is not interpretable and does not offer any controls besides selecting a target instrument class.
This paper introduces a generative model training on an individual timbre domain that allows variable-length timbre transfer of diverse audio sources and sound synthesis with direct acoustic descriptor control. This auto-encoder with a discrete latent space that is disentangled from loudness learns the feature quantization of a given timbre distribution. Latent features are decoded into short-term spectral coefficients of a filter applied to overlapping frames of a noise excitation. This subtractive synthesis technique does not constrain the types and lengths of signals that can be processed. We perform timbre transfer by encoding any input signals into this discrete representation. The matched series of latent features is inverted into a signal which corresponds to the trained timbre domain. Since the model has learned an approximate decomposition of a timbre into a set of short-term spectral features, we can individually decode each latent vector and compute the corresponding acoustic properties. It provides a direct mapping for descriptor-based synthesis. A descriptor target can be matched with a series of latent features and decoded into a signal with the desired auditory property.
Our timbre transfer experiments apply to orchestral instruments and singing voice. We pretrain an instrument classifier and evaluate transfer with the predicted accuracy of a model at translating all other instruments into the trained timbre domain. And we measure the distances between the input and output fundamental frequency and loudness. These distances amount to the error of a model at preserving the source pitch and loudness independently from transforming the timbre. We also perform timbre transfer from vocal imitations to instruments as an example of voice driven synthesis. Whereas many sound ideas are hardly described with musical parameters, which require an expert knowledge, human voice control can be an intuitive medium. For instance, mimicking some moods, objects or actions that are translated into musical sounds.
2 State of the art
2.1 Generative Modeling
Generative neural networks aim to model a given set of observationsin order to consistently produce novel samples . To this extent, we introduce latent variables defined in a lower-dimensional space (). These latent features form a simpler representation from which the data can be generated. An unsupervised approach to learn these variables is the auto-encoder. A deterministic encoder maps observations to latent codes that are fed to the decoder which in turn reconstructs the input . Their parameters jointly optimize some reconstruction loss
As this approach explicitly performs dimensionality reduction, these latent variables can extract the most salient features in the dataset. Hence, they also facilitate the generation over high-dimensional distributions. However, in this deterministic auto-encoder setting there is no guarantee that latent inference on unseen data produces meaningful codes for the decoder. In other words, these latent projections are usually scattered apart from those of the training observations, and the decoder may fail at reconstructing anything consistent besides its training domain.
Regularized auto-encoders tackle this problem by introducing constraints over the distribution of latent codes and generation mechanism. To do so, the Variational Auto-Encoder (VAE) sets a probabilistic framework by optimizing a variational approximation of the encoder distribution given a continuous prior over latent variables. The model is trained with a Kullback-Leibler (KL) divergence regularizer added to a reconstruction cost
VAEs provide several desirable features such as their interpolation quality, generalization power from small datasets, and the ease for both high-level visualization and sampling. However, they tend to produce less detailed low-level features (blurriness effect), and the regularization can degenerate into an uninformative latent representation (a phenomenon known asposterior collapse ).
The Vector-Quantized VAE (VQ-VAE) addresses these issues by learning a discrete latent representation, defined as a codebook with a fixed number of latent vectors . Hence, the output of the deterministic encoder is matched to its nearest embedding code
which is passed to the decoder, so that it optimizes generation solely using the current codebook state. In addition to the latent dimensionality reduction, the amount of information compression is set by the size of the discrete embedding. Assuming a uniform prior distribution over the embedding, the amount of information encoded in the representation corresponds to a constant KL divergence of
. Since that hyperparameter is not optimized, the VQ-VAE alleviates posterior collapse. The representation is optimized with acodebook update loss which matches the selected code to the encoder features
where denotes a stop gradient operation, bypassing the variable in the back-propagation. Symmetrically, the encoder commitment to the selected code is applied as a loss
in order to bound its outputs and stabilize the training. The complete objective with commitment cost is then
Because of the argmin operator (nearest-neighbor selection), Eqn (3) is not differentiable and the encoder cannot be directly optimized. However, as shown in fig:VQVAE, this issue is circumvented by simply copying the gradient from to (straight-through approximation) and back-propagating this information in the encoder unaltered with respect to the quantization output. The VQ-VAE achieves sharper reconstructions than those of the probabilistic VAE, and its discrete latent representation was successfully applied to speech for unsupervised acoustic unit discovery
. In this paper, it was shown that the quantized codebook could extract high-level interpretable audio features that strongly correlate to phonemes, with applications for voice conversion. Inference is performed by quantizing every continuous encoder outputs with the learned latent codebook. Consequently, the decoder is bound to reconstruct the input given this discrete latent space, whose degrees of freedom can be adjusted with the codebook size. This reconstruction with latent quantization may be seen as a transfer when matching any out-of-domain inputs with a set of features learned from a given dataset.
2.2 Raw Waveform Modeling
The first methods for neural waveform synthesis have relied on auto-regressive sample predictions, as in the reference WaveNet model . It achieves high-fidelity sound synthesis, at the cost of a heavy architecture that is inherently slow to train and sample from. In more recent developments, waveform models have exploited digital signal processing knowledge, providing efficient solutions that achieve competitive audio quality. It results in more interpretable and lighter architectures which consequently require less data to train on. A sinusoids plus stochastic decomposition is first used in the Neural Source Filter (NSF) 
model. It generates speech from acoustic features and the estimated fundamental frequency, that are used as a conditioning information for the synthesis modules. These are a sinusoidal source controlled by the , a Gaussian noise source and two separate temporal filters to process each of them. The generated signals are adaptively mixed in order to render both voiced (periodic) and unvoiced (aperiodic) speech components. More specific to musical sound synthesis, the Differentiable Digital Signal Processing (DDSP) model implements a similar decomposition with an additive synthesizer conditioned with
summed with a subtractive noise synthesizer, both controlled by the decoder. It predicts the harmonic amplitudes and the frequency domain coefficientsto generate the filtered audio from non-overlapping frames of noise
the Discrete Fourier Transform andits inverse. This model offers promising results and an interesting modularity that disentangles harmonic, stochastic as well as reverberation features. However, it is mainly tailored for harmonic sounds and does not allow end-to-end training as it relies on an external estimator.
The two aforementioned models train on a multi-scale Short-Term Fourier Transform (STFT) reconstruction objective, that is computed for several resolutions. The distance between spectrogram magnitudes is an efficient criterion for optimizing waveform reconstruction as it provides a structured time-frequency representation. However, since the phase is discarded, it may fail at evaluating certain acoustic errors. Based on human ratings to evaluate just-noticeable distortions, a differentiable audio metric is proposed 
in order to assess artifacts at the threshold of perception. Listeners were asked whether pairs of audio were exactly similar, with one element being applied varying strengths of additive noises, reverberation or equalization. This dataset provides pairs of waveforms along with binary ratings, on which a convolutional neural network learns a differentiable loss. A deep feature distanceis trained by forwarding each audio (clean) and (altered) into the network. Considering layers and the -th convolution activations, it computes
with a learnable weight for each of the channels of width . Given this deep feature distance, a low-capacity classifier is trained to infer human ratings of noticeable dissimilarity. In this setting, the network must efficiently model such just-noticeable differences in order to allow an accurate prediction. Once trained, this distance can be used as a differentiable audio loss. It was shown to improve the performance of speech enhancement systems and may be added as an additional reconstruction objective.
2.3 Musical Timbre Transfer
The task of musical timbre transfer is to convert the identity of one sound into another, e.g. two instruments, while preserving independent features such as pitch and loudness. The model in  learns a representation that disentangles these features inside instrument sounds. It offers interesting visualizations and generative controls. However, it is restricted to processing individual notes of limited duration from spectrogram magnitudes. As a result, synthesis occurs with an inversion latency and is not evaluated in the signal domain.
In this work we focus instead on unsupervised transfer for variable-length waveforms, such as recorded music performances. The Universal Music Translation Network  proposes an architecture for multi-domain transfers, using a shared encoder paired with domain-specific decoders. The generalization of the learned representation to many domains is achieved with a latent confusion objective. It uses an adversarial classifier to enforce the domain-invariance of latent codes. The task is solved in the waveform domain by relying on multiple WaveNet models. For that reason, both training and synthesis are slow and computationally very intensive. Although it allows high-quality auto-encoding with domain selection, its latent representation does not offer more generative controls. On the other hand, more expressive and light-weight synthesis models can perform timbre manipulations with additional constraints. The DDSP model was applied to single domain transfer with independent control over pitch and loudness, but with limitations of its amortized inference.
3 Vector-Quantized Model for Timbre
In this paper, we introduce a waveform auto-encoder for learning a discrete representation of an individual timbre that can be used for sound transfer and descriptor-based synthesis. We merge the VQ-VAE approach with a decoder that performs subtractive noise filtering with a disentangled gain prediction. As the model is unsupervised, it can train on diverse music performance recordings and can as well process non-musical audio such as vocal imitations. The resulting latent representation decomposes spectral timbre properties, while being invariant to loudness. The model performs timbre transfer by encoding any audio sources into the loudness-invariant feature quantization which is inverted to the learned timbre. The discrete latent space can be mapped to acoustic descriptors. It allows us to order series of latent features according to a descriptor target and offers meaningful synthesis controls.
3.1 Model Overview
We define an individual timbre through a corpus of audio files recorded for a target sound domain, for instance isolated or solo performances of a given instrument. A dataset of successive overlapping signal windows is constructed by slicing input waveforms of given duration into series . The encoder projects each of the windows of length into a continuous latent code , while reducing the dimensionality as . A quantization estimator selects a vector in the discrete embedding that is the closest match to . The decoder predicts filtering coefficients that are applied to spectral frames of a noise excitation, with the number of frequency bins. In order to disentangle loudness from the latent timbre embedding, the encoder predicts an additional scalar gain . This architecture is depicted in fig:architecture and the output time frames are filtered as
The reconstruction is done by inversion of into
. This overlap-add uses the same stride as the encoder and the noise spectrum, and it can be performed for variable-length signals.
3.2 Encoding Modules
The first layer of the encoder slices the input waveform into overlapping windows with a convolution of stride and Hanning kernel of size set as a power of 2. Every individual window is passed into a stack of downsampling convolutions with stride 2. One output layer predicts the latent features and another infers the scalar gains . The latent features are projected into the discrete embedding, yielding the quantization codes sent to the decoder.
3.3 Decoding Modules
Subtractive synthesis is performed by filtering an excitation with flat energy distribution. A uniform noise signal of the same length as is converted into complex spectrum frames . We use a convolution with a stride and kernels of size corresponding to the Fourier basis. The first half of the bins is the real part and the other is the imaginary part. The series of quantized features is processed by the decoder which predicts the series of filtering coefficients
. The decoder is composed of an input stack of linear layers, a Recurrent Neural Network (RNN) and an output stack of linear layers. The predicted filters are scaled with the disentangled gains asand applied to the noise spectrum. Synthesis from the filtered frames is done by overlap-add. We use a transposed convolution of stride and kernels of size corresponding to the inverse Fourier basis. Such use of convolutional neural networks for time-frequency analysis and synthesis has previously been detailed for both music information retrieval and source separation tasks.
3.4 Model Objectives
Our proposed model jointly optimizes waveform reconstruction and vector quantization with encoder commitment and codebook update losses. In order to evaluate the reconstruction, we use a multi-scale spectrogram loss over several STFT resolutions of magnitudes and the deep feature distance defined in Eqn (8). The different loss contributions are scaled by hyperparameters for reconstruction terms and for latent optimization, as
|scores||classification accuracy||DTW||DTW loudness||LSD|
In order to learn the individual timbre of instruments, we rely on multitrack recordings of music performances from two datasets, namely URMP and Phenicx. They both provide isolated audio for bassoon, cello, clarinet, double-bass, flute, horn, oboe, trumpet, viola and violin.
To learn the singing voice timbre representation, we use the recordings from the VocalSet database which provides 9 female and 11 male singers individually performing several techniques and pitches. We discard the noisiest techniques breathy, inhaled, lip-trill, trillo, vocal fry and merge all others in the same timbre domain.
To experiment with voice-controlled sound synthesis, we use some examples of the VocalSketch database, which were given as source inputs to models pretrained on instruments. Vocal imitations were not used as training data, but as crowd-sourced examples of untrained human voices expressing some diverse sound concepts.
4.2 Perceptual Audio Loss
Using the dataset of just-noticeable audio differences and human ratings , we re-implemented the deep feature distance in PyTorch (codes and pretrained parameters of are provided111https://github.com/adrienchaton/PerceptualAudio_Pytorch). To use this loss as a reconstruction objective for music performances recorded at various volumes, we apply a random gain to the training audio pairs so that the learned distance is invariant to audio levels. As this criterion was trained for several perturbations including additive noises and reverberation, the model optimizes additional acoustic cues to generate audio signals that are consistent with the training dataset. We observe that vocal imitations recorded in uncontrolled conditions can be transferred into musical sounds which do not exhibit the input noise found in VocalSketch.
4.3 Training Details
All audio examples are first downsampled to 22kHz in mono format. The subsets corresponding to each individual timbre, either instrumental or singing voice, are split into training and test data (15%). We remove silences and concatenate the trimmed audio. Segments of 1.5 seconds are randomly sampled in the training data and collated into mini-batches of size 20 for training the VQ-VAE. We optimize the model for 150,000 iterations with the ADAM optimizer and a learning rate of 2e-4.
The model is defined with window size , stride and which corresponds to the real and imaginary parts of the halved complex spectrum. The encoder has 7 downsampling convolutions of stride 2, with increasing output channel dimension from 32 to 256 and kernel size 13. One output layer maps to latent features of size and another pair of linear layers outputs the scalar gains. The vector quantization space is a codebook of size
. The decoder has two blocks of 4 linear layers with a constant hidden dimension of 768 that are interleaved with intermediate Gated Recurrent Units of the same feature size. The output of the decoder is a linear layer that producesfiltering coefficients which are passed into a sigmoid activation and log1p compression. The convolutions and are initialized as the linear STFT and its inverse, future experiments could include using different frequency scales or training their kernels. The multi-scale sprectrogram reconstruction is computed for STFTs with a hop ratio of 0.25 and window sizes of [128, 256, 512, 1024, 2048]. We adjust the strengths in order to balance the initial gradient magnitudes of each objective, accordingly and . The latent loss uses an encoder commitment strength of .
4.4 Classifier Model
In order to evaluate the timbre transfer task, we train a reference classifier on the 10 target instruments. We adapt the baseline proposed in  to perform short-term predictions rather than predicting a single label per file. Our classifier predicts a label every non-overlapping frame of 4096 samples which amounts to a context of about 185ms. The model was trained with pitch-shifting data augmentation and achieves 85% test set frame-level accuracy at predicting the correct instrument label.
The performance of our VQ-VAE is quantitatively compared against a baseline deterministic auto-encoder without vector quantization. Since its latent space is continuous, the disentangled gain prediction did not improve the baseline and is as well removed. Besides that, it shares the same encoder and decoder architectures and only optimizes reconstruction costs. We compare the models in terms of spectrogram reconstruction quality in the learned timbre domain and transfer quality from other sources.
5.1 Comparative Model Evaluation
The test set reconstruction quality of the models is evaluated by comparing the spectrogram magnitudes of the input and output waveforms using the Log-Spectral Distance (LSD). The instrument timbre transfer accuracy is evaluated by auto-encoding every other instrument subsets from URMP and Phenicx (besides the trained target) and every singing excerpts from VocalSet and predicting the instrument label of the synthesized audio with the pretrained reference classifier. The accuracy is reported with respect to the target instrument, and aims to be maximized. In addition, the source and loudness curves are compared with those of the audio transfer. We use the Dynamic Time Warping (DTW) distance to measure how well the model preserves pitch and loudness independently from transferring timbre. The DTW score is normalized across audio excerpts by scaling the time series in unit range and averaging by the lengths of the DTW paths. For the model trained on singing voice, we transfer audio from all the instrument subsets and only report the average DTW distances.
As detailed in tab:bench, the discrete representation of the VQ-VAE consistently improves the unsupervised timbre transfer accuracy in comparison with the baseline auto-encoder. For inference on other source domains, our proposed model solely uses a fixed basis of latent features learned from the spectral distribution of a given timbre. As a result, the quantization enforces audio transfer of the target timbre properties. We also observe that the disentangled gain prediction tends to improve the reconstruction of loudness, as shown by a lower average DTW distance for the VQ-VAE model. However, we did not constrain the model to rely on an explicit estimate of the fundamental frequency. Since it is not disentangled from the representation, we observe that quantization comes at the expense of a lesser accurate reconstruction of the pitch than for the continuous baseline model. Notably, in the VQ-VAE this property is bound to the trained instrument tessitura. The overall reconstruction quality in the target timbre, assessed with the test set LSD, is similar for both auto-encoders.
Besides the quantitative evaluation of the discrete representation against the baseline auto-encoder, we note two additional benefits of feature quantization. When processing out-of-domain audio of lower quality, such as vocal imitations recorded in uncontrolled conditions, the transfer ability is paired with denoising. Indeed, acoustically inconsistent features are discarded in the latent projection to a trained domain such as musical studio recordings. This facilitates the use of timbre transfer from diverse recording environments such as for voice controlled synthesis. Moreover, we show that learning a discrete latent representation enables a direct mapping to acoustic descriptors as an other mean of high-level synthesis control.
5.2 Descriptor-Based Timbre Synthesis
In comparison with the baseline auto-encoder, the VQ-VAE decoder optimizes generation solely based on a discrete latent codebook. We introduce a mapping method for controllable sound synthesis (detailed in the supplementary material222https://adrienchaton.github.io/VQ-VAE-timbre/). Each embedding vector with approximately corresponds to a short-term timbre feature and a spectral filter . Given that the decoder has a RNN, some temporal relationships are introduced in the overlap-add subtractive synthesis. We decode a series of an individual feature and compute the average acoustic descriptor value . After analyzing every latent vector, we obtain the mapping .
We can perform acoustic descriptor-based synthesis from a target of any length with by selecting the nearest values in the discrete mapping and decoding the corresponding series of latent features . The mark is used here to denote the nearest embedding elements to the descriptor target, whereas in Eqn (3) the selection of is done by matching with the encoder output. Using such mapping, we show in fig:centroid that we can control a VQ-VAE model of violin with an increasing centroid target. The decoded audio has a consistent spectrogram and synthesized centroid. We also observe that the acoustically ordered series of latent features corresponds to an unordered traversal of the discrete embedding. In other words, the index positions in the quantization space do not correlate to acoustic similarities, which are only provided by our proposed mapping method.
This analysis can be performed for other acoustic descriptors and other instrument representations. In fig:vn_vc, we depict the control of the VQ-VAE model by a target defined either with fundamental frequency for the violin or with bandwidth for the cello. Our proposed model does not rely on conditioning in order to process diverse audio sources, such as vocal imitations without pitch. However, we show that the fundamental frequency can be controlled by mapping the unsupervised representation. Our proposed method yields an approximate decomposition of the acoustic properties of an individual timbre, it allows high-level and direct controls for sound synthesis.
We have introduced a raw waveform auto-encoder to learn a discrete representation of an individual timbre that is disentangled from loudness. It can be used for unsupervised transfer of musical instrument performances and singing voice. The model generates audio by subtractive sound synthesis, a technique which neither restricts the types of signals nor the duration that can be processed. The spectral distribution of a timbre is quantized with a set of short-term latent features that are decoded into noise filtering coefficients. This discrete representation can be mapped to acoustic properties in order to perform direct descriptor-based synthesis. Some descriptor targets can be matched with latent features that are decoded into signals with the desired auditory qualities. For instance, the unsupervised model can be controlled with the fundamental frequency. In addition, we experiment with transferring vocal imitations into an instrument timbre as an example of voice-controlled sound synthesis. Audio samples are provided in the supporting GitHub page2.
This work was done under a Japanese Society for Promotion of Science (JSPS) short-term fellowship. We thank the JSPS and The University of Tokyo for their outstanding support.
-  (2015) Vocalsketch: vocally imitating audio concepts. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 43–46. Cited by: §4.1.
-  (2019) NnAudio: a pytorch audio processing tool using 1d convolution neural networks. In ISMIR - Late Breaking Demo, Cited by: §3.3.
Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, pp. 2041–2053. Cited by: §2.1.
-  (2020) DDSP: differentiable digital signal processing. International Conference on Learning Representations (ICLR). Cited by: §2.2.
-  (2018) Generative timbre spaces with variational audio synthesis. In Proceedings of the International Conference on Digital Audio Effects (DAFx), Cited by: §1.
-  (2014) Auto-encoding variational bayes. International Conference on Learning Representations (ICLR). Cited by: §2.1.
-  (2019) Creating a multitrack classical music performance dataset for multimodal music analysis: challenges, insights, and applications. IEEE Transactions on Multimedia 21 (2), pp. 522–535. Cited by: §4.1.
-  (2019) Understanding posterior collapse in generative latent variable models. In International Conference on Learning Representations (ICLR), Cited by: §2.1.
-  (2019) Learning disentangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders. In Proceedings of the 20th International Society for Music Information Retrieval Conference, pp. 746–753. Cited by: §1, §2.3.
-  (2020) A differentiable perceptual audio metric learned from just noticeable differences. arXiv:2001.04460. Cited by: §2.2, §4.2.
-  (2013) Timbre as a structuring force in music. In Proceedings of Meetings on Acoustics ICA, Vol. 19, pp. 035050. Cited by: §1.
-  (2019) A universal music translation network. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.3.
-  (2020) Filterbank design for end-to-end speech separation. In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing, Cited by: §3.3.
-  (2008) Anechoic recording system for symphony orchestra. Acta Acustica united with Acustica 94 (6), pp. 856–865. Cited by: §4.1.
-  (1990) Spectral modeling synthesis: a sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Computer Music Journal 14 (4), pp. 12–24. Cited by: §2.2.
-  (2016) WaveNet: A generative model for raw audio. In The 9th ISCA Speech Synthesis Workshop, pp. 125. Cited by: §2.2.
-  (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems 30, pp. 6306–6315. Cited by: §2.1.
-  (2019) Neural source-filter waveform models for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 402–415. Cited by: §2.2.
-  (2018) Vocalset: a singing voice dataset. In Proceedings of the 19th International Society for Music Information Retrieval Conference, pp. 468–474. Cited by: §4.1, §4.4.
Retrieving sounds by vocal imitation recognition.
IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Vol. , pp. 1–6. Cited by: §1.