Musical synthesis, most commonly, is the process of generating musical audio with given control parameters such as instrument type and note sequences over time. The primary difference between synthesis engines is the way in which timbre is modeled and controlled. In general, it is difficult to design a synthesizer that both has dynamic and intuitive timbre control and is able to span a wide range of timbres; most synthesizers change timbres by having presets for different instrument classes or have a very limited space of timbre transformations available for a single instrument type.
In this paper, we present a flexible music synthesizer named Mel2Mel, which uses a learned, non-linear instrument embedding as timbre control parameters in conjunction with a learned synthesis engine based on WaveNet. Because the model has to learn the timbre – any information not specified in the note sequence – to successfully reconstruct the audio, the embedding space spans over the various aspects of timbre such as spectral and temporal envelopes of notes. This learned synthesis engine allows for flexible timbre control, and in particular, timbre morphing between instruments, as demonstrated in our interactive web demo.111https://neural-music-synthesis.github.io
1.1 Timbre Control in Musical Synthesis
Methods for music synthesis are based on a variety of techniques such as FM synthesis, subtractive synthesis, physical modeling, sample-based synthesis, and granular synthesis . The method of controlling timbre and the level of flexibility depends on the parameters of the exact method used, but in general, there is a trade-off between flexible timbre control over synthetic sounds (e.g. FM or subtractive synthesis) and a limited timbre control in more “realistic” sounds (e.g. sample-based or granular synthesis). Our work is aimed at achieving the best of both worlds: flexibly controlling a variety of realistic-sounding timbres.
1.2 Timbre Morphing
‘Morphing’ of a sound can be generally described as making a perceptually gradual transition between two or more sounds . A common approach is to use a synthesis model and define sound morphing as a numerical interpolation of the model parameters. Sinusoidal models can directly interpolate between the energy proportions of the partials [3, 4]. Other models use parameters characterizing the spectral envelope [5, 6] or psychoacoustic features for perceptually linear transition . A limitation of these approaches is that morphing can only be applied among the range of timbres covered by a certain synthesis model, whose expressiveness or parameter set may be limited. To overcome this, we employ a data-driven approach for music synthesis that is generalizable to all timbres in the dataset.
1.3 Timbre Spaces and Embeddings
Timbre is often modeled using a timbre space , in which similar timbres lie closer than dissimilar timbres. In psychoacoustics, multidimensional scaling (MDS) is used to obtain a timbre space which preserves the timbral dissimilarities measured in perceptual experiments [9, 10]. Meanwhile, in music content analysis, timbre similarity is measured using computed features such as the Mel-frequency cepstral coefficients (MFCCs) , descriptors of the spectral envelope , or hidden-layer weights of a neural network trained to distinguish different timbres . A recent method 
used a variational autoencoder to obtain a timbre space, and unlike the above embeddings, the method is able to generate monophonic audio for a particular timbre embedding but does not consider the temporal evolution of notes such as attacks and decays. In our work, we generate a timbre embedding as a byproduct of polyphonic synthesis, which can utilize both spectral and temporal aspects of timbres.
1.4 Neural Audio Synthesis using WaveNet
WaveNet  is a generative audio synthesis model that is able to produce realistic human speech. WaveNet achieves this by learning an autoregressive distribution which predicts the next audio sample from the previous samples in its receptive field using a series of dilated convolutions. Tacotron  and Deep Voice  are WaveNet-based text-to-speech models which first predict a Mel spectrogram from text and use it to condition a WaveNet vocoder.
There are also a few notable applications of WaveNet on music, including NSynth , an autoencoder architecture which separately encodes monophonic pitch with learned timbral features, and the universal music translation network 
which uses a denoising autoencoder architecture that can extract and translate between musical styles while preserving the melody. In, a WaveNet is used for music synthesis conditioned directly on note sequences, while only supporting piano sounds. Our model is similarly built around WaveNet for its synthesis capability, while using a learned embedding space to flexibly control the timbre of polyphonic music.
The neural network shown in Figure 1, dubbed Mel2Mel, concerns the task of synthesizing music corresponding to given note sequences and timbre. The note sequences are supplied as a MIDI file and converted to a piano roll representation, which contains the note timings and the corresponding note velocities for each of the 88 piano keys. We use a fixed step size in time, and the piano roll representation is encoded as a matrix by quantizing the note timings to the nearest time step. The input to the neural network is a concatenation of two 88-dimensional piano roll representations, one for onsets and one for frames, comprising 176 dimensions in total:
where is the indicator function, and denotes the MIDI velocity scaled to . This input representation is inspired by  which showed that jointly training on onsets and frames performs better than using frame information only; similarly, we want the network to maximally utilize the onsets which have the most relevant information on the attack sounds, while still receiving the frame information. Another reason for using both onsets and frames is that, because of the time quantization, repeated notes become indistinguishable only using when an offset is too close to the subsequent onset.
The input goes through a linear 1x1 convolution layer, which is essentially a time-distributed fully connected layer, followed by a FiLM layer, to be described in the following subsection, which takes the timbre embedding vector and transforms the features accordingly. After a bidirectional LSTM layer and another FiLM layer for timbre conditioning, another linear 1x1 convolution layer produces the Mel spectrogram prediction. The resulting Mel spectrogram is then fed to a WaveNet vocoder to produce the music; Mel spectrograms compactly convey sufficient information for audio synthesis and have been successfully used for conditioning WaveNet[17, 18]. The use of bidirectional LSTM is justified because Mel spectrograms are constructed using a larger window than the step size, making it non-causal. The only nonlinearities in the network are in the LSTM, and there are no time-domain convolutions except in WaveNet.
2.1 Timbre Conditioning using FiLM Layers
We can think of this architecture as a multi-step process that shapes a piano roll into its Mel spectrogram, by applying appropriate timbre given as side information. A FiLM layer  is a suitable choice for this task, because it can represent such action of shaping using an affine transformation of intermediate-layer features. For each instrument to model, its timbre is represented in an embedding vector
, implemented as a learned matrix multiplication on one-hot encoded instrument labels. A FiLM layer learns functionsand , which are simply linear layers mapping the timbre embedding to and . The affine transformation, or FiLM-ing, of intermediate-layer features is then applied using feature-wise operations:
At a higher level, the affine transformations learned by the FiLM layers are nonlinearly transformed by the recurrent and convolutional layers to respectively form temporal and spectral envelopes, which are two important aspects that characterize instrumental timbre. Using the first FiLM layer is essential because the recurrent layer needs to take timbre-dependent input to apply the temporal dynamics according to the timbre, and the second FiLM layer can apply additional spectral envelope on the recurrent layer’s output.
2.2 Model Details
The sampling rate of 16 kHz and -law encoding with 256 quantization levels are used for all audio as in the original WaveNet paper . The predicted Mel spectrograms are defined using 80 area-normalized triangular filters distributed evenly between zero and the Nyquist frequency in the Mel scale. The STFT window length of 1,024 samples and the step size of 128 samples are used, which translate to 64 milliseconds and 8 milliseconds, respectively. Unless specified otherwise, we use 256 channels in all hidden layers and a two-dimensional embedding space for timbre conditioning.
For Mel spectrogram prediction, an Adam optimizer with the initial learning rate of 0.002 is used, and the learning rate is halved every 40,000 iterations. The model is trained for 100,000 iterations, where each iteration takes a mini-batch of 128 sequences of length 65,536, or 4.096 seconds. Three different loss functions are used and compared; for linear-scale Mel spectrogramsand :
All logarithms above are natural, and the spectrogram magnitudes are clipped at -100 dB. Prepending gives a soft-thresholding effect where the errors in the low-energy ranges are penalized less than the errors close to 0 dB.
For the WaveNet vocoder, we used nv-wavenet222https://github.com/NVIDIA/nv-wavenet
, a real-time open-source implementation of autoregressive WaveNet by NVIDIA. This implementation limits the recurrent channel size at 64 and the skip channels at 256, because of the GPU memory capacity. A 20-layer WaveNet model was trained with the maximum dilation of 512, and the Mel spectrogram input is upsampled using two transposed convolution layers of window sizes 16 and 32 with strides of 8 and 16, respectively. An Adam optimizer with the initial learning rate of 0.001 is used, and the learning rate is halved every 100,000 iterations, for one million iterations in total. Each iteration takes a mini-batch of 4 sequences of length 16,384, i.e. 1.024 seconds.
While it is ideal to use recorded audio of real instruments as the training dataset, the largest multi-instrument polyphonic datasets available such as MusicNet 
is highly skewed, contains a limited variety of solo instrument recordings, and is expected to have a certain degree of labeling errors. So we resorted to using synthesized audio for training and collected MIDI files fromwww.piano-midi.de, which are also used in the MAPS Database ; these MIDI files are recorded from actual performances and contain expressive timing and velocity information. We have selected 10 General MIDI instruments shown in Figure 4 covering a wide variety of timbres, and 334 piano tracks are synthesized for each instrument using FluidSynth with the default SoundFont from MuseScore 3. The 334 tracks are randomly split into 320 for training and 14 for validation. The total size of the synthesized dataset is 3,340 tracks and 221 hours.
For later experiments, we also generate a similar dataset using 100 manually selected instrument classes using a high-quality collection of SoundFonts, which contains a wide variety of timbres.
3.2 Ablation Study on Model Design
In this series of experiments, we examine how slight variations in the model architecture affect the performance and show that the proposed model achieves the best performance in accurately predicting Mel spectrograms. The first two variations use either the frame data or the onset data only as the input. The next three omit an architectural component: one of the two FiLM layers or the backward LSTM. The last four increase the network’s capacity by adding the ReLU nonlinearity after the first convolution, using kernel sizes of 3 or 5 time steps in convolutions, or adding another LSTM layer.
|Variations||Train loss ()||Validation loss ()|
|Proposed||4.09 0.30||4.75 0.05|
|Frame input only||5.58 0.32||5.92 0.08|
|Onset input only||5.88 0.37||6.97 0.06|
|First FiLM only||4.55 0.32||4.99 0.06|
|Second FiLM only||7.65 0.34||8.76 0.08|
|Forward LSTM only||5.70 0.43||5.56 0.09|
|ReLU activation||3.97 0.35||5.04 0.06|
|3x1 convolutions||3.66 0.28||5.12 0.08|
|5x1 convolutions||3.49 0.30||5.06 0.08|
|2-layer LSTM||2.98 0.20||4.96 0.12|
The train and validation losses as defined in Equation 3 are shown333 The means and standard deviations over the model checkpoints in 90k-100k iterations are reported, to minimize the variability due to SGD.
The means and standard deviations over the model checkpoints in 90k-100k iterations are reported, to minimize the variability due to SGD.in the table above for each variation. Using both onsets and frames is indeed more effective than using only one of them in the input. The first FiLM layer plays a more crucial role than the second, because only the first can help learn a timbre-dependent recurrent layer. As expected, removing the backward LSTM also hurt the performance.
On the other hand, any variations increasing the model capacity make the model overfit and fail to generalize to validation data. This implies that the proposed model has the optimal architecture among the tested variations, and more specifically, having the nonlinearity only in the single recurrent layer helps the model better generalize in predicting Mel spectrograms from unseen note sequences. A possible interpretation is that the increased capacity is being used for memorizing the note sequences in the training dataset, as opposed to learning to model the timbral features independent of specific notes.
3.3 Synthesis Quality
3.3.1 Numerical Analysis of Audio Degradation
The model goes through several stages of prediction, and each stage incurs a degradation of audio quality. There necessarily exists some degradation caused by the -law quantization, and WaveNet adds additional degradation due to its limited model capacity. The generated audio is further degraded when imperfect Mel spectrogram predictions are given. As an objective measure of audio quality degradation at each stage and for each instrument, we plot the Pearson correlations between the synthesized and original audio in Figure 2. To calculate and visualize the correlations with respect to evenly spaced octaves, we use 84-bin log-magnitude constant-Q transforms with 12 bins per octave starting from C1 () and 512-sample steps. For ideal synthesis, the Pearson correlation should be close to , and lower correlations indicate larger degradation from the original.
Figure 2a shows the correlations for each stage of degradation and for different loss functions used for training the model. The degradations are more severe in low frequencies in general, where the WaveNet model sees less number of periods of a note within its fixed receptive field length. The orange curve showing the correlations for WaveNet synthesis using ground-truth Mel spectrograms already exhibits a significant drop from the top curve; this defines an upper bound of Mel2Mel’s synthesis quality. The lower three curves correspond to the loss functions in Equations 1-3, among which the abs MSE loss clearly performs the worse than the other two which have almost identical Pearson correlation curve, indicating that the MSE loss is more effective in the log-magnitude scale.
Figure 2b shows the breakdown of the curve corresponding to Equation 3 into each of the 10 instruments. There are rather drastic differences among instruments, and most instruments have low Pearson correlations in low pitches except pizzicato strings. The reasons and implications of these trends are discussed in the following subsection, in comparison with the subjective audio quality test.
3.3.2 Subjective Audio Quality Test
We performed a crowd-sourced test asking the listeners to rate the quality of 20-second audio segments using a 5-point mean opinion score (MOS) scale with 0.5-point steps. MOS allows a simple interface that is more suitable for non-expert listeners in a crowdsourcing setup than e.g. MUSHRA  and is used as a standard approach for assessing the perceptual quality of WaveNet syntheses [16, 27]. For each of the six configurations corresponding to the curves in Figure 2a in addition to the original audio, the first 20 seconds of the 140 validation tracks are evaluated. The listeners are provided with a randomly selected segment at a time, and each segment is evaluated by three listeners, comprising 420 samples in each configuration.
|Condition||Mean Opinion Scores|
|Original audio||4.301 0.080|
|-law encode-decoded audio||3.876 0.097|
|WaveNet: ground-truth Mel||3.383 0.100|
|WaveNet: tanh-log-abs MSE||3.183 0.106|
|WaveNet: log-abs MSE||3.019 0.109|
|WaveNet: abs MSE||2.751 0.110|
This table shows the mean opinion scores (MOS) and the 95% confidence intervals, which generally follow the tendency similar to Figure2. Using the soft-thresholding loss in Equation 3 results in the best subjective quality among the three loss functions compared, more significantly so than the numerical comparison in Figure 2.
In Figure 3, we compare the numerically and perceptually evaluated quality for each instrument. The horizontal coordinates are the average values for each curve in Figure 2b. Pearson correlations between CQT are not necessarily indicative of the subjective synthesis quality, because a large contrast in the temporal envelope can contribute to high Pearson correlations, notwithstanding a low perceptual quality. Reflecting this, more transient instruments in the lower right such as pizzicato strings achieve higher Pearson correlations compared to the MOS, while more sustained instruments on the left side have relatively lower Pearson correlations but have higher perceptual quality.
3.4 The Timbre Embedding Space
To make sense of how the learned embedding space conveys timbre information, we construct a 320-by-320 grid that encloses all instrument embeddings and predict the Mel spectrogram conditioned on every pixel in the grid. The spectral centroid and the mean energy corresponding to each pixel are plotted in Figure 4, which are indicative of the two main aspects of instrumental timbres: the spectral and temporal envelopes. A higher spectral centroid signifies stronger harmonic partials in high frequency, while a lower spectral centroid indicates that it is closer to a pure sine tone. Similarly, higher mean energy implies a more sustained tone, and low mean energy means that the note is transient and decays rather quickly. The points corresponding to the 10 instruments are annotated with instrument icons.444The icons are made by Freepik and licensed by CC 3.0 BY. These plots show that the learned embedding space forms a continuous span over the timbres expressed by all instruments in the training data. This allows us to use the timbre embedding as a flexible control space for the synthesizer, and timbre morphing is possible by interpolating along curves within the embedding space.
To illustrate how the model scales with more diverse timbre, we train the Mel2Mel model with 100 instruments using a 10-dimensional embedding, and we refer the readers to the web demo for an interactive -SNE visualization  of the embedding space. The 10-dimensional embedding space also contains a locally continuous timbre distribution of instruments, as in Figure 4, implying that the Mel2Mel model is capable of scaling to hundreds of instruments and to a higher-dimensional embedding space.
In addition to the audio samples used in the experiments, our interactive web demo555https://neural-music-synthesis.github.io showcases the capability of flexible timbre control, where the Mel2Mel model runs on browser to convert preloaded or user-provided MIDI files into Mel spectrograms using a user-selected point in the embedding space.
4 Conclusions and Future Directions
We showed that it is possible to build a music synthesis model by combining a recurrent neural network and FiLM conditioning layers, followed by a WaveNet vocoder. It successfully learns to synthesize musical notes according to the given note sequence and timbre embedding in a continuous timbre space, providing the ability of flexible timbre control for music synthesizers.
The capacity of the WaveNet, such as the number of residual channels and the number of layers, is limited due to the memory requirements of the nv-wavenet implementation, and the degradation from -law quantization is also apparent in the experiments. These limitations can be overcome by Parallel WaveNet 
, which does not require a special CUDA kernel for fast synthesis and uses a continuous probability distribution for generation, thereby avoiding the quantization noise. Our earlier experiments on continuous emission failed to stably perform autoregressive sampling due to teacher forcing, and the future work includes investigating this phenomenon comparing with, which used a mixture of logistics distributions to produce high-quality piano sounds.
A notable observation is that the WaveNet vocoder is able to synthesize polyphonic music from Mel spectrograms containing only 80 frequency bins, which are not even aligned to the tuning of the audio files. While more information available from the increased bins should help synthesize more accurate audio, predicting the higher-dimensional representation becomes more compute-intensive and inaccurate, making 80 bins a sweet spot for use with WaveNet. Introducing an adversarial loss function for predicting high-resolution images  can be a viable direction for predicting more accurate and realistic Mel spectrograms for conditioning WaveNet.
Overall, we have demonstrated that a MIDI-to-audio synthesizer can be learned directly from audio, and that this learning allows for flexible timbre control. Once extended with an improved vocoder and trained on real audio data, we believe the model can result in a powerful and quite realistic music synthesis model.
-  Andrea Pejrolo and Scott B Metcalfe, Creating Sounds from Scratch: A Practical Guide to Music Synthesis for Producers and Composers, Oxford University Press, 2017.
-  Marcelo Freitas Caetano and Xavier Rodet, “Automatic Timbral Morphing of Musical Instrument Sounds by High-Level descriptors,” in Proceedings of the International Computer Music Conference, 2010.
-  Naotoshi Osaka, “Timbre Interpolation of Sounds Using a Sinusoidal Model,” in Proceedings of the International Computer Music Conference, 1995.
Federico Boccardi and Carlo Drioli,
“Sound Morphing with Gaussian Mixture Models,”in Proceedings of DAFx, 2001.
-  Malcolm Slaney, Michele Covell, and Bud Lassiter, “Automatic Audio Morphing,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1996, vol. 2.
-  Tony Ezzat, Ethan Meyers, James Glass, and Tomaso Poggio, “Morphing Spectral Envelopes using Audio Flow,” in European Conference on Speech Communication and Technology, 2005.
-  Marcelo Caetano and Xavier Rodet, “Musical Instrument Sound Morphing Guided by Perceptually Motivated Features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 8, 2013.
-  Geoffroy Peeters, Bruno L Giordano, Patrick Susini, Nicolas Misdariis, and Stephen McAdams, “The timbre Toolbox: Extracting Audio Descriptors from Musical Signals,” the Journal of the Acoustical Society of America, vol. 130, no. 5, 2011.
-  John M Grey, “Multidimensional Perceptual Scaling of Musical Timbres,” the Journal of the Acoustical Society of America, vol. 61, no. 5, 1977.
-  David L Wessel, “Timbre Space as a Musical Control Structure,” Computer Music Journal, 1979.
-  Beth Logan, “Mel Frequency Cepstral Coefficients for Music Modeling,” in Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, 2000.
-  Giulio Agostini, Maurizio Longari, and Emanuele Pollastri, “Musical instrument timbres classification with spectral features,” EURASIP Journal on Advances in Signal Processing, vol. 2003, no. 1, 2003.
Eric J Humphrey, Aron P Glennon, and Juan Pablo Bello,
“Non-Linear Semantic Embedding for Organizing Large Instrument
Proceedings of the International Conference on Machine Learning Applications (ICMLA), 2011, vol. 2.
-  Philippe Esling, Axel Chemla–Romeu-Santos, and Adrien Bitton, “Generative Timbre Spaces with Variational Audio Synthesis,” in Proceedings of the International Conference on Digital Audio Effects (DAFx), 2018.
-  Diederik P. Kingma and Max Welling, “Auto-Encoding Variational Bayes,” in Proceedings of the International Conference on Learning Representations (ICLR), 2014.
-  Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv:1609.03499, 2016.
-  Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
-  Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller, “Deep Voice 3: 2000-Speaker Neural Text-to-Speech,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
-  Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan, “Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders,” in Proceedings of the International Conference on Machine Learning (ICML), 2017, vol. 70.
-  Noam Mor, Lior Wolf, Adam Polyak, and Yaniv Taigman, “A Universal Music Translation Network,” arXiv:1805.07848, 2018.
-  Anonymous, “Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset,” Submitted to the International Conference on Learning Representations (ICLR), 2019, under review.
-  Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck, “Onsets and frames: Dual-Objective Piano Transcription,” in Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, 2018.
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron
“Film: Visual Reasoning with a General Conditioning Layer,”
Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference, 2018.
-  John Thickstun, Zaid Harchaoui, and Sham Kakade, “Learning Features of Music from Scratch,” in Proceedings of the International Conference on Learning Representations (ICLR), 2017.
Valentin Emiya, Roland Badeau, and Bertrand David,
“Multipitch Estimation of Piano Sounds using a New Probabilistic Spectral Smoothness Principle,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, 2010.
-  “Method for the Subjective Assessment of Intermediate Quality Level of Coding Systems,” in ITU-R Recommendation BS.1534-1, 2001.
-  Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al., “Parallel WaveNet: Fast High-Fidelity Speech Synthesis,” arXiv:1711.10433, 2017.
-  Laurens van der Maaten and Geoffrey Hinton, “Visualizing Data Using t-SNE,” Journal of Machine Learning Research, vol. 9, no. Nov, 2008.
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew P Aitken, Alykhan Tejani, Johannes Totz,
Zehan Wang, et al.,
“Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network.,”in , 2017, vol. 2.