Conditioning Autoencoder Latent Spaces for Real-Time Timbre Interpolation and Synthesis

We compare standard autoencoder topologies' performances for timbre generation. We demonstrate how different activation functions used in the autoencoder's bottleneck distributes a training corpus's embedding. We show that the choice of sigmoid activation in the bottleneck produces a more bounded and uniformly distributed embedding than a leaky rectified linear unit activation. We propose a one-hot encoded chroma feature vector for use in both input augmentation and latent space conditioning. We measure the performance of these networks, and characterize the latent embeddings that arise from the use of this chroma conditioning vector. An open source, real-time timbre synthesis algorithm in Python is outlined and shared.


page 5

page 7


CAESynth: Real-Time Timbre Interpolation and Pitch Control with Conditional Autoencoders

In this paper, we present a novel audio synthesizer, CAESynth, based on ...

Introducing Latent Timbre Synthesis

We present the Latent Timbre Synthesis (LTS), a new audio synthesis meth...

Head-Related Transfer Function Interpolation from Spatially Sparse Measurements Using Autoencoder with Source Position Conditioning

We propose a method of head-related transfer function (HRTF) interpolati...

Drum Beats and Where To Find Them: Sampling Drum Patterns from a Latent Space

This paper presents a large dataset of drum patterns and compares two di...

Network Modulation Synthesis: New Algorithms for Generating Musical Audio Using Autoencoder Networks

A new framework is presented for generating musical audio using autoenco...

Machine Learning to Predict Aerodynamic Stall

A convolutional autoencoder is trained using a database of airfoil aerod...

Walking the Tightrope: An Investigation of the Convolutional Autoencoder Bottleneck

In this paper, we present an in-depth investigation of the convolutional...

I Introduction

Timbre refers to the perceptual qualities of a musical sound distinct from its amplitude and pitch. It is timbre that allows a listener to distinguish between a guitar and a violin both producing a concert C note. Moreover, a musician’s ability to create, control, and exploit new timbres has led, in part, to the surge in popularity of pop, electronic, and hip hop music.

New algorithms for timbre generation and sound synthesis have accompanied the rise to prominence of artificial neural networks. GANSynth [gansynth], trained on the NSynth dataset [nsynth2017], uses generative adversarial networks to output high-fidelity, locally coherent outputs. Furthermore, GANSynth’s global latent conditioning allows for interpolations between two timbres. Other works have found success in using Variational Autoencoders (VAEs) [universal] [bijective] [assisted], which combine autoencoders and probabilistic inference to generate new audio. Most recently, differential digital signal processing has shown promise by casting common modules used in signal processing into a differentiable form [ddsp], where they can be trained with neural networks using stochastic optimization.

The complexity of these models require powerful computing hardware to train, hardware often out of reach for musicians and creatives. When designing neural networks for creative purposes one must strike a three-way balance between the expressivity of the system, the freedom given to a user to train and interface with the network, and the computational overhead needed for sound synthesis. One successful example we point to in the field of music composition is MidiMe [midime], which allows a composer to train a VAE with their own scores on a subspace of a larger, more powerful model. Moreover, these training computations take place on the end user’s browser.

Our previous work has tried to strike this three-way balance as well [colonel] [colonel2]

, by utilizing feed-forward neural network autoencoder architectures trained on Short-Time Fourier Transform (STFT) magnitude frames. This work demonstrated how choice of activation functions, corpora, and augmentations to the autoencoder’s input could improve performance for timbre generation. However, we found upon testing that the autoencoder’s latent space proved difficult to control and characterize. Also, we found that our use of a five-octave MicroKORG corpus encouraged the autoencoder to produce high-pitched, often uncomfortable tones.

This paper introduces a chroma-based input augmentation and skip connection to help improve our autoencoder’s reconstruction performance with little additional training time. A one-octave MicroKORG corpus as well as a violin-based corpus are used to train and compare various architectural tweaks. Moreover, we show how this skip connection conditions the autoencoder’s latent space so that a musician can shape a timbre around a desired note class. A full characterization of the autoencoder’s latent space is provided by sampling from meshes that span the latent space. Finally, a real-time, responsive implementation of this architecture is outlined and made available in Python.

Ii Autoencoding Neural Networks

An autoencoding neural network (i.e. autoencoder) is a machine learning algorithm that is typically used for unsupervised learning of an encoding scheme for a given input domain, and is comprised of an encoder and a decoder


. For the purposes of this work, the encoder is forced to shrink the dimension of an input into a latent space using a discrete number of values, or “neurons.” The decoder then expands the dimension of the latent space to that of the input, in a manner that reconstructs the original input.

In a single layer model, the encoder maps an input vector to the hidden layer , where . Then, the decoder maps to . In this formulation, the encoder maps via


where , , and is an activation function that imposes a non-linearity in the neural network. The decoder has a similar formulation:


with , .

A multi-layer autoencoder acts in much the same way as a single-layer autoencoder. The encoder contains layers and the decoder contains layers. Using Equation 1 for each mapping, the encoder maps . Treating as in Equation 2, the decoder maps .

The autoencoder trains the weights of the ’s and ’s to minimize some cost function. This cost function should minimize the distance between input and output values. The choice of activation functions and cost functions depends on the domain of a given task.

Iii Network Design and Topology

We build off of previous work to present our current network architecture.

Iii-a Activations


and rectified linear unit (ReLU)


are often used to impose the nonlinearities in an autoencoding neural network. A hybrid autoencoder topology cobmining both sigmoid and ReLU activations was shown to outperform all-sigmoid and all-ReLU models in a timbre encoding task [colonel]. However, this hybrid model often would not converge for a deeper autoencoder [colonel2].

More recently, the leaky rectified linear unit (LReLU) [LReLU]


has been shown to avoid both the vanishing gradient problem introduced by using the sigmoid activation

[vangrad] and the dying neuron problem introduced by using the ReLU activation [dying]

. The hyperparameter

is typically small, and in this work fixed at .

Iii-B Chroma Based Input Augmentation

The work presented in [colonel2] showed how appending descriptive features to the input of an autoencoder can improve reconstruction performance, at the cost of increased training time. More specifically, appending the first-order difference of the training example to the input was shown to give the best reconstruction performance, at the cost of doubling the training time. Here, we suggest a basic chroma-based feature augmentation to help the autoencoder.

Chroma-based features capture the harmonic information of an input sound by projecting the input’s frequency content onto a set of chroma values [chroma]. Assuming a twelve-interval equal temperment Western music scale, these chroma values form the set {C, C#, D, D#, E, F, F#, G, G#, A, A#, B}. A chromagram can be calculated by decomposing an input sound into 88 frequency bands corresponding to the musical notes A0 to C8. Summing the short-time mean-square power across frames for each sub-band across each note (i.e. A0-A7) yields a chromagram.

In this work, a one-hot encoded chroma representation is calculated for each training example by taking its chromagram, setting the maximum chroma value to , and setting every other chroma value to . While this reduces to note conditioning in the case of single-note audio, this generalizes to the dominant frequency of a chord or polyphonic mixture. Furthermore this feature can be calculated on an arbitrary corpus, which eliminates the tedious process of annotating by hand.

Iii-C Hidden Layers and Bottleneck Skip Connection

This work uses a slight modification of the geometrically decreasing/increasing autoencoder topology proposed in [colonel2]. All layers aside from the bottleneck and output layers use the LReLU activation function. The output layer uses the ReLU, as all training examples in the corpus take strictly non-negative values. For the bottleneck layer, models are trained separately with all-LReLU and all-sigmoid activations to compare how each activation constructs a latent space.

The 2048-point first-order difference input augmentation is replaced with the 12-point one-hot encoded chroma feature explained above. Furthermore, in this work three separate topologies are explored by varying the bottleneck layer’s width – one with two neurons, one with three neurons, and one with eight neurons.

Residual and skip connections are used in autoencoder design to help improve performance as network depth increases [residual]. In this work, the 12-point one-hot encoded chroma feature input augmentation is passed directly to the autoencoder’s bottleneck layer. Models with and without this skip connection are trained to compare how the skip connection affects the autoencoder’s latent space. Figure 1 depicts our architecture with the chroma skip connection and eight neuron latent space. Note that for our two neuron model, the latent embedding would become a latent embedding, and similarly would become a for our three neuron model.

Fig. 1: Network diagram with eight neuron bottleneck and chroma skip connection

Iv Corpora

In this work a multi-layer neural network autoencoder is trained to learn representations of musical timbre. The aim is to train the autoencoder to contain high level descriptive features in a low dimensional latent space that can be easily manipulated by a musician. As in the formulation above, dimension reduction is imposed at each layer of the encoder until the desired dimensionality is reached. All audio used to generate the corpora for this work is stored as a 16-bit PCM wav file with 44.1kHz sampling rate.

The various corpora used to train the autoencoding neural network are formed by taking 2049 points from a 4096-point magnitude STFT as its target, where denotes the frame index of the STFT and denotes the frequency index, with 75% frame overlap. The Hann window is used in all cases. Each frame is normalized to . This normalization tasks the autoencoder to encode solely the timbre of an input observation and ignore its loudness relative to other observations within the corpus. These corpora were not mixed for training; models were only trained on each corpus separately.

Iv-a MicroKORG Dataset

Two corpora were created by recording C Major scales from a MicroKORG synthesizer/vocoder. In both cases, 70 patches make up the training set, 5 patches make up the validation set, and 5 patches make up the test set. These patches ensured that different timbres are present in the corpus. To ensure the integrity of the testing and validation sets, the dataset is split on the “clip” level. This means that the frames in each of the three sets are generated from distinct passages in the recording, which prevents duplicate or nearly duplicate frames from appearing across the three sets.

The first corpus is comprised of approximately magnitude STFT frames, with an additional frames held out for validation and another for testing. This makes the corpus frames in total, or roughly minutes of audio. The audio used to generate these frames is composed of five octave C Major scales recorded from a MicroKORG synthesizer/vocoder across 80 patches.

The second corpus is a subset of the first. It is comprised of one octave C Major scales starting from concert C. Approximately frames make up the training set, with an additional frames held out for validation and another for testing. This makes the subset frames in total, or roughly minutes of audio.

By restricting the corpus to single notes played on a MicroKORG, the autoencoder needs only to learn higher level features of harmonic synthesizer content. These tones often have time variant timbres and effects, such as echo and overdrive. Thus the autoencoder is also tasked with learning high level representations of these effects.

Iv-B TU-Note Violin Sample Library

A third corpus was created using a portion of the TU-Note Violin Sample Library [tu-note]. The dataset consists of recordings of a violin in an anechoic chamber playing single sounds, two-note sequences, and solo performances such as scales and compositions. The single notes were used to construct a training corpus, and the solo performances were cut into two parts to form the validation and test sets. These two parts were split on the “clip” level to ensure that no frames from the same passages were found across the validation and test sets. Approximately frames make up the training set, with an additional frames held out for validation and another for testing. This makes the subset frames in total, or roughly minutes of audio. Here, the autoencoder is tasked with learning the difference in timbre one can here when a violin is played at different dynamic levels, semitones, and with different stroke techniques.

Iv-C Training Setup

All models were trained for 300 epochs using the ADAM method for stochastic optimization

[ADAM], initialized with a learning rate of . Mean squared error was used as the cost function, with an L2 penalty of [L2]

. All training utilized one NVIDIA Quadro P2000 GPU, and all networks were implemented using Keras 2.2.4


with Tensorflow-GPU 1.9.0

[tensorflow] as a backend.

V Results

Input Augmentation Test Set MSE Training Time
No Append 35 minutes
Order Diff 44 minutes
One-Hot Chroma 37 minutes
TABLE I: Five Octave Dataset Autoencoder holdout set MSE loss and training time
Bottleneck Activation Skip? Test Set MSE
Sigmoid No
Sigmoid Yes
TABLE II: One Octave Dataset Autoencoder holdout set MSE loss, 2 neuron bottleneck
Bottleneck Activation Skip? Test Set MSE Training Time
LReLU No 41 minutes
LReLU Yes 58 minutes
Sigmoid No 43 minutes
Sigmoid Yes 43 minutes
TABLE III: TU-Note Violin Sample Library Dataset Autoencoder holdout set MSE loss and training time, two neuron bottleneck
Corpus Skip? Test Set MSE Training Time
One Octave No 8 minutes
One Octave Yes 8 minutes
Violin No 44 minutes
Violin Yes 45 minutes
TABLE IV: Sigmoid Bottleneck Autoencoder holdout set MSE loss and training time, three neuron bottleneck

Table I shows the performance of three autoencoders with an eight neuron bottleneck layer using LReLU activations trained on the five octave MicroKORG corpus. The model with the chroma augmentation outperforms both the first-order difference augmentation and no augmentation models. Moreover, the chroma augmentation only increases training time by two minutes. Therefore, the rest of the models in this work utilize the chroma input augmentation.

Table II show the performance of four autoencoders with a two neuron bottleneck layer trained on the one octave MicroKORG corpus. Models used either the LReLU or sigmoid activation for the bottleneck, and either did or did not utilize a chroma skip connection. All models took eight minutes to train. Both sigmoid models outperformed each LReLU model, and the sigmoid model with no skip connection performed the best.

Table III show the performance of four autoencoders with a two neuron bottleneck layer trained on the TU-Note Violin Sample Library corpus. Models used either the LReLU or sigmoid activation for the bottleneck, and either did or did not utilize a chroma skip connection. With this corpus, the chroma skip connection significantly improved the reconstruction error for both sigmoid and LReLU activations. Furthermore, the sigmoid activation with the chroma skip connection outperformed all models.

With these results in mind, two models were trained on the one octave MicroKORG corpus using a three neuron bottleneck with sigmoid activations: one with the chroma skip connection, and one without. Two more models with corresponding topologies were trained on the TU-Note Violin Sample Library corpus. Table IV shows the reconstruction performance of each model. In this case, the models with the chroma skip connection outperformed the models without.

V-a Latent Embeddings

When designing an autoencoder for musicians to use in timbre synthesis, it is important not only to measure the network’s reconstruction error, but also to characterize the latent space’s embedding. The software synthesizer implemented in [colonel2] allows a musician to chose a point in the autoencoder’s latent space and generate its corresponding timbre. By exploring the latent space, the musician can explore an embedding of timbre.

A clear understanding of the boundedness of an embedding ensures that a musician can fully explore the latent space of an arbitrary training corpus, and a clear understanding of the density of the latent embedding can help a musician avoid portions of the latent space that will generate unrealistic examples while interpolating between two encoded timbres [manifold] [sampling].

Recent work has attempted to encourage an autoencoder to interpolate in a “semantically continuous” manner [interpolate]. The authors sample from their autoencoder’s latent space along a line that connects two points to demonstrate this meaningful interpolation. The authors also characterize their latent space using a method proposed by [unsupervised], where an unsupervised clustering accuracy is measured to see how well ground truth labels are separated in the latent space. In the case of our work, however, we are less concerned with how clusters separate in the latent space and more concerned with how uniform samplings of the latent space produce note classes and timbres.

We begin with a visual inspection of the training set embeddings produced by the eight distinct autoencoders referred to in Tables II and III. Figure 2 shows the embeddings for the one octave MicroKORG corpus, and Figure 3 shows the embeddings for the TU-Note Violin Sample Library corpus. Models trained with the LReLU bottleneck activation are plotted in the top row, and models trained with the sigmoid bottleneck activation are plotted in the bottom row. Models trained without the chroma skip connection are plotted in the left column, and models trained with the chroma skip connection are plotted in the right column. Each note class is plotted as one color (i.e. C is dark blue, F is teal, B is yellow) using a perceptually uniform colormap.

In all cases, the chroma skip connection appears to encourage the embedding to be denser and contain fewer striations. Note that by definition, all models with sigmoid activations are bounded by . On the other hand, the models with LReLU activation vary their bounds greatly. Moreover, the first and second dimensions of the LReLU embeddings appear to have linear correlations, rather than populating the latent space in a more uniform manner. As such we move forward using only sigmoid activations at the bottleneck.

A full accounting of the two neuron sigmoid bottleneck autoencoder’s latent space is shown in Figures 4 and 5. These graphs were created by setting the chroma conditioning vector to a given note class, and then sampling the autoencoder’s latent space using a point mesh grid. Each note class is plotted as one color (i.e. C is dark blue, F is teal, B is yellow) using a perceptually uniform colormap. We observe that the autoencoder is able to use the majority of the alloted two dimensions to produce timbres that match the conditioned chroma vector. We note that most mismatches occur near the boundaries of the latent space. We suspect this may be caused by the asymptotic behavior of the sigmoid function coupled with the L2 penalty encouraging the network to choose smaller weights, though a full characterization is outside the scope of this paper.

This mesh sampling procedure was repeated for the three neuron and eight neuron sigmoid bottleneck models. Due to compuational constraints, the three neuron model used a mesh and the eight neuron model used a mesh. The accuracies of the model samplings are shown in Table V. We suspect that some of the decreases in prediction accuracy as the number of neurons in the bottleneck increases may be due in part to the coarser meshes over-weighing samplings near the boundaries of the latent space, though a full characterization is outside the scope of this paper.

Fig. 2: 2D embeddings of the One Octave MicroKORG Corpus

Fig. 3: 2D embeddings of the TU-Note Violin Sample Library Corpus
Model Mesh Length C C# D D# E F F# G G# A A# B
2D One Octave 350
3D One Octave 50
8D One Octave 5
2D Violin 350
3D Violin 50
8D Violin 5
TABLE V: Percent of sampled outputs matching conditioned chroma skip vector

Vi Python Implementation

As outlined in [colonel2], a spectrogram with no phase information can be generated via bypassing the network’s encoder and passing latent activations to the decoder. To generate the true phase of this spectrogram, the real-time phase gradient heap integration algorithm can be used [phase]

. However, to decrease the computational overhead involved in our algorithm, we store the stripped the phase of a white noise audio signal and use it to invert the generated spectrogram.

Our implementation is purely Python, using Tkinter as our GUI backend. Once a user selects a trained decoder to sample from, Keras loads the model into memory. The user is presented with sliders that correspond to each neuron in the model’s bottleneck, and a twelve-value radio button is used to set the chroma conditioning vector. The Pyaudio library provides Python bindings to PortAudio [portaudio], and handles the audio stream output.

Our implementation has been made available at, along with code to create a corpus from an audio file for training, code to train a model, and code to plot the samplings of a model’s latent space. We have tested our implementation on a laptop with an Intel Core i7-8750H CPU @ 2.20GHz × 12 with 16GB of RAM.

We also provide code to train and sample from Variational Autoencoder implementation (specifically a -VAE [beta]), with a word of caution. We found that all of our trained models exhibited posterior collapse [partial_collapse]

, wherein the variational distribution would closely match the uninformative prior for a subset of latent variables, and the rest of the latent variables would output high mean, low variance samplings. Moreover, we did not find that the non-conditioned

-VAE disentangled the note class from timbre. We found that the note class would change when varying any one latent dimension while fixing the rest. Unfortunately a full treatment of this behavior is outside the scope of this paper.

Vii Conclusion

We present an improved neural network autoencoder topology and training regime for use in timbre synthesis and interpolation. By using a one-hot encoded chroma vector as both an augmentation to the autoencoder’s input and a conditioning vector at the autoencoder’s bottleneck, we improve the autoencoder’s reconstruction error at little additional computational cost. Moreover, we characterize how this conditioning vector shapes the autoencoder’s usage of its latent space. We provide an open source implementation of our architecture in Python, which can sample from its latent space in real-time without the need for powerful computing hardware.

Fig. 4: 2D embeddings of the One Octave Corpus

Fig. 5: 2D embeddings of the TU-Note Violin Sample Library Corpus