Universal audio synthesizer control with normalizing flows

07/01/2019 ∙ by Philippe Esling, et al. ∙ Università degli Studi di Milano 3

The ubiquity of sound synthesizers has reshaped music production and even entirely defined new music genres. However, the increasing complexity and number of parameters in modern synthesizers make them harder to master. Hence, the development of methods allowing to easily create and explore with synthesizers is a crucial need. Here, we introduce a novel formulation of audio synthesizer control. We formalize it as finding an organized latent audio space that represents the capabilities of a synthesizer, while constructing an invertible mapping to the space of its parameters. By using this formulation, we show that we can address simultaneously automatic parameter inference, macro-control learning and audio-based preset exploration within a single model. To solve this new formulation, we rely on Variational Auto-Encoders (VAE) and Normalizing Flows (NF) to organize and map the respective auditory and parameter spaces. We introduce the disentangling flows, which allow to perform the invertible mapping between separate latent spaces, while steering the organization of some latent dimensions to match target variation factors by splitting the objective as partial density evaluation. We evaluate our proposal against a large set of baseline models and show its superiority in both parameter inference and audio reconstruction. We also show that the model disentangles the major factors of audio variations as latent dimensions, that can be directly used as macro-parameters. We also show that our model is able to learn semantic controls of a synthesizer by smoothly mapping to its parameters. Finally, we discuss the use of our model in creative applications and its real-time implementation in Ableton Live



There are no comments yet.


page 4

page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Synthesizers are parametric systems able to generate audio signals ranging from musical instruments to entirely unheard-of sound textures. Since their commercial beginnings more than 50 years ago, synthesizers have revolutionized music production, while becoming increasingly accessible, even to neophytes with no background in signal processing.

While there exists a variety of sound synthesis types [1], they all require an a priori knowledge to make the most out of a synthesizer possibilities. Hence, the main appeal of these systems (namely their versatility provided by large sets of parameters) also entails their major drawback. Indeed, the sheer combinatorics of parameter settings makes exploring all possibilities to find an adequate sound a daunting and time-consuming task. Furthermore, there are highly non-linear relationships between the parameters and the resulting audio. Unfortunately, no synthesizer provides intuitive controls related to perceptual and semantic properties of the synthesis. Hence, a method allowing an intuitive and creative exploration of sound synthesizers has become a crucial need, especially for non-expert users.

Figure 1: Universal synthesizer control. (a) Previous methods perform direct inference from audio, which is limited by non-differentiable synthesis and lacks high-level control. (b) Our novel formulation states allows to learn an organized latent space of the synthesizer’s audio capabilities, while mapping it to the space of its synthesis parameters.

A potential direction taken by synth manufacturers, is to propose macro-controls that allow to quickly tune a sound by controlling multiple parameters through a single knob. However, these need to be programmed manually, which still requires expert knowledge. Furthermore, no method has ever tried to tackle this macro-control learning task, as this objective appears unclear and depends on a variety of unknown factors. An alternative to manual parameter setting would be to infer the set of parameters that could best reproduce a given target sound. This task of parameter inference has been studied in the past years using various techniques. In Cartwright et al. [2], parameters are iteratively refined based on audio descriptors similarity and relevance feedback provided by the user. However, this approach appears to be rather inaccurate and slow. Garcia et al. [3]

proposed to use genetic programming to directly grow modular synthesizers to solve this problem. Although the approach is appealing and appears accurate, the optimization of a single target can take from 10 to 200 hours, which makes it unusable. Recently, Yee-king et al.

[4] showed that a bi-directional LSTM with highway layers can produce accurate parameters appproximations. However, this approach does not allow for any user interaction. All of these approaches share the same flaws that (i) though it is unlikely that a synthesizer can generate exactly any audio target, none explicitly model these limitations, (ii) they do not account for the non-linear relationships that exist between parameters and the corresponding synthesized audio. Hence, no approach has succeeded in unveiling the true relationships between these auditory and parameters spaces. Here, we argue that it is mandatory to organize the parameters and audio capabilities of a given synthesizer in their respective spaces, while constructing an invertible mapping between these spaces in order to access a range of high-level interactions. This idea is depicted in Figure 1

The recent rise of generative models might provide an elegant solution to these questions. Indeed, amongst these models, the Variational Auto-Encoder (VAE) [5] aims to uncover the underlying structure of the data, by explicitly learning a latent space [5]. This space can be seen as a high-level representation, which aims to disentangle underlying variation factors and reveal interesting structural properties of the data [5, 6]. VAEs address the limitations of control and analysis through this latent space, while being able to learn on small sets of examples. Furthermore, the recently proposed Normalizing Flows (NF) [7] allow to model highly complex distributions in the latent space. Although the use of VAEs for audio applications has only been scarcely investigated, Esling et al. [8] recently proposed a perceptually-regularized VAE that learns a space of audio signals aligned with perceptual ratings via a regularization loss. The resulting space exhibits an organization that is well aligned with perception. Hence, this model appears as a valid candidate to learn an organized audio space.

In this paper, we introduce a radically novel formulation of audio synthesizer control by formalizing it as the general question of finding an invertible mapping between two learned latent spaces. In our case, we aim to map the audio space of a synthesizer’s capabilities to the space of its parameters. We provide a generic probabilistic formalization and show that it allows to address simultaneously the tasks of parameter inference, macro-control learning, audio-based preset exploration and semantic dimension discovery within a single model. To elegantly solve this formulation, we introduce conditional regression flows, which map a latent space to any given target space, while steering the organization of some dimensions to match target distributions. Our complete model is depicted in Figure 2.

Based on this formulation, parameter inference simply consists of encoding the audio target to the latent audio space that is mapped to the parameter space. Interestingly, this bypasses the well-known blurriness issue in VAEs as we can generate directly with the synthesizer. We evaluate our proposal against a large set of baseline models and show its superiority in parameter inference and audio reconstruction. Furthermore, we show that our model is the first able to address the new task of automatic macro-control learning. As the latent dimensions are continuous and map to the parameter space, they provide a natural way to learn the perceptually most significant macro-parameters. We show that these controls map to smooth, yet non-linear parameters evolution, while remaining perceptually continuous. Furthermore, as our mapping is invertible, we can map synthesis parameters back to the audio space. This allows intuitive audio-based preset exploration, where exploring the neighborhood of a preset encoded in the audio space yields similarly sounding patches, yet with largely different parameters. Finally, we discuss creative applications of our model and real-time implementation in Ableton Live.

2 State-of-art

2.1 Generative models and variational auto-encoders

Generative models aim to understand a given set

by modeling the underlying probability distribution of the data

. To do so, we consider latent variables defined in a lower-dimensional space (

), a higher-level representation that could have led to generate a given example. The complete model is defined by the joint distribution

. Unfortunately, real-world data follow complex distributions, which cannot be found analytically. The idea of variational inference (VI) is to solve this problem through optimization by assuming a simpler approximate distribution from a family of approximate densities [9]. The goal of VI is to minimize the difference between this approximation and the real distribution, by minimizing the Kullback-Leibler (KL) divergence between these densities

By developing this KL divergence and re-arranging terms (the detailed development can be found in [5]), we obtain


This formulation describes the quantity we want to model minus the error we make by using an approximate instead of the true . Therefore, we can optimize this alternative objective, called the evidence lower bound (ELBO)


The ELBO intuitively minimizes the reconstruction error through the likelihood of the data given a latent , while regularizing the distribution to follow a given prior distribution . We can see that this equation involves which encodes the data into the latent representation and a decoder , which generates given a . This structure defines the Variational Auto-Encoder

(VAE), where we can use parametric neural networks to model the

encoding () and decoding () distributions. VAEs are powerful representation learning frameworks, while remaining simple and fast to learn without requiring large sets of examples [10].

However, the original formulation of the VAE entails several limitations. First, it has been shown that the KL divergence regularization can lead both to uninformative latent codes (also called posterior collapse

) and variance over-estimation

[11]. One way to alleviate this problem is to rely on the Maximum Mean Discrepancy (MMD) instead of the KL to regularize the latent space, leading to the WassersteinAE (WAE) model [12]. Second, one of the key aspect of VI lies in the choice of the family of approximations. The simplest choice is the mean-field family where latent variables are mutually independent and parametrized by distinct variational parameters . Although this provide an easy tool for analytical development, it might prove too simplistic when modeling complex data as this assumes pairwise independence among every latent axis. Normalizing flows alleviate this issue by adding a sequence of invertible transformations to the latent variable, providing a more expressive inference process.

2.2 Normalizing flows

In order to transform a probability distribution, we can rely on the change of variable theorem. As we deal with probability distributions, we need to scale the transformed density so that it still sums to one, which is measured by the determinant of the transform. Formally, let

be a random variable with distribution

and an invertible smooth mapping. We can use to transform , so that the resulting random variable has the following probability distribution


where the last equality is obtained through the inverse function theorem. We can perform any number of transforms to obtain a final distribution given by


This series of transformations, called a normalizing flow [7], can turn a simple distribution into a complicated multimodal density. For practical use of these flows, we need transforms whose Jacobian determinants are easy to compute. Interestingly, Auto-Regressive (AR) transforms fit this requirement as they lead to a triangular Jacobian matrix. Hence, different AR flows were proposed such as Inverse AR Flows (IAF) [13] and Masked AR Flows (MAF) [14]

Normalizing flows in VAEs. Normalizing flows allow to address the simplicity of variational approximations by complexifying their posterior distribution [7]. In the case of VAEs, we parameterize the approximate posterior distribution with a flow of length , , and the new optimization loss can be simply written as an expectation over the initial distribution


The resulting objective can be easily optimized since is still a Gaussian from which we can easily sample. However, the final samples used by the decoder are drawn from a more complex distribution.

3 Our proposal

Figure 2: Universal synthesizer control. We learn an organized latent audio space of a synthesizer capabilities with a VAE parameterized with NF. This space maps to the parameter space through our proposed regression flow and can be further organized with metadata targets . This provides sampling and invertible mapping between different spaces.

3.1 Formalizing synthesizer control

Considering a set of audio samples where the follow an unknown distribution , we can define latent factors to model the joint distribution as detailed in Section 2.1. In our case, some inside this set have been generated by a given synthesizer. This synthesizer defines a generative function where is a set of parameters that produce at a given pitch and intensity . However, in the general case, we know that if , then where models the error made when trying to reproduce any audio with a given synthesizer. Finally, we consider that some audio examples are annotated with a set of categorical semantic tags , which define high-level perceptual properties that separate unknown latent factors and target factors . Hence, the complete generative story of a synthesizer can be defined as


This very general formulation entails our original idea that we should uncover the relationship between the latent audio and parameters spaces by modeling . The advantage of this formulation is that the reduced dimensionality of the latent simplifies the problem of parameters inference, by relying on a more adequate and smaller input space. Furthermore, this formulation also provides a natural way of learning macro-controls by inferring in the general case, where separate dimensions of are expected to produce smooth auditory transforms.

Mapping latent spaces. In order to map the latent and parameter spaces, we can first consider that and semantic tags are both unknown latent factors where so that we address the reduced problem


This allows to separately model the variational approximation (Section 2.1), while solving the inference .

3.2 Mapping latent spaces with regression flows

In order to map the latent and parameter spaces, we first separate our formulation so that


This allows to separately model the variational approximation detailed in Section 2.1, while solving separately the inference problem . To address this inference, we need to find the optimal parameters of a transform so that , where models the inference error as a zero-mean additive Gaussian noise with covariance . Here, we assume that the covariance decomposes into , where are fixed basis functions and

are hyperparameters. Therefore, the full joint likelihood that we need to optimize is given by


If we know the optimal transform and parameters , the likelihood of the data can be easily computed as


However, the two posteriors and remain intractable in the general case. In order to solve this issue, we rely again on variational inference by defining an approximation (see Section 2.1) and assume that it factorizes as . Therefore, our final inference problem is


Hence, we can optimize our approximations through the KL divergence if we find a closed form. To solve for

, we use a Gaussian distribution for both the prior

and posterior . To solve this issue, we introduce the idea of regression flows. This allows to obtain a simple analytical solution. However, the second part of the objective might be more tedious. Indeed, to perform an accurate inference, we need to rely on a complicated non-linear function, which cannot be assumed to be Gaussian. To address this issue, we introduce the idea of regression flows. We consider that the transform is a normalizing flow (see Section 2.1) and provides two different way of optimizing the approximation.

Posterior parameterization. First, we follow a reasoning akin to the original formulation of normalizing flows by parameterizing the posterior with a flow . Hence, by developing the KL expression, we obtain


Hence, we can now safely rely on Gaussian priors for and . This formulation allows to consider as a transformed version of , while being easily invertible as . We denote this version as .

Conditional amortization. Here, we consider that the parameters of the flow are random variables that are optimized by decomposing the posterior KL objective as


As we rely on Gaussian priors for the parameters, this additional KL term can be computed easily. In this version, denoted , parameters of the flow are sampled from their distributions before computing the resulting transform.

3.3 Disentangling flows for semantic dimensions

We introduce semantic tags in the training by expanding latent factors with categorical . Hence, we define the generative process where and is the prior distribution of the tags. We define the inference model as and assume that it factorizes as . In order to handle the fact that tags are not always observed, we define a model similar to [kingma2014semi]. When is unknown, it is considered as a latent variable over which we can perform posterior inference


When tags are known, we take a rather unusual approach through the idea of disentangling flows. As we seek to obtain a latent dimension with continuous semantic control, we define a tag pair as a set of negative and positive samples. We define two target distributions and that model samples of a semantic pair as opposite sides of a latent dimension. Hence, we turn the treatment of tags into a density estimation problem, where we aim to match tagged samples densities to targets minimizing . To solve this, we consider that is parameterized by a normalizing flow applied to the latent , leading to our final objective


This formulation enforces a form of supervised disentanglement, where latent are transformed to provide dimensions with explicit target properties. The final bound is defined as the sum of both objectives and the complete model is obtained by integrating regression and disentangling flows together.

4 Experiments

4.1 Dataset

Synthesizer. We constructed a dataset of synthesizer sounds and parameters, by using an off-the-shelf commercial synthesizer Diva developed by U-He222https://u-he.com/products/diva/. It should be noted that our model can work for any synthesizer, as long as we obtain couples of (audio, parameters) as input. We selected Diva as (i) almost all its parameters can be MIDI-controlled, (ii) large banks of presets are available and (iii) presets include well-organized semantic tags pairs. The factory presets for Diva and additional presets from the internet were collected, leading to a total of roughly 11k files. We manually established the correspondence between synth and MIDI parameters as well as the parameters values range and distributions. We only kept continuous parameters and normalize all their values to . All other parameters are set to their fixed default value. Finally, we performed PCA and manual screening to select increasing sets of the most used 16, 32 and 64 parameters. We use RenderMan333https://github.com/fedden/RenderMan to batch-generate all the audio files by playing the note for 3 sec. and recording for 4 sec. to capture the release of the note. The files are saved in 22050Hz and 16bit floating point format.

Audio processing. For each sample, we compute a 128 bins Mel-spectrogram with a FFT of size 2048 with a hop of 1024 and frequency range of . We only keep the magnitude of the spectrogram and perform a log-amplitude transform. The dataset is randomly split between a training (80%), validation (10%) and test (10%) set before each training. We repeat the training times to perform -fold cross-validation. Finally, we perform a corpus-wide zero-mean unit-variance normalization based on the train set.

4.2 Models

Baseline models. In order to evaluate our proposal, we implemented several feed-forward deep models that take spectrograms as input and try to infer the corresponding parameters . All these models are trained with a

loss on the parameters vector. First, we implement a 5-layers

with 2048 hidden units per layer, Exponential Linear Unit (ELU) activation, batch normalization and dropout with

. This model is applied on a flattened version of the input and the final layer is a sigmoid activation. We implement a convolutional model composed of 5 layers with 128 channels of strided dilated 2-D convolutions with kernel size 7, stride 2 and an exponential dilation factor of

(starting at ) with batch normalization and ELU activation. The convolutions are followed by a 3-layers MLP identical to the previous model. Finally, we implemented a Residual Network, with parameters settings identical to and denote this model 444All remaining details on the models along with the complete source code for full reproducibility are available on the supporting webpage.

Our models. We implemented various *AE architectures with the setup for encoders and decoders. However, we halve their number of parameters (by dividing the number of units and channels) to perform a fair comparison by obtaining roughly the same capacity as the baselines. All models are trained with a reconstruction loss on the spectrograms. First, we implement a simple deterministic without regularization. We implement the by adding a KL regularization to the latent space and the by replacing the KL by the MMD. Finally, we implement by adding a normalizing flow of 16 successive IAF transforms to the posterior. All AEs map to latent spaces of dimensionality equal to the number of synthesis parameters. We perform warmup [10] by linearly increasing the latent regularization

from 0 to 1 for 100 epochs. All AE models are trained with a 2-layers MLP to predict the parameters based on the latent space. Then, we use

regression flows () by adding them to , with an IAF of length 16 without tags. Finally, we add the disentangling flows () by adding our objective defined in Sec. 3.3

Optimization. We train all models for 500 epochs with the ADAM optimizer, initial learning rate of 0.0002, Xavier initialization and a scheduler that halves the learning rate if the validation loss stalls for 20 epochs. With this setup, the most complex with regression flows only needs  5 hours to complete training on a NVIDIA Titan Xp GPU.

Test set - 16 parameters Test set - 32 parameters Out-of-domain (32 p.)
Params Audio Params Audio Audio
0.236.44 6.226.13 9.5483.1 0.218.46 13.513.1 36.4811.9 2.3482.1 37.997.8
0.171.45 1.372.29 6.3291.9 0.159.46 19.184.7 33.409.4 2.3112.2 29.228.2
0.191.43 1.004.35 6.4221.9 0.196.49 10.371.8 31.139.8 2.3221.6 31.079.5
0.181.40 0.893.13 5.5571.7 0.169.40 5.5661.2 17.716.9 1.2252.2 27.377.2
0.182.32 0.810.03 4.9011.4 0.153.34 5.5191.4 16.856.1 1.2371.3 27.067.1
0.159.37 0.787.05 4.9791.5 0.147.33 3.967.88 16.646.2 1.1941.5 26.106.4
0.199.32 0.838.02 4.9751.4 0.164.34 1.418.23 17.746.8 1.1931.8 27.036.4
0.197.31 0.752.05 4.4091.6 0.193.32 0.9111.4 16.617.4 1.1011.2 26.077.7
0.199.31 0.831.04 5.1032.1 0.197.42 1.4811.8 17.127.9 1.2091.4 26.777.3
Table 1: Comparison between baselines, *AEs and our flows on the test set with 16, 32 parameters and an out-of-domain set. We report across-folds mean and variance for parameters (MSE) and audio (SC and MSE) errors.

5 Results

Figure 3: Reconstruction analysis. Comparing parameters inference and resulting audio on the test set with 16 (a) or 32 (b) parameters, and on the out-of-domain (c) set.

5.1 Parameters inference

First, we compare the accuracy of all models on parameters inference by computing the magnitude-normalized Mean Square Error () between predicted and original parameters values. We average these results across folds and report variance. We also evaluate the distance between the audio synthesized from inferred parameters and the original with the Spectral Convergence (SC) distance (magnitude-normalized Frobenius norm) and MSE. We provide evaluations for 16 and 32 parameters on the test set and an out-of-domain dataset in Table 1.

In low parameters settings, baseline models seem to perform an accurate approximation of parameters, with the providing the best inference. Based on this criterion solely, our formulation would appear to provide only a marginal improvement, with s even outperformed by baseline models and best results obtained by the . However, analysis of the corresponding audio accuracy tells an entirely different story. Indeed, AEs approaches strongly outperform baseline models in audio accuracy, with the best results obtained by our proposed (1-way ANOVA , ). These results show that, even though AE models do not provide an exact parameters approximation, they are able to account for the importance of these different parameters on the synthesized audio. This supports our original hypothesis that learning the latent space of synthesizer audio capabilities is a crucial component to understand its behavior. Finally, it appears that adding disentangling flows () slightly impairs the audio accuracy. However, the model still outperform most approaches, while providing the huge benefit of explicit semantic macro-controls.

Increasing parameters complexity. We evaluate the robustness of different models by increasing the number of parameters from 16 to 32. As we can see, the accuracy of baseline models is highly degraded, notably on audio reconstruction. Interestingly, the gap between parameter and audio accuracies is strongly increased. This seems logical as the relative importance of parameters in larger sets provoke stronger impacts on the resulting audio. Also, it should be noted that models now outperform baselines even on parameters accuracy. Although our proposal also suffers from larger sets of parameters, it appears as the most resilient and manages this higher complexity. While the gap between AE variants is more pronounced, the flows strongly outperform all methods (, ).

Out-of-domain generalization. We evaluate out-of-domain generalization with a set of audio samples produced by other synthesizers, orchestral instruments and voices. We rely on the same audio evaluation and provide results in Table 1 (Right). Here, the overall distribution of scores remains consistent with previous observations. However, it seems that the average error is quite high, indicating a potentially distant reconstruction of some examples. Upon closer listening, it seems that the models fail to reproduce natural sounds (voices, instruments) but perform well with sounds from other synthesizers. In both cases, our proposal accurately reproduces the temporal shape of target sounds, even if the timbre is somewhat distant.

5.2 Reconstructions and latent space

We provide an analysis of the relations between inferred parameters and corresponding audio by selecting samples from different sets and displaying the results in Figure 4.

As we can see, although the provides a close inference of the parameters, the synthesized approximation misses important structural aspects, even in simpler instances of 16 parameters. This observation is amplified for 32 parameters, which confirms that direct inference models are unable to assess the relative impact of parameters on the audio. Indeed, the error in all parameters is considered equivalently, even though the same error magnitude on two different parameters can lead to dramatic differences in the audio. Oppositely, while the parameters inferred by the VAE are quite far from the original, the corresponding audio is largely closer. This indicates that the latent space provides knowledge on audio-based neighborhoods, allowing to understand the impact of different parameters in a given region of the latent audio space. Regarding out-of-domain reconstruction, our proposal appears to provide an accurate rendition of the global temporal shape of the target audio, but seems to miss parts of the target timbre.

5.3 Reconstructions and latent space

We provide an in-depth analysis of the relations between inferred parameters and corresponding synthesized audio to support our previous claims. First, we selected two samples from the test set and compare the inferred parameters and synthesized audio in Figure 4.

Figure 4: Reconstruction analysis. Comparing parameters inference and corresponding synthesized audio on the test dataset between the best performing models.

As we can see, although the provides a close inference of the parameters, the synthesized approximation completely misses important structural aspects, even in simpler instances as the slow ascending attack in the second example. This confirms that direct inference models are unable to assess the relative impact of parameters on the audio. Indeed, the errors in all parameters are considered equivalently, even though the same error magnitude on two different parameters can lead to dramatic differences in the synthesized audio. Oppositely, even though the parameters inferred by the VAE are quite far from the original preset, the corresponding audio is largely closer. This indicates that the latent space provides knowledge on the audio-based neighborhoods of the synthesizer. Therefore, this allows to understand the impact of different parameters in a given region of the latent audio space.

To evaluate this hypothesis, we encode two distant presets in the latent audio space and perform random sampling around these points to evaluate how local neighborhoods are organized. We also analyze the latent interpolation between those examples. The results are displayed in Figure 


Figure 5: Latent neighborhoods. We select two examples from the test set that map to distant locations in the latent space and perform random sampling in their local neighborhood to observe the parameters and audio. We also display the latent interpolation between those points.

As we can see, our hypothesis seems to be confirmed by the fact that neighborhoods are highly similar in terms of audio but have a larger variance in terms of parameters. Interestingly, this leads to complex but smooth non-linear dynamics in the parameters control.

5.4 Macro-parameters learning

Our formulation is the first to provide a continuous mapping between the audio and parameter spaces of a synthesizer. As latent VAE dimensions has been shown to disentangle major data variations, we hypothesized that we could directly use as macro-parameters defining the most interesting variations in a given synthesizer. Hence, we introduce the new task of macro-parameters learning by mapping latent audio dimensions to parameters through , which provides simplified control of the major audio variations for a given synthesizer. This is depicted in Figure 7

Figure 6: Macro-parameters learning. We show two of the learned latent dimensions and compute the mapping when traversing these dimensions, while keeping all other fixed at to see how define smooth macro-parameters. We plot the evolution of the 5 parameters with highest variance (top), the corresponding synthesis (middle) and audio descriptors (bottom). (Left) seems to relate to a percussivity parameter. (Right) defines an harmonic densification parameter.

We show the two most informative latent dimensions based on their variance. We study the traversal of these dimensions by keeping all other fixed at to assess how defines smooth macro-parameters through the mapping . We report the evolution of the 5 parameters with highest variance (top), the corresponding synthesis (middle) and audio descriptors (bottom).

First, we can see that latent dimension corresponds to very smooth evolutions in terms of synthesized audio and descriptors. This is coherent with previous studies on the disentangling abilities of VAEs [6]. However, a very interesting property appear when we map to the parameter space. Although the parameters evolution is still smooth, it exhibits more non-linear relationships between different parameters. This correlates with the intuition that there are lots of complex interplays in parameters of a synthesizer. Our formulation allows to alleviate this complexity by automatically providing macro-parameters that are the most relevant to the audio variations of a given synthesizer. Here, we can see that the latent dimension (left) seems to provide a percussivity parameter, where low values produce a very slow attack, while moving along this dimension, the attack becomes sharper and the amount of noise increases. Similarily, seems to define an harmonic densification parameter, starting from a single peak frequency and increasingly adding harmonics and noise.

5.5 Semantic parameter discovery

Our proposed disentangling flows can steer the organization of selected latent dimensions so that they provide a separation of given tags. As this audio space is mapped to parameters through , this turns the selected dimensions into macro-parameters with a clearly defined semantic meaning. To evaluate this, we analyze the behavior of corresponding latent dimensions, as depicted in Figure 7.

Figure 7: Semantic macro-parameters. Two latent dimensions learned through disentangling flows for different pairs. We show the effect on the latent space (left) and parameters mapping when traversing these dimensions, that define smooth macro-parameters. We plot the evolution of 6 parameters with highest variance and the resulting synthesized audio.

First, we can see the effect of disentangling flows on the latent space (left), which provide a separation of semantic pairs. We study the traversal of semantic dimensions while keeping all other fixed at and infer parameters through . We display the 6 parameters with highest variance and the resulting synthesized audio. As we can see, the semantic latent dimensions provide a very smooth evolution in terms of both parameters and synthesized audio. Interestingly, while the parameters evolution is smooth, it exhibits non-linear relationships between different parameters. This correlates with the intuition that there are complex interplays in parameters of a synthesizer. Regarding the effect of different semantic dimensions, it appears that the [’Constant’, ’Moving’] pair provides a very intuitive result. Indeed, the synthesized sounds are mostly stationary in extreme negative values, but gradually incorporate clearly marked temporal modulations. Hence, our proposal appears successful to uncover semantic macro-parameters for a given synthesizer. However, the corresponding parameters are quite harder to interpret. The [’Calm’, ’Aggressive’] dimension also provides an intuitive control starting from a sparse sound and increasingly adding modulation, resonance and noise. However, we note that the notion of ’Aggressive’ is highly subjective and requires finer analyses to be conclusive.

5.6 Creative applications

Our proposal allows to perform a direct exploration of presets based on audio similarity. Indeed, as the flow is invertible, we can map parameters to the audio space for exploration, and then back to parameters to obtain a new preset. Furthermore, this can be combined with vocal sketch control where the user inputs vocal imitations of the sound that he is looking for. This allows to quickly produce an approximation of the intended sound and then exploring the audio neighborhood of the sketch for intuitive refinement. We embedded our model inside a MaxMSP external called flow_synth~ by using the LibTorch API and further integrate it into Ableton Live by using the Max4Live interface.

6 Conclusion

In this paper, we introduced several novel ideas including reformulating the problem of synthesizer control as matching the two latent space defined as the user perception space and the synthesizer parameter space. We showed that our approach outperforms all previous proposals on the seminal problem of parameters inference. Our formulation also naturally introduces the original tasks of macro-control learning, audio-based preset exploration and semantic parameters discovery. This proposal is the first to be able to simultaneously address most synthesizer control issues at once.

Altogether, we hope that this work will provide new means of exploring audio synthesis, sparkling the development of new leaps in musical creativity.

7 Acknowledgements

This work was supported by MAKIMOno project (ANR:17-CE38-0015-01 and NSERC:STPG 507004-17) and the ACTOR Partnership (SSHRC:895-2018-1023).