The process of generating musical audio has seen a continuous expansion since the advent of digital systems. Audio synthesis methods relying on parametric models can be derived from physical considerations, spectral analysis (sinusoids plus noise[sms] models) or signal processing operations (frequency modulation). Alternatively to those signal generation techniques, samplers provide synthesis mechanisms by relying on stored waveforms and sets of audio transformations. However, when tackling large audio sample libraries, these methods cannot scale and are also unable to aggregate a model over the whole data. Therefore, they cannot globally manipulate the audio features in the sound generation process. To this extent, corpus-based synthesis has been introduced by slicing sets of signals in shorter audio segments, which can be rearranged into new waveforms through a selection algorithm.
An instance of corpus-based synthesis, named granular sound synthesis, uses short waveform windows of a fixed length. These units (called grains) usually have a size ranging between 10 and 100 milliseconds. For a given corpus, the grains are extracted and can be analyzed through audio descriptors [adesc] in order to facilitate their manipulation. Such analysis space provides a representation that reflects some form of local similarities across grains. The grain corpus is displayed as a cloud of points whose distances relate to some of their acoustic relationships. By relying on this space, resynthesis can be done with concatenative sound synthesis [catart]
. To a certain extent, this process can emulate the spectro-temporal dynamics of a given signal. However, the perceptual quality of the audio similarities, assessed through predefined sets of acoustic descriptors, is inherently biased by their design. These only offer a limited consistency across many different sounds, within the corpus and with respect to other targets. Furthermore, it should be noted that the synthesis process can only use the original grains, precluding continuously invertible interpolations in this grain space.
To enhance the expressivity of granular synthesis, grain sequences should be drawn in more flexible ways, by understanding the temporal dynamics of trajectories in the acoustic descriptor space. However, current methods are only restricted to perform random or simple hand-drawn paths. Traversals across the space map to grain series that are ordered according to the corresponding feature. However, given that the grain space from current approaches is not invertible, these paths do not correspond to continuous audio synthesis, besides that of each of the scattered original grains. This could be alleviated by having a denser grain space (leading to smoother assembled waveform), but it would require a correspondingly increasing amount of memory, quickly exceeding the gigabyte scale when considering nowadays sound sample library sizes. In a real-time setting, this causes further limitations to consider in a traditional granular synthesis space. As current methods only account for local relationships, they cannot generate the structured temporal dynamics of musical notes or drum hits without having a strong inductive bias, such as a target signal. Finally, the audio descriptors and the slicing size of grains are critical parameters to choose for these methods. They model the perceptual relationships across elements and set a trade-off: shorter grains allow for a denser space and faster sound variations at the expense of a limited estimate of the spectral features and the need to process larger series for a given signal duration.
In this paper, we show that we can address most of the aforementioned shortcomings by drawing parallels between granular sound synthesis and probabilistic latent variable models. We develop a new neural granular synthesis technique that refines granular synthesis and is efficiently solved by generative neural networks (Figure 1
). Through the repeated observation of grains, our proposed technique adaptively and unsupervisedly learns analysis dimensions, structuring a latent grain space, which is continuously invertible to signal domain. Such space embeds the training dataset, which is no longer required in memory for generation. It allows to continuously generate novel grains at any interpolated latent position. In a second step, this space serves as basis for a higher-level temporal modeling, by training a sequential embedding over contiguous series of grain features. As a result, we can sample latent paths with a consistent temporal structure and moreover relieve some of the challenges to learn to generate raw waveforms. Its architecture is suited to optimizing local spectro-temporal features that are essential for audio quality, as well as longer-term dependencies that are efficiently extracted from grain-level sequences rather than individual waveform samples. The trainable modules used are well-grounded in digital signal processing (DSP), thus interpretable and efficient for sound synthesis. By providing simple variations of the model, it can adapt to many audio domains as well as different user interactions. With this motivation, we report several experiments applying the creative potentials of granular synthesis to neural waveform modeling: continuous free-synthesis with variable step size, one-shot sample generation with controllable attributes, analysis/resynthesis for audio style transfer and high-level interpolation between audio samples.
2 State of the art
2.1 Generative neural networks
Generative models aim to understand a given set
by modeling an underlying probability distributionof the data. To do so, we consider latent variables defined in a lower-dimensional space (), as a higher-level representation generating any given example. The complete model is defined by . However, a real-world dataset follows a complex distribution that cannot be evaluated analytically. The idea of variational inference (VI) is to address this problem through optimization by assuming a simpler distribution from a family of approximate densities [vae]. The goal of VI is to minimize differences between the approximated and real distribution, by using their Kullback-Leibler (KL) divergence
By developing this divergence and re-arranging terms (detailed development can be found in [vae]), we obtain
This formulation of the Variational Auto-Encoder (VAE) relies on an encoder , which aims at minimizing the distance to the unknown conditional latent distribution. Under this assumption, the Evidence Lower Bound Objective (ELBO) is optimized by minimization of a weighted KL regularization over the latent distribution added to the reconstruction cost of the decoder
The second term of this loss requires to define a prior distribution over the latent space, which for ease of sampling and back-propagation is chosen to be an isotropic gaussian of unit variance. Accordingly, a forward pass of the VAE consists in encoding a given data point to obtain a mean and variance . These allow us to obtain the latent by sampling from the Gaussian, such that .
The representation learned with a VAE has a smooth topology [higgins2016beta] since its encoder is regularized on a continuous density and intrinsically supports sampling within its unsupervised training process. Its latent dimensions can serve both for analysis when encoding new samples, or as generative variables that can continuously be decoded back to the target data domain. Furthermore, it has been shown [esling2018generative] that it could be successfully applied to audio generation. Thus, it is the core of our neural model for granular synthesis of raw waveforms.
2.2 Neural waveform generation
Applications of generative neural networks to raw audio data must face the challenge of modeling time series with very high sampling rates. Hence, the models must account for both local features ensuring the generated audio quality, as well as longer-term relationships (consistent over tens of thousands of samples) in order to form meaningful signals. The first proposed approaches were based on auto-regressive models, which exploit the causal nature of audio. Given the whole waveform
, these models decompose the joint distribution into a product of conditional distributions. Hence, each sample is generated conditionally on all previous ones
Amongst these models, WaveNet [wavenet] has been established as the reference solution for high-quality speech synthesis. It has also been successfully applied to musical audio with the Nsynth dataset [nsynth]. However, generating a signal in an auto-regressive manner is inherently slow since it iterates one sample at a time. Moreover, a large convolutional structure is needed in order to infer even a limited context of 100ms. This results in heavy models, only adapted to large databases and requiring long training times.
Specifically for musical audio generation, the Symbol-to-Instrument Neural Generator (SING) proposes an overlap-add convolutional architecture [sing] on top of which a sequential embedding is trained on frame steps , by conditioning over instrument, pitch and velocity classes . The model processes signal windows of 1024 points with a 75% overlap, thus reducing the temporal dimension by 256 before the forward pass of the up-sampling convolutional decoder . Given an input signal with log-magnitude spectrogram , the decoder outputs a reconstruction , in order to optimize
for . This approach removes auto-regressive computation costs and offers meaningful controls, while achieving high-quality synthesis. However, given its specific architecture, it does not generalize to generative tasks other than sampling individual instrumental notes of fixed duration in pitched domains.
Recently, additional inductive biases arising from digital signal processing have allowed to specify tighter constraints on model definitions, leading to high sound quality with lower training costs. In this spirit, the Neural Source-Filter (NSF) model [nsf] applies the idea of Spectral Modeling Synthesis (SMS) [sms] to speech synthesis. Its input module receives acoustic features and computes conditioning information for the source and temporal filtering modules. In order to render both voiced and unvoiced sounds, a sinusoidal and gaussian noise excitations are fed into separate filter modules. Estimation of noisy and harmonic components is further improved by relying on a multi-scale spectrogram reconstruction criterion.
Similar to NSF, but for pitched musical audio, the Differentiable Digital Signal Processing [ddsp]
model has been proposed. Compared to NSF, this architecture features an harmonic additive synthesizer that is summed with a subtractive noise synthesizer. Envelopes for the fundamental frequency and loudness as well as latent features are extracted from a waveform and fed into a recurrent decoder which controls both synthesizers. An alternative filter design is proposed by learning frequency-domain transfer functions of time-varying Finite Impulse Response (FIR) filters. Furthermore, the summed output is fed into a reverberation module that refines the acoustic quality of the signal. Although this process offers very promising results, it is restricted in the nature of signals that can be generated.
3 Neural granular sound synthesis
In this paper, we propose a model that can learn both a local audio representation and modeling at multiple time scales, by introducing a neural version of the granular sound synthesis [catart]. The audio quality of short-term signal windows is ensured by efficient DSP modules optimized with a spectro-temporal criterion suited to both periodic and stochastic components. We structure the relative acoustic relationships in a latent grain space, by explicitly reconstructing waveforms through an overlap-add mechanism across audio grain sequences. This synthesis operation can model any type of spectrogram, while remaining interpretable. Our proposal allows for analysis prior to data-driven resynthesis and also performs continuous variable length free-synthesis trajectories. Taking advantage of this grain-level representation, we further train a higher-level sequence embedding to generate audio events with meaningful temporal structure. In its less restrictive definition, our model allows for unconditional sampling, but it can be trained with additional independent controls (such as pitch or user classes) for more explicit interactions in composition and sound transfer. The complete architecture is depicted in Figure 2.
3.1 Latent grain space
Formally, we consider a set of audio grains extracted from audio waveforms in a given sound corpus, with fixed grain size . This set of grains follows an underlying probability density that we aim to approximate through a parametric distribution . This would allow to synthesize consistent novel audio grains by sampling . This likelihood is usually intractable, we can tackle this process by introducing a set of latent variables (). This low-dimensional space is expected to represent the most salient features of the data, which might have led to generate a given example. In our case, it will efficiently replace the use of acoustic descriptors, by optimizing continuous generative features. This latent grain space is based on an encoder network that models paired with a decoder network allowing to recover for every grains . We use the Variational Auto-Encoder [vae] with a mean-field family and Gaussian prior to learn a smooth latent distribution .
3.2 Latent path encoder
As we will perform overlap-add reconstruction, our model processes series of grains extracted from a given waveform . The down-sampling ratio between the waveform duration and number of grains is given by the hop size separating neighboring grains. Each of these grains is analyzed separately by the encoder in order to produce . Hence, the successive encoded grains form a corresponding series of latent coordinates such that
. The layers of the encoder are first strided residual convolutions that successively down-sample the input grains through temporal 1-dimensional filters. The output of these layers is then fed into several fully-connected linear layers that map to Gaussian means and variances at the desired latent dimensionality.
3.3 Spectral filtering decoder
Given a latent series , the decoder must first synthesize each grain prior to the overlap-add operation. To that end, we introduce a filtering model that adapts the design of [ddsp] to granular synthesis. Hence, each is processed by a set of residual fully-connected layers that produces frequency domain coefficients of a filtering module that transforms uniform noise excitations into waveform grains. We replace the recurrence over envelope features proposed in [ddsp]
by performing separate forward passes over overlapping grain features. Denoting the Discrete Fourier TransformDFT and its inverse iDFT, this amounts to computing
Since the DFT of a real valued signal is Hermitian, symmetry implies that for an even grain size , the network only filters the positive frequencies.
These grains are then used in an overlap-add mechanism that produces the waveform, which is passed through a final learnable post-processing inspired from [wavegan]. This module applies a multi-channel temporal convolution that learns a parallel set of time-invariant FIR filters and improves the audio quality of the assembled signal .
3.4 Sequence trajectories embedding
As argued earlier, generative audio models need to sample audio events with a consistent long-term temporal structure. Our model provides this in an efficient manner, by learning a higher-level distribution of sequences that models temporal trajectories in the granular latent space
. This allows to use the down-sampling of an intermediate frame-level representation in order to learn longer-term relationships. This is achieved by training a temporal recurrent neural network on ordered sequences of grain features. This process can be applied equivalently to any types of audio signals. As a result, our proposal can also synthesize and transfer meaningful temporal paths inside the latent grain space. It starts by sampling from the Gaussian , then sequentially decoding and finally generating the grains and overlap-add waveform with .
3.5 Multi-scale training objective
To optimize the waveform reconstruction, we rely on a multi-scale spectrogram loss [nsf, ddsp], where STFTs are computed with increasing hop and window sizes, so that the temporal scale is down-sampled while the spectral accuracy is refined. We use both linear and log-frequency STFT [nnaudio] on which we compare log-magnitudes with the L1 distance . In addition to fitting multiple resolutions of , we can explicitly control the trade-off between low and high-energy components with the floor value [sing]. In order to optimize a latent grain space, KL regularization and sampling (6) are performed for each latent point , thus we extend the original VAE objective (3) as
where is the number of scales in the spectrogram loss and is the number of grains processed in one sequence.
In order to evaluate our model across a wide variety of sound domains, we train on the following datasets
Studio-On-Line provides individual note recordings sampled at 22050 Hz with labels (pitch, instrument, playing technique) for 12 orchestral instruments. The tessitura for Alto-Saxophone, Bassoon, Clarinet, Flute, Oboe, English-Horn, French-Horn, Trombone, Trumpet, Cello, Violin, Piano are in average played in 10 different extended techniques. The full set amounts to around 15000 notes [sol].
8 Drums around 6000 one-shot recordings sampled at 16000 Hz in Clap, Cowbell, Crash, Hat, Kick, Ride, Snare, Tom instrument classes111https://github.com/chrisdonahue/wavegan/tree/v1.
10 animals contains around 3 minutes of recordings sampled at 22050 Hz for each of Cat, Chirping Birds, Cow, Crow, Dog, Frog, Hen, Pig, Rooster, Sheep classes of the ESC-50 dataset222https://github.com/karolpiczak/ESC-50.
For datasets sampled at 22050 Hz, we use a grain size , which subsequently sets the filter size , and compute spectral losses for STFT window sizes . For datasets sampled at 16000 Hz, and STFT window sizes range from 32 to 1024. Hop sizes for both grain series and STFTs are set with an overlap ratio of 75%. Log-magnitudes are computed with a floor value . Dimensions for latent features are and .
Since datasets provide some labels, we both train unconditional models and variants with decoder conditioning. For instance Studio-On-Line can be trained with control over pitch and/or instrument classes when using multiple instrument subsets. Otherwise for a single instrument we can instead condition on its playing styles (such as Pizzicato or Tremolo for the violin
). To do so, we concatenate one-hot encoded labels
to the latent vectors at the input of the decoder. During generation we can explicitly set these target conditions, which provide independent controls over the considered sound attributes
The model is trained according to eq. 9
. In the first epochs only the reconstruction is optimized, which amounts to. This regularization strength is then linearly increased to its target value, during some warm-up epochs. The last epochs of training optimize the full objective at the target regularization strength, which is roughly fixed in order to balance the gradient magnitudes when individually back-propagating each term of the objective. The number of training iterations vary depending on the datasets, we use a minibatch size of 40 grain sequences, an initial learning rate of and the ADAM optimizer. In this setting, a model can be fitted within 10 hours on a single GPU, such as an Nvidia Titan V.
The model performance is first compared to some baseline auto-encoders in Table 1. To assess the generative qualities of the model, we provide audio samples of data reconstructions as well as examples of neural granular sound synthesis333https://anonymized124.github.io/neural_granular_synthesis/ . These are generations based on its common processes as well as novel interactions enabled by our proposed neural architecture.
5.1 Baseline comparison
In the first place, the granular VAE could be implemented using a convolutional decoder that symmetrically reverts the latent mapping of the encoder we use. Strided down-sampling convolutions can be mirrored with transposed convolutions or up-sampling followed with convolutions. We will refer to these baselines as and while our model with spectral filtering decoder is and with the added learnable post-processing is . We train these models on the Studio-On-Line dataset for the full orchestra in ordinario and the strings in all playing modes as well as the 8 Drums dataset, keeping all other hyper-parameters identical. We report their test set spectrogram reconstruction scores for the Root Mean Squared Error (RMSE), Log-Spectral Distance (LSD) and their average time per training iteration. Each model was trained for about 10 hours. Accordingly, we can see that our proposal globally outperforms the convolutional decoder baselines, while training and generating fast. The latency of our model to synthesize 1 second of audio is about 19.7 ms. on GPU and 25.0 ms. on CPU.
5.2 Common granular synthesis processes
The audio-quality of the models trained in different sound domains can be judged by data reconstructions. It gives a sense of the model performance at auto-encoding various types of sounds. This extends to generating new sounds by sampling latent sequences rather than encoding features from input sounds. For structured one-shot samples, such as musical notes and drum hits, latent sequences are generated from the higher-level sequence embedding. For use in composition (e.g. MIDI score), this sampling can be done with conditioning over user classes such as pitch and target instrument (eq. 10). Since the VAE learns a continuously invertible grain space, it can as well be explored with smooth interpolations that render free-synthesis trajectories. Some multidimensional latent curves that are mapped to overlap-add grain sequences, including linear interpolations between random samples from the latent Gaussian prior, circular paths and spirals. When repeating forward and backward traversals of a linear interpolation or looping a circular curve, we can modulate non-uniformly the steps between latent points in order to bring additional expressivity to the synthesis. Free-synthesis can be performed at variable lengths (in multiples of ) by concatenating several contiguous latent paths.
5.3 Audio style and temporal manipulations
To perform data-driven resynthesis, a target sample is analyzed by the encoder. Its corresponding latent features are then decoded, thus emulating the target sound in the style of the learned grain space. A conditioning over multiple timbres (e.g. instrument classes) allows for finer control over such audio transfer between multiple target styles. To perform resynthesis of audio samples longer than the grain series length , we auto-encode several contiguous segments that are assembled with fade-out/fade-in overlaps. Since the model can also learn a continuous temporal embedding, by interpolating this higher-level space, we can generate successive latent series in the grain space that are decoded into signals with evolving temporal structures. We illustrate this feature in Figure 3.
5.4 Real-time sound synthesis
With GPU support, for instance a sufficient dedicated laptop chip or an external thunderbolt hardware, the models can be ran in real-time. In order to apply trained models to these different generative tasks, we currently work on some prototype interfaces based on a Python OSC444https://pypi.org/project/python-osc/ server controlled from a MaxMsp555https://cycling74.com patch. For instance a neural drum machine 3 featuring a step-sequencer driving a model with sequential embedding and conditioning trained over the 8 Drums dataset classes.
We propose a novel method for raw waveform generation that implements concepts from granular sound synthesis and digital signal processing into a Variational Auto-Encoder. It adapts to a variety of sound domains and supports neural audio modeling at multiple temporal scales. The architecture components are interpretable with respect to its spectral reconstruction power. Such VAE addresses some limitations of traditional techniques by learning a continuously invertible grain latent space. Moreover, it enables multiple modes of generation derived from granular sound synthesis, as well as potential controls for composition purpose. By doing so, we hope to enrich the creative use of neural networks in the field of musical sound synthesis.