Sound textures represent a broad class of sounds that is often overlooked despite being omnipresent in our daily lives. The hubbub of a crowd or the noise of a busy highway can all be interpreted as sound textures, acting as some relatively uniform sonic background. The question of their definition was first thoroughly investigated in , and resulted in a rather strict definition that is summarized in  as ”a superposition of small audio atoms overlapping randomly while following a higher level organization”. This definition is however narrow in that it completely excludes all textures containing salient events that are not part of an overarching organization. These events might be caused by a member of a crowd coughing loudly, or of one car honking, and are more often than not present in actual texture recordings. As such we adopt a broader definition by tolerating the presence of such events as long as they are rare enough compared to the time scale of the texture organization.
Sound texture synthesis consists in creating a realistic sounding texture, and is often based on re-synthesis: given an original texture recording, the goal of the synthesis is to create a texture which style resemble that of the original, as if it had been recorded in the same conditions. Amongst existing synthesis methods, parametric methods represent a promisingly powerful paradigm. This paradigm consists in extracting from the original texture a set of parameters, and then creating a sound that possesses the same parameters. This imposition of the parameters onto a base sound is performed using an iterative optimization process. If the parametrization is done correctly, it should guarantee that the produced sound has the same textural properties as the original, without being its copy (which would defeat the purpose of the synthesis).
The synthesis method introduced in  by McDermott & Simoncelli is an example of such a synthesis, and uses as parameters a set of perceptual-based statistics that aim at mimicking the processing of sound textures by the human auditory system. This methods works convincingly well on a wide array of textures, but it however does not correctly reproduce salient events and impact sounds (i.e. short-lived audio events that span most of the frequency axis and have a strong attack).
Using the similarities between visual textures and the time-frequency representation of sound textures, several methods have been recently developed by adapting a successful parametric synthesis method for visual textures presented in : in this methods, the parameters are the cross-correlations between the feature map of a trained 2D Convolutional Neural Network (CNN). The method presented in  by Ulyanov & Lebedev is such an adaptation, and uses the spectrogram of a sound as input to a 1D CNN, with the frequency dimension acting as input depth, to synthesize textures with moderate success. In , this adaption is improved by Antognini & al. with the addition of several constraints aimed at better preserving rhythmic patterns and increasing the diversity of the results. Both this and the previous method eventually synthesize a spectrogram, which is then approximately inverted using the Griffin-Lim algorithm (introduced in ). Just like the method presented in , they also both present difficulties at synthesizing impacts.
This difficulty encountered with impact synthesis is one of the main motivation of our work, combined with the aim of improving the overall realism of the results of parametric synthesis methods.
In the state-of-the-art parametric methods introduced by McDermott & Simoncelli in  and by Antognini & al. in , the presence of impacts in the original texture results in soft, watery artefacts in the synthesized texture. Our initial attempt at using a CNN-based parametrization for sound texture synthesis, presented in , also presents similar artefacts. Because fire textures mostly contain a low rumbling sound and crackings noises, the artefacts caused by the re-synthesis of these crackings are easily perceived in the synthesized texture: the comparison between the three aforementioned parametric methods and an original fire recording is available online111See http://recherche.ircam.fr/anasyn/caracalla/icassp20/fire.php.
Both our initial method and that of Antonini & al. use spectrograms as input to the CNN used for parametrization. Since both are inspired by the visual texture synthesis method of , the fact that the synthesized spectrograms are visually close to the original texture spectrograms implies that the flaws of both methods originate from an ill-adapted choice of sound representation. The most obvious downside of using spectrograms for sound synthesis lies in the fact that they completely disregard the phase of the signal. This phase is recovered using the Griffin-Lim algorithm in the case of Antognini & al., and implicitly created by performing the optimization directly onto the time signal in the case of our initial method: however, none of these methods guarantees that the phases of the different frequency bins are correlated across the spectrum. Since this correlation is most important during sharp events, managing to re-create it should thus improve impact synthesis while potentially improving the overall realism of synthesized textures. The following section presents our sound texture synthesis method which aims at rectifying this oversight.
In this first section, we detail the process of extracting a set of parameters from a sound signal.
Our aim is to use a time-frequency representation of sound that contains the phase information of the Short Time Fourier Transform (STFT) while using the paradigm of CNN-based parametric synthesis. Because phase and magnitude are strongly correlated, it is not conceivable to synthesize them separately. To bind the two, both 2D matrices can be used in the manner of color channels in the input to the CNN, as is done in 
. Given how much phase matrices resemble white noise images, we however prefer using a representation in which local correlations are more visible. Instead, we thus propose the use of the real and imaginary part of the STFT, dubbed RI spectrograms in. This representation, while implicitly containing both the phase and magnitude of the STFT, presents the advantage of being visibly locally correlated (as visible on Figure 1) while also resembling spectrograms enough for existing spectrogram-based texture synthesis methods to be adapted to it.
Following this reasoning, the representation that serves as input to the CNN are the compressed RI spectrograms which are organized as color channels to form a 3D matrix. All sounds worked with and presented in this article are sampled at kHz and use a window length of samples with a hop-size of for the computation of the STFT. Given the STFT of a sound signal, we first normalize it by the maximum of its absolute value and then compute the RI representation as follows:
with the compressed real part of the STFT, its compressed imaginary part, and
the sigmoid function.is a compression factor that we arbitrarily set to . Defined this way, both and are always comprised between and , while being centered around .
2.2.2 Networks used
Similarly to Antognini & al. in , we use 8 distinct CNN instead of one. As per  and our own findings presented in , given enough filters of various shapes the results obtained with trained and untrained CNN are similar in quality. Each of the 8 CNN is comprised of a single untrained convolutional layer, with
filters of one unique size. These sizes are chosen following two criteria. Given that only patterns of size similar to that of the filters may be described by the parametrization, the first is that the shapes of these filters need to match events commonly present in textures. Since using larger filters means that more have to be randomly drawn to get a representative samples of possible filters, the second is that filters are better chosen as small as possible. As a consequence, we use either relatively small square filters or tall and thin filters aimed at describing impacts. The 8 filter shapes respectively used in the 8 CNN are (101, 2), (53, 3), (11, 5), (3, 3), (5, 5), (11, 11), (19, 19), and (27, 27). All CNN also use a stride of (1, 1). No padding is applied, and all layers include a ReLU activation function. The weights of the filters are drawn from a uniform distribution betweenand , and no bias is applied.
We use the same parametrization as in , aimed at producing a time-invariant description of textures while not being frequency-invariant. The parameters, which are the cross-correlations between the feature maps of a same CNN, are stored inside a set of parameter matrices given by:
with the feature map at position of the th filter from the th network.
Following the paradigm of parametric synthesis, a new sound possessing the same parameters as those extracted from the original texture is then created in order to obtain the new synthesized texture. This process is detailed in the following section.
2.3.1 Texture loss
In order to create a signal possessing the same parameters as the original texture, this process is interpreted as an optimization problem. A texture loss is defined so that it represents the distance between the parameters of a given sound and those of the original texture. A base sound is then iteratively optimized to minimize this loss. Similarly to , we use the following texture loss:
with the th
parameter tensor with the base sound as input to the network whileis the th parameter tensor with the original texture as input. denotes the euclidean distance. Minimizing this loss is thus equivalent to imposing the parameters of the original texture onto the base sound.
2.3.2 Parameters imposition
Given a synthesized RI spectrogram, it is possible to invert the corresponding STFT and produce a time signal with no need for a phase retrieval process. However, it is important to keep in mind that the complex matrix obtained by recombining the real and imaginary parts has no guarantee of being consistent222Consistency in the sense of the STFT is the property of being the image by the STFT of a 1D signal. Not all complex matrix are consistent: inverting them is still possible, but the STFT of the resulting signal will not be identical to the initial complex matrix.. In order to avoid this issue, and using a method that can also be found in [3, 2, 12], we instead directly modify the base signal in the time domain. The resulting method is illustrated in Figure 2.
In practice, we use the L-BFGS optimization algorithm and the tensorflow library to perform the iterative imposition of the parameters. In order to synthesize textures of 7 seconds, the optimization is carried out in 5000 steps and lasts roughly 7 minutes on a GeForce GTX 1080 Ti GPU.
Examples of sounds synthesized using an array of various original are available online333See http://recherche.ircam.fr/anasyn/caracalla/icassp20/results.php. As is clearly audible on the ”fire” texture, our method succeeds in convincingly re-synthesizing impacts due to the combination of the use of tall filters and the RI representation. In addition to this, it manages to synthesize noisy textures (such as the ”bees” and ”static” texture), pitched events (such as the ”birds” texture) and salient, un-recurring events (such as the cutlery noises in ”crowd”) in a very convincing way.
However, the ”wind” texture shows a limit of this method: due to the size of the filters of the CNN described in Section 2.2.2, the characteristic size of the events it may reproduce is approximately of 0.5 seconds. The texture ”wind”, however, is the only one that contains an event (in this case, the howling), which characteristic time is of several seconds. As a result, the texture synthesized by our method does not manage to reproduce the slow evolution of the howling. This behavior could be changed by horizontally extending the filters of its CNN, although doing this would mean reproducing longer patches of the original: we would thus risk creating a synthesized sound resembling the original texture too much.
3 Perceptual evaluation
In order to assess the realism of textures synthesized using our method, we performed an online perceptual evaluation comparing it to other state-of-the-art methods.
3.1 Experimental set-up
Our protocol was loosely based on MUSHRA (Multiple Stimuli with Hidden Reference and Anchor), defined in , in which an original sample is compared to several test samples. A hidden reference (a ”perfect” sample) and an anchor (a ”bad” sample) are also hidden among the test samples in order to act as references. Although the original texture might have been used as the hidden reference (referred to as hidden from now on), doing so would have blurred the distinction between identity and similarity: instead we decided to use its continuation as hidden reference. We created each anchor (anchor) by filtering a white noise so that it had the same frequency spectrum as its corresponding original texture.
With the kind consent of Dr. Joseph M. Antognini, we used his sound set available online444See https://antognini-google.github.io/audio_textures/baselines.html to choose our original samples from. In addition to containing a wide array of textures, this set also contained the synthesized versions of each texture using the methods presented in  (antognini),  (ulyanov) and  (mcdermott). From this set we chose 10 original textures: this selection was made so as to cover an array of textures as broad as possible. We re-synthesized those textures using our RI-based method (RI), but also our previous spectrogram-based method presented in  (spec) for comparison’s sake.
All sounds were down-sampled when needed so that they all had a sample rate of 16 kHz. They were also cropped to a length of 4 seconds. From the 7-second long selected textures, the first 4 seconds were used as original textures for the test and the last 4 seconds as hidden references. Because uneven audio volumes may influence the perception of artefacts, we normalized all samples so that their energy (or variance) were identical. All sounds can be listened to online555See http://recherche.ircam.fr/anasyn/caracalla/icassp20/assets.php.
For each of the selected textures, the participants were presented with the original sample followed by the 7 corresponding test samples (hidden and anchor, antognini, ulyanov, mcdermott, RI and spec
). Like in MUSHRA, they were asked to rate how similar sounding each texture was to the original on a scale ranging from 0 (unrecognizable) to 100 (perfect): it was stressed in the instructions that the goal of these synthesis methods was not to reproduce an identical copy but a sample appearing to have been recorded moments later. The name and order of both textures and methods were anonymized so as to prevent the introduction of any bias.
A total of 64 valid and full evaluations were filled at the time of the writing of this article. Due to the various rating strategies adopted by participants, we find that the rankings of the different methods for each texture are more telling and more stable than grades: as such, we use those as metric. The rankings range from 1 (preferred) to 8 (rejected), and an average ranking is given when several method have the same grade.
The mean rankings for each texture are displayed in Table 1, while the global rankings of the different method on all textures are shown on Figure 3. We use box plots as a way to display data without making any assumption regarding its statistical distribution.
The high rankings of the hidden reference, shown by its high mean across all texture, are encouraging as they show that participants correctly understood the task given to them. The rankings of RI being close to those of the hidden reference is an extremely positive result regarding the realism of our method. The difference between these rankings and those of spec is also a concrete proof of the improvements that the changes in time-frequency representation and CNN architecture bring. Hidden reference put aside, our RI methods ranks first on all textures outside of ”birds”, for which it ranks slightly behind antognini, and ”wind”, for which it ranks behind mcdermott and anchor. This behavior was expected given the discussion of our results presented in Section 2.4, although the fact that a simple filtered white noise is ranked this high compared to state-of-the-art methods was rather unexpected.
Those results confirm that in addition to succeeding in the synthesis of impacts (present in ”fire” and ”applause”), our methods also manages to synthesize monotonous textures (such as ”bees” or ”static”) and textures that present salient events (such as ”crowd”) with a realism that is comparable to that of an actual recording and surpasses current state-of-the-art methods in parametric sound texture synthesis.
We have demonstrated that the use of a more fitting time-frequency representation, coupled with a careful choice of filter shapes, allows for a more convincing CNN-based parametric sound texture synthesis. This improvement has been further assessed by an online perceptual evaluation which compared original texture samples with samples synthesized using both our algorithm and several state-of-the-art parametric synthesis methods: its results have been unequivocally in favor of our algorithm, showing that our samples were rated similarly to original texture samples. Despite giving less convincing results on a slowly evolving texture, we are confident that the architecture of the CNN used in our method can be investigated further to also work with this kind of texture.
Overall, this shows that our parametrization is suited to the description and synthesis of sound textures. From there, further investigations might be performed so as to test the influence of the manipulation of these parameters on the audio signal: attempts at texture control or at (textural) style transfer could for instance be made.
-  (2019) Audio texture synthesis with random neural networks: improving diversity and quality. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §2.1, §2.2.2, §3.1.
-  (2018) Style transfer for musical audio using multiple time-frequency representations. Note: Available at https://openreview.net/forum?id=BybQ7zWCb Cited by: §2.3.2.
-  (2019) Sound texture synthesis using convolutional neural networks. Digital Audio Effects (DAFx). Cited by: §2.1, §2.2.2, §2.2.3, §2.3.2, §3.1.
-  (2019) Gansynth: adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710. Cited by: §2.2.1.
Complex spectrogram enhancement by convolutional neural network with multi-metrics learning.
IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Cited by: §2.2.1.
-  (2015) Texture synthesis using convolutional neural networks. In Advances in neural information processing systems, Cited by: §1, §2.1.
Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing. Cited by: §1.
-  (2003) Method for the subjective assessment of intermediate quality levels of coding systems (itu-r bs.1534-1).. Cited by: §3.1.
-  (2011) Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron. Cited by: §1, §1, §2.1, §3.1.
-  (1995) Classification of sound textures. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §1.
-  (2011) State of the art in sound texture synthesis. In Digital Audio Effects (DAFx), Cited by: §1.
-  (2018) Audio style transfer with rhythmic constraints. In Digital Audio Effects (DAFx), Cited by: §2.3.2.
-  (2016) Audio texture synthesis and style transfer. Note: Available at https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/ Cited by: §1, §3.1.
-  (2016) Texture synthesis using shallow convolutional networks with random filters. arXiv preprint arXiv:1606.00021. Cited by: §2.2.2, §2.3.1.