DeepAI
Log In Sign Up

Adversarial Guitar Amplifier Modelling With Unpaired Data

11/02/2022
by   Alec Wright, et al.
aalto
0

We propose an audio effects processing framework that learns to emulate a target electric guitar tone from a recording. We train a deep neural network using an adversarial approach, with the goal of transforming the timbre of a guitar, into the timbre of another guitar after audio effects processing has been applied, for example, by a guitar amplifier. The model training requires no paired data, and the resulting model emulates the target timbre well whilst being capable of real-time processing on a modern personal computer. To verify our approach we present two experiments, one which carries out unpaired training using paired data, allowing us to monitor training via objective metrics, and another that uses fully unpaired data, corresponding to a realistic scenario where a user wants to emulate a guitar timbre only using audio data from a recording. Our listening test results confirm that the models are perceptually convincing.

READ FULL TEXT VIEW PDF
07/18/2022

Style Transfer of Audio Effects with Differentiable Signal Processing

We present a framework that can impose the audio effects and production ...
05/11/2021

Differentiable Signal Processing With Black-Box Audio Effects

We present a data-driven approach to automate audio signal processing by...
05/28/2019

SignalTrain: Profiling Audio Compressors with Deep Neural Networks

In this work we present a data-driven approach for predicting the behavi...
10/12/2022

JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VA

This paper proposes a model that generates a drum track in the audio dom...
10/15/2018

Modeling of nonlinear audio effects with end-to-end deep neural networks

In the context of music production, distortion effects are mainly used f...
03/29/2022

CycleGAN-Based Unpaired Speech Dereverberation

Typically, neural network-based speech dereverberation models are traine...
07/11/2021

Neural Waveshaping Synthesis

We present the Neural Waveshaping Unit (NEWT): a novel, lightweight, ful...

1 Introduction

When recording musical instruments for contemporary music production, the audio effects applied are an essential component of the resulting timbre. This is especially true of the electric guitar, where the timbral characteristics, or as it is often referred to by guitarists, the tone, imparted by certain vacuum tube amplifiers is highly sought-after. Digitally emulating analog audio devices such as guitar amplifiers is known as Virtual Analog (VA) modelling [1].

A related but more challenging problem is to emulate the timbre of an electric guitar directly from a recording. This corresponds to a practical and very exciting task: how to imitate the guitar tone heard on a commercial music recording using your own electric guitar?

The usual paradigm of VA modelling is the emulation of certain analog devices, however this is insufficient for this purpose for two reasons. Firstly, the specific instrument and devices used to craft the guitar tone on a famous recording might not be known. Even if, for example, the exact amplifier used to make a recording is available, recreating the target timbre is no simple task, as it would require unearthing the combination of parameter settings and recording setup. Secondly, the desired timbre includes both the musical instrument itself, as well as any processing applied to it. This means that any emulation of a recorded guitar tone must account for differences between the timbre of your guitar, and the timbre of the target guitar, before effects processing is applied.

The problem thus corresponds more closely to that of timbre transfer, but in the specific case where the input and target instrument are both guitars. The problem can be divided into two subproblems: the linear and the nonlinear modification. The linear problem is related to the difference of the spectral shapes of the two instruments. A similar problem has been solved earlier using a high-order linear filter, either to enhance the string instrument timbre taken from a piezo pickup [2, 3], or to process an electric guitar such that it sounds like an acoustic one [4]. In these cases, the filter behaves as an equalizer (EQ).

The nonlinear problem involves the emulation of the devices, other than the instrument itself, used to create the recording, such as the guitar amplifier, speaker cabinet, or other effects such as compression or distortion. Modelling these types of devices has been widely studied, and broadly falls into two approaches, “white-box”, where the physical properties of the system are studied to produce equations describing its behaviour [5, 6, 7], or “black-box”, where data collected from the system is used to fit a more generic model [8, 9, 10]. Recently, deep learning methods based on neural networks have become a popular choice for this purpose [11]. These include encoder-decoder models aimed at a wide range of effects [12], feedforward [13], and recurrent [14] models for guitar amplifiers and time-varying effects [15], as well as models based on traditional DSP components such as infinite impulse response filters [16, 17].

Whilst the aforementioned approaches are all viable for the task of imitating a target electric guitar timbre, they all have major limitations in that they require us to know what devices were used during recording, or to have access to paired data with both the target timbre and the unprocessed signal taken directly from the guitar used during the recording. In the case where paired training data is impossible, or unavailable, one solution is to formulate the problem as domain-to-domain transfer. Great success has been achieved in Image Style Transfer, where the style of an input image is transformed to match a target style, whilst retaining the input image content [18, 19].

A related problem in audio is Audio Style Transfer, which seeks to transfer the style of some input audio to a target style, whilst retaining the content of the input audio [20]. In the speech domain, an example of this is voice conversion, where an utterance spoken by one speaker is processed such that it sounds like it was spoken by a different speaker. This can be achieved, for example, using non-parallel speech data to train a Generative Adversarial Network (GAN) with a cycle-consistency loss [21]. For musical applications, the problem is more often thought of as altering the timbre of some input audio, to match a target timbre. Hence, in the musical domain style transfer is more frequently referred to as timbre transfer [22, 23] or tone transfer [24]. This can be achieved by applying image style transfer techniques to a time-frequency representation of the input audio, and then re-synthesising the audio in the waveform domain [22], or by learning a high level representation of the input and using it as input to a synthesizer [25]. Recent work has also proposed style transfer of audio effects, using a shared-weight encoder to impart the production style of one recording onto another [26].

Figure 1: (a) Supervised black-box modelling is based on paired audio data , where the target audio is obtained by processing the input audio with the target device. When paired data is unavailable, we propose to use (b) unpaired data, made up of examples of a source timbre and examples of a target timbre , where neither the content nor the timbre contained in match those contained in .

In this paper, we propose to address this problem using an unsupervised approach, in which a black-box feedforward convolutional model is trained in an adversarial manner using unpaired data. The distinction between the supervised and unsupervised modelling task is shown in Fig. 1. We conceptualise this as a form of domain transfer, and focus on the case of processing a signal recorded directly from an electric guitar, or input timbre, with the objective of matching the timbre of a guitar from a recording, or target timbre.

The rest of this paper is structured as follows. Sec. 2 discusses the modelling approach, which uses a GAN, as well as the training objective and data. Sec. 3 describes two experiments conducted to validate the proposed approach, and Sec. 4 concludes the paper.

2 Method

The proposed method uses an adversarial training approach. We define domains in the space of guitar timbre, where a domain is defined by the sounds produced by a particular combination of a guitar, pickups, amplifier, and any other audio effects. Note that the domain is defined purely by the guitar timbre, and not the content of the guitar playing. If we define domains, and , and we sample audio data, , where , and , where , our objective is to learn a mapping, , that converts audio from domain to domain :

(1)

The ultimate objective is to transform the tone such that is perceptually indistinguishable from . To train we use a discriminator model trained to identify examples from the target timbre domain.

Previous work on adversarial domain-to-domain translation [19, 21] has applied a cycle-consistency criterion to avoid a specific form of mode collapse, which results in simply ignoring the content of the input and synthesising unrelated content from the target domain. However, in our present experiments the generator model

did not exhibit this problem, likely due to the constrained expressive capability of the feedforward convolutional neural network used, which learns to directly apply a transformation to the input signal in the time domain.

2.1 Generator

The generator model used in this work is a feedforward variant of the WaveNet architecture [27]. A non-causal feedforward WaveNet variant was first proposed for speech denoising [28]. A causal version was later applied to guitar amplifier modelling [13], and this architecture is used as the generator throughout this work. The model has two main components, a stack of dilated causal convolutional layers, and a linear post-processor. The post-processor is a fully connected layer that takes the outputs of each of the convolutional layers as input.

All generator models used in this work consist of two stacks of nine dilated convolutions, with each stack starting at a dilation of one and increasing by a factor of two with each subsequent convolutional layer. Each convolutional layer has a kernel size of three, and uses the same gated activation function as the original WaveNet

[27]. The receptive field of this model is 2045 samples, or about 46.4 ms at a 44.1-kHz sample rate. It should also be noted that previous work has shown that a C++ implementation of this model is capable of running in real-time on a modern desktop computer [29].

2.2 Discriminator

The input to the proposed discriminator is a time-frequency representation of the audio, as proposed in [30]

. The discriminator consists of a stack of 1D convolutional layers, with the frequency bins of the time-frequency representation being provided as channels to the first layer of the discriminator. Subsequent layers used grouped convolutions to reduce computational cost, and the hyperparameters for each layer are shown in Table

1

. All layers use weight normalization, and all layers except the final output layer are followed by a Leaky ReLU activation function with a negative slope of 0.2. Four different time-frequency representations were trialled, either a magnitude spectrogram, a magnitude mel-spectrogram, a log magnitude spectrogram, or a log magnitude mel-spectrogram. For all mel-spectrograms, the number of mel bands used was 160, the maximum frequency was set to Nyquist and the minimum frequency was set to 0 Hz.

Additionally, a multi-scale version of the spectral domain discriminator was trialled which included three sub-discriminators, each operating on time-frequency representations obtained using different window sizes. In the case of the single spectral-domain discriminator, a window size of was used, as for the multi-scale spectral discriminator, window sizes of were used. In all cases, the hop size was set to .

Layer # 1 2 3 4 5 6 7
Kernel Size 10 21 21 21 21 5 3
Out Channels 32 128 512 1024 1024 1024 1
Groups 1 8 32 64 64 1 1
Table 1: Convolutional layer parameters for proposed spectral domain discriminator.

2.3 Training Objective

The generator and discriminator of the GAN were trained using the hinge loss objective [31], identical to that which was used in MelGAN [32], as follows:

(2)

and

(3)

respectively, where is a guitar audio waveform used as input to the generator and is a guitar audio waveform taken from the target set. The training scheme is shown in Fig. 2. During training, both the input and target guitar datasets are split into two-second segments before processing by or . The models were trained using the Adam optimizer [33] with a batch size of 5.

Figure 2: Training setup for (a) the Generator and (b) Discriminator. Generator inputs, taken from the input domain , are processed to emulate the timbre, but not content, of the target domain .

2.4 Data

The data used throughout this work was taken from a guitar dataset111www.idmt.fraunhofer.de/en/business_units/m2d/smt/guitar.html originally proposed for the task of automatic transcription [34]. We use the fourth subset of the dataset, which consists of 64 short musical pieces played at fast and slow tempi. The pieces were recorded on two different electric guitars, a Career SG and an Ibanez 2820. We processed the recordings to remove leading and trailing silence, as well as the 2 bar count in at the beginning of each piece. Additionally, it was noted that clipping is present in some of the samples in the Career SG dataset, so examples where excessive clipping was observed were removed.

After the pre-processing, there was approximately 40 min of audio from the Ibanez 2820 guitar, and 30 min from the Career-SG. To create the datasets used during our experiments, the guitar audio was processed by a guitar amplifier plugin. To test the robustness of our modelling approach, a separate dataset was created for three different plugin settings, hereafter referred to as ‘Clean’, ‘Light Distortion’, and ‘Heavy Distortion’. The amount of harmonic distortion introduced increases from a relatively small amount in the ‘Clean’ setting, to an extremely distorted guitar tone found in the ‘Heavy Distortion’ case.

3 Experiments

Our proposed modelling approach was tested on two different problems. In the first scenario, which is a toy problem, the signal contained in both the input and target datasets is recorded from the same guitar. In this case, the specific instrument in each dataset is identical, and the modelling task is to recreate the effects processing applied to the input signal. Although not realistic, this scenario is relevant because it allows the use of supervised metrics on evaluating non-supervised training methods.

In the second (more realistic) scenario, the input dataset is the unprocessed audio recorded from one guitar, and the target dataset is audio that has been recorded from a different guitar, and has audio effects processing applied to it. In this case, the modelling task implicitly includes transforming the tone of one guitar into another, as well as recreating the effects processing applied to that guitar. The two experiments are depicted in Fig. 3. As a baseline, the time-domain discriminator from MelGAN [32] was used during both experiments.

Figure 3: (a) In the Single Guitar experiment, the input and output audio are produced using the same guitar, but with unpaired data, whereas (b) in the Mismatched Guitar experiment, the input and output audio are generated with different guitars.

3.1 Experiment 1: Single Guitar

The dataset is split to ensure that the input and target datasets do not contain any of the same guitar content. To achieve this, the dataset is divided into two-second segments, with subsequent segments being sent alternately to either the input or target training dataset. This ensures that the spectral content of the unprocessed guitar in both datasets is similar, but that the actual content is different.

As in this case the guitar is the same in both datasets, as shown in Fig. 3(a), a ground truth, or reference, for how the input guitar should sound after the effects processing is applied is available. This allows a validation loss to be calculated over a held-out validation set, that consists of paired input/output guitar audio. This also allows a baseline supervised model to be trained.

For validation loss metrics, we use both the linear and log scaled multi-scale magnitude spectrogram loss described in [25], which we refer to as and respectively. In addition to this, we also present the L1 distance between the output and target mel magnitude spectrograms, again using both linear and log scaling, which we will refer to as and respectively.

Additionally, a MUSHRA [35] style listening test was carried out. Participants were presented with audio clips that were processed by the target plugin, as well as the various neural network models. An anchor was also included, which was created by processing the input with a tanh nonlinearity, as well as a hidden reference. Participants were asked to rate each test condition out of 100, based on perceived similarity to the reference. Twelve subject took the listening test, and four were removed in post screening as they rated the hidden reference less than 90 in more than of the trials.

The supervised baseline used the same generator model, but trained in a supervised manner, using the Error-to-Signal ratio loss function,

, with high-pass filter pre-emphasis [13]. As an unsupervised baseline, the generator model was trained using the MelGAN discriminator [32]. Then, spectral domain discriminators were trained, with the various configurations introduced in Sec. 2.

The results are shown in Table 2. In each case the training was run for 400k iterations, and the validation loss was used to select the best performing model. For the spectral domain discriminators, the validation loss used to determine the best performing model was chosen depending on the form of the input provided to the model, for example, if the input to the discriminator was a log-mel spectrogram, then the training iteration where the lowest validation was achieved was selected as the best performing model. In each case, our experiments included both multi-scale and single-scale spectral domain discriminators, however, for brevity, only the results for the best performing of the two are included in Table 2.

The objective results for Experiment 1 show that the WaveNet model trained in a supervised fashion performs better than the unsupervised models on all the proposed metrics, across all the targets chosen, see Table 2. Generally, of the unsupervised models, those trained with the MelGAN discriminator tend to perform better in the objective loss metrics. However, the results from the listening tests indicate that there is no clear best performing model between the supervised and unsupervised training approach, except in the case of spectral domain discriminators that receive linear scaled spectrograms as input. For all target tones attempted, at least one of the unsupervised models is able to achieve a score of 80 or higher, indicating a perceptual match somewhere between Good and Excellent on the MUSHRA scale.

Validation Loss Listening
Model Test
Target Tone: Clean
Supervised 5.12 0.76 0.57 0.12 0.003 814.1
MelGAN 37.5 1.47 2.75 0.17 2.38 714.8
Spectral Domain
Input # Disc.
Spect. 1 39.2 3.27 3.39 0.39 2.55 324.7
Mel 1 40.0 1.51 2.88 0.28 1.27 464.4
Log Spect. 3 44.1 0.81 3.76 0.18 2.71 824.5
Log Mel 3 46.9 0.93 4.07 0.19 1.04 833.9
Target Tone: Light Distortion
Supervised 2.57 0.81 0.28 0.09 0.001 933.0
MelGAN 25.2 2.18 1.32 0.18 2.51 735.4
Spectral Domain
Input # Disc.
Spect. 1 32.5 4.26 2.39 0.45 1.49 354.0
Mel 1 34.4 4.12 2.57 0.48 2.43 344.0
Log Spect. 1 45.3 1.11 4.51 0.23 2.18 814.8
Log Mel 3 38.1 1.17 3.36 0.21 2.50 88.73.9
Target Tone: Heavy Distortion
Supervised 6.33 2.53 0.60 0.19 0.03 574.6
MelGAN 22.4 2.49 1.81 0.22 2.04 922.8
Spectral Domain
Input # Disc.
Spect. 1 28.9 4.14 2.70 0.37 2.33 545.7
Mel 1 25.5 7.15 2.36 0.60 0.86 283.4
Log Spect. 1 32.1 2.52 3.25 0.29 3.17 814.8
Log Mel 3 24.5 2.55 2.21 0.23 2.37 853.8
Table 2: Objective and subjective results for the Single Guitar experiment. For validations losses, bold indicates best performing unsupervised model. For the listening test result bold indicates best performing of all models and confidence intervals are shown.

3.2 Experiment 2: Mismatched Guitar

For the scenario in Experiment 2, the guitar used to create each dataset is different, as shown in Fig. 3(b). This means that the objective metrics listed in Table 2 are unavailable. As such, we conducted a listening test, in which participants were presented with a reference, consisting of a few seconds of guitar playing from the target timbre domain. The participants were then asked to rate a number of test conditions, which consisted of the next few seconds of the same piece of music, but performed on the guitar from the input timbre domain. The test conditions all consisted of processed versions of the same guitar audio. It was impossible to include a hidden reference in the test, as it does not exist.

Two baselines were created, both having access to some ground truth information. The first baseline was created by processing the input guitar with the same effects plugin that was used to create the target guitar timbre, this baseline is referred to as the “plugin-only” timbre. This corresponds to the simplified solution in which the linear timbre transfer is not included, but the nonlinear mapping is perfect. The second baseline was created by applying a linear EQ matching to the unprocessed input guitar tone, with the EQ target being the target guitar before effects processing was applied. This EQ-matched version of the input guitar was then processed by the effects plugin used to create the reference timbre. This second baseline is referred to as the “EQ+plugin” tone. Notice that this processing is impossible to achieve in a practical setting, but is used here in lieu of an ideal reference. A low-quality anchor was also included in the listening test, which consisted of the input guitar processed by a tanh nonlinearity.

The test conditions consisted of three unsupervised models, trained using the MelGAN discriminator, or the spectral domain discriminator with either log spectrogram or log mel-spectogram input. In both cases, the multi-scale version of the spectral domain discriminator was used. The models were trained for 250k iterations.

The results of the listening test are shown in Fig. 4. The results indicate that the unsupervised models are competitive with our proposed baselines. For the ‘Clean’ and ‘Light Distortion’ case, the MelGAN model performs poorly. One possible explanation for this is that during training for the first experiment it was observed that the MelGAN produced some oscillation and instability as training went on, as the spectral discriminator models tended to quickly plateau and then remain stable. As no validation loss was available to monitor the training for this mismatched guitar case, it was not possible to select model parameters that produced the lowest validation loss once training was stopped.

Figure 4: MUSHRA scores with 95% confidence intervals for the (a) Clean, (b) Light Distortion and (c) Heavy Distortion guitar tone settings that were modelled in Experiment 2.

4 Conclusion

This work shows for the first time how the guitar timbre heard on a music recording can be imitated by another guitar, using an unsupervised method based on a GAN framework. We formulated the problem as domain transfer, and proposed a spectral domain discriminator. We validated our method through two listening tests and showed that the models produced are perceptually convincing. Audio samples are available at our demonstration page222https://ljuvela.github.io/adversarial-amp-modeling-demo/.

References

  • [1] V. Välimäki, S. Bilbao, J. O. Smith, J. S. Abel, J. Pakarinen, and D. Berners, “Virtual analog effects,” in DAFX: Digital Audio Effects, U. Zölzer, Ed., pp. 473–522. Wiley, Chichester, UK, second edition, 2011.
  • [2] M. Karjalainen, V. Välimäki, H. Penttinen, and H. Saastamoinen, “DSP equalization of electret film pickup for the acoustic guitar,” J. Audio Eng. Soc., vol. 48, no. 12, pp. 1183–1193, Dec. 2000.
  • [3] M. Rau, J. S. Abel, and J. O. Smith III, “Contact sensor processing for acoustic instrument recording using a modal architecture,” in Proc. Int. Conf. Digital Audio Effects (DAFX), Aveiro, Portugal, Sep. 2018, pp. 304–308.
  • [4] M. Karjalainen, H. Penttinen, and V. Välimäki, “Acoustic sound from the electric guitar using DSP techniques,” in Proc. IEEE ICASSP, Istanbul, Turkey, June 2000, pp. 773–776.
  • [5] M. Karjalainen and J. Pakarinen, “Wave digital simulation of a vacuum-tube amplifier,” in Proc. IEEE ICASSP, Toulouse, France, May 2006, pp. 153–156.
  • [6] D. T. Yeh, “Automated physical modeling of nonlinear audio circuits for real-time audio effects—Part II: BJT and vacuum tube examples,” IEEE Trans. Speech Audio Process., vol. 20, no. 4, pp. 1207–1216, 2012.
  • [7] O. Kröning, K. Dempwolf, and U. Zölzer, “Analysis and simulation of an analog guitar compressor,” in Proc. Int. Conf. Digital Audio Effects (DAFX), 2011, pp. 205–208.
  • [8] A. Novak, L. Simon, P. Lotton, and J. Gilbert, “Chebyshev model and synchronized swept sine method in nonlinear audio effect modeling,” in Proc. Int. Conf. Digital Audio Effects (DAFX), Graz, Austria, Sep. 2010, pp. 423–426.
  • [9] F. Eichas and U. Zölzer, “Black-box modeling of distortion circuits with block-oriented models,” in Proc. Int. Conf. Digital Audio Effects (DAFX), Brno, Czech Republic, 2016, pp. 5–9.
  • [10] S. Orcioni, A. Terenzi, S. Cecchi, F. Piazza, and A. Carini,

    “Identification of Volterra models of tube audio devices using multiple-variance method,”

    J. Audio Eng. Soc., vol. 66, no. 10, pp. 823–838, Oct. 2018.
  • [11] T. Vanhatalo, P. Legrand, M. Desainte-Catherine, et al., “A review of neural network-based emulation of guitar amplifiers,” Appl. Sci., vol. 12, no. 12, pp. 5894, 2022.
  • [12] M. A. Martínez Ramírez and J. D. Reiss, “Modeling nonlinear audio effects with end-to-end deep neural networks,” in Proc. IEEE ICASSP, Brighton, UK, May 2019, pp. 171–175.
  • [13] E.-P. Damskägg, L. Juvela, E. Thuillier, and V. Välimäki, “Deep learning for tube amplifier emulation,” in Proc. IEEE ICASSP, Brighton, UK, May 2019, pp. 471–475.
  • [14] A. Wright, E.-P. Damskägg, and V. Välimäki,

    “Real-time black-box modelling with recurrent neural networks,”

    in Proc. Int. Conf. Digital Audio Effects (DAFX), Birmingham, UK, Sep. 2019, pp. 173–180.
  • [15] A. Wright and V. Välimäki, “Neural modeling of phaser and flanging effects,” J. Audio Eng. Soc., vol. 69, no. 7/8, pp. 517–529, Jul. 2021.
  • [16] S. Nercessian, A. Sarroff, and K. J. Werner, “Lightweight and interpretable neural modeling of an audio distortion effect using hyperconditioned differentiable biquads,” in Proc. IEEE ICASSP, Toronto, Canada, June 2021, pp. 890–894.
  • [17] A. Wright and V. Välimäki, “Grey-box modelling of dynamic range compression,” in Proc. Int. Conf. Digital Audio Effects (DAFX), Vienna, Austria, 2022, pp. 304–311.
  • [18] Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song, “Neural style transfer: A review,” IEEE Trans. Vis. Comput. Graph., vol. 26, no. 11, pp. 3365–3385, Nov. 2020.
  • [19] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros,

    “Unpaired image-to-image translation using cycle-consistent adversarial networks,”

    in

    Proc. IEEE Int. Conf. Computer Vision (ICCV)

    , 2017, pp. 2223–2232.
  • [20] E. Grinstein, N. QK Duong, A. Ozerov, and P. Pérez, “Audio style transfer,” in Proc. IEEE ICASSP, Calgary, Canada, 2018, pp. 586–590.
  • [21] T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,” in Proc. Eur. Signal Process. Conf., 2018, pp. 2100–2104.
  • [22] H. Sicong, L. Qiyang, A. Cem, B. Xuchan, O. Sageev, and B. G. Roger, “TimbreTron: A wavenet(cycleGAN(CQT(au-dio))) pipeline for musical timbre transfer,” in Proc. Int. Conf. Learning Representations (ICLR), New Orleans, LA, 2019.
  • [23] D. K. Jain, A. Kumar, L. Cai, S. Singhal, and V. Kumar, “ATT: Attention-based timbre transfer,” in Proc. Int. Joint Conf. Neural Networks (IJCNN), Glasgow, UK, 2020, pp. 1–6.
  • [24] M. Carney, C. Li, E. Toh, N. Zada, P. Yu, and J. Engel, “Tone transfer: In-browser interactive neural audio synthesis,” in Proc. ACM IUI Workshops, Apr. 2021.
  • [25] J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable digital signal processing,” in Proc. Int. Conf. Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020.
  • [26] C. J. Steinmetz, N. J. Bryan, and J. D. Reiss, “Style transfer of audio effects with differentiable signal processing,” J. Audio Eng. Soc., vol. 70, no. 9, pp. 708–721, Sep. 2022.
  • [27] A. van den Oord, S. Dieleman, H. Zen, et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv: 1609.03499, Sep. 2016.
  • [28] D. Rethage, J. Pons, and X. Serra, “A Wavenet for speech denoising,” in Proc. IEEE ICASSP, Calgary, Canada, Apr. 2018, pp. 5069–5073.
  • [29] E.-P. Damskägg, L. Juvela, and V. Välimäki, “Real-time modeling of audio distortion circuits with deep learning,” in Proc. Int. Sound and Music Computing Conf. (SMC), Malaga, Spain, May 2019, pp. 332–339.
  • [30] L. Juvela, B. Bollepalli, X. Wang, et al., “Speech waveform synthesis from MFCC sequences with generative adversarial networks,” in Proc. IEEE ICASSP, Calgary, Canada, Apr. 2018, pp. 5679–5683.
  • [31] J. H. Lim and J. C. Ye, “Geometric GAN,” arXiv preprint arXiv:1705.02894, 2017.
  • [32] K. Kumar, R. Kumar, T. de Boissiere, et al., “MelGAN: Generative adversarial networks for conditional waveform synthesis,” Advances in Neural Inform. Process. Syst., vol. 32, 2019.
  • [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations (ICLR), San Diego, CA, 2015.
  • [34] C. Kehling, J. Abeßer, C. Dittmar, and G. Schuller,

    “Automatic tablature transcription of electric guitar recordings by estimation of score- and instrument-related parameters,”

    in Proc. Int. Conf. Digital Audio Effects (DAFX), Erlangen, Germany, Sep. 2014, pp. 219–226.
  • [35] ITU, “BS.1534: Method for the subjective assessment of intermediate quality levels of coding systems,” Recommendation ITU-R BS.1534-3, Geneva, Switzerland, 2015.