When recording musical instruments for contemporary music production, the audio effects applied are an essential component of the resulting timbre. This is especially true of the electric guitar, where the timbral characteristics, or as it is often referred to by guitarists, the tone, imparted by certain vacuum tube amplifiers is highly sought-after. Digitally emulating analog audio devices such as guitar amplifiers is known as Virtual Analog (VA) modelling .
A related but more challenging problem is to emulate the timbre of an electric guitar directly from a recording. This corresponds to a practical and very exciting task: how to imitate the guitar tone heard on a commercial music recording using your own electric guitar?
The usual paradigm of VA modelling is the emulation of certain analog devices, however this is insufficient for this purpose for two reasons. Firstly, the specific instrument and devices used to craft the guitar tone on a famous recording might not be known. Even if, for example, the exact amplifier used to make a recording is available, recreating the target timbre is no simple task, as it would require unearthing the combination of parameter settings and recording setup. Secondly, the desired timbre includes both the musical instrument itself, as well as any processing applied to it. This means that any emulation of a recorded guitar tone must account for differences between the timbre of your guitar, and the timbre of the target guitar, before effects processing is applied.
The problem thus corresponds more closely to that of timbre transfer, but in the specific case where the input and target instrument are both guitars. The problem can be divided into two subproblems: the linear and the nonlinear modification. The linear problem is related to the difference of the spectral shapes of the two instruments. A similar problem has been solved earlier using a high-order linear filter, either to enhance the string instrument timbre taken from a piezo pickup [2, 3], or to process an electric guitar such that it sounds like an acoustic one . In these cases, the filter behaves as an equalizer (EQ).
The nonlinear problem involves the emulation of the devices, other than the instrument itself, used to create the recording, such as the guitar amplifier, speaker cabinet, or other effects such as compression or distortion. Modelling these types of devices has been widely studied, and broadly falls into two approaches, “white-box”, where the physical properties of the system are studied to produce equations describing its behaviour [5, 6, 7], or “black-box”, where data collected from the system is used to fit a more generic model [8, 9, 10]. Recently, deep learning methods based on neural networks have become a popular choice for this purpose . These include encoder-decoder models aimed at a wide range of effects , feedforward , and recurrent  models for guitar amplifiers and time-varying effects , as well as models based on traditional DSP components such as infinite impulse response filters [16, 17].
Whilst the aforementioned approaches are all viable for the task of imitating a target electric guitar timbre, they all have major limitations in that they require us to know what devices were used during recording, or to have access to paired data with both the target timbre and the unprocessed signal taken directly from the guitar used during the recording. In the case where paired training data is impossible, or unavailable, one solution is to formulate the problem as domain-to-domain transfer. Great success has been achieved in Image Style Transfer, where the style of an input image is transformed to match a target style, whilst retaining the input image content [18, 19].
A related problem in audio is Audio Style Transfer, which seeks to transfer the style of some input audio to a target style, whilst retaining the content of the input audio . In the speech domain, an example of this is voice conversion, where an utterance spoken by one speaker is processed such that it sounds like it was spoken by a different speaker. This can be achieved, for example, using non-parallel speech data to train a Generative Adversarial Network (GAN) with a cycle-consistency loss . For musical applications, the problem is more often thought of as altering the timbre of some input audio, to match a target timbre. Hence, in the musical domain style transfer is more frequently referred to as timbre transfer [22, 23] or tone transfer . This can be achieved by applying image style transfer techniques to a time-frequency representation of the input audio, and then re-synthesising the audio in the waveform domain , or by learning a high level representation of the input and using it as input to a synthesizer . Recent work has also proposed style transfer of audio effects, using a shared-weight encoder to impart the production style of one recording onto another .
In this paper, we propose to address this problem using an unsupervised approach, in which a black-box feedforward convolutional model is trained in an adversarial manner using unpaired data. The distinction between the supervised and unsupervised modelling task is shown in Fig. 1. We conceptualise this as a form of domain transfer, and focus on the case of processing a signal recorded directly from an electric guitar, or input timbre, with the objective of matching the timbre of a guitar from a recording, or target timbre.
The proposed method uses an adversarial training approach. We define domains in the space of guitar timbre, where a domain is defined by the sounds produced by a particular combination of a guitar, pickups, amplifier, and any other audio effects. Note that the domain is defined purely by the guitar timbre, and not the content of the guitar playing. If we define domains, and , and we sample audio data, , where , and , where , our objective is to learn a mapping, , that converts audio from domain to domain :
The ultimate objective is to transform the tone such that is perceptually indistinguishable from . To train we use a discriminator model trained to identify examples from the target timbre domain.
Previous work on adversarial domain-to-domain translation [19, 21] has applied a cycle-consistency criterion to avoid a specific form of mode collapse, which results in simply ignoring the content of the input and synthesising unrelated content from the target domain. However, in our present experiments the generator model
did not exhibit this problem, likely due to the constrained expressive capability of the feedforward convolutional neural network used, which learns to directly apply a transformation to the input signal in the time domain.
The generator model used in this work is a feedforward variant of the WaveNet architecture . A non-causal feedforward WaveNet variant was first proposed for speech denoising . A causal version was later applied to guitar amplifier modelling , and this architecture is used as the generator throughout this work. The model has two main components, a stack of dilated causal convolutional layers, and a linear post-processor. The post-processor is a fully connected layer that takes the outputs of each of the convolutional layers as input.
All generator models used in this work consist of two stacks of nine dilated convolutions, with each stack starting at a dilation of one and increasing by a factor of two with each subsequent convolutional layer. Each convolutional layer has a kernel size of three, and uses the same gated activation function as the original WaveNet. The receptive field of this model is 2045 samples, or about 46.4 ms at a 44.1-kHz sample rate. It should also be noted that previous work has shown that a C++ implementation of this model is capable of running in real-time on a modern desktop computer .
The input to the proposed discriminator is a time-frequency representation of the audio, as proposed in 
. The discriminator consists of a stack of 1D convolutional layers, with the frequency bins of the time-frequency representation being provided as channels to the first layer of the discriminator. Subsequent layers used grouped convolutions to reduce computational cost, and the hyperparameters for each layer are shown in Table1
. All layers use weight normalization, and all layers except the final output layer are followed by a Leaky ReLU activation function with a negative slope of 0.2. Four different time-frequency representations were trialled, either a magnitude spectrogram, a magnitude mel-spectrogram, a log magnitude spectrogram, or a log magnitude mel-spectrogram. For all mel-spectrograms, the number of mel bands used was 160, the maximum frequency was set to Nyquist and the minimum frequency was set to 0 Hz.
Additionally, a multi-scale version of the spectral domain discriminator was trialled which included three sub-discriminators, each operating on time-frequency representations obtained using different window sizes. In the case of the single spectral-domain discriminator, a window size of was used, as for the multi-scale spectral discriminator, window sizes of were used. In all cases, the hop size was set to .
2.3 Training Objective
respectively, where is a guitar audio waveform used as input to the generator and is a guitar audio waveform taken from the target set. The training scheme is shown in Fig. 2. During training, both the input and target guitar datasets are split into two-second segments before processing by or . The models were trained using the Adam optimizer  with a batch size of 5.
The data used throughout this work was taken from a guitar dataset111www.idmt.fraunhofer.de/en/business_units/m2d/smt/guitar.html originally proposed for the task of automatic transcription . We use the fourth subset of the dataset, which consists of 64 short musical pieces played at fast and slow tempi. The pieces were recorded on two different electric guitars, a Career SG and an Ibanez 2820. We processed the recordings to remove leading and trailing silence, as well as the 2 bar count in at the beginning of each piece. Additionally, it was noted that clipping is present in some of the samples in the Career SG dataset, so examples where excessive clipping was observed were removed.
After the pre-processing, there was approximately 40 min of audio from the Ibanez 2820 guitar, and 30 min from the Career-SG. To create the datasets used during our experiments, the guitar audio was processed by a guitar amplifier plugin. To test the robustness of our modelling approach, a separate dataset was created for three different plugin settings, hereafter referred to as ‘Clean’, ‘Light Distortion’, and ‘Heavy Distortion’. The amount of harmonic distortion introduced increases from a relatively small amount in the ‘Clean’ setting, to an extremely distorted guitar tone found in the ‘Heavy Distortion’ case.
Our proposed modelling approach was tested on two different problems. In the first scenario, which is a toy problem, the signal contained in both the input and target datasets is recorded from the same guitar. In this case, the specific instrument in each dataset is identical, and the modelling task is to recreate the effects processing applied to the input signal. Although not realistic, this scenario is relevant because it allows the use of supervised metrics on evaluating non-supervised training methods.
In the second (more realistic) scenario, the input dataset is the unprocessed audio recorded from one guitar, and the target dataset is audio that has been recorded from a different guitar, and has audio effects processing applied to it. In this case, the modelling task implicitly includes transforming the tone of one guitar into another, as well as recreating the effects processing applied to that guitar. The two experiments are depicted in Fig. 3. As a baseline, the time-domain discriminator from MelGAN  was used during both experiments.
3.1 Experiment 1: Single Guitar
The dataset is split to ensure that the input and target datasets do not contain any of the same guitar content. To achieve this, the dataset is divided into two-second segments, with subsequent segments being sent alternately to either the input or target training dataset. This ensures that the spectral content of the unprocessed guitar in both datasets is similar, but that the actual content is different.
As in this case the guitar is the same in both datasets, as shown in Fig. 3(a), a ground truth, or reference, for how the input guitar should sound after the effects processing is applied is available. This allows a validation loss to be calculated over a held-out validation set, that consists of paired input/output guitar audio. This also allows a baseline supervised model to be trained.
For validation loss metrics, we use both the linear and log scaled multi-scale magnitude spectrogram loss described in , which we refer to as and respectively. In addition to this, we also present the L1 distance between the output and target mel magnitude spectrograms, again using both linear and log scaling, which we will refer to as and respectively.
Additionally, a MUSHRA  style listening test was carried out. Participants were presented with audio clips that were processed by the target plugin, as well as the various neural network models. An anchor was also included, which was created by processing the input with a tanh nonlinearity, as well as a hidden reference. Participants were asked to rate each test condition out of 100, based on perceived similarity to the reference. Twelve subject took the listening test, and four were removed in post screening as they rated the hidden reference less than 90 in more than of the trials.
The supervised baseline used the same generator model, but trained in a supervised manner, using the Error-to-Signal ratio loss function,, with high-pass filter pre-emphasis . As an unsupervised baseline, the generator model was trained using the MelGAN discriminator . Then, spectral domain discriminators were trained, with the various configurations introduced in Sec. 2.
The results are shown in Table 2. In each case the training was run for 400k iterations, and the validation loss was used to select the best performing model. For the spectral domain discriminators, the validation loss used to determine the best performing model was chosen depending on the form of the input provided to the model, for example, if the input to the discriminator was a log-mel spectrogram, then the training iteration where the lowest validation was achieved was selected as the best performing model. In each case, our experiments included both multi-scale and single-scale spectral domain discriminators, however, for brevity, only the results for the best performing of the two are included in Table 2.
The objective results for Experiment 1 show that the WaveNet model trained in a supervised fashion performs better than the unsupervised models on all the proposed metrics, across all the targets chosen, see Table 2. Generally, of the unsupervised models, those trained with the MelGAN discriminator tend to perform better in the objective loss metrics. However, the results from the listening tests indicate that there is no clear best performing model between the supervised and unsupervised training approach, except in the case of spectral domain discriminators that receive linear scaled spectrograms as input. For all target tones attempted, at least one of the unsupervised models is able to achieve a score of 80 or higher, indicating a perceptual match somewhere between Good and Excellent on the MUSHRA scale.
|Target Tone: Clean|
|Target Tone: Light Distortion|
|Target Tone: Heavy Distortion|
3.2 Experiment 2: Mismatched Guitar
For the scenario in Experiment 2, the guitar used to create each dataset is different, as shown in Fig. 3(b). This means that the objective metrics listed in Table 2 are unavailable. As such, we conducted a listening test, in which participants were presented with a reference, consisting of a few seconds of guitar playing from the target timbre domain. The participants were then asked to rate a number of test conditions, which consisted of the next few seconds of the same piece of music, but performed on the guitar from the input timbre domain. The test conditions all consisted of processed versions of the same guitar audio. It was impossible to include a hidden reference in the test, as it does not exist.
Two baselines were created, both having access to some ground truth information. The first baseline was created by processing the input guitar with the same effects plugin that was used to create the target guitar timbre, this baseline is referred to as the “plugin-only” timbre. This corresponds to the simplified solution in which the linear timbre transfer is not included, but the nonlinear mapping is perfect. The second baseline was created by applying a linear EQ matching to the unprocessed input guitar tone, with the EQ target being the target guitar before effects processing was applied. This EQ-matched version of the input guitar was then processed by the effects plugin used to create the reference timbre. This second baseline is referred to as the “EQ+plugin” tone. Notice that this processing is impossible to achieve in a practical setting, but is used here in lieu of an ideal reference. A low-quality anchor was also included in the listening test, which consisted of the input guitar processed by a tanh nonlinearity.
The test conditions consisted of three unsupervised models, trained using the MelGAN discriminator, or the spectral domain discriminator with either log spectrogram or log mel-spectogram input. In both cases, the multi-scale version of the spectral domain discriminator was used. The models were trained for 250k iterations.
The results of the listening test are shown in Fig. 4. The results indicate that the unsupervised models are competitive with our proposed baselines. For the ‘Clean’ and ‘Light Distortion’ case, the MelGAN model performs poorly. One possible explanation for this is that during training for the first experiment it was observed that the MelGAN produced some oscillation and instability as training went on, as the spectral discriminator models tended to quickly plateau and then remain stable. As no validation loss was available to monitor the training for this mismatched guitar case, it was not possible to select model parameters that produced the lowest validation loss once training was stopped.
This work shows for the first time how the guitar timbre heard on a music recording can be imitated by another guitar, using an unsupervised method based on a GAN framework. We formulated the problem as domain transfer, and proposed a spectral domain discriminator. We validated our method through two listening tests and showed that the models produced are perceptually convincing. Audio samples are available at our demonstration page222https://ljuvela.github.io/adversarial-amp-modeling-demo/.
-  V. Välimäki, S. Bilbao, J. O. Smith, J. S. Abel, J. Pakarinen, and D. Berners, “Virtual analog effects,” in DAFX: Digital Audio Effects, U. Zölzer, Ed., pp. 473–522. Wiley, Chichester, UK, second edition, 2011.
-  M. Karjalainen, V. Välimäki, H. Penttinen, and H. Saastamoinen, “DSP equalization of electret film pickup for the acoustic guitar,” J. Audio Eng. Soc., vol. 48, no. 12, pp. 1183–1193, Dec. 2000.
-  M. Rau, J. S. Abel, and J. O. Smith III, “Contact sensor processing for acoustic instrument recording using a modal architecture,” in Proc. Int. Conf. Digital Audio Effects (DAFX), Aveiro, Portugal, Sep. 2018, pp. 304–308.
-  M. Karjalainen, H. Penttinen, and V. Välimäki, “Acoustic sound from the electric guitar using DSP techniques,” in Proc. IEEE ICASSP, Istanbul, Turkey, June 2000, pp. 773–776.
-  M. Karjalainen and J. Pakarinen, “Wave digital simulation of a vacuum-tube amplifier,” in Proc. IEEE ICASSP, Toulouse, France, May 2006, pp. 153–156.
-  D. T. Yeh, “Automated physical modeling of nonlinear audio circuits for real-time audio effects—Part II: BJT and vacuum tube examples,” IEEE Trans. Speech Audio Process., vol. 20, no. 4, pp. 1207–1216, 2012.
-  O. Kröning, K. Dempwolf, and U. Zölzer, “Analysis and simulation of an analog guitar compressor,” in Proc. Int. Conf. Digital Audio Effects (DAFX), 2011, pp. 205–208.
-  A. Novak, L. Simon, P. Lotton, and J. Gilbert, “Chebyshev model and synchronized swept sine method in nonlinear audio effect modeling,” in Proc. Int. Conf. Digital Audio Effects (DAFX), Graz, Austria, Sep. 2010, pp. 423–426.
-  F. Eichas and U. Zölzer, “Black-box modeling of distortion circuits with block-oriented models,” in Proc. Int. Conf. Digital Audio Effects (DAFX), Brno, Czech Republic, 2016, pp. 5–9.
S. Orcioni, A. Terenzi, S. Cecchi, F. Piazza, and A. Carini,
“Identification of Volterra models of tube audio devices using multiple-variance method,”J. Audio Eng. Soc., vol. 66, no. 10, pp. 823–838, Oct. 2018.
-  T. Vanhatalo, P. Legrand, M. Desainte-Catherine, et al., “A review of neural network-based emulation of guitar amplifiers,” Appl. Sci., vol. 12, no. 12, pp. 5894, 2022.
-  M. A. Martínez Ramírez and J. D. Reiss, “Modeling nonlinear audio effects with end-to-end deep neural networks,” in Proc. IEEE ICASSP, Brighton, UK, May 2019, pp. 171–175.
-  E.-P. Damskägg, L. Juvela, E. Thuillier, and V. Välimäki, “Deep learning for tube amplifier emulation,” in Proc. IEEE ICASSP, Brighton, UK, May 2019, pp. 471–475.
A. Wright, E.-P. Damskägg, and V. Välimäki,
“Real-time black-box modelling with recurrent neural networks,”in Proc. Int. Conf. Digital Audio Effects (DAFX), Birmingham, UK, Sep. 2019, pp. 173–180.
-  A. Wright and V. Välimäki, “Neural modeling of phaser and flanging effects,” J. Audio Eng. Soc., vol. 69, no. 7/8, pp. 517–529, Jul. 2021.
-  S. Nercessian, A. Sarroff, and K. J. Werner, “Lightweight and interpretable neural modeling of an audio distortion effect using hyperconditioned differentiable biquads,” in Proc. IEEE ICASSP, Toronto, Canada, June 2021, pp. 890–894.
-  A. Wright and V. Välimäki, “Grey-box modelling of dynamic range compression,” in Proc. Int. Conf. Digital Audio Effects (DAFX), Vienna, Austria, 2022, pp. 304–311.
-  Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song, “Neural style transfer: A review,” IEEE Trans. Vis. Comput. Graph., vol. 26, no. 11, pp. 3365–3385, Nov. 2020.
J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros,
“Unpaired image-to-image translation using cycle-consistent adversarial networks,”in
Proc. IEEE Int. Conf. Computer Vision (ICCV), 2017, pp. 2223–2232.
-  E. Grinstein, N. QK Duong, A. Ozerov, and P. Pérez, “Audio style transfer,” in Proc. IEEE ICASSP, Calgary, Canada, 2018, pp. 586–590.
-  T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,” in Proc. Eur. Signal Process. Conf., 2018, pp. 2100–2104.
-  H. Sicong, L. Qiyang, A. Cem, B. Xuchan, O. Sageev, and B. G. Roger, “TimbreTron: A wavenet(cycleGAN(CQT(au-dio))) pipeline for musical timbre transfer,” in Proc. Int. Conf. Learning Representations (ICLR), New Orleans, LA, 2019.
-  D. K. Jain, A. Kumar, L. Cai, S. Singhal, and V. Kumar, “ATT: Attention-based timbre transfer,” in Proc. Int. Joint Conf. Neural Networks (IJCNN), Glasgow, UK, 2020, pp. 1–6.
-  M. Carney, C. Li, E. Toh, N. Zada, P. Yu, and J. Engel, “Tone transfer: In-browser interactive neural audio synthesis,” in Proc. ACM IUI Workshops, Apr. 2021.
-  J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable digital signal processing,” in Proc. Int. Conf. Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020.
-  C. J. Steinmetz, N. J. Bryan, and J. D. Reiss, “Style transfer of audio effects with differentiable signal processing,” J. Audio Eng. Soc., vol. 70, no. 9, pp. 708–721, Sep. 2022.
-  A. van den Oord, S. Dieleman, H. Zen, et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv: 1609.03499, Sep. 2016.
-  D. Rethage, J. Pons, and X. Serra, “A Wavenet for speech denoising,” in Proc. IEEE ICASSP, Calgary, Canada, Apr. 2018, pp. 5069–5073.
-  E.-P. Damskägg, L. Juvela, and V. Välimäki, “Real-time modeling of audio distortion circuits with deep learning,” in Proc. Int. Sound and Music Computing Conf. (SMC), Malaga, Spain, May 2019, pp. 332–339.
-  L. Juvela, B. Bollepalli, X. Wang, et al., “Speech waveform synthesis from MFCC sequences with generative adversarial networks,” in Proc. IEEE ICASSP, Calgary, Canada, Apr. 2018, pp. 5679–5683.
-  J. H. Lim and J. C. Ye, “Geometric GAN,” arXiv preprint arXiv:1705.02894, 2017.
-  K. Kumar, R. Kumar, T. de Boissiere, et al., “MelGAN: Generative adversarial networks for conditional waveform synthesis,” Advances in Neural Inform. Process. Syst., vol. 32, 2019.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations (ICLR), San Diego, CA, 2015.
C. Kehling, J. Abeßer, C. Dittmar, and G. Schuller,
“Automatic tablature transcription of electric guitar recordings by estimation of score- and instrument-related parameters,”in Proc. Int. Conf. Digital Audio Effects (DAFX), Erlangen, Germany, Sep. 2014, pp. 219–226.
-  ITU, “BS.1534: Method for the subjective assessment of intermediate quality levels of coding systems,” Recommendation ITU-R BS.1534-3, Geneva, Switzerland, 2015.