1 Introduction
Modulation based or timevarying audio effects involve audio processors or effect units that include a modulator signal within their analog or digital implementation [1]. These modulator signals are in the low frequency range (usually below
Hz). Their waveforms are based on common periodic signals such as sinusoidal, squarewave or sawtooth oscillators and are often referred to as a Low Frequency Oscillator (LFO). The LFO periodically modulates certain parameters of the audio processors to alter the timbre, frequency, loudness or spatialization characteristics of the audio. This differs from timeinvariant audio effects which do not change their behavior over time. Based on how the LFO is employed and the underlying signal processing techniques used when designing the effect units, we can classify modulation based audio effects into
timevarying filters such as phaser or wahwah; delayline based effects such as flanger or chorus; and amplitude modulation effects such as tremolo or ring modulator [2].The phaser effect is a type of timevarying filter implemented through a cascade of notch or allpass filters. The characteristic sweeping sound of this effect is obtained by modulating the center frequency of the filters, which creates phase cancellations or enhancements when combining the filter’s output with the input audio. Similarly, the wahwah is based on a bandpass filter with a variable center frequency, usually controlled by a pedal. If the center frequency is modulated by an LFO or an envelope follower, the effect is commonly called autowah.
Delayline based audio effects, as in the case of flanger and chorus, are based on the modulation of the length of the delay lines. A flanger is implemented via a modulated comb filter whose output is mixed with the input audio. Unlike the phaser, the notch and peak frequencies caused by the flanger’s sweep comb filter effect are equally spaced in the spectrum, thus causing the known metallic sound associated with this effect. A chorus occurs when mixing the input audio with delayed and pitch modulated copies of the original signal. This is similar to various musical sources playing the same instrument but slightly shifted in time. vibrato is digitally implemented as a delayline based audio effect, where pitch shifting is achieved when periodically varying the delay time of the input waveform [3].
Tremolo is an amplitude modulation effect where an LFO is used to directly vary the amplitude of the incoming audio, creating in this way a perceptual temporal fluctuation. A ring modulator is also based on amplitude modulation, but the modulation is achieved by having the input audio multiplied by a sinusoidal oscillator with higher carrier frequencies. In the analog domain, this effect is commonly implemented with a diode bridge, which adds a nonlinear behavior and a distinct sound to this effect unit. Another type of modulation based effect that combines amplitude, pitch and spatial modulation is the Leslie speaker, which is implemented by a rotating horn and a rotating woofer inside a wooden cabinet. This effect can be interpreted as a combination of tremolo, Doppler effect and reverberation [4].
Most of these effects can be implemented directly in the digital domain through the use of digital filters and delay lines. Nevertheless, modeling specific effect units or analog circuits has been heavily researched and remains an active field. This is because hardware effect units are characterized by the nonlinearities introduced by certain circuit components. Musicians often prefer the analog counterparts because the digital implementations may lack this behavior, or because the digital simulations make certain assumptions when modeling specific nonlinearities.
Virtual analog methods for modeling such effect units mainly involve circuit modeling and optimization for specific analog components such as operational amplifiers or transistors. This often requires assumptions or models that are too specific for a certain circuit. Such models are also not easily transferable to different effects units since expert knowledge of the type of circuit being modeled is required, i.e. specific linear and nonlinear components.
Using endtoend deep neural networks (DNNs) and convolutional and recurrent layers, we explore how a deep neural network can learn the long temporal dependencies which characterizes these effect units as well as the possibilities to match nonlinearities within the audio effects. We include Bidirectional Long Short Term Memory (BiLSTM) neural networks and explore their capabilities when learning timevarying transformations. We explore linear and nonlinear timevarying emulation as a contentbased transformation without explicitly obtaining the solution of the timevarying system. We show the model performing modeling modulation based audio effects such as
chorus, flanger, phaser, tremolo, vibrato, tremolowah, ring modulator and Leslie speaker. We investigate the capabilities of the model when adding further nonlinearities to the linear timevarying audio effects. Furthermore, we extend the applications of the model by including nonlinear timeinvariant audio effects with long temporal dependencies such as autowah, compressor and multiband compressor. Finally, we measure performance of the model using a metric based on the modulation spectrum.The paper is structured as follows. In Section 2 we present the relevant literature related to virtual analog of modulation based audio effects. Section 3
gives details of our model, the modulation based effect tasks and the proposed evaluation metric. Sections
4 and 5 show the analysis, obtained results, and the respective conclusion.2 Background
2.1 Virtual analog modeling of timevarying audio effects
Virtual analog audio effects aim to simulate an effect unit and recreate the sound of an analog reference circuit. Much of the active research models nonlinear audio processors such as distortion effects, compressors, amplifiers or vacuum tubes [5, 6, 7]. With respect to modeling timevarying audio effects, most of the research has been applied to develop whitebox methods, i.e. in order to model the effect unit a complete study of the internal circuit is carried out. These methods use circuit simulation techniques to characterize various analog components such as diodes, transistors, operational amplifiers or integrated circuits.
In [8], phasers implemented via Junction Field Effect Transistors (JFET) and Operational Transconductance Amplifiers (OTA) were modeled using circuit simulation techniques that discretize the differential equations that describe these components. Using a similar circuit modeling procedure, delayline based effects are modeled, such as flanger and chorus as implemented with Bucket Brigade Delay (BBD) chips. BBD circuits have been widely used in analog delayline based effect units and several digital emulations have been investigated. [9] emulated BBD devices through circuit analysis and electrical measurements of the linear and nonlinear elements of the integrated circuit. [10] modeled BBDs as delaylines with fixed length but variable sample rate.
Based on BBD circuitry, a flanger effect was modeled in [11] via the nodal DKmethod. This is a common method in virtual analog modeling [12]
where nonlinear filters are derived from the differential equations that describe an electrical circuit. In
[13], a wahwah pedal is implemented using the nodal DKmethod and the method is extended to model the temporal fluctuations introduced by the continuous change of the pedal. In [14], the MXR Phase 90 phaser effect is modeled via a thorough circuit analysis and the DKmethod. This effect unit is based on JFETs, and voltage and current measurements were performed to obtain the nonlinear characteristics of the transistors.Amplitude modulation effects such as an analog ring modulator were modeled in [15], where the diode bridge is emulated as a network of static nonlinearities. [16] modeled the rotating horn of the Leslie speaker via varying delaylines, artificial reverberation and physical measurements from the rotating loudspeaker. [17] also modeled the Leslie speaker and achieved frequency modulation through timevarying spectral delay filter,s and amplitude modulation using a modulator signal. In both Leslie speaker emulations, various physical characteristics of the effect are not taken into account, such as the frequencydependent directivity of the loudspeaker and the effect of the wooden cabinet.
In [18], graybox modeling was proposed for linear timevarying audio effects. This differs from whitebox modeling, since the method was based on inputoutput measurements but the timevarying filters were based on knowledge of analog phasers. In this way, phaser emulation was achieved by multiple measurements of the impulse response of a cascade of allpass filters.
2.2 Endtoend deep neural networks
Endtoend deep learning is based on the idea that an entire problem can be taken as a single indivisible task which must be learned from endtoend. Deep learning architectures using this principle have recently been researched in the music information retrieval field [21, 22, 23], since the amount of required prior knowledge may be reduced and engineering effort minimized by learning directly from raw audio [24]. Recent work also demonstrated the feasibility of these architectures for audio synthesis and audio effects modeling. [25, 26] proposed models that synthesize audio waveforms and [27] obtained a model capable of performing singing voice synthesis.
Endtoend deep neural networks for audio effects modeling were implemented in [28]
, where Equalization (EQ) matching was achieved with convolutional neural networks (CNN). Also,
[29] presented a deep learning architecture for modeling nonlinear processors such as distortion, overdrive and amplifier emulation. The DNN is capable of modeling an arbitrary combination of linear and nonlinear memoryless audio effects, but does not generalize to transformations with long temporal dependencies such as modulation based audio effects.3 Methods
3.1 Model
The model is entirely based on the timedomain and operates with raw audio as the input and processed audio as the output. It is divided into three parts: adaptive frontend, latentspace and synthesis backend. A block diagram can be seen in Fig. 1 and its structure is described in detail in Table 1. We build on the architecture from [29], since we incorporate BiLSTMs into the latentspace and we modified the structure of the synthesis backend in order to allow the model to learn nonlinear timevarying transformations.
3.2 Adaptive frontend
The frontend performs timedomain convolutions with the incoming audio. It follows a filter bank architecture and is designed to learn a latent representation for each audio effect modeling task. It consists of a convolutional encoder which contains two CNN layers, one pooling layer and one residual connection. This residual connection is used by the backend to facilitate the synthesis of the waveform based on the specific timevarying transformation.
In order to allow the model to learn longterm memory dependencies, the input consists of the current audio frame concatenated with the previous and subsequent frames. These frames are of size and sampled with a hop size . The input is described by (1).
Layer  Output shape  Units  Output 
Input  (9, 4096, 1)  .  
Conv1D  (9, 4096, 32)  32(64)  
Residual  (4096, 32)  .  
Abs  (9, 4096, 32)  .  . 
Conv1DLocal  (9, 4096, 32)  32(128)  . 
Softplus  (9, 4096, 32)  .  
MaxPooling  (9, 64, 32)  .  
BiLSTM  (64, 128)  64  . 
BiLSTM  (64, 64)  32  . 
BiLSTM  (64, 32)  16  . 
SAAF  (64, 32)  25  
Unpooling  (4096, 32)  .  
Multiply  (4096, 32)  .  
Dense  (4096, 32)  32  . 
Dense  (4096, 16)  16  . 
Dense  (4096, 16)  16  . 
Dense  (4096, 32)  32  . 
SAAF  (4096, 32)  25  . 
Abs  (4096, 32)  .  . 
Global Average  (1, 32)  .  . 
Dense  (1, 512)  512  . 
Dense  (1, 32)  32  . 
Multiply  (4096, 32)  .  
Add  (4096, 32)  .  
deConv1D  (4096, 1)  . 
(1) 
The first convolutional layer has onedimensional filters of size and is followed by the absolute value
as nonlinear activation function. The operation performed by the first layer can be described by (
2).(2) 
Where is the feature map after the input audio is convolved with the kernel matrix . is the corresponding row in for the frequency decomposition of the current input frame . The backend does not directly receive information from the past and subsequent context frames. The second layer has filters of size and each filter is locally connected. We follow a filter bank architecture since each filter is only applied to its corresponding row in and so we significantly decrease the number of trainable parameters. This layer is followed by the softplus nonlinearity [30], described by 3
(3) 
Where is the second feature map obtained after the local convolution with , the kernel matrix of the second layer. The maxpooling operation is a moving window of size applied over , where the maximum value within each window corresponds to the output.
By using the absolute value as activation function of the first layer and by having larger filters , we expect the frontend to learn smoother representations of the incoming audio, such as envelopes [22]. All convolutions and pooling operations are time distributed, i.e the same convolution or pooling operation is applied to each of the input frames.
3.3 Bidirectional LSTMs
The latentspace consists of three BiLSTM layers of , , and units respectively. BiLSTMs are a type of recurrent neural network that can access longterm context from both backward and forward directions [31]. BiLSTMs are capable of learning long temporal dependencies when processing timeseries where the context of the input is needed [32].
The BiLSTMs processes the latentspace representation . is learned by the frontend and contains information regarding the input frames. These recurrent layers are trained to reduce the dimension of , while also learning a nonlinear modulation . This new latent representation is fed into the synthesis backend in order to reconstruct an audio signal that matches the timevarying task. Each BiLSTM has dropout and recurrent dropout values of and the first two layers have the hyperbolic tangent as activation function.
The performance of CNNs in regression tasks has improved by using adaptive activation functions [33]. So we add a Smooth Adaptive Activation Function (SAAF) as the nonlinearity for the last layer. SAAFs consist of piecewise second order polynomials which can approximate any continuous function and are regularized under a Lipschitz constant to ensure smoothness. As shown in [29], SAAFs can be used within deep neural networks to model nonlinearities in audio processing tasks.
3.4 Synthesis backend
The synthesis backend accomplishes the reconstruction of the target audio by processing the current input frame and the nonlinear modulation . The backend consists of an unpooling layer, a DNN block with SAAF and SqueezeandExcitation (SE) [34] layers (DNNSAAFSE) and a single CNN layer.
The DNNSAAFSE block consists of fully connected (FC) layers of , , and hidden units respectively. Each FC layer is followed by the hiperbolic tangent function except for the last one, which is followed by a SAAF layer. Overall, each SAAF layer is locally connected and each function consists of segments between to .
The SE blocks explicitly model interdependencies between channels by adaptively scaling the channelwise information of feature maps [34]. The SE dynamically scales each of the 32 channels and its structure is as in [35]. It consists of a global average pooling operation followed by two FC layers of and
hidden units respectively. The FC layers are followed by a rectifier linear unit (
ReLU) and sigmoid activation functions accordingly. Since the feature maps of the model are based on timedomain waveforms, we incorporate an absolute value layer before the global average pooling operation.The backend matches the timevarying task by the following steps. First, a discrete approximation of () is obtained by an upsampling operation. Then the feature map is the result the elementwise multiplication of the residual connection and . This can be seen as a frequency dependent amplitude modulation between the learned modulator and the frequency decomposition .
(4) 
The feature map is obtained when the nonlinear and channelwise scaled filters from the DNNSAAFSE block are applied to the modulated frequency decomposition . Then, is added back to , acting as a nonlinear delayline.
(5) 
The last layer corresponds to the deconvolution operation, which can be implemented by transposing the first layer transform. This layer is not trainable since its kernels are transposed versions of . In this way, the backend reconstructs the audio waveform in the same manner that the frontend decomposed it. The complete waveform is synthesized using a hanning window and constant overlapadd gain.
All convolutions are along the time dimension and all strides are of unit value. The models have approximately
trainable parameters, which, within a deep learning context, represents a model that is not very large or difficult to train.3.5 Training
The training of the model is performed in two steps. The first step is to train only the convolutional layers for an unsupervised learning task, while the second step consists of an endtoend supervised learning task based on a given timevarying target. During the first step only the weights of
Conv1D and Conv1DLocal are trained and both the raw audio and wet audio are used as input and target functions. This means the model is being prepared to reconstruct the input and target data in order to have a better fitting when training for the timevarying task. Only during this step, the unpooling layer of the backend uses the time positions of the maximum values recorded by the maxpooling operation.Once the model is pretrained, the BiLSTM and DNNSAAFSE layers are incorporated into the model, and all the weights of the convolutional, recurrent, dense and activation layers are updated. Since small amplitude errors are as important as large ones, the loss function to be minimized is the mean absolute error between the target and output waveforms. We explore input size frames from
to samples and we always use a hop size of . The batch size consisted of the total number of frames per audio sample.Adam is used as optimizer and we perform the pretraining for epochs and the supervised training for epochs. During the second training step, we start with a learning rate of and we reduce it by every epochs. We select the model with the lowest error for the validation subset.
3.6 Dataset
Modulation based audio effects such as chorus, flanger, phaser, tremolo and vibrato were obtained from the IDMTSMTAudioEffects dataset [36]. It corresponds to individual 2second notes and covers the common pitch range of various 6string electric guitars and 4string bass guitars.
The recordings include the raw notes and their respective effected versions for different settings for each effect. For our experiments we use processed and unprocessed audio for bass guitar and the setting for each of the effects. In addition, processing the bass guitar raw audio, we implement an autowah with a peak filter whose center frequency ranges from Hz to kHz and is modulated by a Hz.
Since the previous audio effects are linear timevarying, we further test the capabilities of the model by adding a nonlinearity to each of these effects. Thus, using the bass guitar wet audio, we applied an overdrive (gaindB) after each modulation based effect.
We also use virtual analog implementations of a ring modulator and a Leslie speaker to process the electric guitar raw audio. The ring modulator implementation^{1}^{1}1https://github.com/nrlakin/robot_voice/blob/master/robot.py is based on [15] and we use a modulator signal of Hz. The Leslie speaker implementation^{2}^{2}2https://ccrma.stanford.edu/software/snd/snd/leslie.cms is based on [16] and we model each of the stereo channels.
Finally, we also explore the capabilities of the model with nonlinear timeinvariant audio effects with long temporal dependencies, such as compressors and autowah. We use the compressor and multiband compressor from SoX^{3}^{3}3http://sox.sourceforge.net/ to process the electric guitar raw audio. The settings of the compressor are as follows: attack time 10 ms, release time 100 ms, knee 1 dB, ratio 4:1 and threshold 40 dB. The multiband compressor has 2 bands with a crossover frequency of Hz, attack time: ms and s, decay time: ms and ms, knee: dB and dB, ratio: : and : and threshold: dB and dB.
Similarly, we use an autowah implementation^{4}^{4}4https://github.com/lucieperrotta/ASP with an envelope follower and a peak filter which center frequency modulates between Hz to kHz.
For each timevarying task we use raw and effected notes and the test and validation samples correspond to of this subset. The recordings were downsampled to kHz and amplitude normalization was applied with exception to the timeinvariant audio effects.
3.7 Evaluation
Two metrics were used when testing the models with the various test subsets. Since the mean absolute error depends on the amplitude of the output and target waveforms, before calculating this error, we normalize the energy of the target and the output and define it as the energynormalized mean absolute error (mae).
We also propose an objective metric which mimics human perception of amplitude and frequency modulation. The modulation spectrum uses timefrequency theory integrated with the psychoacoustics of modulation frequency perception, thus, providing longterm knowledge of temporal fluctuation patterns [37]. We propose the modulation spectrum euclidean distance (msed), which is based on the audio features from [38] and [39] and is defined as follows:

A Gammatone filter bank is applied to the target and output entire waveforms. In total we use filters, with center frequencies spaced logarithmically from Hz to Hz.

The envelope of each filter output is calculated via the magnitude of the Hilbert transform and downsampled to Hz.

A Modulation filter bank is applied to each envelope. In total we use filters, with center frequencies spaced logarithmically from Hz to Hz.

The Fast Fourier Transform (FFT) is calculated for each modulation filter output of each Gammatone filter. The energy is normalized by the DC value and summarized in the following bands:
 Hz,  Hz,  Hz and  Hz. 
The msed metric is the mean euclidean distance between the energy values at these bands.
4 Results & Analysis
First, we explore the capabilities of BiLSTMs to learn longterm temporal dependencies. Fig. 2 shows the results of the test dataset for different input frame sizes and various linear timevarying tasks. The most optimal results are with an input size of samples, since shorter frame sizes represent a higher error and samples do not represent a significant improvement. Since the average modulation frequency in our tasks is Hz, for each input size we select a that covers one period of this modulator signal. Thus, for the rest of our experiments, we use an input size of samples and for the number of past and subsequent frames.
Fig. 4 visualizes the functioning of the model for the tremolo task. It shows how the model processes the input frame into the different frequency maps and , learns a modulator signal , and applies the respective amplitude modulation. This linear timevarying audio effect is easy to interpret. For more complex nonlinear timevarying effects, a more indepth analysis of the model is required.
The training procedures were performed for each type of timevarying and timeinvariant audio effect and the audio results are available online^{5}^{5}5https://mchijmma.github.io/modelingtimevarying/. Fig. 3 shows the mae and msed for all the test subsets. To provide a reference, the mean mae and msed values between input and target waveforms are and respectively. It can be seen that the model performed well on each audio effect modeling task. Overall, the model achieved better results with amplitude modulation and timevarying filter audio effects, although delayline based effects were also successfully modeled.
For selected linear and nonlinear timevarying tasks, Fig. 5 shows the input, target, and output waveforms together with their respective modulation spectrum. In the timedomain, it is evident that the model is matching the target waveform. From the modulation spectrum it is noticeable that the model introduces different modulation energies into the output which were not present in the input and which closely match those of the respective targets.
The task becomes more challenging when a nonlinearity is added to a linear timevarying transformation. Fig. (d)d depicts results for the phaseroverdrive task. Given the large overdrive gain the resulting audio has a lowerfrequency modulation. It can be seen that the model introduces modulations as low as Hz. But the waveform is not as smooth as the target, hence the larger values. Although the increased, the model does not significantly reduce performance and is able to match the combination of nonlinear and modulation based audio effects.
Much more complicated timevarying tasks, such as the ring modulator and Leslie speaker virtual analog implementations were also successfully modeled. This represents a significant result, since these implementations include nonlinear modulation; ring modulator, or varying delay lines together with artificial reverberation and Doppler effect simulation; and the Leslie speaker.




Lastly, the model is also able to perform linear and nonlinear timeinvariant modeling. The long temporal dependencies of an envelope driven autowah, compressor and multiband compressor are succesfully modeled. Furthermore, in the latter case, the crossover filters are also matched. The msed may not be relevant for these effects, but the low mae values represent that the model also performs well here.
5 Conclusion
In this work, we introduced a generalpurpose deep learning architecture for modeling audio effects with long temporal dependencies. Using raw audio and a given timevarying task, we explored the capabilities of endtoend deep neural networks to learn lowfrequency modulations and to process the audio accordingly. The model was able to match linear and nonlinear timevarying audio effects, timevarying virtual analog implementations and timeinvariant audio effects with longterm memory.
Other whitebox or graybox modeling methods suitable for these timevarying tasks would require expert knowledge such as specific circuit analysis and discretization techniques. Moreover, these methods can not easily be extended to other timevarying tasks, and assumptions are often made regarding the nonlinear behavior of certain components. To the best of our knowledge, this work represents the first blackbox modeling method for linear and nonlinear, timevarying and timeinvariant audio effects. It makes less assumptions about the audio processor target and represents an improvement of the stateoftheart in audio effects modeling.
We showed the model matching chorus, flanger, phaser, tremolo, vibrato, autowah, ring modulator, Leslie speaker and compressors. We proposed an objective perceptual metric to measure the performance of the model. The metric is based on the euclidean distance between the frequency bands of interest within the modulation spectrum. We demonstrated that the model processes the input audio by applying different modulations which closely match with those of the timevarying target.
Perceptually, most output waveforms are indistinguishable from their target counterparts, although there are minor discrepancies at the highest frequencies and noise level. This could be improved by using more convolution filters, which means a higher resolution in the filter bank structures [29]. Moreover, as shown in [28], a cost function based on time and frequency can be used to improve this frequency related issue, though additional listening tests may be required.
The generalization can also be studied more thoroughly, since the model learns to apply the specific transformation to the audio of a specific musical instrument, such as the electric guitar or the bass guitar. In addition, since the model strives to learn long temporal dependencies with shorter input size frames, and also needs past and subsequent frames, more research is needed on how to adapt this architecture to realtime implementations.
Realtime applications would benefit significantly from the exploration of recurrent neural networks to model transformations that involve longterm memory without resorting to large input frame sizes and the need for past and future context frames. Although the model was able to match the artificial reverberation of the Leslie speaker
implementation, a thorough exploration of reverberation modeling is needed, such as plate, spring or convolution reverberation. In addition, since the model is learning a static representation of the audio effect, ways of devising a parametric model could also be explored. Finally, applications beyond virtual analog can be investigated, for example, in the field of automatic mixing
[40] the model could be trained to learn a generalization from mixing practices.6 Acknowledgments
The Titan Xp GPU used for this research was donated by the NVIDIA Corporation. EB is supported by a RAEng Research Fellowship (RF/128).
References
 [1] Joshua D Reiss and Andrew McPherson, Audio effects: theory, implementation and application, CRC Press, 2014.
 [2] Udo Zölzer, DAFX: digital audio effects, John Wiley & Sons, 2011.
 [3] Julius Orion Smith, Physical audio signal processing: For virtual musical instruments and audio effects, W3K Publishing, 2010.
 [4] Clifford A Henricksen, “Unearthing the mysteries of the leslie cabinet,” Recording Engineer/Producer Magazine, 1981.
 [5] Jyri Pakarinen and David T Yeh, “A review of digital techniques for modeling vacuumtube guitar amplifiers,” Computer Music Journal, vol. 33, no. 2, pp. 85–100, 2009.
 [6] Dimitrios Giannoulis, Michael Massberg, and Joshua D Reiss, “Digital dynamic range compressor design  a tutorial and analysis,” Journal of the Audio Engineering Society, vol. 60, no. 6, pp. 399–408, 2012.
 [7] David T Yeh, Jonathan S Abel, and Julius O Smith, “Automated physical modeling of nonlinear audio circuits for realtime audio effects part I: Theoretical development,” IEEE transactions on audio, speech, and language processing, vol. 18, no. 4, pp. 728–737, 2010.
 [8] Antti Huovilainen, “Enhanced digital models for analog modulation effects,” in 8th International Conference on Digital Audio Effects (DAFx05), 2005.
 [9] Colin Raffel and Julius Smith, “Practical modeling of bucketbrigade device circuits,” in 13th International Conference on Digital Audio Effects (DAFx10), 2010.
 [10] Martin Holters and Julian D Parker, “A combined model for a bucket brigade device and its input and output filters,” in 21st International Conference on Digital Audio Effects (DAFx17), 2018.
 [11] Jaromír Mačák, “Simulation of analog flanger effect using BBD circuit,” in 19th International Conference on Digital Audio Effects (DAFx16), 2016.
 [12] David T Yeh, “Automated physical modeling of nonlinear audio circuits for realtime audio effects part II: BJT and vacuum tube examples,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, 2012.
 [13] Martin Holters and Udo Zölzer, “Physical modelling of a wahwah effect pedal as a case study for application of the nodal dk method to circuits with variable parts,” in 14th International Conference on Digital Audio Effects (DAFx11), 2011.
 [14] Felix Eichas et al., “Physical modeling of the mxr phase 90 guitar effect pedal,” in 17th International Conference on Digital Audio Effects (DAFx14), 2014.
 [15] Julian Parker, “A simple digital model of the diodebased ringmodulator,” in 14th International Conference on Digital Audio Effects (DAFx11), 2011.
 [16] Julius Smith et al., “Doppler simulation and the leslie,” in 5th International Conference on Digital Audio Effects (DAFx02), 2002.
 [17] Jussi Pekonen, Tapani Pihlajamäki, and Vesa Välimäki, “Computationally efficient hammond organ synthesis,” in 14th International Conference on Digital Audio Effects (DAFx11), 2011.
 [18] Roope Kiiski, Fabián Esqueda, and Vesa Välimäki, “Timevariant graybox modeling of a phaser pedal,” in 19th International Conference on Digital Audio Effects (DAFx16), 2016.
 [19] Kurt J Werner, W Ross Dunkel, and François G Germain, “A computational model of the hammond organ vibrato/chorus using wave digital filters,” in 19th International Conference on Digital Audio Effects (DAFx16), 2016.
 [20] Ólafur Bogason and Kurt James Werner, “Modeling circuits with operational transconductance amplifiers using wave digital filters,” in 20th International Conference on Digital Audio Effects (DAFx17), 2017.
 [21] Jordi Pons et al., “Endtoend learning for music audio tagging at scale,” in 31st Conference on Neural Information Processing Systems (NIPS), 2017.
 [22] Shrikant Venkataramani, Jonah Casebeer, and Paris Smaragdis, “Adaptive frontends for endtoend source separation,” in 31st Conference on Neural Information Processing Systems (NIPS), 2017.
 [23] Daniel Stoller, Sebastian Ewert, and Simon Dixon, “Waveunet: A multiscale neural network for endtoend audio source separation,” in 19th International Society for Music Information Retrieval Conference, 2018.
 [24] Sander Dieleman and Benjamin Schrauwen, “Endtoend learning for music audio,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.
 [25] Soroush Mehri et al., “Samplernn: An unconditional endtoend neural audio generation model,” in 5th International Conference on Learning Representations. ICLR, 2017.

[26]
Jesse Engel et al.,
“Neural audio synthesis of musical notes with wavenet autoencoders,”
34th International Conference on Machine Learning
, 2017.  [27] Merlijn Blaauw and Jordi Bonada, “A neural parametric singing synthesizer,” in Interspeech 2017.
 [28] Marco A. Martínez Ramírez and Joshua D. Reiss, “Endtoend equalization with convolutional neural networks,” in 21st International Conference on Digital Audio Effects (DAFx18), 2018.
 [29] Marco A. Martínez Ramírez and Joshua D. Reiss, “Modeling of nonlinear audio effects with endtoend deep neural networks,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019.

[30]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio,
“Deep sparse rectifier neural networks,”
in
14th International Conference on Artificial Intelligence and Statistics
, 2011.  [31] Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013.
 [32] Alex Graves and Jürgen Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, 2005.
 [33] Le Hou et al., “Convnets with smooth adaptive activation functions for regression,” in Artificial Intelligence and Statistics, 2017, pp. 430–439.

[34]
Jie Hu, Li Shen, and Gang Sun,
“Squeezeandexcitation networks,”
in
IEEE Conference on Computer Vision and Pattern Recognition
, 2018.  [35] Taejun Kim, Jongpil Lee, and Juhan Nam, “Samplelevel cnn architectures for music autotagging using raw waveforms,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
 [36] Michael Stein et al., “Automatic detection of audio effects in guitar and bass recordings,” in 128th Audio Engineering Society Convention, 2010.
 [37] Somsak Sukittanon, Les E Atlas, and James W Pitton, “Modulationscale analysis for content identification,” IEEE Transactions on Signal Processing, vol. 52, 2004.
 [38] Josh H McDermott and Eero P Simoncelli, “Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis,” Neuron, vol. 71, 2011.
 [39] Martin McKinney and Jeroen Breebaart, “Features for audio and music classification,” 2003.
 [40] Brecht De Man, Joshua D Reiss, and Ryan Stables, “Ten years of automatic mixing,” in Proceedings of the 3rd Workshop on Intelligent Music Production, 2017.
Comments
There are no comments yet.