The ability to digitally model musical instruments and audio effects allows for multiple desirable properties,(Smith, 2011) among which are i) portability – virtual instruments and software effects require no space or weight; ii) flexibility – many such effects can be stored and accessed together and quickly modified; iii) signal to noise – often can be higher with digital effects; iv) centralized, automated control; v) repeatability – digital effects can be exactly the same, as opposed to physical systems which may require calibration; and vi) extension – the development of digital effects involves fewer constraints than their real-world counterparts.
The process of constructing such models has traditionally been performed using one of two main approaches. One approach is the physical simulation of the processes involved,(Smith, 2011) whether these be acoustical processes such as reverberation(Välimäki et al., 2016) or “virtual analog modeling” of circuit elements.(Yeh et al., 2008; Pakarinen et al., 2011; Eichas, Gerat, and Zölzer, 2017) The other main approach has been to emulate the requisite audio features via signal processing techniques which seek to capture the salient aspects of the sounds and transformations under consideration. Both of these approaches are typically performed with the goal of faithfully reproducing one particular effect, such as audio compressors.Stikvoort (1986); Kröning, Dempwolf, and Zölzer (2011); Floru (1999); Gerat, Eichas, and Zölzer (2017); Simmer, Schmidt, and Bitzer (2006)
Rather than modeling one particular effect, a different class of systems are those which can ‘profile’ and ‘learn’ to mimic the tonal effects of other units. One popular commercial example is the Kemper Profiler Amplifier,(GmbH, ) which can learn to emulate the sounds of amplifiers and speaker cabinets in the user’s possession, to enable them to store and easily transport a virtual array of analog gear. Another product in this category is the “ToneMatch” feature of Fractal Audio’s Axe-Fx,(axe, ) which supplies a large number of automatically-tunable pre-made effects units including reverberation, delay, equalization, and formant processing.
The present paper involves efforts toward the goal of profiling ‘general’ audio effects. For systems which are linear and time-invariant (LTI), one can develop finite impulse response (FIR) filters, e.g., for convolution reverb effects. But for systems which involve nonlinearity and/or time-dependence, more sophisticated approaches are required.
Deep learning has demonstrated great utility at such diverse audio signal processing tasks as classification,(Choi et al., 2017; Fonseca et al., 2017) onset detection,(Schlüter and Böck, 2014) source separation,(Huang et al., 2015) event detection,(Stowell et al., 2015) dereverberation,(Han et al., 2015; Arifianto and Farid, 2018) denoising,(Rethage, Pons, and Serra, 2018a)
formant estimation,(Dissen, Goldberger, and Keshet, 2019) remixing,(Pons et al., 2016) and synthesis,(Engel et al., 2017; Wang et al., 2017; Sotelo et al., 2017; Donahue, McAuley, and Puckette, 2018) as well as dynamic range compression to automate the mastering process.(Mimilakis et al., 2016) In the area of audio component modeling, deep learning has been used to model tube amplifiers(Damskägg et al., 2018) and most recently guitar distortion pedals.(Damskägg, Juvela, and Välimäki, 2019) Besides creating specific effects, efforts have been underway to explore how varied are the types of effects which can be learned from a single model,(Martínez and Reiss, 2018) to which this paper comprises a contribution.
A challenging goal in deep learning audio processing is to devise models that operate directly on the raw waveform signals, in the time domain, known as “end-to-end” models.(Dieleman and Schrauwen, 2014) Given that the raw waveform data exists in the time domain, there are questions as to whether an end-to-end formulation is most suitable,(Lluís, Pons, and Serra, 2018)
however it has been shown to be useful nevertheless. Our approach is end-to-end, however, we make use of a spectral representation within the autoencoders and for regularization.
Our efforts in this array have been focused on modeling dynamic range compressors for three reasons: 1. They constitute a ‘desirable’ problem to solve: Many audio production practitioners rely on the specific operational characteristics of certain compression units, and as such compression modeling is a key area of interest for practical application. 2. They constitute a ‘hard’ problem to solve: As noted earlier, existing methods are sufficient to implement a variety of linear and/or time-independent effects. Our own deep learning investigations (unpublished) demonstrated that effects such as echo or distortion could be modeled via Long Short-Term Memory (LSTM) cells, but compressors proved to be ‘unlearnable’ to our networks. This present work therefore describes one solution to this problem. 3. It is our contention that compressors represent a set of capabilities that would be required for modeling more general effects.
In this study we are not estimating compressor parameters,(Simmer, Schmidt, and Bitzer, 2006) although deep neural networks have recently shown proficiency at this task as well.(Sheng and Fazekas, 2019) Rather, being given parameters associated with input-output pairs of audio data, we synthesize audio by means of a network which learns to emulate given mappings subject to these parameters. The hope is that by performing well on the challenging problem of dynamic range compression, such a network could also prove useful for learning other audio effects as well.
It is a common occurrence for researchers in machine learning to associate memorable nomenclature to denote their models.(Wang et al., 2017; Van Den Oord et al., 2016) Given that the goal our system is successively approximate the audio signal chain through a process of training, we refer to the computer code as SignalTrain.111Source code and datasets accompanying this paper will be publicly released on GitHub.com pending acceptance for publication.
This paper proceeds as follows: In Section II, we describe the problem specification, the design of the neural network architecture, its training procedure, and the dataset. In Section III we relate results for two compressor models, one digital and one analog. Finally we offer some conclusions in section IV and outline some avenues for future work.
Ii System Design
ii.1 Problem Specification
The objective is to accurately model the input-output characteristics of a wide range of musical signal processing effects, and their parameterized controls, in a model-agnostic manner. That is to say, not to merely infer certain control parameters which are then used in conjunction with pre-made internal effect modules (e.g., as is done by Axe-FX.)(axe, ) We apply our method to the case of compressors in this paper, but we operate no internal compressor model – the system learns what a compressor is in the course of training using a large variety of training signals and control settings.
We conceive of the task as a supervised learning regression problem, performed in an end-to-end manner. While other approaches have made use of techniques such as
-law companding and one-hot encoding to formulate the task as a classification problem,(Van Den Oord et al., 2016) we have not yet done so. Rather than predicting one audio sample (i.e., time-series value) at a time, we map a range of inputs to a range of outputs, i.e., we window the audio into “windows.” This allows for both speed in computation as well as the potential for modeling non-causal behavior such as reverse-audio effects or time-alignment. 222The system could be modified to predict one sample at a time, however our experience with this model has found this practice to be neither necessary nor helpful.
ii.1.2 Compressor Effects Used
We trained against two software compressors, with similar controls but different time scales. The effect we designate “Comp-4C” which operates in a sequential manner (later samples explicitly depend on earlier samples) and has four controls for Threshold, Ratio, Attack and Release. The other formulation, “Comp-4C-Large,” allows for wider ranges of the control parameters. For an analog effect we used a Universal Audio LA-2A, output audio for a wide range of input audio as we varied the Peak Reduction knob and the Comp/Lim switch. (The input and output gain knobs were left fixed in the creation of the dataset.) These effects are summarized in Table 1.
|Effect Name||Type||Controls: Ranges|
|Comp-4C||Software||Threshold: -30–0 dB
Attack: 1–40 ms
Release: 1–40 ms
Attack: 1–1000 ms
Release: 1–1000 ms
|LA-2A||Analog||Comp/Lim Switch: 0/1
Peak Reduction: 0–100
ii.1.3 Data Specification and Error Estimates
Typically audio effects are applied to an entire “stream” of data from beginning to end, yet it is not uncommon for digital audio processors to be presented with only a smaller “window” (also referred to as a “chunk,” “frame”, “input buffer,” etc.) of the most recent audio, of a duration usually determined by computational requirements such as memory and/or latency. For time-dependent effects such as compressors, the size of the window can have repercussions as information preceding the window boundary will necessarily propagate into the window currently under consideration. This introduces a concern over “causality,” occurring over a timescale given by the exponential decay due to the compressor’s attack and release controls.
This suggests two different ways to approach training, and two different ways to specify the dataset of pairs of input audio and target output audio. The first we refer to as “streamed target” (ST) data, which is the usual method of applying the audio effect to the entire stream at once. The second we refer to “windowed target” (WT) data, in which the effect is applied sequentially to individual windows of input. WT data will necessarily contain transient errors (compared to ST data) occurring on a frequency of the inverse of the window duration. If however one adds a “lookback buffer,” i.e. making the length of the output shorter than that of the input, then this “lookback” can be chosen to be large enough that transient errors in the WT data decay (exponentially) below the “noise floor” before the output is generated. The goal of this study is to produce ST data as accurately as possible, as it corresponds to the normal application of audio effects, but WT data is in some sense “easier” to learn. Indeed, in our early attempts with the LA-2A compressor and ST data, the model was not able to learnat all, because the lookback buffer was not long enough.
The difference between ST and WT data constitutes a lower bound on the error produced by our neural network model: we do not expect the model to perform better than the “true” effect applied to WT data. The dependence of this error bound on the size of the lookback buffer can be estimated in a straightforward way, and can provide guidance on the size of buffer that should be used when training the model. Such estimates are shown in Figure 1. In order to allow for low enough error while not putting too great a strain on computational resources, we will choose model sizes with lookback windows sufficient to allow a lower bound on the error in the range of to .
ii.2 Model Specification
ii.2.1 Network architecture.
The architecture of the SignalTrain model consists of the front-end, the autoencoder-like module, and the back-end module. The proposed architecture shares some similarities with the U-NetRonneberger, Fischer, and Brox (2015) and TFNetYian Lim et al. (2018)
architectures. In more details, the front-end module is comprised by a set of two 1-D convolution operators that are responsible for producing a signal sub-space similar to a time-frequency decomposition, yielding magnitude and phase features. The autoencoder module consists of two deep neural networks for processing individually the magnitude and phase information of the front-end module. Each deep neural network in this autoencoder consists of 7 fully connected, feed-forward neural networks (FC). It should be denoted that the “bottleneck” latent space of each deep neural network is additionally conditioned on the control variables of the audio effect module that are represented as one-hot encoded vectors.
Figure 2 illustrates the neural network architecture for the SignalTrain model, which essentially learns a mapping function from the un-processed to the processed audio, by the audio effect to be profiled, and is conditioned on the vector of the effect’s controls (e.g., the “knobs”). In order to obtain the predicted output waveform, the back-end module uses another set of two 1-D transposed convolutional operators. Similarly to the analysis front-end, the initialization of the back-end is using the bases of the discrete Fourier transform. It should be stated that all the weights are subject to optimization and are expected to vary during the training of the model. The frame and hop sizes used for the convolutions are 1024 and 384 samples, respectively.
Unlike some other proposed architectures which use convolutional layers,Ronneberger, Fischer, and Brox (2015) we use fully-connected (FC) layers that have shared-weights with respect to the sub-space dimensionality (i.e., the frequencies). That is done for two reasons. The first reason is that the number of the parameters inside the model is dramatically reduced, and secondly we preserve the location of the magnitude and phase information of the original signal. Essentially, the operations carried by each deep neural network in the autoencoder module can be seen as non-linear affine transformations of the transpose time-frequency patches (spectrograms) of the input signal. Furthermore, we apply residual (additive) skip connections(He et al., 2016) inspired by U-Net(Ronneberger, Fischer, and Brox, 2015) and “skip filter” (multiplicative) connection for the magnitude only.Mimilakis et al. (2016) These skip connections dramatically improve the speed of training, and can be viewed in three complementary ways: allowing information to propagate further through the network, smoothing the loss surface,(Li et al., 2017) and/or allowing the network to compute formidable perturbations subject to the goal of profiling an audio effect.
In the middle of the autoencoder, we concatenate values of the effect controls (e.g., threshold, ratio, etc.) and “merge” these via an additional FC layer. The first layer of the autoencoder maps the number of time frames in the spectrograms to 64, with subsequent layers shrinking (or, on the output side, growing) this by factors of 2. The resulting model has approximately 4 million trainable parameters.
ii.3 Training Procedure
ii.3.1 Loss function.
We use a log-cosh loss function,(Chen, Chen, and Zhang, 2019) which has similar properties to the MAE (i.e., L1 norm divided by the number of elements) in the sense that it forces the predicted signal to closely follow the target signal at all times, however the roundness of the log-cosh function near zero allows for significantly better training at low loss values than does L1, which has a discontinuous derivative at zero. This is illustrated in Figure 3.
Furthermore we include an L1 regularization term with a small coefficient (e.g., 2e-5), consisting of the magnitude spectrogram from the output side of the autoencoder, weighted exponentially by frequency-bin number to help reduce high-frequency noise in the predicted output. Thus the equation for the loss function is given by
where and are the predicted and target outputs, respectively, and implies exponential weighting by frequency bin , and means no such weighting.
ii.3.2 Training Data.
Simply training on a large amount of musical audio files is not necessarily the most efficient way to train the network – depending on the type of effect being profiled, some signals may be more ‘instructive’ than others. A compressor requires numerous transients of significant size, whereas an echo (or ‘delay’) effect may train most efficiently on uncorrelated input signals (e.g., white or pink noise). Therefore, we augment a dataset of music recordings with randomly-generated sounds intended to provide both dynamic range variation and broadband frequency coverage. Examples of these are shown in Figure 4.
By virtue of the automation afforded by software effects such as Comp-4C, we can train indefinitely using randomly-synthesized signals which change during each iteration. But for the LA-2A, we created an large (20 GB) input dataset of public domain musical sounds and randomly-generated test sounds, concatenated these and divided the result into (unique) files of 15-minute duration, using a fixed increment of “5” on the LA-2A’s Peak Reduction knob between recordings, for both settings of the Comp/Lim switch. For prerecorded (i.e., non-synthesized) audio, windows from the input (and for ST data, target) data are copied from random locations in the audio files, along with the control settings used. Data augmentation is applied only in the form of randomly flipping the phase of inputs and targets.
To achieve the results in this paper, we trained for two days (see “Implementation,” below) on what corresponded to approximately 2000 hours of audio sampled at 44.1 kHz (or 130 GB if it were stored on disk). As a performance metric, we keep a separate, fixed “validation set” of approximately 12 minutes of audio; all results displayed in this paper are for validation data, i.e., on data which the network has not “seen” before.
The arrangement of this data is “maximally shuffled,” i.e., we find that training is significantly more smooth and stable when the control settings are randomly changed for each data window within each mini-batch. Trying to train using the same knob settings for large sequences of inputs – as one might expect to do by taking a lengthy pair of (input-output) audio clips obtained at one effect setting and breaking them up into a sequential mini-batch of training windows – results in unstable training in the sense that the (training and/or validation) loss varies much more erratically and, overall, decreases much more slowly than for the ‘maximally shuffled’ case in which one varies the knob settings with every window. This shuffling from window to window is another reason why our model is not autoregressive: because we wish to learn to model the controls with the effect.
When starting from scratch, weights are initialized randomly except for the weights connecting to the input and output layers, which are initialized according to a Discrete Fourier Transform (DFT), and its inverse transform, respectively. These are subsequently allowed to evolve as training proceeds.
For the different task of image classification on the ImageNet dataset,(Deng et al., 2009) the combination of Adam(Kingma and Ba, 2014) with weight decay(Krogh and Hertz, 1992) has been shown(Gugger and Howard, 2018) to be among the fastest training methods available when combined with learning rate scheduling. We also adopt this combination for our problem.
ii.3.5 Learning Rate Scheduling.
An important feature, found to decrease both final validation loss values and the number of epochs required to reach them, is the use of learning rate scheduling,i.e., adjusting the value of the learning rate dynamically during the course of gradient-based optimization, rather than keeping the learning rate static. We follow the “1-cycle” policy,(Smith, 2018) which incorporates cosine annealing,(Loshchilov and Hutter, 2016) in the manner popularized by the Fasti.ai team.Gugger and Howard (2019) Compared to using a static learning rate, the 1-cycle policy allowed us to reach roughly 1/10th the error in 1/5 the time.
code was written in Python using the PyTorch(Paszke et al., 2017) library along with Numba for speeding up certain subroutines. Development and training was primarily conducted on a desktop computer with two NVIDIA Titan X GPUs. Late the project we upgraded to two RTX 2080Ti GPUs, which, with the benefit of NVIDIA’s “Apex” mixed-precision (MP) training library,333https://github.com/NVIDIA/apex yielded speedup of 1.8x over the earlier runs. MP training involves computing losses at full precision while performing inference using “half-precision” (i.e., FP16) representations, which have a machine epsilon on the order of , and thus applications of MP are more commonly associated with classification tasks than regression tasks. While our results obtained from MP and those from full-precision calculations for our regression task are not strictly identical, they show no significant differences, even at loss values on the order of . For the GPUs used, we found by experimentation that a mini-batch size of 200 offered the best training performance per wall-clock execution time, and each “epoch” consisted of 1000 batches of randomly-sampled windows from the audio files (or synthesized on the fly).
iii.1 Software Compressor: “Comp-4C”
We ported MATLAB code to Python for a single-band, hard-knee compressor with four controls: threshold, ratio, attack and release times.Tarr (2018)
(This compressor implements no side-chaining, make-up gain or other features.) As it is a software compressor, the training data could be generated “on the fly,” choosing control (‘knob’) settings randomly according to some probability distribution (e.g.,
uniform, or a beta distribution to emphasize the endpoints444
Initially we chose control settings according to a symmetric beta distribution, with the idea that by mildly emphasizing the endpoints of the control ranges, the model would learn more efficiently, however experience showed no performance enhancement compared to choosing from a uniform distribution.). This synthesis allows for a virtually limitless size of the training dataset. Our early experiments used such a dataset, but given that intended goal of this system is to profile systems within a finite amount of time, and particularly analog effects which would typically require the creation of a finite set of recordings, we chose to emulate the intended use case for analog gear, namely a finite dataset in which the control knob settings are equally spaced, with 10 settings per control.
Figure 5 shows the performance of the model compared to the target audio, for the case of a step-response, a common diagnostic signal for compressor performance.(Simmer, Schmidt, and Bitzer, 2006; Giannoulis, Massberg, and Reiss, 2012; Eichas, Gerat, and Zölzer, 2017) The predicted values follow the target closely enough that we show their differences in Figure 6. Key differences occur at the discontinuities themselves (especially at low attack times), and we see that the predictions tend to “overshoot” slightly on at the release discontinuity (likely due to slight errors in the phase in the spectral decomposition in the model), but that in between and after the discontinuities the predictions and target match closely.
As noted in Section II.1.3, the size of the lookback window can have an effect on the error bounds. Figure 7 shows that the loss on the Validation set to be consistent with estimates obtained for the cases depicted in Figure 1. And yet listening to these examples (see Supplemental Materials(sup, )) one notices noise in the predicted results, suggesting that the lookback window size (or “causality noise”) is not the only source of error in the model.
Although step responses are a useful diagnostic, the neural network model approximates the input-output mappings it is trained on, and is ultimately intended for use with musical sounds which may typically lack such sharp discontinuities. Thus a comparison of compressor output for musical sounds is in order as well. Figure 8 shows a comparison of frequency spectrum for a full-band recording (i.e., drums, bass, guitar, vocals) in the testing dataset. It also shows that scaling the L1 regularization exponentially by frequency can yield a reduction in high-frequency noise, sacrificing a proportionally smaller amount of accuracy at low frequencies.
iii.2 Analog Compressor: LA-2A
A primary interest in the application our method is not for cases in which a software plugin already exists, but rather for the profiling of analog units. As an example, we choose the Universal Audio’s Teletronix LA-2A, an electro-optical compressor-limiter,(Lawrence, 1964) the controls for which consist of three knobs and one switch. Given that two of the knobs are only for input-output gain adjustment, for this study, we focus only on varying the “Peak Reduction” (PR) knob, and the “Compress/Limit” switch. The switch is treated like any other knob, with limits chosen arbitrarily to range from 0 for “Compress” to 1 for “Limit” (internally these are mapped to -0.5 and 0.5 to preserve zero-mean inputs to the neural network). 555Given the requirements of differentiability imposed by training via gradient descent optimization, one might expect the discontinuous nature of a switch to pose a problem for training, however we find this not to be the case. The dataset – consisting of subsets for Training, Validation and Testing – was created by assembling examples of music from sources with Creative Commons licenses, the authors’ own recordings, and synthetic waveforms such as those shown in Figure 4, all concatenated into a series of unique 15-minute WAV files at 44.1 kHz, and sent through the LA-2A at increments of 5 on the PR knob, for both settings of the Comp/Lim switch.666We will make the dataset publicly available pending review.
Figure 9 shows the loss on the validation set for the LA-2A for different lookback sizes. The dashed (black) line shows a model with an input size of 8192*2=16384 and took 15 hours to run, the solid (blue) line is a model with input size of 8192*27=221184 and took 72 hours. Both runs used output sizes of 8192 samples. In all our runs, a loss value of approximately 1e-4 is achieved, regardless of the size of the model – even for a lookback extending beyond the “5 seconds for complete release” typically associated with the LA-2A.(ua_, ) This indicates that the finite size of the lookback window (or “causality noise”) is not the primary source of error; this is consistent with the Comp-4C results (e.g., see Figure 7). The primary source of error remains an ongoing subject of investigation. Graphs of example audio waveforms from the Testing dataset are shown in Figure 10, where it is noteworthy that the model will at times over-compress the onset of an attack as compared to the true LA2A target response.
In pursuit of the goal of capturing and modeling generic audio effects by means of artificial neural networks, we have focused this study on dynamic range compressors as a representative problem set because their nonlinear, time-dependent nature makes them a challenging class of problems, and because they are a class of effects of high interest in the field of musical audio production. Rather than rely on domain-specific knowledge of audio compressors in constructing our end-to-end system, our model learns the effects the parameterized controls in the process of training on a large dataset consisting of input-output audio pairs and the control settings used.
The results capture the qualities of the compressors sampled, although the speed of execution and the residual noise in the neural network output suggest that practical implementations of this method may await improvements in computer implementation and refinements to the model. We are interested in trying a model based on WaveNet Van Den Oord et al. (2016); Rethage, Pons, and Serra (2018b) or WaveRNN(Kalchbrenner et al., 2018) for comparisons to our model regarding speed and accuracy.
As the intent of this effort are the modeling of effects in general, more work remains to probe the limits of our method toward a variety of other signal processing effects, both analog and digital, as well for the construction of new effects by solving “inverse problems” such as de-compression.(Gorlow and Reiss, 2013)
Acknowledgements.Scott H. Hawley wishes to thank Eric Tarr, William Hooper, and fast.ai (especially Jeremy Howard and Sylvain Gugger) for helpful discussions. Stylianos Ioannis Mimilakis is supported by the by the German Research Foundation (AB 675/2-1, MU 2686/11-1).
- Smith (2011) J. O. Smith, Physical Audio Signal Processing: For Virtual Musical Instruments and Audio Effects (W3K Publishing, 2011) oCLC: 774174525.
- Välimäki et al. (2016) V. Välimäki, J. Parker, L. Savioja, J. Smith, and J. S Abel, “More than 50 years of artificial reverberation,” (2016).
- Yeh et al. (2008) D. T. Yeh, J. S. Abel, A. Vladimirescu, and J. O. Smith, “Numerical Methods for Simulation of Guitar Distortion Circuits,” Computer Music Journal 32, 23–42 (2008).
- Pakarinen et al. (2011) J. Pakarinen, V. Välimäki, F. Fontana, V. Lazzarini, and J. S. Abel, “Recent Advances in Real-Time Musical Effects, Synthesis, and Virtual Analog Models,” EURASIP Journal on Advances in Signal Processing 2011, 940784 (2011).
- Eichas, Gerat, and Zölzer (2017) F. Eichas, E. Gerat, and U. Zölzer, “Virtual Analog Modeling of Dynamic Range Compression Systems,” in Audio Engineering Society Convention 142 (2017).
- Stikvoort (1986) E. F. Stikvoort, “Digital Dynamic Range Compressor for Audio,” J. Audio Eng. Soc 34, 3–9 (1986).
- Kröning, Dempwolf, and Zölzer (2011) O. Kröning, K. Dempwolf, and U. Zölzer, “ANALYSIS AND SIMULATION OF AN ANALOG GUITAR COMPRESSOR,” (2011).
- Floru (1999) F. Floru, “Attack and Release Time Constants in RMS-Based Feedback Compressors,” J. Audio Eng. Soc 47, 788–804 (1999).
- Gerat, Eichas, and Zölzer (2017) E. Gerat, F. Eichas, and U. Zölzer, “Virtual Analog Modeling of a UREI 1176LN Dynamic Range Control System,” in Audio Engineering Society Convention 143 (2017).
- Simmer, Schmidt, and Bitzer (2006) U. Simmer, D. Schmidt, and J. Bitzer, “Parameter Estimation of Dynamic Range Compressors: Models, Procedures and Test Signals,” in Audio Engineering Society Convention 120 (2006).
- (11) K. GmbH, “Kemper Amps,” [Online; accessed 01-March-2019].
- (12) “Axe-Fx II XL+,” [Online; accessed 01-March-2019].
Choi et al. (2017)
K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Convolutional recurrent neural networks for music classification,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2017) pp. 2392–2396.
- Fonseca et al. (2017)
- Schlüter and Böck (2014) J. Schlüter and S. Böck, “Improved musical onset detection with convolutional neural networks,” in 2014 ieee international conference on acoustics, speech and signal processing (icassp) (IEEE, 2014) pp. 6979–6983.
- Huang et al. (2015) P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 2136–2147 (2015).
- Stowell et al. (2015) D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia 17, 1733–1746 (2015).
- Han et al. (2015) K. Han, Y. Wang, D. Wang, W. S. Woods, I. Merks, and T. Zhang, “Learning spectral mapping for speech dereverberation and denoising,” IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 982–992 (2015).
- Arifianto and Farid (2018) D. Arifianto and M. N. Farid, “Dereverberation binaural source separation using deep learning,” The Journal of the Acoustical Society of America 144, 1684–1684 (2018).
- Rethage, Pons, and Serra (2018a) D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018) pp. 5069–5073.
- Dissen, Goldberger, and Keshet (2019) Y. Dissen, J. Goldberger, and J. Keshet, “Formant estimation and tracking: A deep learning approach,” The Journal of the Acoustical Society of America 145, 642–653 (2019).
- Pons et al. (2016) J. Pons, J. Janer, T. Rode, and W. Nogueira, “Remixing music using source separation algorithms to improve the musical experience of cochlear implant users,” The Journal of the Acoustical Society of America 140, 4338–4349 (2016).
- Engel et al. (2017) J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi, “Neural audio synthesis of musical notes with wavenet autoencoders,” arXiv preprint arXiv:1704.01279 (2017).
- Wang et al. (2017) Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in INTERSPEECH (2017).
- Sotelo et al. (2017) J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2wav: End-to-end speech synthesis,” (ICLR 2017, 2017).
- Donahue, McAuley, and Puckette (2018) C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” arXiv preprint arXiv:1802.04208 (2018).
- Mimilakis et al. (2016) S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, “Deep neural networks for dynamic range compression in mastering applications,” in Audio Engineering Society Convention 140 (Audio Engineering Society, 2016).
- Damskägg et al. (2018) E.-P. Damskägg, L. Juvela, E. Thuillier, and V. Välimäki, “Deep learning for tube amplifier emulation,” arXiv preprint arXiv:1811.00334 (2018).
- Damskägg, Juvela, and Välimäki (2019) E.-P. Damskägg, L. Juvela, and V. Välimäki, “Black-box modeling of audio distortion circuits with deep learning,” (2019), submitted to SMC.
- Martínez and Reiss (2018) M. Martínez and J. D. Reiss, “Modeling of nonlinear audio effects with end-to-end deep neural networks,” (2018), arXiv:1810.06603 [cs, eess] .
- Dieleman and Schrauwen (2014) S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2014) pp. 6964–6968.
- Lluís, Pons, and Serra (2018) F. Lluís, J. Pons, and X. Serra, “End-to-end music source separation: is it possible in the waveform domain?” arXiv preprint arXiv:1810.12187 (2018).
- Sheng and Fazekas (2019) D. Sheng and G. Fazekas, “A feature learning siamese model for intelligent control of the dynamic range compressor,” (2019), arXiv:1905.01022 .
- Van Den Oord et al. (2016) A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio.” in SSW (2016) p. 125.
- (35) Source code and datasets accompanying this paper will be publicly released on GitHub.com pending acceptance for publication.
- (36) The system could be modified to predict one sample at a time, however our experience with this model has found this practice to be neither necessary nor helpful.
- (37) Supplementary Materials, including audio clips and an interactive demo, are available at a website created for this paper: http://www.signaltrain.ml.
- Ronneberger, Fischer, and Brox (2015) O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention (Springer, 2015) pp. 234–241.
Yian Lim et al. (2018)
T. Yian Lim, R. A. Yeh, Y. Xu, M. N. Do, and M. Hasegawa-Johnson, “Time-frequency networks for audio super-resolution,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018) pp. 646–650.
- Clevert, Unterthiner, and Hochreiter (2016) D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” ICLR 2016 (2016).
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
- Li et al. (2017) H. Li, Z. Xu, G. Taylor, and T. Goldstein, “Visualizing the loss landscape of neural nets,” CoRR abs/1712.09913 (2017), arXiv:1712.09913 .
- Chen, Chen, and Zhang (2019) P. Chen, G. Chen, and S. Zhang, “Log hyperbolic cosine loss improves variational auto-encoder,” Submitted to ICML (2019).
- Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09 (2009).
- Kingma and Ba (2014) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).
- Krogh and Hertz (1992) A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in neural information processing systems (1992) pp. 950–957.
- Gugger and Howard (2018) S. Gugger and J. Howard, “AdamW and Super-convergence is now the fastest way to train neural nets,” (2018), last accessed 19 November 2018.
- Smith (2018) L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay,” CoRR abs/1803.09820 (2018), arXiv:1803.09820 .
Loshchilov and Hutter (2016)
I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent with restarts,”CoRR abs/1608.03983 (2016), arXiv:1608.03983 .
- Gugger and Howard (2019) S. Gugger and J. P. Howard, “Callbacks.one_cycle — fastai,” (2019), [Online; accessed 25-May-2019].
- Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS 2017 (2017).
- (52) Https://github.com/NVIDIA/apex.
- Tarr (2018) E. Tarr, Hack Audio (Audio Engineering Society Presents): An Introduction to Computer Programming and Digital Signal Processing in MATLAB (Routledge, 2018) p. 428.
- (54) Initially we chose control settings according to a symmetric beta distribution, with the idea that by mildly emphasizing the endpoints of the control ranges, the model would learn more efficiently, however experience showed no performance enhancement compared to choosing from a uniform distribution.
- Giannoulis, Massberg, and Reiss (2012) D. Giannoulis, M. Massberg, and J. D. Reiss, “Digital dynamic range compressor design—a tutorial and analysis,” J. Audio Eng. Soc 60, 399–408 (2012).
- Lawrence (1964) J. F. Lawrence, “An improved method of audio level control for broadcasting and recording,” Journal of the SMPTE 73, 661–663 (1964).
- (57) Given the requirements of differentiability imposed by training via gradient descent optimization, one might expect the discontinuous nature of a switch to pose a problem for training, however we find this not to be the case.
- (58) We will make the dataset publicly available pending review.
- (59) “Product catalog - universal audio,” [Online; accessed 01-March-2019].
- Rethage, Pons, and Serra (2018b) D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018) pp. 5069–5073.
- Kalchbrenner et al. (2018) N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” CoRR abs/1802.08435 (2018), arXiv:1802.08435 .
- Gorlow and Reiss (2013) S. Gorlow and J. D. Reiss, “Model-based inversion of dynamic range compression,” IEEE Transactions on Audio, Speech, and Language Processing 21, 1434–1444 (2013).