1 Introduction
Audio source separation, a central task in computer audition, is the study of how to break down complex auditory scenes into their constituent parts (e.g., isolating the voice of a single speaker in a crowd). Robust audio source separation has many practical applications, such as video conferencing, hearing aids, speech enhancement, and home voice assistants. In recent years, deep neural networks have greatly advanced the state of the art in source separation across multiple audio domains, such as speech, music, and environmental sounds. However, optimizing deep networks for audio source separation remains a tricky endeavor, requiring careful tuning of hyperparameters and choice of optimizer, network architecture, and loss function.
The loss landscape on which a neural network is optimized is often nonsmooth and filled with local minima. This is especially true in the case of recurrent neural networks, which are vulnerable to both exploding and vanishing gradient issues
[1]. Gradient clipping [17, 16, 26, 19] attempts to resolve the former issue: exploding gradients. Gradient clipping has been found to be a highly effective and necessary ingredient for stateoftheart performance in applications such as language modeling [16, 19]. It has also proved crucial to the optimization process of recent stateoftheart source separation algorithms [15, 22, 24]. In gradient clipping, if the norm of the gradient vector (taken over all parameters) exceeds the clipping value, the gradient vector is scaled so that its norm does not exceed that value. The clipping value is typically set by hand.
The gradient clipping hyperparameter is very sensitive to the loss function, the scaling of the data, and the network architecture. As an illustrative example, consider two implementations of the same mask inference architecture [21]
. Mask inference networks are typically applied to audio preprocessed with a short term Fourier transform (STFT). Assume that, for one network, the STFT normalizes each window, and for the other, the STFT does not normalize. This can result in data scales that differ by orders of magnitude. One cannot specify a single gradient clipping value appropriate for both, otherwise identical, implementations
^{1}^{1}1This situation has occurred in practice, resulting in degraded performance after changing STFT functions with the same clipping threshold.. This is in contrast to other facets of the optimization process, such as using an Adam optimizer [9], which adapts to the statistics of the gradients.In this work, we propose AutoClip, a simple adaptive gradient clipping procedure which removes the need to handtune the clipping parameter, transfers easily across multiple loss functions, and is scaleinvariant by design. In experiments, we show the impact of AutoClip on the optimization of source separation networks. Our experiments also provide empirical evidence for the need for gradient clipping when training source separation networks. An implementation is provided here: https://github.com/pseeth/autoclip.
2 Gradient Clipping
Given a function computed on data and parameterized by , and a learning rate , a gradient descent update at iteration of the current parameters to is defined as:
(1) 
Note that a stochastic version of this update is more commonly used, with being computed on minibatches . Gradient clipping enforces an upper bound on the update of , by placing a max on the norm of the gradient:
(2)  
(3) 
Here,
is a clipping value hyperparameter that needs to be carefully chosen by the end user. Note that this clipping scheme is the socalled clip by norm, not clip by value, where individual values of the gradient vectors are clipped if they go beyond a preset value. In clip by norm, the entire gradient is scaled if the norm of the gradient exceeds a threshold. In stochastic gradient descent (SGD), this places a maximum on the step size that can be taken during training, preventing the optimization from going too far in the direction of a gradient with very large magnitude. If the gradient is below the threshold, the optimization is unaffected.
The reason why gradient clipping stabilizes and accelerates training of neural networks (especially recurrent ones) is an active area of research, with both empirical work showing its efficacy [17] as well as theoretical work analyzing the dynamics of SGD with gradient clipping [26].
In gradient clipping, selection of
is very important. If it is set too high, then the gradient norm will always be smaller than that, and clipping is never applied. If set too low, then the step size taken by the network may be too small. In practice, a heuristic proposed by Pascanu et al.
[17] is often used for setting the clipping threshold. First, start training and observe the gradient norms of each batch for a sufficient number of updates. A set of reasonable values can then be selected to serve as adequate settings for , and an optimal value determined by crossvalidation, or, as a simpler alternative, a value anywhere between 5x and 10x the average gradient norm that was observed can be used. This adhoc method must be done for each network, loss function, and dataset. There is no “onesizefitsall” number for the gradient clipping threshold.3 AutoClip
AutoClip is an automated and dynamic approach to setting the clipping value hyperparameter based on the statistics of the history of the gradient norms observed while training. AutoClip sets at each training iteration , where each iteration corresponds to the processing of one minibatch. AutoClip keeps track of the gradient norm on every batch seen during training. Given the history of the gradient norms up to iteration , it calculates the gradient norm that would reach the th percentile value of the current history. It then sets . This is shown in Algorithm 1.
While AutoClip is agnostic to the optimizer and can be used with SGD, in our work we use gradient clipping in conjunction with the Adam optimizer, rather than SGD. Adam has become the de facto optimizer for source separation networks [15, 11, 12, 24, 21] and is thus of particular interest to this work. The standard Adam equations [9] for updating the parameters of the network
are used, with the only difference that the gradients are replaced by their (potentially) clipped versions before the Adam optimizer updates the estimate of the first (
) and second () moments of the gradients.
The dynamics of AutoClip lead to an adaptive setting of
that is determined dynamically by the data, the network, and the loss, as opposed to one selected carefully by a user after observing the training dynamics. A user only has to specify what percentile to clip to. As minibatch stochastic optimization is prone to outliers, using robust statistics
[6] via percentiles would mitigate this issue. This setting can be transferred across multiple optimization scenarios, as it is defined relative to the data and loss, rather than an absolute value that is sensitive to these factors as well as to implementation.For higher settings of , less clipping is applied to the gradients. If , then no clipping is applied during optimization. If , then every gradient is clipped to the minimum gradient seen during training until then. For low values of , the clip value will have more “inertia,” as it will not update significantly without a long sequence of high gradient norms. For higher values of , will be much more responsive to seeing higher gradient norms. As a result, AutoClip will only raise the clipping value during optimization if a longenough sequence of high gradient norms justifies it.
3.1 Relationship to other optimization approaches
Much recent work in optimization has centered around “denoising” the gradient during training. AdaGrad[2] achieves this by keeping track of the sum of the squared gradients for each parameter. One drawback of AdaGrad is that the learning rate will decay to an infinitesimally small number as more iterations are taken. A refinement of AdaGrad, AdaDelta [25], resolves this issue by accumulating the sum of the squared gradients over a window. Adam [9]
addresses outliers in the gradient by keeping track of the first and second moments of the gradient for each parameter using an exponential moving average. After correcting for bias, the optimizer step taken on each parameter is adjusted based on the observed variance of the gradient history. Adam can be seen as combining the best of AdaGrad with RMSProp
[20]. AutoClip can be applied in conjunction with any existing optimization approach (including AdaGrad, Adam, SGD, etc.). AutoClip denoises the gradients via adaptive gradient clipping, which is done prior to the optimization step.For , assuming that a small gradient norm appears early in training, all subsequent gradients will be normalized to that constant value, so all clipped gradients passed to the optimizer will have the same norm. For a scaleinvariant optimizer such as Adam, such a setting is equivalent to using normalized gradients, which was shown to improve optimization [23].
4 Experimental design
Our experiments are designed to investigate the impact of AutoClip on optimization. Our primary research question is whether a single setting of AutoClip can transfer easily across loss functions that have vastly different scales. When setting gradient clipping manually, the thresholds must be carefully chosen based on observing the gradient norms of training for every network and loss function individually. AutoClip scales the clipping threshold automatically, potentially enabling a “setandforget” approach to applying gradient clipping.
Our secondary research question stems from the fact that AutoClip is interpretable, because it is relative to the actual gradient norms. For example, if , then we know % of the gradients are typically being clipped during training. By varying , we can draw conclusions about the impact of more or less aggressive clipping during optimization.
We apply AutoClip to the problem of separating individual speech streams from a mixture of concurrent speech. We use a standard dataset in the speech separation literature, WSJ02mix [4], which consists of 20,000 2speaker mixtures for training, 5,000 for validation, and 3,000 for test. The sample rate of all audio files was fixed to be 8 kHz. We use 32 ms windows, 8 ms hop length, and the square root of the Hann window as our window function for computing the STFT.
4.1 Loss functions
We investigate the interaction between AutoClip and five commonly used loss functions in the source separation literature. Because these loss functions have different scales, their gradients will also likely have different scales^{2}^{2}2, where is a constant.. These gradients will thus have different norms, and therefore the optimal gradient clipping value will vary.
We chose the classic deep clustering loss () [4]
, the whitened kmeans loss (
) [21], a mask inference loss based on the distance between the estimated source and the actual source () [21], a Chimera multitask loss function that combines and () [14, 21], and a waveform loss where the audio output of the network is optimized via the signaltonoise ratio (
) [7].The mask inference loss uses the truncated phasesensitive spectrum approximation (PSA) loss [3] to compare an estimated spectrogram with the ground truth spectrogram :
(4) 
where denotes the number of timefrequency points in , the input mixture, and the magnitude and phase of a spectrogram , and denotes truncation to the range . The values of
are on the order of 1e5. The deep clustering loss compares the affinity matrix of embeddings
for all timefrequency (TF) points with that of ground truth assignments , introducing weights for every TF point:(5) 
The values of range between and , when is normalized for the number of TF points. The whitened kmeans loss is a selfnormalizing variant of :
(6) 
where is the embedding size. The range of is between and , where is typically around .
The multitask loss function – the Chimera loss in [14, 21] – combines the mask inference loss with the whitened kmeans loss, weighting each using a constant factor :
(7) 
In this work, is chosen to be . Because of the large difference between the magnitudes of and (in practice orders of magnitude apart), this results in mostly dominating the optimization. The range of is between and (typically between and ).
The signaltonoise ratio compares the timedomain audio estimate and the timedomain ground truth source [7]:
(8) 
is the negative SNR so that it is minimized during optimization. It can range anywhere between and , but more typically stays between and .
4.2 Network architectures
We train identical networks for each of these loss functions, with minor differences to account for their losses’ particularities. At the core of each network is a stack of 4 bidirectional LSTM layers with 600 hidden units in each direction. The input to each network is the logmagnitude spectrogram of the mixture. The networks that are trained with and output 20dimensional embeddings for each timefrequency point. These embeddings are unitnorm with sigmoid activation. The mask inference network outputs two masks with sigmoid activation. The Chimera network trained with outputs an embedding and two masks. The network that is trained with does so by using an inverse STFT within the network: it outputs two masks which are applied to the magnitude spectrogram; the masked STFT is then inverted using the mixture phase to the timedomain audio for each source within the network. We use permutationinvariant versions of , , and [4, 10].
0  10.7  11.1  10.0  11.2  9.9 

1  10.7  11.2  10.3  11.3  10.2 
10  10.8  11.0  10.2  11.3  10.4 
25  10.7  11.0  9.9  11.3  10.3 
50  10.7  11.0  9.2  11.2  9.9 
90  10.5  11.0  8.7  11.1  9.5 
100  10.2  10.8  8.5  10.9  8.3 
4.3 Training and evaluation
All networks are trained with identical hyperparameters and are initialized with the same random seed. We use the Adam optimizer with an initial learning rate of 1e3, and a sequence length of 400 frames (25536 samples for the network trained with
) which are selected from random offsets within each utterance. Mixtures that are too short are padded with zeros to the required length. We use a batch size of 25 and train for 100 epochs. For each loss function, we experiment with the value of
, the percentile threshold in AutoClip, setting it to , , , , , , and . Setting to corresponds to the most aggressive clipping strategy of “minclipping,” where every gradient is clipped to the minimum gradient norm seen so far during training. Setting it to corresponds to no gradient clipping at all applied during training. We evaluate the performance of each network on the WSJ02mix test set using scaleinvariant sourcetodistortion ratio (SISDR) [12].5 Results and discussion
The performance of each network we trained is shown in Table 1. As expected, the bottom row, corresponding to where no gradient clipping is applied, gets consistently worse results across the board for every loss function. In the case of and , not applying clipping results in a huge performance drop of nearly dB. To the best of our knowledge, this is the first formal reporting of this phenomenon for source separation. Setting or results in vastly improved performance across the board for each loss function. As a reminder, setting a static clipping threshold for each of these loss functions would require individual hyperparameter search for each scenario. With AutoClip, one can set to get greatly improved performance, showing that AutoClip can be a “setandforget” approach to gradient clipping. Even continuously updating the clipping value to be very low () or even the minimum () of the gradients seem to greatly aid optimization.
Prior work optimizing , , and all used static clipping thresholds [21]. For , we obtain higher performance ( dB) than reported in prior work ( dB). We also observe this for ( dB with AutoClip vs dB in prior work), ( dB with AutoClip vs dB in prior work), and ( dB with AutoClip vs dB in prior work). AutoClip discovers optimal clipping thresholds, without requiring an exhaustive hyperparameter search.
The results in Table 1 hints at very different learning dynamics when training with and without AutoClip. To investigate this further, we observe the training dynamics of a smaller speaker separation network (as capturing detailed training behavior is computationally expensive). The smaller network has 2 BLSTM layers with 300 hidden units. The training recipe is identical to that in Section 4.3. We compare training the network with AutoClip () and without (). Every 20 iterations, we record the training loss, the step size, the gradient norm, and the local smoothness. Given model parameters and at consecutive time steps, the step size is computed as , the norm of the difference in the model parameters. Local smoothness is measured via the local gradient Lipschitz constant, as used in prior work [26, 18].
Fig. 1 shows that the step size varies more smoothly when using AutoClip. AutoClip induces an almost constant step size, which decays slowly over the course of training. Initially, the gradient norms are small, leading to small step sizes. As the gradient norms get bigger, so does the clipping threshold. This manifests as an initial ramp in the step size, similar to learning rate warmup, another popular training trick for deep networks [13]. After this initial stage, the step size then decays slowly with little variance. This corresponds to better optimization, as shown by the lower training loss, and higher test performance ( dB vs dB).
In the two lower plots in Fig. 1, we show the relationship between the gradient norm and the local smoothness. When training with AutoClip, there is a better correlation between the two. With AutoClip, areas of low gradient norm (e.g. minima) are also smoother. Smoother minima are believed to result in better generalization performance [8, 5]. AutoClip does not clip the gradient in relatively smooth regions.
6 Conclusion
We have presented AutoClip, a simple method for adaptively choosing a threshold for gradient clipping based on the history of gradient norms observed during training. AutoClip obviates the need for a handtuned clipping threshold and generalizes across loss functions with different scales. Experiments show that AutoClip results in better test performance for source separation networks. We examined the training dynamics of a separation network trained with and without AutoClip, showing that AutoClip stabilizes optimization. It is simple to implement and can be integrated readily into a variety of applications across multiple domains. In future work, we will examine AutoClip’s suitability to other tasks, such as image classification, language modeling, sound event detection, and more. We will also explore applying different clipping thresholds to each layer independently, similarly to the usage of block normalized gradients in [23]. Finally, we plan to investigate using moving windows, rather than the entire gradient history, as a way to reduce AutoClip’s memory usage as well as make more sensitive to local rather than global training dynamics.
References
 [1] (1994) Learning longterm dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5 (2), pp. 157–166. Cited by: §1.
 [2] (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12 (Jul), pp. 2121–2159. Cited by: §3.1.
 [3] (201504) Phasesensitive and recognitionboosted speech separation using deep recurrent neural networks. In Proc. IEEE ICASSP, pp. 708–712. Cited by: §4.1.
 [4] (201603) Deep clustering: discriminative embeddings for segmentation and separation. In Proc. IEEE ICASSP, pp. 31–35. Cited by: §4.1, §4.2, §4.
 [5] (1997) Flat minima. Neural Computation 9 (1), pp. 1–42. Cited by: §5.
 [6] (2004) Robust statistics. Vol. 523, John Wiley & Sons. Cited by: §3.
 [7] (2019) Universal sound separation. In Proc. IEEE WASPAA, pp. 175–179. Cited by: §4.1, §4.1.
 [8] (2017) On largebatch training for deep learning: generalization gap and sharp minima. In Proc. ICLR, Cited by: §5.
 [9] (2015) Adam: A Method for Stochastic Optimization. In Proc. ICLR, Cited by: §1, §3.1, §3.
 [10] (2017) Multitalker speech separation with utterancelevel permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio, Speech, Language Process., pp. 1901–1913. Cited by: §4.2.
 [11] (2019) Phasebook and friends: leveraging discrete representations for source separation. IEEE Journal of Selected Topics in Signal Processing. Cited by: §3.
 [12] (201905) SDR – halfbaked or well done?. In Proc. IEEE ICASSP, Cited by: §3, §4.3.
 [13] (202004) On the variance of the adaptive learning rate and beyond. In Proc. ICLR, Cited by: §5.
 [14] (201703) Deep clustering and conventional networks for music separation: stronger together. In Proc. IEEE ICASSP, pp. 61–65. Cited by: §4.1, §4.1.
 [15] (2019) ConvTasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio, Speech, Language Process. 27 (8), pp. 1256–1266. Cited by: §1, §3.
 [16] (2012) Statistical language models based on neural networks. Ph.D. Thesis, Brno University of Technology. Cited by: §1.
 [17] (2013) On the difficulty of training recurrent neural networks. In Proc. ICML, pp. 1310–1318. Cited by: §1, §2, §2.

[18]
(2018)
How does batch normalization help optimization?
. In Proc. NeurIPS, pp. 2483–2493. Cited by: §5.  [19] (2012) LSTM neural networks for language modeling. In Proc. ISCA Interspeech, Cited by: §1.
 [20] (2012) Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. Note: COURSERA: Neural Networks for Machine Learning Cited by: §3.1.
 [21] (201804) Alternative objective functions for deep clustering. In Proc. IEEE ICASSP, Cited by: §1, §3, §4.1, §4.1, §5.
 [22] (201909) WHAM!: extending speech separation to noisy environments. In Proc. ISCA Interspeech, Cited by: §1.
 [23] (201804) Blocknormalized gradient method: an empirical study for training deep neural network. arXiv preprint arXiv:1707.04822. Cited by: §3.1, §6.
 [24] (2020) Wavesplit: endtoend speech separation by speaker clustering. arXiv preprint arXiv:2002.08933. Cited by: §1, §3.
 [25] (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §3.1.
 [26] (2019) Why gradient clipping accelerates training: a theoretical justification for adaptivity. In Proc. ICLR, Cited by: §1, §2, §5.