1 Introduction
Stochastic gradient descent (SGD) [37] has been the workhorse for solving largescale machine learning (ML) problems. It gives rise to a family of algorithms that enables efficient training of many ML models including deep neural nets (DNNs). SGD utilizes training data very efficiently at the beginning of the training phase, as it converges much faster than GD and LBFGS during this period [8, 16]. Moreover, the variance of SGD can help gradientbased optimization algorithms circumvent local minima and saddle points and reach those that generalize well [38, 18]
. However, the variance of SGD also slows down the convergence after the first few training epochs. To account for the effect of SGD’s variance and to ensure the convergence of SGD, a decaying step size has to be applied which is one of the major bottlenecks for the fast convergence of SGD
[7, 41, 40]. Moreover, in training many ML models, typically the stagewise schedule of learning rate is used in practice [39, 38]. In this scenario, the variance of SGD usually leads to a large optimality gap.A natural question arises from the above bottlenecks of SGD is: Can we improve SGD such that the variance of the stochastic gradient is reduced onthefly with negligible extra computational and memory overhead and a larger step size is allowed to train ML models?
We answer the above question affirmatively by applying the discrete onedimensional Laplacian smoothing (LS) operator to smooth the stochastic gradient vector onthefly. The LS operation can be performed efficiently by using the fast Fourier transform (FFT). It is shown that the LS reduces the variance of stochastic gradient and allows to take a larger step size.
Another issue of standard GD and SGD is that when the Hessian of the objective function has a large condition number, gradient descent performs poorly. In this case, the derivative increases rapidly in one direction, while growing slowly in another. As a byproduct, numerically we will show that LS can avoid oscillation along steep directions and help make progress in shallow directions effectively [25]
. The implicit version of our proposed approach is linked to an unusual HamiltonJacobi partial differential equation (HJPDE) whose solution makes the original loss function more convex while retaining its flat (and global) minima, and essentially works on this surrogate function with a much better landscape. See
[10] for earlier, related work.1.1 Our contribution
In this paper, we propose a new modification to the stochastic gradientbased algorithms, which at its core uses the LS operator to reduce the variance of stochastic gradient vector onthefly. The (stochastic) gradient smoothing can be done by multiplying the gradient by the inverse of the following circulant convolution matrix
(1) 
for some positive constant . In fact, we can write , where
is the identity matrix, and
is the discrete onedimensional Laplacian which acts on indices. If we define the (periodic) forward finite difference matrix asThen, we have , where is the backward finite difference.
We summarize the benefits of this simple LS operation below:

It reduces the variance of stochastic gradient onthefly, and reduces the optimality gap when constant step size is used.

It allows us to take a larger step size than the standard (S)GD.

It is applicable to train a large variety of ML models including DNNs with better generalization.

It converges faster for the objective functions that have a large condition number numerically.

It avoids local sharp minima empirically.
Moreover, as a straightforward extension, we generalize the LS to highorder smoothing operators, e.g., biharmonic smoothing.
1.2 Related work
There is an extensive volume of research over the past decades for designing algorithms to speed up the convergence. These include using momentum and other heavyball methods, reduce the variance of the stochastic gradient, and adaptive the learning rate. We will discuss the related work from these three perspectives.
The first type of idea to accelerate the convergence of GD and SGD is to apply the momentum. Around local optima, the surface curves can be much more steeply in one dimension than in another [43], whence (S)GD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum. Momentum is proposed to accelerate (S)GD in the relevant direction and dampens oscillations [34]. Nesterov accelerated gradient (NAG) is also introduced to slow down the progress before the surface curve slopes up, and it provably converge faster in specific scenarios [31]. There are lots of recent progress in the development of momentum; a relatively complete survey can be found at [3].
Due to the bottleneck of the variance of the stochastic gradient, a natural idea is to reduce the variance of the stochastic gradient. There are several principles in developing variance reduction algorithms, including Dynamic sample size methods; Gradient aggregation, control variate type of technique is widely used along this direction, some representative works are SAGA [11], SCSG [24], and SVRG [19]; Iterative averaging methods. A thorough survey can be found at [8].
Another category of work tries to speed up the convergence of GD and SGD by using an adaptive step size, which makes use of the historical gradient to adapt the step size. RMSProp
[44] and Adagrad [13] adapts the learning rate to the parameters, performing smaller updates (i.e., low learning rates) for parameters associated with frequently occurring features, and more substantial updates (i.e., high learning rates) for parameters associated with infrequent features. Both RMSProp and Adagrad make the learning rate to be historical gradient dependent. Adadelta [48] extends the idea of RMSProp and Adagrad, instead of accumulating all past squared gradients, it restricts the window of accumulated past gradients to some fixed size . Adam [21] and AdaMax [21] behave like a heavy ball with friction, and they compute the decaying averages of past and past squared gradients to adaptive the learning rate. AMSGrad [36] fix the issue of Adam that may fail to converge to an optimal solution. Adam can be viewed as a combination of RMSprop and momentum: RMSprop contributes the exponentially decaying average of past squared gradients, while momentum accounts for the exponentially decaying average of past gradients. Since NAG is superior to vanilla momentum, Dozat [12] proposed NAdam which combines the idea Adam and NAG.1.3 Notations
Throughout this paper, we use boldface uppercase letters , to denote matrices and boldface lowercase letters , to denote vectors. For vectors, we use to denote the norm for vectors and spectral norm for matrices, respectively. And we use , , and to denote the largest, smallest, and the
th largest eigenvalues, respectively. For a function
, we use and to denote its gradient and Hessian, and to denote a local minimum of . For a positive definite matrix , we define the vector induced norm by the matrix as . List is denoted by .1.4 Organization
We organize this paper as follows: In section 2, we introduce the LS(S)GD algorithm and the FFTbased fast solver. In section 3, we show that LS(S)GD allows us to take a larger step size than (S)GD based on the and estimate of the introduced discrete Laplacian operator. In section 4, we show that LS reduces the variance of SGD both empirically and theoretically. We show that LSGD can avoid some local minima and speed up convergence numerically in section 5. In section 6
, we show the benefit of LS in deep learning, including training LeNet
[23], ResNet [17], Wasserstein generative adversarial nets (WGAN) [27], and deep reinforcement learning (DRL) model. The convergence analysis for LS(S)GD is provided in section 7. The connection to the HamiltonJacobi partial differential equations (HJPDEs) and future direction are discussed in section 8. Most of the technical proofs are provided in section 9.2 Laplacian Smoothing (Stochastic) Gradient Descent
We present our algorithm for SGD in the finitesum setting. The GD and other settings follow straightforwardly. Consider the following finitesum optimization
(2) 
where is the loss of a given ML model on the training data . This finitesum formalism is an abstract of training many ML models mentioned above. To resolve the optimization problem Eq. (2), starting from some initial guess , the th iteration of SGD reads
(3) 
where is the step size, is a random sample with replacement from .
We propose to replace the stochastic gradient by the Laplacian smoothed surrogate, and we call the resulting algorithm LSSGD, which is written as
(4) 
Intuitively, compared to the standard GD, this scheme smooths the gradient onthefly by an elliptic smoothing operator while preserving the mean of the entries of the gradient. We adopt fast Fourier transform (FFT) to compute , which is available in both PyTorch [33] and TensorFlow [2]. Given a vector , a smoothed vector can be obtained by computing . This is equivalent to , where and is the convolution operator. Therefore
where we use componentwise division (here, and are the FFT and inverse FFT, respectively). Hence, the gradient smoothing can be done in quasilinear time. This additional time complexity is almost the same as performing a one step update on the weights vector . For many machine learning models, we may need to concatenate the parameters into a vector. This reshaping might lead to some ambiguity, nevertheless, based on our tests, both row and column majored reshaping work for the LSGD algorithm. Moreover, in deep learning cases, the weights in different layers might have different physical meanings. For these cases, we perform layerwise gradient smoothing, instead. We summarize the LSSGD for solving the finitesum optimization Eq. (2) in Algorithm 1.
Remark 1.
In image processing and elsewhere, the Sobolev gradient [20] uses a multidimensional Laplacian operator that operates on , and is different from the onedimensional discrete Laplacian operator employed in our LSGD scheme that operates on indices.
It is worth noting that LS is a complement to the heavy ball, e.g., Nesterov momentum, and adaptive learning rate, e.g., Adam, algorithms. It can be combined with these acceleration techniques to speed up the convergence. We will show the performance of these algorithms in the Section
6.2.1 Generalized smoothing gradient descent
We can generalize to the th order discrete hyperdiffusion operator as follows
Each row of the discrete Laplacian operator consists of an appropriate arrangement of weights in central finite difference approximation to the 2nd order derivative. Similarly, each row of is an arrangement of the weights in the central finite difference approximation to the th order derivative.
Remark 2.
The th order smoothing operator can only be applied to the problem with dimension at least
. Otherwise, we need to add dummy variables to the object function.
Again, we apply FFT to compute the smoothed gradient vector. For a given gradient vector , the smoothed surrogate, , can be obtained by solving , where is a vector of the same dimension as the gradient to be smoothed. And the coefficient vector can be obtained recursively by the following formula
Remark 3.
The computational complexities for different order smoothing schemes are the same when the FFT is utilized for computing the surrogate gradient.
3 The Choice of Step Size
In this section, we will discuss the step size issue of LS(S)GD with a theoretical focus on LSGD on Lipschitz functions.
Definition 1 (Lipschitz).
We say the function is Lipschitz, if for any , we have .
Remark 4.
If the function is Lipschitz and differentiable, then for any , we have .
For Lipschitz function, it is known that the largest suitable step size for GD is [32]. In the following, we will establish a estimate of the square root of the LS operator when it is applied to an arbitrary vector. Based on these estimates, we will show that LSGD can take a larger step size than GD.
To determine the largest suitable step size for LSGD. We first do a change of variable in the LSGD 2 by letting where , then LSGD can be written as
(5) 
which is actually the GD for solving the following minimization problem
(6) 
Therefore, to determine the largest suitable step size for LSGD, it is equivalent to find the largest appropriate step size for GD for . Therefore, it suffices to determine the Lipschitz constant for the function , i.e., to find
Note that for , we have
To find the largest appropriate step size, we need to further estimate .
3.1 estimates of
Proposition 1.
Given any vector , let , then
(7) 
Proof.
Observe that . Therefore,
where we used for the second last equality. ∎
Proposition 1 shows that the Lipschitz constant of is not larger than that of , since
Therefore, LSGD can take at least the same step size as GD. However, note that
can be arbitrarily close to zero, so LSGD cannot always take a larger step size than GD. Next, we establish a high probability estimation for taking a larger step size when using LSGD.
Without any prior knowledge about , let us assume it is sampled uniformly from a ball in centered at the origin. Without loss of generality, we assume the radius of this ball is one. For the sake of notation simplicity, in the following we denote . Under the above ansatz, we have the following result
Theorem 1 (estimate).
Let , and
where , , are the roots of unity. Let
be uniformly distributed in the unit ball of the
dimensional space. Then(8) 
for any .
The proof of this theorem is provided in the appendix. For high dimensional ML problems, e.g., training DNNs, can be as large as tens of millions so that the probability will be almost one. The closed form of is given in Lemma 1.
Lemma 1.
If denote the roots of unity, then
(9) 
as , where
The proof of the above lemma requires some tools from complex analysis and harmonic analysis, which is provided in the appendix. Table 1 lists some typical values for different and dimensions .
1  2  3  4  5  

0.447  0.333  0.277  0.243  0.218  
0.447  0.333  0.277  0.243  0.218  
0.447  0.333  0.277  0.243  0.218 
Based on the estimate in Theorem 1, LSGD can take the largest step size for highdimensional Lipschitz function with high probability. We will verify this result numerically in the following sections.
4 Variance Reduction
The variance of SGD is one of the major bottlenecks that slows down the theoretical guaranteed convergence rate in training ML models. Most of the existing variance reduction algorithms require either the full batch gradient or the storage of stochastic gradient for each data point which makes it difficult to be used to train the highcapacity DNNs. LS is an alternative approach to reduce the variance of the stochastic gradient with negligible extra computational time and memory cost. In this section, we rigorously show that LS reduces the variance of the stochastic gradient and reduce the optimality gap under the Gaussian noise assumption. Moreover, we numerically verify our theoretical results on both a quadratic function and a simple finitesum optimization problem.
4.1 Gaussian noise assumption
Stochastic gradient , for any
, is an unbiased estimate of
, many existing works model the variance between the stochastic gradient and full batch gradient as Gaussian noise , where is the covariance matrix [28]. Therefore, ignoring the variable for simplicity of notation, we can write the equation involving gradient and stochastic gradient vectors as(10) 
where . Thus for LS stochastic gradient, we have
(11) 
The variances of stochastic gradient and LS stochastic gradient are basically the variance of and , respectively. The following theorem quantifies the variance between and .
Theorem 2.
Let denote the condition number of . Then, for dimensional Gaussian random vector , we have
(12) 
The proof of Theorem 2 will be provided in the appendix.
Table 2 lists the ratio of variance after and before LS for an dimensional standard normal vector, i.e., . In practice, high order smoothing reduce variance more significantly.
1  2  3  4  5  

0.268  0.185  0.149  0.129  0.114  
0.279  0.231  0.207  0.192  0.181  
0.290  0.256  0.238  0.226  0.218 
Moreover, LS preserves the mean (Proposition 2), decreases the largest component and increases the smallest component (Proposition 3) for any vector.
Proposition 2.
For any vector , , let and . We have and .
Proof.
Since , it holds that
where periodicity of subindex are used if necessary. Since , We have . A similar argument can show that . ∎
Proposition 3.
The operator preserves the sum of components. For any and , we have , or equivalently, .
Proof.
Since ,
where we used . ∎
4.2 Reduce the optimality gap
A direct benefit of variance reduction is that it reduces the optimality gap in SGD when constant step size is applied. We state the corresponding result in the following.
Proposition 4.
Suppose is convex with the global minimizer , and . Consider the following iteration with constant learning rate
where is the sampled gradient in the th iteration at satisfying . Denote and the ergodic average of iterates. Then the optimality gap is
Proof.
Since is convex, we have
(13) 
Furthermore,
where the last inequality is due to (13). We rearrange the terms and arrive at
Summing over from to and averaging and using the convexity of , we have
Taking the limit as above establishes the result. ∎
Remark 5.
Since is smaller than the corresponding value without LS. It shows that the optimality gap is reduced when LS is used with a constant step size. In practice, this is also true for the stagewise step size since it is a constant in each stage of the training phase.
4.2.1 Optimization for quadratic function
In this part, we empirically show the advantages of the LS(S)GD and its generalized schemes for the convex optimization problems. Consider searching the minima of the quadratic function defined in Eq. (14).
(14) 
To simulate SGD, we add Gaussian noise to the gradient vector, i.e., at any given point , we have
where the scalar controls the noise level, is the Gaussian noise vector with zero mean and unit variance in each coordinate. The corresponding numerical schemes can be formulated as
(15) 
where is the smoothing parameter selected to be to remove the intense noise. We take diminishing step sizes with initial values for SGD/smoothed SGD; and for GD/smoothed GD, respectively. Without noise, the smoothing allows us to take larger step sizes, rounding to the first digit, and are the largest suitable step size for GD and smoothed version here. We study both constant learning rate and exponentially decaying learning rate, i.e., after every 1000 iteration the learning rate is divided by 10. We apply different schemes that corresponding to in Eq. (15) to the problem (Eq. (14)), with the initial point .
Figure. 1 shows the iteration v.s. optimality gap when the constant learning rate is used. In the noise free case, all three schemes converge linearly. When there is noise, our smoothed gradient helps to reduce the optimality gap and converges faster after a few iterations.
(a)  (b) 
(c)  (d) 
The exponentially decaying learning rate helps our smoothed SGD to reach a point with a smaller optimality gap, and the higher order smoothing further reduces the optimality gap, as shown in Fig. 2. This is due to the noise removal properties of the smoothing operators.
(a)  (b) 
(c)  (d) 
4.2.2 Find the center of multiple points
Consider searching the center of a given set of 5K random points . ^{1}^{1}1We thank professor Adam Oberman for suggesting this problem to us. This problem can be formulate as the following finitesum optimization
(16) 
We solve this optimization problem by running either SGD or LSSGD for 20K iterations starting from the same random initial point with batch size 20. The initial step size is set to be 1.0 and 1.2, respectively, for SGD and LSSGD, and decays 1.1 times after every 10 iterations. As the learning rate decays, the variance of the stochastic gradient decays [46], thus we decay 10 times after every 1K iterations. Figure 3 (a) plots a 2D cross section of the trajectories of SGD and LSSGD, and it shows that the trajectory of SGD is more noisy than that of LSSGD. Figure 3 (b) plots the iteration v.s. loss for both SGD and LSSGD averaged over 3 independent runs. LSSGD converges faster than SGD and has a smaller optimality gap than LSSGD. This numerical result verifies our theoretical results on the optimality gap (Proposition 4).
4.2.3 Multiclass Logistic regression
Consider applying the proposed optimization sch–emes to train the multiclass Logistic regression model. We run 200 epochs of SGD and different order smoothing algorithms to maximize the likelihood of multiclass Logistic regression with batch size 100. And we apply the exponentially decaying learning rate with initial value and decay 10 times after every 50 epochs. We train the model with only 10 randomly selected MNIST training data and test the trained model on the entire testing images. We further compare with SVRG under the same setting. Figure. 4 shows the histograms of generalization accuracy of the model trained by SGD (a); SVRG (b); LSSGD (order 1) (c); LSSGD (oder 2) (d). It is seen that SVRG somewhat improves the generalization with higher averaged accuracy. However, the first and the second order LSSGD type algorithms lift the averaged generalization accuracy by more than and reduce tnt of Electrical Engineering and Computer Sciences University ofhe variance of the generalization accuracy over 100 independent trials remarkably.
4.3 Iteration v.s. loss
In this part, we show the evolution of the loss in training the multiclass Logistic regression model by SGD, SVRG, LSGD with first and second order smoothing, respectively. As illustrated in Fig. 5. At each iteration, among 100 independent experiments, SGD has the largest variance, SGD with first order smoothed gradient significantly reduces the variance of loss among different experiments. The second order smoothing can further reduce the variance. The variance of loss in each iteration among 100 experiments is minimized when SVRG is used to train the multiclass Logistic model. However, the generalization performance of the model trained by SVRG is not as good as the ones trained by LSSGD, or higher order smoothed gradient descent (Fig. 4 (b)).
(a) SGD  (b) SVRG 
(c) LSGD: Order 1  (d) LSGD: Order 2 
(a) SGD  (b) SVRG 
(c) LSGD: Order 1  (d) LSGD: Order 2 
4.4 Variance reduction in stochastic gradient
We verify the efficiency of variance reduction numerically in this part. We simplify the problem by applying the multiclass Logistic regression only to the digits 1 and 2 of the MNIST training data. In order to compute the variance of the (LS)stochastic gradients, we first compute descent path of (LS)GD by applying the full batch (LS)GD with learning rate starting from the same random initialization. We record the full batch (LS)gradient on each point along the descent path. Then we compute the (LS)stochastic gradients on each points along the path by using different batch sizes and smoothing parameters . In computing (LS)stochastic gradients we run 100 independent experiments. Then we compute the variance of the (LS)stochastic gradient among these 100 experiments and regarding the full batch (LS)gradient as the mean on each point along the full batch (LS)GD descent path. For each pair of batch size and , we report the maximum variance over all the coordinates of the gradient and all the points along the descent path. We list the variance results in Table 3 (note the case corresponds to the SGD). These results show that compared to the SGD, LSGD with can reduce the maximum variance times for different batch sizes. It is worth noting that the high order smoothing reduces more variance than the lower order smoothing, this might due to the fact that the noise of SGD is not Gaussian.
Batch Size  2  5  10  20  50 

1.50E1  5.49E2  2.37E2  1.01E2  4.40E3  
3.40E3  1.30E3  5.45E4  2.32E4  9.02E5  
2.00E3  7.17E4  3.46E4  1.57E4  5.46E5  
1.40E3  4.98E4  2.56E4  1.17E4  3.97E5 
5 Numerical Results on Avoid Local Minima and Speed Up Convergence
We first show that LSGD can bypass sharp minima and reach the global minima. We consider the following function, in which we ‘drill’ narrow holes on a smooth convex function,
(17)  
where the summation is taken over the index set , and are the parameters that determine the location and narrowness of the local minima and are set to and , respectively. We do GD and LSGD starting from a random point in the neighborhoods of the narrow minima, i.e., , where is a neighborhood of the point with radius . Our experiments (Fig. 6) show that, if GD will converge to a narrow local minima, while LSGD convergences to the wider global minima.
(a)  (b) 
Next, let us compare LSGD with some popular optimization methods on the benchmark 2DRosenbrock function which is a nonconvex function. The global minimum is inside a long, narrow, parabolic shaped flag valley. To find the valley is trivial. To converge to the global minimum, however, is difficult. The function is defined by
(18) 
it has a global minimum at , and we set and in the following experiments.
Starting from the initial point with coordinate , we run 2K iterations of the following optimizers including GD, GD with Nesterov momentum [31], Adam [21], RMSProp [44], and LSGD (). The step size used for all these methods is . Figure 7 plots the iteration v.s. objective value, and it shows that GD together with Nesterov momentum converges faster than all the other algorithms. The second best algorithm is LSGD. Meanwhile, Nesterov momentum can be used to speed up LSGD, and we will show this numerically in training DNNs in section 6.
Figure 8 depicts some snapshots (The 300th, 600th, 900th, and 1200th iteration, respectively) of the trajectories of different optimization algorithms. These figures show that even though GD with momentum converge faster but it suffers from some overshoots, and they detour to converge to the local minima. All the other algorithms go along a direct path to the minima, and LSGD converges fastest.
Iteration: 300  (b) Iteration: 600 
Iteration: 900  (d) Iteration: 1200 
Furthermore, we will show that LSGD can be further accelerated by using Nesterov momentum. As show in Fig. 9, the LSGD together with Nesterov momentum converges much faster than GD with momentum, especially for high dimensional Rosenbrock function.
6 Application to Deep Learning
6.1 Train neural nets with small batch size
Many advanced artificial intelligence tasks make high demands on training neural nets with extremely small batch sizes. The milestone technique for this is group normalization
[47]. In this section, we show that LSSGD successfully trains DNN with extremely small batch size. We consider LeNet5 [23] for MNIST classification. Our network architecture is as followsThe notation denotes a 2D convolutional layer with output channels, each of which is the sum of a channelwise convolution operation on the input using a learnable kernel of size
, it further adds ReLU nonlinearity and max pooling with stride size
.is an affine transformation that transforms the input to a vector of dimension 512. Finally, the tensors are activated by a multiclass Logistic function. The MNIST data is first passed to the layer
, and further processed by this hierarchical structure. We run epochs of both SGD and LSSGD with initial learning rate and divide by after 50 epochs, and use a weight decay of and momentum of . Figure. 10(a) plots the generalization accuracy on the test set with the LeNet5 trained with different batch sizes. For each batch size, LSSGD with keeps the testing accuracy more than , SGD reduce the accuracy to when batch size 4 is used. The classification become just a random guess, when the model is trained by SGD with batch size 2. Small batch size leads to large noise in the gradient, which may make the noisy gradient not along the decent direction; however, Lapacian smoothing rescues this by decreasing the noise.6.2 Improve generalization accuracy
The skip connections in ResNet smooth the landscape of the loss function of the classical CNN [17, 26]. This means that ResNet has fewer sharp minima. On Cifar10 [22], we compare the performance of LSSGD and SGD on ResNet with the preactivated ResNet56 as an illustration. We take the same training strategy as that used in [17], except that we run epochs with the learning rate decaying by a factor of after every 40 epochs. For ResNet, instead of applying LSSGD for all epochs, we only use LSSGD in the first 40 epochs, and the remaining training is carried out by SGD (this will save the extra computational cost due to LS, and we noticed that the performance is similar to the case when LS is used for the whole training process). The parameter is set to . Figure 10(b) depicts one path of the training and generalization accuracy of the neural nets trained by SGD and LSSGD, respectively. It is seen that, even though the training accuracy obtained by SGD is higher than that by LSSGD, the generalization is however inferior to that of LSSGD. We conjecture that this is due to the fact that SGD gets trapped into some sharp but deeper minimum, which fits better than a flat minimum but generalizes worse. We carry out replicas of this experiments, the histograms of the corresponding accuracy are shown in Fig. 11.
SGD  LSSGD with 

6.3 Training Wassersterin GAN
Generative Adversarial Networks (GANs) [15] are notoriously delicate and unstable to train [4]. In [27], WassersteinGANs (WGANs) are introduced to combat the instability in the training GANs. In addition to being more robust in training parameters and network architecture, WGANs provide a reliable estimate of the Earth Mover (EM) metric which correlates well with the quality of the generated samples. Nonetheless, WGANs training becomes unstable with a large learning rate or when used with a momentum based optimizer [27]. In this section, we demonstrate that the gradient smoothing technique in this paper alleviates the instability in the training, and improves the quality of generated samples. Since WGANs with weight clipping are typically trained with RMSProp [44], we propose replacing the gradient by a smoothed version , and also update the running averages using instead of . We name this algorithm LSRMSProp.
To accentuate the instability in training and demonstrate the effects of gradient smoothing, we deliberately use a large learning rate for training the generator. We compare the regular RMSProp with the LSRMSProp. The learning rate for the critic is kept small and trained approximately to convergence so that the critic loss is still an effective approximation to the Wasserstein distance. To control the number of unknowns in the experiment and make a meaningful comparison using the critic loss, we use the classical RMSProp for the critic, and only apply LSRMSProp to the generator.
RMSProp 
LSRMSProp, 
RMSProp  LSRMSProp, 
(a)  (b) 
RMSProp  LSRMSProp, 
(c)  (d) 
We train the WGANs on the MNIST dataset using the DCGAN [35] for both the critic and generator. In Figure 12 (top), we observe the loss for RMSProp trained with a large learning rate has multiple sharp spikes, indicating instability in the training process. The samples generated are also lower in quality, containing noisy spots as shown in Figure 13 (a). In contrast, the curve of training loss for LSRMSProp is smoother and exhibits fewer spikes. The generated samples as shown in Fig. 13 (b) are also of better quality and visibly less noisy. The generated characters shown in Fig. 13 (b) are more realistic compared to the ones shown in Fig. 13 (a). The effects are less pronounced with a small learning rate, but still result in a modest improvement in sample quality as shown in Figure 13 (c) and (d).We also apply LSRMSProp for training the critic, but do not see a clear improvement in the quality. This may be because the critic is already trained near optimality during each iteration, and does not benefit much from gradient smoothing.
RMSProp  LSRMSProp, 

6.4 Deep reinforcement learning
Deep reinforcement learning (DRL) has been applied to playing games including Cartpole [9], Atari [30], Go [42, 29]. DNN plays a vital role in approximating the Qfunction or policy function. We apply the Laplacian smoothed gradient to train the policy function to play the Cartpole game. We apply the standard procedure to train the policy function by using the policy gradient [9]. And we use the following network to approximate the policy function:
The network is trained by RMSProp and LSRMSProp with , respectively. The learning rate and other related parameters are set to be the default ones in PyTorch. The training is stopped once the average duration of 5 consecutive episodes is more than 490. In each training episode, we set the maximal steps to be 500. Left and right panels of Fig. 14 depict a training procedure by using RMSProp and LSRMSProp, respectively. We see that Laplacian smoothed gradient takes fewer episodes to reach the stopping criterion. Moreover, we run the above experiments 5 times independently, and apply the trained model to play Cartpole. The game lasts more than 1000 steps for all the 5 models trained by LSRMSProp, while only 3 of them lasts more than 1000 steps when the model is trained by vanilla RMSProp.
7 Convergence Analysis
Note that the LS matrix is positive definite and its largest and smallest eigenvalues are 1 and , respectively. It is straightforward to show that all the convergence results for (S)GD still hold for LS(S)GD. In this section, we will show some additional convergence for LS(S)GD with a focus on LSGD, the corresponding results for LSSGD follow in a similar way.
Proposition 5.
Consider the algorithm . Suppose is Lipschitz smooth and . Then . Moreover, if the Hessian of is continuous with being the minimizer of , and , then as , and the convergence is linear.
Proof.
By the Lipschitz continuity of and the descent lemma [5], we have
Summing the above inequality over , we have
Therefore, , and thus .
For the second claim, we have
Therefore,
So if , the result follows. ∎
Remark 6.
The convergence result in Proposition 5 is also call
Comments
There are no comments yet.