Laplacian Smoothing Gradient Descent

06/17/2018 ∙ by Stanley Osher, et al. ∙ 14

We propose a very simple modification of gradient descent and stochastic gradient descent. We show that when applied to a variety of machine learning models including softmax regression, convolutional neural nets, generative adversarial nets, and deep reinforcement learning, this very simple surrogate can dramatically reduce the variance and improve the accuracy of the generalization. The new algorithm, (which depends on one nonnegative parameter) when applied to non-convex minimization, tends to avoid sharp local minima. Instead it seeks somewhat flatter local (and often global) minima. The method only involves preconditioning the gradient by the inverse of a tri-diagonal matrix that is positive definite. The motivation comes from the theory of Hamilton-Jacobi partial differential equations. This theory demonstrates that the new algorithm is almost the same as doing gradient descent on a new function which (a) has the same global minima as the original function and (b) is "more convex". Again, the programming effort in doing this is minimal, in cost, complexity and effort. We implement our algorithm into both PyTorch and Tensorflow platforms, which will be made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic gradient descent (SGD) [37] has been the workhorse for solving large-scale machine learning (ML) problems. It gives rise to a family of algorithms that enables efficient training of many ML models including deep neural nets (DNNs). SGD utilizes training data very efficiently at the beginning of the training phase, as it converges much faster than GD and L-BFGS during this period [8, 16]. Moreover, the variance of SGD can help gradient-based optimization algorithms circumvent local minima and saddle points and reach those that generalize well [38, 18]

. However, the variance of SGD also slows down the convergence after the first few training epochs. To account for the effect of SGD’s variance and to ensure the convergence of SGD, a decaying step size has to be applied which is one of the major bottlenecks for the fast convergence of SGD

[7, 41, 40]. Moreover, in training many ML models, typically the stage-wise schedule of learning rate is used in practice [39, 38]. In this scenario, the variance of SGD usually leads to a large optimality gap.

A natural question arises from the above bottlenecks of SGD is: Can we improve SGD such that the variance of the stochastic gradient is reduced on-the-fly with negligible extra computational and memory overhead and a larger step size is allowed to train ML models?

We answer the above question affirmatively by applying the discrete one-dimensional Laplacian smoothing (LS) operator to smooth the stochastic gradient vector on-the-fly. The LS operation can be performed efficiently by using the fast Fourier transform (FFT). It is shown that the LS reduces the variance of stochastic gradient and allows to take a larger step size.

Another issue of standard GD and SGD is that when the Hessian of the objective function has a large condition number, gradient descent performs poorly. In this case, the derivative increases rapidly in one direction, while growing slowly in another. As a by-product, numerically we will show that LS can avoid oscillation along steep directions and help make progress in shallow directions effectively [25]

. The implicit version of our proposed approach is linked to an unusual Hamilton-Jacobi partial differential equation (HJ-PDE) whose solution makes the original loss function more convex while retaining its flat (and global) minima, and essentially works on this surrogate function with a much better landscape. See

[10] for earlier, related work.

1.1 Our contribution

In this paper, we propose a new modification to the stochastic gradient-based algorithms, which at its core uses the LS operator to reduce the variance of stochastic gradient vector on-the-fly. The (stochastic) gradient smoothing can be done by multiplying the gradient by the inverse of the following circulant convolution matrix

(1)

for some positive constant . In fact, we can write , where

is the identity matrix, and

is the discrete one-dimensional Laplacian which acts on indices. If we define the (periodic) forward finite difference matrix as

Then, we have , where is the backward finite difference.

We summarize the benefits of this simple LS operation below:

  • It reduces the variance of stochastic gradient on-the-fly, and reduces the optimality gap when constant step size is used.

  • It allows us to take a larger step size than the standard (S)GD.

  • It is applicable to train a large variety of ML models including DNNs with better generalization.

  • It converges faster for the objective functions that have a large condition number numerically.

  • It avoids local sharp minima empirically.

Moreover, as a straightforward extension, we generalize the LS to high-order smoothing operators, e.g., biharmonic smoothing.

1.2 Related work

There is an extensive volume of research over the past decades for designing algorithms to speed up the convergence. These include using momentum and other heavy-ball methods, reduce the variance of the stochastic gradient, and adaptive the learning rate. We will discuss the related work from these three perspectives.

The first type of idea to accelerate the convergence of GD and SGD is to apply the momentum. Around local optima, the surface curves can be much more steeply in one dimension than in another [43], whence (S)GD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum. Momentum is proposed to accelerate (S)GD in the relevant direction and dampens oscillations [34]. Nesterov accelerated gradient (NAG) is also introduced to slow down the progress before the surface curve slopes up, and it provably converge faster in specific scenarios [31]. There are lots of recent progress in the development of momentum; a relatively complete survey can be found at [3].

Due to the bottleneck of the variance of the stochastic gradient, a natural idea is to reduce the variance of the stochastic gradient. There are several principles in developing variance reduction algorithms, including Dynamic sample size methods; Gradient aggregation, control variate type of technique is widely used along this direction, some representative works are SAGA [11], SCSG [24], and SVRG [19]; Iterative averaging methods. A thorough survey can be found at [8].

Another category of work tries to speed up the convergence of GD and SGD by using an adaptive step size, which makes use of the historical gradient to adapt the step size. RMSProp

[44] and Adagrad [13] adapts the learning rate to the parameters, performing smaller updates (i.e., low learning rates) for parameters associated with frequently occurring features, and more substantial updates (i.e., high learning rates) for parameters associated with infrequent features. Both RMSProp and Adagrad make the learning rate to be historical gradient dependent. Adadelta [48] extends the idea of RMSProp and Adagrad, instead of accumulating all past squared gradients, it restricts the window of accumulated past gradients to some fixed size . Adam [21] and AdaMax [21] behave like a heavy ball with friction, and they compute the decaying averages of past and past squared gradients to adaptive the learning rate. AMSGrad [36] fix the issue of Adam that may fail to converge to an optimal solution. Adam can be viewed as a combination of RMSprop and momentum: RMSprop contributes the exponentially decaying average of past squared gradients, while momentum accounts for the exponentially decaying average of past gradients. Since NAG is superior to vanilla momentum, Dozat [12] proposed NAdam which combines the idea Adam and NAG.

1.3 Notations

Throughout this paper, we use boldface upper-case letters , to denote matrices and boldface lower-case letters , to denote vectors. For vectors, we use to denote the -norm for vectors and spectral norm for matrices, respectively. And we use , , and to denote the largest, smallest, and the

-th largest eigenvalues, respectively. For a function

, we use and to denote its gradient and Hessian, and to denote a local minimum of . For a positive definite matrix , we define the vector induced norm by the matrix as . List is denoted by .

1.4 Organization

We organize this paper as follows: In section 2, we introduce the LS(S)GD algorithm and the FFT-based fast solver. In section 3, we show that LS(S)GD allows us to take a larger step size than (S)GD based on the and estimate of the introduced discrete Laplacian operator. In section 4, we show that LS reduces the variance of SGD both empirically and theoretically. We show that LSGD can avoid some local minima and speed up convergence numerically in section 5. In section 6

, we show the benefit of LS in deep learning, including training LeNet

[23], ResNet [17], Wasserstein generative adversarial nets (WGAN) [27], and deep reinforcement learning (DRL) model. The convergence analysis for LS(S)GD is provided in section 7. The connection to the Hamilton-Jacobi partial differential equations (HJ-PDEs) and future direction are discussed in section 8. Most of the technical proofs are provided in section 9.

2 Laplacian Smoothing (Stochastic) Gradient Descent

We present our algorithm for SGD in the finite-sum setting. The GD and other settings follow straightforwardly. Consider the following finite-sum optimization

(2)

where is the loss of a given ML model on the training data . This finite-sum formalism is an abstract of training many ML models mentioned above. To resolve the optimization problem Eq. (2), starting from some initial guess , the -th iteration of SGD reads

(3)

where is the step size, is a random sample with replacement from .

We propose to replace the stochastic gradient by the Laplacian smoothed surrogate, and we call the resulting algorithm LSSGD, which is written as

(4)

Intuitively, compared to the standard GD, this scheme smooths the gradient on-the-fly by an elliptic smoothing operator while preserving the mean of the entries of the gradient. We adopt fast Fourier transform (FFT) to compute , which is available in both PyTorch [33] and TensorFlow [2]. Given a vector , a smoothed vector can be obtained by computing . This is equivalent to , where and is the convolution operator. Therefore

where we use component-wise division (here, and are the FFT and inverse FFT, respectively). Hence, the gradient smoothing can be done in quasilinear time. This additional time complexity is almost the same as performing a one step update on the weights vector . For many machine learning models, we may need to concatenate the parameters into a vector. This reshaping might lead to some ambiguity, nevertheless, based on our tests, both row and column majored reshaping work for the LS-GD algorithm. Moreover, in deep learning cases, the weights in different layers might have different physical meanings. For these cases, we perform layer-wise gradient smoothing, instead. We summarize the LSSGD for solving the finite-sum optimization Eq. (2) in Algorithm 1.

Input: for .
: initial guess of , : the total number of iterations, and , : the scheduled step size.
Output: The optimized weights .
for  do
     . return
Algorithm 1 LSSGD
Remark 1.

In image processing and elsewhere, the Sobolev gradient [20] uses a multi-dimensional Laplacian operator that operates on , and is different from the one-dimensional discrete Laplacian operator employed in our LS-GD scheme that operates on indices.

It is worth noting that LS is a complement to the heavy ball, e.g., Nesterov momentum, and adaptive learning rate, e.g., Adam, algorithms. It can be combined with these acceleration techniques to speed up the convergence. We will show the performance of these algorithms in the Section 

6.

2.1 Generalized smoothing gradient descent

We can generalize to the -th order discrete hyper-diffusion operator as follows

Each row of the discrete Laplacian operator consists of an appropriate arrangement of weights in central finite difference approximation to the 2nd order derivative. Similarly, each row of is an arrangement of the weights in the central finite difference approximation to the -th order derivative.

Remark 2.

The -th order smoothing operator can only be applied to the problem with dimension at least

. Otherwise, we need to add dummy variables to the object function.

Again, we apply FFT to compute the smoothed gradient vector. For a given gradient vector , the smoothed surrogate, , can be obtained by solving , where is a vector of the same dimension as the gradient to be smoothed. And the coefficient vector can be obtained recursively by the following formula

Remark 3.

The computational complexities for different order smoothing schemes are the same when the FFT is utilized for computing the surrogate gradient.

3 The Choice of Step Size

In this section, we will discuss the step size issue of LS(S)GD with a theoretical focus on LSGD on -Lipschitz functions.

Definition 1 (-Lipschitz).

We say the function is -Lipschitz, if for any , we have .

Remark 4.

If the function is -Lipschitz and differentiable, then for any , we have .

For -Lipschitz function, it is known that the largest suitable step size for GD is [32]. In the following, we will establish a estimate of the square root of the LS operator when it is applied to an arbitrary vector. Based on these estimates, we will show that LSGD can take a larger step size than GD.

To determine the largest suitable step size for LSGD. We first do a change of variable in the LSGD 2 by letting where , then LSGD can be written as

(5)

which is actually the GD for solving the following minimization problem

(6)

Therefore, to determine the largest suitable step size for LSGD, it is equivalent to find the largest appropriate step size for GD for . Therefore, it suffices to determine the Lipschitz constant for the function , i.e., to find

Note that for , we have

To find the largest appropriate step size, we need to further estimate .

3.1 estimates of

Proposition 1.

Given any vector , let , then

(7)
Proof.

Observe that . Therefore,

where we used for the second last equality. ∎

Proposition 1 shows that the Lipschitz constant of is not larger than that of , since

Therefore, LSGD can take at least the same step size as GD. However, note that

can be arbitrarily close to zero, so LSGD cannot always take a larger step size than GD. Next, we establish a high probability estimation for taking a larger step size when using LSGD.

Without any prior knowledge about , let us assume it is sampled uniformly from a ball in centered at the origin. Without loss of generality, we assume the radius of this ball is one. For the sake of notation simplicity, in the following we denote . Under the above ansatz, we have the following result

Theorem 1 (-estimate).

Let , and

where , , are the roots of unity. Let

be uniformly distributed in the unit ball of the

dimensional space. Then

(8)

for any .

The proof of this theorem is provided in the appendix. For high dimensional ML problems, e.g., training DNNs, can be as large as tens of millions so that the probability will be almost one. The closed form of is given in Lemma 1.

Lemma 1.

If denote the roots of unity, then

(9)

as , where

The proof of the above lemma requires some tools from complex analysis and harmonic analysis, which is provided in the appendix. Table 1 lists some typical values for different and dimensions .

        1         2         3         4         5
0.447 0.333 0.277 0.243 0.218
0.447 0.333 0.277 0.243 0.218
0.447 0.333 0.277 0.243 0.218
Table 1: The values of corresponding to some and . converges quickly to its limiting value as increases.

Based on the estimate in Theorem 1, LSGD can take the largest step size for high-dimensional -Lipschitz function with high probability. We will verify this result numerically in the following sections.

4 Variance Reduction

The variance of SGD is one of the major bottlenecks that slows down the theoretical guaranteed convergence rate in training ML models. Most of the existing variance reduction algorithms require either the full batch gradient or the storage of stochastic gradient for each data point which makes it difficult to be used to train the high-capacity DNNs. LS is an alternative approach to reduce the variance of the stochastic gradient with negligible extra computational time and memory cost. In this section, we rigorously show that LS reduces the variance of the stochastic gradient and reduce the optimality gap under the Gaussian noise assumption. Moreover, we numerically verify our theoretical results on both a quadratic function and a simple finite-sum optimization problem.

4.1 Gaussian noise assumption

Stochastic gradient , for any

, is an unbiased estimate of

, many existing works model the variance between the stochastic gradient and full batch gradient as Gaussian noise , where is the covariance matrix [28]. Therefore, ignoring the variable for simplicity of notation, we can write the equation involving gradient and stochastic gradient vectors as

(10)

where . Thus for LS stochastic gradient, we have

(11)

The variances of stochastic gradient and LS stochastic gradient are basically the variance of and , respectively. The following theorem quantifies the variance between and .

Theorem 2.

Let denote the condition number of . Then, for dimensional Gaussian random vector , we have

(12)

The proof of Theorem 2 will be provided in the appendix.

Table 2 lists the ratio of variance after and before LS for an -dimensional standard normal vector, i.e., . In practice, high order smoothing reduce variance more significantly.

        1         2         3         4         5
0.268 0.185 0.149 0.129 0.114
0.279 0.231 0.207 0.192 0.181
0.290 0.256 0.238 0.226 0.218
Table 2: Theoretical upper bound of when is an -dimensional standard normal vector with .

Moreover, LS preserves the mean (Proposition 2), decreases the largest component and increases the smallest component (Proposition 3) for any vector.

Proposition 2.

For any vector , , let and . We have and .

Proof.

Since , it holds that

where periodicity of subindex are used if necessary. Since , We have . A similar argument can show that . ∎

Proposition 3.

The operator preserves the sum of components. For any and , we have , or equivalently, .

Proof.

Since ,

where we used . ∎

4.2 Reduce the optimality gap

A direct benefit of variance reduction is that it reduces the optimality gap in SGD when constant step size is applied. We state the corresponding result in the following.

Proposition 4.

Suppose is convex with the global minimizer , and . Consider the following iteration with constant learning rate

where is the sampled gradient in the -th iteration at satisfying . Denote and the ergodic average of iterates. Then the optimality gap is

Proof.

Since is convex, we have

(13)

Furthermore,

where the last inequality is due to (13). We rearrange the terms and arrive at

Summing over from to and averaging and using the convexity of , we have

Taking the limit as above establishes the result. ∎

Remark 5.

Since is smaller than the corresponding value without LS. It shows that the optimality gap is reduced when LS is used with a constant step size. In practice, this is also true for the stage-wise step size since it is a constant in each stage of the training phase.

4.2.1 Optimization for quadratic function

In this part, we empirically show the advantages of the LS(S)GD and its generalized schemes for the convex optimization problems. Consider searching the minima of the quadratic function defined in Eq. (14).

(14)

To simulate SGD, we add Gaussian noise to the gradient vector, i.e., at any given point , we have

where the scalar controls the noise level, is the Gaussian noise vector with zero mean and unit variance in each coordinate. The corresponding numerical schemes can be formulated as

(15)

where is the smoothing parameter selected to be to remove the intense noise. We take diminishing step sizes with initial values for SGD/smoothed SGD; and for GD/smoothed GD, respectively. Without noise, the smoothing allows us to take larger step sizes, rounding to the first digit, and are the largest suitable step size for GD and smoothed version here. We study both constant learning rate and exponentially decaying learning rate, i.e., after every 1000 iteration the learning rate is divided by 10. We apply different schemes that corresponding to in Eq. (15) to the problem (Eq. (14)), with the initial point .

Figure. 1 shows the iteration v.s. optimality gap when the constant learning rate is used. In the noise free case, all three schemes converge linearly. When there is noise, our smoothed gradient helps to reduce the optimality gap and converges faster after a few iterations.

(a) (b)
(c) (d)
Figure 1: Iterations v.s. optimality gap for GD and smoothed GD with order 1 and order 2 smoothing for the problem in Eq.(14). Constant step size is used.

The exponentially decaying learning rate helps our smoothed SGD to reach a point with a smaller optimality gap, and the higher order smoothing further reduces the optimality gap, as shown in Fig. 2. This is due to the noise removal properties of the smoothing operators.

(a) (b)
(c) (d)
Figure 2: Iterations v.s. optimality gap for GD and smoothed GD with order 1 and 2 smoothing for the problem in Eq.(14). Exponentially decaying step size is utilized here.

4.2.2 Find the center of multiple points

Consider searching the center of a given set of 5K random points . 111We thank professor Adam Oberman for suggesting this problem to us. This problem can be formulate as the following finite-sum optimization

(16)

We solve this optimization problem by running either SGD or LSSGD for 20K iterations starting from the same random initial point with batch size 20. The initial step size is set to be 1.0 and 1.2, respectively, for SGD and LSSGD, and decays 1.1 times after every 10 iterations. As the learning rate decays, the variance of the stochastic gradient decays [46], thus we decay 10 times after every 1K iterations. Figure 3 (a) plots a 2D cross section of the trajectories of SGD and LSSGD, and it shows that the trajectory of SGD is more noisy than that of LSSGD. Figure 3 (b) plots the iteration v.s. loss for both SGD and LSSGD averaged over 3 independent runs. LSSGD converges faster than SGD and has a smaller optimality gap than LSSGD. This numerical result verifies our theoretical results on the optimality gap (Proposition 4).

Figure 3: Left: a cross section of the trajectories of SGD and LSSGD. Right: Iteration v.s. Loss for SGD and LS-SGD.

4.2.3 Multi-class Logistic regression

Consider applying the proposed optimization sch–emes to train the multi-class Logistic regression model. We run 200 epochs of SGD and different order smoothing algorithms to maximize the likelihood of multi-class Logistic regression with batch size 100. And we apply the exponentially decaying learning rate with initial value and decay 10 times after every 50 epochs. We train the model with only 10 randomly selected MNIST training data and test the trained model on the entire testing images. We further compare with SVRG under the same setting. Figure. 4 shows the histograms of generalization accuracy of the model trained by SGD (a); SVRG (b); LS-SGD (order 1) (c); LS-SGD (oder 2) (d). It is seen that SVRG somewhat improves the generalization with higher averaged accuracy. However, the first and the second order LSSGD type algorithms lift the averaged generalization accuracy by more than and reduce tnt of Electrical Engineering and Computer Sciences University ofhe variance of the generalization accuracy over 100 independent trials remarkably.

4.3 Iteration v.s. loss

In this part, we show the evolution of the loss in training the multi-class Logistic regression model by SGD, SVRG, LSGD with first and second order smoothing, respectively. As illustrated in Fig. 5. At each iteration, among 100 independent experiments, SGD has the largest variance, SGD with first order smoothed gradient significantly reduces the variance of loss among different experiments. The second order smoothing can further reduce the variance. The variance of loss in each iteration among 100 experiments is minimized when SVRG is used to train the multi-class Logistic model. However, the generalization performance of the model trained by SVRG is not as good as the ones trained by LS-SGD, or higher order smoothed gradient descent (Fig. 4 (b)).

(a) SGD (b) SVRG
(c) LS-GD: Order 1 (d) LS-GD: Order 2
Figure 4: Histogram of testing accuracy over 100 independent experiments of the multi-class Logistic regression model trained on randomly selected MNIST data by different algorithms.
(a) SGD (b) SVRG
(c) LS-GD: Order 1 (d) LS-GD: Order 2
Figure 5: Iterations v.s. loss for SGD, SVRG, and LS-SGD with order 1 and order 2 gradient smoothing for training the multi-class Logistic regression model.

4.4 Variance reduction in stochastic gradient

We verify the efficiency of variance reduction numerically in this part. We simplify the problem by applying the multi-class Logistic regression only to the digits 1 and 2 of the MNIST training data. In order to compute the variance of the (LS)-stochastic gradients, we first compute descent path of (LS)-GD by applying the full batch (LS)-GD with learning rate starting from the same random initialization. We record the full batch (LS)-gradient on each point along the descent path. Then we compute the (LS)-stochastic gradients on each points along the path by using different batch sizes and smoothing parameters . In computing (LS)-stochastic gradients we run 100 independent experiments. Then we compute the variance of the (LS)-stochastic gradient among these 100 experiments and regarding the full batch (LS)-gradient as the mean on each point along the full batch (LS)-GD descent path. For each pair of batch size and , we report the maximum variance over all the coordinates of the gradient and all the points along the descent path. We list the variance results in Table 3 (note the case corresponds to the SGD). These results show that compared to the SGD, LSGD with can reduce the maximum variance times for different batch sizes. It is worth noting that the high order smoothing reduces more variance than the lower order smoothing, this might due to the fact that the noise of SGD is not Gaussian.

Batch Size          2         5         10         20         50
1.50E-1 5.49E-2 2.37E-2 1.01E-2 4.40E-3
3.40E-3 1.30E-3 5.45E-4 2.32E-4 9.02E-5
2.00E-3 7.17E-4 3.46E-4 1.57E-4 5.46E-5
1.40E-3 4.98E-4 2.56E-4 1.17E-4 3.97E-5
Table 3: The maximum variance of the stochastic gradient generated by LS-SGD with different and batch size. recovers the SGD.

5 Numerical Results on Avoid Local Minima and Speed Up Convergence

We first show that LS-GD can bypass sharp minima and reach the global minima. We consider the following function, in which we ‘drill’ narrow holes on a smooth convex function,

(17)

where the summation is taken over the index set , and are the parameters that determine the location and narrowness of the local minima and are set to and , respectively. We do GD and LS-GD starting from a random point in the neighborhoods of the narrow minima, i.e., , where is a neighborhood of the point with radius . Our experiments (Fig. 6) show that, if GD will converge to a narrow local minima, while LS-GD convergences to the wider global minima.

(a) (b)
Figure 6: Demo of GD and LS-GD. Panel (a) depicts the slice of the function (Eq.(17)) with ; panel (b) shows the paths of GD (red) and LS-GD (black). We take the step size to be 0.02 for both GD and LS-GD. is utilized for LS-GD.

Next, let us compare LSGD with some popular optimization methods on the benchmark 2D-Rosenbrock function which is a non-convex function. The global minimum is inside a long, narrow, parabolic shaped flag valley. To find the valley is trivial. To converge to the global minimum, however, is difficult. The function is defined by

(18)

it has a global minimum at , and we set and in the following experiments.

Starting from the initial point with coordinate , we run 2K iterations of the following optimizers including GD, GD with Nesterov momentum [31], Adam [21], RMSProp [44], and LSGD (). The step size used for all these methods is . Figure 7 plots the iteration v.s. objective value, and it shows that GD together with Nesterov momentum converges faster than all the other algorithms. The second best algorithm is LSGD. Meanwhile, Nesterov momentum can be used to speed up LSGD, and we will show this numerically in training DNNs in section 6.

Figure 7: Iteration v.s. loss of different optimization algorithms in optimize the Rosenbrock function.

Figure 8 depicts some snapshots (The 300th, 600th, 900th, and 1200th iteration, respectively) of the trajectories of different optimization algorithms. These figures show that even though GD with momentum converge faster but it suffers from some overshoots, and they detour to converge to the local minima. All the other algorithms go along a direct path to the minima, and LSGD converges fastest.

Iteration: 300 (b) Iteration: 600
Iteration: 900 (d) Iteration: 1200
Figure 8: Some snapshots of trajectories of different optimization algorithms on the Rosenbrock function.

Furthermore, we will show that LSGD can be further accelerated by using Nesterov momentum. As show in Fig. 9, the LSGD together with Nesterov momentum converges much faster than GD with momentum, especially for high dimensional Rosenbrock function.

Figure 9: Iteration v.s. objective value for GD with Nesterov momentum and LSGD with Nesterov momentum.

6 Application to Deep Learning

6.1 Train neural nets with small batch size

Many advanced artificial intelligence tasks make high demands on training neural nets with extremely small batch sizes. The milestone technique for this is group normalization

[47]. In this section, we show that LS-SGD successfully trains DNN with extremely small batch size. We consider LeNet-5 [23] for MNIST classification. Our network architecture is as follows

The notation denotes a 2D convolutional layer with output channels, each of which is the sum of a channel-wise convolution operation on the input using a learnable kernel of size

, it further adds ReLU nonlinearity and max pooling with stride size

.

is an affine transformation that transforms the input to a vector of dimension 512. Finally, the tensors are activated by a multi-class Logistic function. The MNIST data is first passed to the layer

, and further processed by this hierarchical structure. We run epochs of both SGD and LS-SGD with initial learning rate and divide by after 50 epochs, and use a weight decay of and momentum of . Figure. 10(a) plots the generalization accuracy on the test set with the LeNet5 trained with different batch sizes. For each batch size, LS-SGD with keeps the testing accuracy more than , SGD reduce the accuracy to when batch size 4 is used. The classification become just a random guess, when the model is trained by SGD with batch size 2. Small batch size leads to large noise in the gradient, which may make the noisy gradient not along the decent direction; however, Lapacian smoothing rescues this by decreasing the noise.

Figure 10: (a). Testing accuracy of LeNet5 trained by SGD/LS-SGD on MNIST with various batch sizes. (b). The evolution of the pre-activated ResNet56’s training and generalization accuracy by SGD and LS-SGD. (Start from the 20-th epoch.)

6.2 Improve generalization accuracy

The skip connections in ResNet smooth the landscape of the loss function of the classical CNN [17, 26]. This means that ResNet has fewer sharp minima. On Cifar10 [22], we compare the performance of LS-SGD and SGD on ResNet with the pre-activated ResNet56 as an illustration. We take the same training strategy as that used in [17], except that we run epochs with the learning rate decaying by a factor of after every 40 epochs. For ResNet, instead of applying LS-SGD for all epochs, we only use LS-SGD in the first 40 epochs, and the remaining training is carried out by SGD (this will save the extra computational cost due to LS, and we noticed that the performance is similar to the case when LS is used for the whole training process). The parameter is set to . Figure 10(b) depicts one path of the training and generalization accuracy of the neural nets trained by SGD and LS-SGD, respectively. It is seen that, even though the training accuracy obtained by SGD is higher than that by LS-SGD, the generalization is however inferior to that of LS-SGD. We conjecture that this is due to the fact that SGD gets trapped into some sharp but deeper minimum, which fits better than a flat minimum but generalizes worse. We carry out replicas of this experiments, the histograms of the corresponding accuracy are shown in Fig. 11.

SGD LS-SGD with
Figure 11: The histogram of the generalization accuracy of the pre-activated ResNet56 on Cifar10 trained by SGD and LS-SGD over 25 independent experiments.

6.3 Training Wassersterin GAN

Generative Adversarial Networks (GANs) [15] are notoriously delicate and unstable to train [4]. In [27], Wasserstein-GANs (WGANs) are introduced to combat the instability in the training GANs. In addition to being more robust in training parameters and network architecture, WGANs provide a reliable estimate of the Earth Mover (EM) metric which correlates well with the quality of the generated samples. Nonetheless, WGANs training becomes unstable with a large learning rate or when used with a momentum based optimizer [27]. In this section, we demonstrate that the gradient smoothing technique in this paper alleviates the instability in the training, and improves the quality of generated samples. Since WGANs with weight clipping are typically trained with RMSProp [44], we propose replacing the gradient by a smoothed version , and also update the running averages using instead of . We name this algorithm LS-RMSProp.

To accentuate the instability in training and demonstrate the effects of gradient smoothing, we deliberately use a large learning rate for training the generator. We compare the regular RMSProp with the LS-RMSProp. The learning rate for the critic is kept small and trained approximately to convergence so that the critic loss is still an effective approximation to the Wasserstein distance. To control the number of unknowns in the experiment and make a meaningful comparison using the critic loss, we use the classical RMSProp for the critic, and only apply LS-RMSProp to the generator.

RMSProp
LS-RMSProp,
Figure 12: Critic loss with learning rate , for RMSProp (top) and LS-RMSProp (bottom), trained for 20K iterations. We apply a mean filter of window size 13 for better visualization. The loss from LS-RMSProp is visibly less noisy.
RMSProp LS-RMSProp,
(a) (b)
RMSProp LS-RMSProp,
(c) (d)
Figure 13: Samples from WGANs trained with RMSProp (a, c) and LS-RMSProp (b, d). The learning rate is set to , for both RMSProp and LS-RMSProp in (a) and (b). And , are used for both RMSProp and LS-RMSProp in (c) and (d). The critic is trained for 5 iterations per step of the generator, and 200 iterations per every 500 steps of the generator.

We train the WGANs on the MNIST dataset using the DCGAN [35] for both the critic and generator. In Figure 12 (top), we observe the loss for RMSProp trained with a large learning rate has multiple sharp spikes, indicating instability in the training process. The samples generated are also lower in quality, containing noisy spots as shown in Figure 13 (a). In contrast, the curve of training loss for LS-RMSProp is smoother and exhibits fewer spikes. The generated samples as shown in Fig. 13 (b) are also of better quality and visibly less noisy. The generated characters shown in Fig. 13 (b) are more realistic compared to the ones shown in Fig. 13 (a). The effects are less pronounced with a small learning rate, but still result in a modest improvement in sample quality as shown in Figure 13 (c) and (d).We also apply LS-RMSProp for training the critic, but do not see a clear improvement in the quality. This may be because the critic is already trained near optimality during each iteration, and does not benefit much from gradient smoothing.

RMSProp LS-RMSProp,
Figure 14: Durations of the cartpole game in the training procedure. Left and right are training procedure by RMSProp and LS-RMSProp with , respectively.

6.4 Deep reinforcement learning

Deep reinforcement learning (DRL) has been applied to playing games including Cartpole [9], Atari [30], Go [42, 29]. DNN plays a vital role in approximating the Q-function or policy function. We apply the Laplacian smoothed gradient to train the policy function to play the Cartpole game. We apply the standard procedure to train the policy function by using the policy gradient [9]. And we use the following network to approximate the policy function:

The network is trained by RMSProp and LS-RMSProp with , respectively. The learning rate and other related parameters are set to be the default ones in PyTorch. The training is stopped once the average duration of 5 consecutive episodes is more than 490. In each training episode, we set the maximal steps to be 500. Left and right panels of Fig. 14 depict a training procedure by using RMSProp and LS-RMSProp, respectively. We see that Laplacian smoothed gradient takes fewer episodes to reach the stopping criterion. Moreover, we run the above experiments 5 times independently, and apply the trained model to play Cartpole. The game lasts more than 1000 steps for all the 5 models trained by LS-RMSProp, while only 3 of them lasts more than 1000 steps when the model is trained by vanilla RMSProp.

7 Convergence Analysis

Note that the LS matrix is positive definite and its largest and smallest eigenvalues are 1 and , respectively. It is straightforward to show that all the convergence results for (S)GD still hold for LS(S)GD. In this section, we will show some additional convergence for LS(S)GD with a focus on LSGD, the corresponding results for LSSGD follow in a similar way.

Proposition 5.

Consider the algorithm . Suppose is -Lipschitz smooth and . Then . Moreover, if the Hessian of is continuous with being the minimizer of , and , then as , and the convergence is linear.

Proof.

By the Lipschitz continuity of and the descent lemma [5], we have

Summing the above inequality over , we have

Therefore, , and thus .

For the second claim, we have

Therefore,

So if , the result follows. ∎

Remark 6.

The convergence result in Proposition 5 is also call