SPI-Optimizer: an integral-Separated PI Controller for Stochastic Optimization

12/29/2018 ∙ by Dan Wang, et al. ∙ Tsinghua University The Hong Kong University of Science and Technology 28

To overcome the oscillation problem in the classical momentum-based optimizer, recent work associates it with the proportional-integral (PI) controller, and artificially adds D term producing a PID controller. It suppresses oscillation with the sacrifice of introducing extra hyper-parameter. In this paper, we start by analyzing: why momentum-based method oscillates about the optimal point? and answering that: the fluctuation problem relates to the lag effect of integral (I) term. Inspired by the conditional integration idea in classical control society, we propose SPI-Optimizer, an integral-Separated PI controller based optimizer WITHOUT introducing extra hyperparameter. It separates momentum term adaptively when the inconsistency of current and historical gradient direction occurs. Extensive experiments demonstrate that SPIOptimizer generalizes well on popular network architectures to eliminate the oscillation, and owns competitive performance with faster convergence speed (up to 40 classification result on MNIST, CIFAR10, and CIFAR100 (up to 27.5 reduction ratio) than the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Serving as a fundamental tool to solve practical problems in both scientific and engineering domains, a proper optimizer plays vital role. Taking the highly concerned deep learning successful stories

[7, 8, 10, 30, 28, 6, 8]

as examples, stochastic gradient descent (SGD) serves as one of the most popular solvers, due to its ability in maintaining a good balance between efficiency and effectiveness. The expectation in training very deep networks substantially requires for even more efficient optimizer such as SGD-Momentum (MOM)

[23]. However it suffers from the oscillation problem [18], with non-negligible maximum overshoot and settling time. Such an oscillation phenomena hinders the convergence of MOM, requiring more training time and resources. As a result, an efficient as well as effective optimizer is urgently demanded yet very challenging, owing to the highly non-convex nature of the optimization problems.

Recently, some researchers investigate the conventional optimization problem by associating it with the Proportional-Integral-Derivative (PID) model that widely used in the feedback control system. By linking the calculation of errors in feedback control system and the calculation of gradient in network updating, [1] shows that MOM can be treated as a special case of classical PID controller with only Proportional (P) and Integral (I) components. It further artificially adds the Derivative (D) component to form a PID based optimizer, which reduces the oscillation phenomena by introducing troublesome hyper-parameter induced by D component. In other words, the calculated coefficient of the derivative term can hardly adapt to the huge diversity of network architectures and different modalities of the training dataset.

On the contrary to extend PI to PID directly, we explore “why momentum-based method oscillates about the optimal point?” via thorough analysis from the perspective of inherent connection between MOM and PI controller. The in-depth pre-analysis (Section 3.1) reveals that the fluctuation problem in momentum-based method relates to the lag effect of integral (I) term in PI controller. Inspired by the conditional integration idea in classical control society, we propose SPI-Optimizer, an integral-Separated PI controller based solver for stochastic Optimization scheme. SPI-Optimizer separates momentum term adaptively when the inconsistency of current and historical gradient direction occurs.

More specifically, the insight of SPI-Optimizer can be explained more explicitly as follows (more discussions in Section 3.3). For Conditional Integration used in classical control society (denoted as CI), the integral component is only considered as long as the magnitude of the feedback deviation (the gradient) is smaller than a threshold . That means SGD with only proportional (P) term can be viewed as CI-. Similarly, MOM never separates out the integral (I) part and can be denoted as CI-. While the oscillation phenomenon may be tuned by setting , the convergence speed of CI cannot be improved by trading off the parameter , which remains bounded by CI- and CI-. Our SPI-Optimizer examines the sign consistency between the residual and the integral term before enabling the integral component, thus easing the oscillation phenomenon WITHOUT introducing extra hyper-parameter. As a result, it can be theoretically shown that SPI-Optimizer outperforms both CI- and CI-, owning more generalization ability for different network structures across several popular dataset. We summarize the technique contributions as follows.

  • By associating MOM with PI controller, we analytically show that the oscillation in momentum-based method corresponds to the lag effect of integral (I) term in PI controller, which inspires us to deal with I term instead of adding D term, as the latter one introduces extra hyper-parameter.

  • A novel SPI-Optimizer based on the integral-Separated PI controller is proposed to separate momentum term adaptively when the inconsistency of current and historical gradient direction occurs. The detailed discussion on the convergence of SPI-Optimizer is provided theoretically.

  • SPI-Optimizer eliminates the oscillation phenomenon without introducing any extra hyper-parameter and leads to considerably faster convergence speed and more accurate result on popular network architectures.

2 Related Work

Among the various optimization schemes, the gradient based methods have served as the most popular optimizer to solve tremendous optimization problems. The representative ones include gradient descent (GD) [4], stochastic gradient descent (SGD) [22], heavy ball (HB)[19], Nesterov accelerated gradient (NAG) [17] etc. While GD is the simplest one, it is restricted by the redundant computations for large dataset, as it recomputes gradients for similar examples before updating each parameter. SGD improves it by sampling a random subset of the overall dataset, yet it is difficult to pass ravines [24]. HB puts forward by adding a fraction to accelerate the iteration, which is further developed and named as Momentum(MOM) [23]. NAG further uses the momentum term to update parameters and corrects gradient with some prescience. All of these classic optimization algorithms own fixed learning rates.

Lately, an increasing share of deep learning researchers train their models with adaptive learning rates methods [5, 29, 20, 12], due to the requirement of speeding up the training time [11]. They try to adapt updates to each individual parameter to perform larger or smaller updates depending on their importance.

Regardless the successful usage of adaptive method in in many applications owing to its competitive performance and its ability to work well despite minimal tuning, the recent findings by [27] show that hand-tuned SGD and MoM achieves better result at the same or even faster speed than adaptive method. Furthermore, the authors also show that for even simple quadratic problems, adaptive methods find solutions that can be orders-of-magnitude worse at generalization than those found by SGD(M). They put forward that a possible explanation for the worse results in adaptive methods lies in the convergence to different local minimums [9]. It is also noted that most state-of-the-art deep models such as ResNet [6] and DenseNet [8] are usually trained by momentum-based method, as the adaptive methods generalize worse than SGD-Momentum, even when these solutions have better training performance.

On the other hand, some researchers try to investigate stochastic optimization by associating it with the classical Proportional-Integral-Derivative (PID) controller that widely used in the feedback control system. The pioneer work [26] regarded the classical gradient descent algorithm as the PID controller that uses the Proportional (P) term merely. They added Integral (I) and Derivative (D) terms to achieve faster convergence. The latest work [1] interpreted that momentum can be treated as a PI controller, and a Derivative (D) term that is the predictive gradient difference is added to reduce oscillation and improve SGD-Momentum on the large-scale dataset. Unfortunately, either introducing I and D terms to GD, or introducing D term to SGD-Momentum intensifies the task of tuning (which will be further elaborated in our experiments).

3 SPI-Optimizer

In this section, we firstly conduct a thorough pre-analysis on the oscillation phenomena of momentum-based algorithm in Section 3.1. Aided by the association with PI controller, the oscillation can be explained by the lag effect of integral (I) term in PI controller. We then propose a novel SPI-Optimizer to separate I term from PI controller adaptively, which eases the oscillation problem WITHOUT introducing extra hyper-parameter. Subsequently, in-depth discussions to further evaluate SPI-Optimizer are provided in Section 3.3.

3.1 Pre-analysis of Oscillation

As introduced in [23], the Momentum algorithm (MOM) works by accumulating an exponentially decayed moving average of past gradients. Mathematically, the momentum-based optimizer can be defined as

(1)

where and respectively define MOM and Nesterov Accelerated Gradient (NAG) [17].

(a) a: MOM’s trajectory
(b) b: MOM
(c) c: GD
(d) d: NAG
(e) e: SPI
Figure 1: (a): convergence path of MOM on a 2D toy convex function with colored segments representing each weight update. (b-e): horizontal residual w.r.t. the optimal point versus time. Two components for the weight update are the current gradient (red arrow) and the momentum (green arrow), which can be interpreted as two forces dragging the blue curve.

Although the momentum component can accelerate the convergence in the case of small and consistent gradients, it suffers from oscillation phenomenon that the convergence path fluctuates about the optimal point, as shown in Fig. 0(a). Such oscillation can be quantitatively described by two concepts: the settling time , defined as the time required for the curve to reach and stay within a range of certain threshold () to the optimal point, and the maximum overshoot describing the difference between the maximum peak value and the optimal value:

(2)

As defined in Eqn. (1), there are two components contributing to the weight update, i.e., the momentum and the current gradient . The integral term can introduce some non-negligible weight updates that are opposite to the gradient descent direction. In that case, the momentum will lag the update of weights even if the weights should change their gradient direction. Analogous to the feedback control, such lag effect leads to more severe oscillation phenomenon, i.e., the convergence path fluctuates about the optimal point with larger maximum overshoot and longer settling time .

We further take a commonly used function for illustration. In Fig. 0(a), the convergence path is composed of multiple weight updates shown in different colors. By only considering the horizontal axis, Fig. 0(b) depicts the residual of the convergence path to the optimal point using blue curve, and the weight updates from both the momentum and the current gradient , shown as green and red arrows respectively. In the process of approaching the goal, we define several time stamps: as the time when the curve first exceeds the optimal point, as the time when it reaches the maximum overshoot, and as the settling time.

The weight updates (green and red arrows) start with the same direction (up) until . For the duration , because the weight exceeds the optimal point (origin point in this specific example), the gradient descent direction (red arrow) gets reversed. But owing to the large accumulated gradients value (green arrow), the weight update deviates from the current rising trend until when . As a result, the momentum introduces lag effect to the update of weights in the period of and leads to severe oscillation effect with large maximum overshoot and long settling time.

Compared with MOM, gradient descent (GD) oscillates less due to the lack of the accumulated gradients, shown in Fig. 0(c). Even though the maximum overshoot is much smaller than MOM, the settling time is unacceptably longer. Due to the lag effect of the momentum within the period , the oscillation phenomenon of NAG is as severe as MOM. In Fig. 1, SPI-Optimizer obtained about convergent epoch reduction ratio.

3.2 integral-Separated PI Controller

The latest work [1] points out that if the optimization process is treated as a dynamic system, the optimizer Stochastic Gradient Descent (SGD) can be interpreted as a proportion (P) controller with . Then, MOM and NAG can be represented as Proportional-Integral (PI) controller. As discussed in the previous subsection, the momentum term can lead to severe system oscillation about the optimal point owing to the lag effect of the integral / momentum. To ease the fluctuation phenomenon, [1] artificially adds the derivative (D) term to be a PID- controller with a network-invariant and dataset-invariant coefficient . However, it is questionable that an universal (D) coefficient , rather than a model-based one, is applicable to diverse network structures. At the same time, the newly introduced hyper-parameter for the derivative term needs more effort for empirical tuning. In contrast, we propose integral-Separated PI Controller based Optimizer (SPI-Optimizer) to directly deal with the integral term, WITHOUT introducing any extra hyper-parameter.

In a typical optimization problem, the loss function

is a metric to measure the distance between the desired output and the prediction given the weight . The gradient of the weights can be used to update the weights till the optimal solution with zero gradient. Hence the gradient w.r.t. weights can be associated with the “error” in the feedback control. Consequently, rethinking the problem from the perspective of control, although PI controller leads to faster respond compared with P controller, it can easily lag and destabilize the dynamic system by accumulating large historical errors.

Inspired by the conditional integration [2] strategy in control community, which prevents the integral term from accumulating within pre-determined bound to effectively suppress the lag effect, a simple conditional integration optimizer (CI-) is proposed as follows:

(3)

where

is the introduced threshold for each dimension of the state vectors. Unfortunately, such naive adoption leads to some drawbacks: (1) it requires extra effort to empirically tune the hyper-parameter

, and has weak generalization capability across different cost function , (2) by manually selecting the gradient threshold, the performance of CI- is almost bounded by SGD (CI-) and MOM (CI-) certainly.

Recall that what we expect is an optimizer with short rising time and small maximum overshoot . As illustrated in Fig. 1 previously, the momentum-based algorithm has much shorter rising time than GD due to the accumulated gradients. However, the historical gradients lag the update of weights in the period when the gradient direction gets reversed, and lead to severe oscillation about the optimal point. To ease the fluctuation, the proposed SPI-Optimizer isolates the integral component of the controller when the inconsistency of current and historical gradient direction occurs, i.e.,

(4)

The SPI-Optimizer is further presented by:

(5)

The key insight here is that the historical gradients will lag the update of weights if the weights should not keep the previous direction, i.e., , leading to oscillation of gradients about the optimal point until the gradients compensates the momentum in the reversed direction. In this way, SPI-Optimizer can converge as fast as MOM and NAG yet leads to much smaller maximum overshoot. On the other hand, we may interpret the SPI-Optimizer from the perspective of state delay.

State Delay: Recall that the objective of this feedback system is to let the gradient approach 0, yet we only have the observation of the state . This can be understood as a dynamic system with measurement latency or temporal delay. The larger the delay is, the more likely severe oscillation or unstable system occurs. Analogously, we define as state delay in stochastic optimization:

(6)

Hypothesis: One hypothesis is that for the momentum-based optimizer (PI controller), the optimizer with smaller state delay is highly likely having less oscillation, which is harmful for system stability. As the increase of in Eqn. 6 from GD to MOM, state delay of MOM has higher chance to be larger than that of GD , which explains why MOM usually oscillates more during optimization process. Similarly, NAG can be understood as a more robust PI controller with smaller state delay under the assumption that both and share the same probabilistic distribution. For SPI-Optimizer, when the oscillation is detected, we reduce the state delay by assigning . Otherwise, it remains using PI controller to speed up.

3.3 Discussion

(a) SPI
(b) GD
(c) MOM
(d) NAG
(e) CI-
(f) CI-
Figure 2: Convergence comparison within 50 iterations among our SPI-Optimizer, Momentum (MOM) [23], Gradient Descent (GD) [4], Nesterov Accelerated Gradient (NAG) [17] and conditional integral (CI-) on the quadratic function .
(a) SPI
(b) GD
(c) MOM
(d) NAG
(e) CI-
(f) CI-
Figure 3: Performance comparison within 100 iterations among our SPI-Optimizer, Momentum (MOM) [23], Gradient Descent (GD) [4], Nesterov Accelerated Gradient (NAG) [17] and conditional integral (CI-) on the 2D non-convex McCormick function .

To make the hypothesis mentioned above more intuitive and rigorous, and to further quantify how much SPI-Optimizer improves compared with other optimizer, we specifically take a very simple convex function and the McCormick function as examples to visualize the optimization procedure. The concerned representative P or PI based optimizers used for comparison are Momentum (MOM) [23], Gradient Descent (GD) [22], Nesterov Accelerated Gradient (NAG) [17] and Conditional Integration (CI-) [2] with different thresholding parameter .

The convergence path of each optimizer applying on the two functions are depicted in Fig. 2 and Fig. 3 (sub-figures with green background). The optimal point locates at the origin (the red point), and the convergence process starts from the blue point with the maximum 100 iterations. The loss is defined as with ,respectively. Apparently, GD and SPI oscillate the least, and NAG tends to be more stable than MOM. This intuitively validates the previous hypothesis: for the same type of optimization controller, the one with smaller state delay is highly likely having less oscillation.

Additionally, the convergence speed of all the methods can be inferred from the top chart in Fig. 2 and Fig. 3, where the naive conditional integration inspired controller with different thresholds are marked as CI-. From the definition of CI-, we can tell that the performance of CI- is almost bounded by GD (CI-) and MOM (CI-), which can also be interpreted from both Fig. 2 and Fig. 3. It is worthy note that the hyper-parameter aggravates the parameter tuning methodology, since it should be determined by the characteristics of the loss functions that depends on the training data , the network structure and the metric between and . Even in the toy 2D example, the extra introduced hyper-parameter by the CI- is not reliable for a favorable result.

In contrast, the proposed SPI-Optimizer takes precautions against oscillation that may lead to unstable system and slow convergence, by preventing large state delays. So that the fluctuation phenomenon of the convergence curve gets eased. Meanwhile, the convergence rate of SPI is clearly superior to that of others, not only in the initial stages where the value of error function is significant, but also in the later part when the error function is close to the minimum. Quantitatively the convergence speed reaches up to and epoch reduction ratio respectively on the 2D function and when the L2 norm of the residual hits the threshold 1e-5.

Convergence analysis: More importantly, we conduct theoretical analysis on the convergence of SPI-Optimizer, and show that under certain condition that the loss function is -strongly convex and -smooth[21], and the learning rate and momentum parameter within a proper range, 1) the convergence of SPI-Optimizer can be guaranteed strictly, 2) the convergence rate of SPI-Optimizer is faster than MOM. Due to limited space, the detailed analysis is presented in the supplementary material.

4 Experiments

Following the discussion on the 2D demo with harsh fluctuation, this section studies the performance of SPI-Optimizer on the convolutional neural networks (CNNs), consisting of convolution layers and non-linear units.

In the first subsection, we compare with the most relevant solutions for dealing with the oscillation problem of the integral component. One method is the PID- [1] that adds a derivative (D) term to MOM. Another counterpart is the conditional integration (CI-) optimizer, introducing a hyper-parameter to define a bound within which the momentum term gets prevented. In contrast, the proposed SPI-Optimizer does not introduce extra hyper-parameter and outperforms both of them.

Subsequently, experiments are conducted to evaluate the P controller (SGD) and the PI controllers (MOM and NAG) under various training settings, showing that SPI-Optimizer is more robust to large learning rate range, different learning rate decay schemes, and various network structures.

Finally, SPI-Optimizer is compared with the state-of-the-art optimization methods in different datasets across different network structures to illustrate its better generalization ability.

Note that all the reported charts and numbers are averaged after 3 runs.

Figure 4: Performance comparison between SPI-Optimizer and PID- with different on the MNIST dataset. Compared with SPI-Optimizer without extra hyper-parameter, PID- requires big effort to be tuned.

4.1 Compare with the Oscillation Compensators

Comparison with PID-: Fig. 4 depicts the performance using CNNs111

The architecture of CNNs contains 2 alternating stages of 5x5 convolution lifters and 2x2 max pooling with stride of 1 followed by a fully connected layer output. The dropout noise applies on the fully connected layer.

on the handwritten digits dataset MNIST [15] consisting of 60k training and 10k test 28x28 gray images. Even though PID- [1] can ease the oscillation problem of MOM, its hyper-parameter requires much effort for empirical tuning to get relatively better result. Specifically, a large range of is tested from to ; however, SPI-Optimizer performs better in terms of faster convergence speed and around error reduction ratio than PID-.

One may notice that the curve for and (blue dashed line) did not appear in the Fig. 4, since and can not lead to convergence but is followed equation of initial value selection in [1] when set learning rate is 0.12. It’s worth pointing out that a similar situation exists for other learning rate. That can be explained by the fact that the hyper-parameter requires big effort to be tuned. One guess is that should depend on many factors, such as the training data, the network structure and the network loss metric.

Additionally, the comparison of the generalization ability across various network structures and datasets is listed in Tab. 2, where SPI-Optimizer also constantly outperforms PID-. More importantly, SPI-Optimizer does not introduce extra hyper-parameter.

(a) CIFAR-10
(b) CIFAR-100
Figure 5: Comparison between Condition Integration (CI-) and SPI-Optimizer in CIFAR dataset using AlexNet. The performance of CI- is almost bounded by SGD (CI-) and MOM (CI-). SPI-Optimizer outperforms all of them by a large margin without introducing hyper-parameter.
Methods CIFAR-10 CIFAR-100
lr=0.05 lr=0.1 lr=0.18 lr=0.05 lr=0.1 lr=0.18
SGD 24.812% 24.757% 25.522% 60.089% 60.240% 60.079%
MOM 24.929% 27.766% NaN 59.158% 68.312% NaN
NAG 24.753% 26.811% NaN 58.945% 67.091% NaN
SPI 24.257% 24.245% 25.823% 58.188% 57.179% 57.223%
Table 1: Three learning rate values are evaluated on both CIFAR-10 and CIFAR-100 datasets with AlexNet. Test errors are listed. Note that the symbol “NaN” indicates that the optimization procedure cannot converge with that specific setting. Compared with other P/PI optimizers (SGD, MOM, and NAG), SPI-Optimizer is more robust to larger learning rate while retaining the best performance.

Comparison with CI-: As we observe from the toy examples in Section 3.3, the performance of CI- is almost bounded by SGD (CI-) and MOM (CI-). Similarly, from Fig. 5 we get the same conclusion that CI- can hardly outperforms SGD and MOM in a large searching range of . The comparison is conducted on the larger datasets CIFAR-10 [13]222The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. and CIFAR-100 [13]333The CIFAR-100 is just like the CIFAR-10, except it has 100 classes containing 600 images each. using AlexNet [14]. Quantitatively, without any extra hyper-parameter, the proposed SPI-Optimizer can reach higher accuracy ( error reduction ratio) and faster convergence ( speed up) than CI- with ranging from to .

4.2 Comparison with P/PI Optimizers

(a) a: Fixed learning rate
(b) b: Learning rate decay
Figure 6: Trained on the CIFAR-100 dataset using AlexNet, SPI-Optimizer achieves good performance diffent learning rate decay schemes. The horizontal dotted-line corresponds to the highest accuracy of SGD, and the convergence speedup ratio of SPI-Optimizer is around .
(a) CIFAR10 ResNet
(b) CIFAR100 ResNet
(c) CIFAR-100 Wide-ResNet
Figure 7: Trained on the CIFAR-10, CIFAR-100 dataset using ResNet and Wide-ResNet. SPI-Optimizer constantly performs the best in terms of faster convergence and error reduction ratio than the second best method.

High Adaptability Of Learning Rate: The comparison for different learning rate is demonstrated in Tab. 1, where three learning rate values are evaluated on both CIFAR-10 and CIFAR-100 datasets with AlexNet. Note that the symbol “NaN” indicates that the optimization procedure cannot converge with that specific setting. Interestingly, SPI-Optimizer is the only surviving PI controller (momentum related optimizer) in the case with large learning rate . We can safely conclude that, compared with other P/PI optimizers (SGD, MOM, and NAG), SPI-Optimizer is more robust to larger learning rate while retaining the best performance.

Learning Rate Decay: We investigate the influence of different learning rate update schemes to the performance. Firstly, Fig. 5(a) depicts the result of constant learned rate , which is corresponding to the best performance of the other methods in Tab. 1. Even though we choose the setting with small performance gap with the others, the convergence speedup can still reach by comparing with the 2nd best method (SGD). It is calculated based on the (epoch, error percentage) key points of SPI-Optimizer () and SGD ().

Then, as a comparison, results with decayed learning rate by a factor of 0.1 in every 30 epochs is reported in Fig. 5(b). Even though MOM and NAG rise faster in the very beginning, SPI-Optimizer still has a big accuracy improvement versus the others. So that the proposed method can performs good in different learning rate decay schemes.

Optimization Methods
MNIST
C10(AlexNet)
C10(WRN)
C100(AlexNet)
C100(WRN)
mean error reduction ratio
SGD [22] 1.111% 24.757% 5.252% 60.079% 23.392% 7.9%
MOM [23] 1.720% 24.929% 4.804% 59.158% 21.684% 11.5%
NAG [17] 1.338% 24.753% 4.780% 58.945% 22.414% 8.3%
Adam [12] 1.110% 27.031% 10.254% 63.397% 32.548% 23.5%
RMSprop [25] 1.097% 30.634% 11.377% 65.704% 34.182% 27.5%
PID [1] 1.3% 24.672% 5.055% 58.946% 21.93% 8.3%
Addsign [3] 1.237% 24.811% 7.6% 60.482% 25.344% 16.4%
SPI 1.070% 24.245% 4.320% 57.118% 20.890% -
Table 2: Test error of the state-of-the-art methods on MNIST, CIFAR-10 and CIFAR-100 with different network structures. The mean error reduction (up to ) colomn averages the error reduction ratio across different network and dataset (along each row) w.r.t. the proposed method.

Convergence speed and Accuracy: We investigate SPI-Optimizer with other optimizers on CIFAR-10 and CIFAR-100 dataset with Resnet-56[6] and Wide ResNet (WRN) [28]. [6] trained ResNet-56 on CIFAR-10, CIFAR-100 by droping the learning rate by 0.1 at and of the training procedure and using weight decay of 0.0001. We use the same setting for the experiments in Fig. 7. WRN-16-8 is selected that consists 16 layers with widening factor . Following [28] training method, we also used weight decay of 0.0005, minibatch size to 128. The learning rate is dropped by a factor 0.2 at 60th, 120th, and 160th epoch with total budget of 200 epochs. For each optimizer we report the best test accuracy out of 7 different learning rate settings ranging from 0.05 to 0.4. From Fig. 7 we can see that SPI-Optimizer constantly performs the best in terms of faster and more error reduction ratio than the second best method.

4.3 Comparison with the State-of-the-art

To further demonstrate the effectiveness and the efficiency of SPI-Optimizer, we conduct the comparison with several state-of-the-art optimizers on different datasets MNIST, CIFAR-10, and CIFAR-100 by using AlexNet and WRN, as shown in Tab. 2. The compared methods includes P controller (SGD), PI controller (MOM and NAG), Adam [12], RMSprop [25], PID- [1], and Addsign [3], , of which the test error is reported in the table. Additionally, the average of the error reduction ratio () across different network and dataset (along each row) w.r.t. the proposed method is listed in the last column. Similar conclusions as the ones in previous subsections can be made, that SPI-Optimizer outperforms the state-of-the-art optimizers by a large margin in terms of faster convergence speed (up to epochs reduction ratio) and more accurate classification result (up to mean error reduction ratio). Such performance gain can also verify the generalization ability of SPI-Optimizer across different datasets and different networks.

5 Conclusion

By analyzing the oscillation effect of the momentum-based optimizer, we know that the lag effect of the accumulated gradients can lead to large maximum overshoot and long settling time. Inspired by the recent work in associating stochastic optimization with classical PID control theory, we propose a novel SPI-Optimizer that can be interpreted as a type of conditional integral PI controller, which prevents the integral / momentum term by examining the sign consistency between residual and integral term. Such adaptability further guarantees the generalization of optimizer on various networks and datasets. The extensive experiments on MNIST, CIFAR-10 and CIFAR-100 using various popular networks fully support the superior performance of SPI-Optimizer, leading to considerably faster convergence speed (up to epochs reduction ratio) and more accurate result (up to error reduction ratio) than the classical optimizers such as MOM, SGD, NAG on different networks and datasets.

References

  • [1] W. An, H. Wang, Q. Sun, J. Xu, Q. Dai, and L. Zhang. A pid controller approach for stochastic optimization of deep networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 8522–8531, 2018.
  • [2] K. J. Astrom and L. Rundqwist. Integrator windup and how to avoid it. In American Control Conference, 1989, pages 1693–1698. IEEE, 1989.
  • [3] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le.

    Neural optimizer search with reinforcement learning.

    In

    International Conference on Machine Learning

    , pages 459–468, 2017.
  • [4] A. Cauchy. Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
  • [5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [7] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
  • [8] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
  • [9] D. J. Im, M. Tao, and K. Branson. An empirical analysis of the optimization of deep network loss surfaces. arXiv preprint arXiv:1612.04010, 2016.
  • [10] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang. Surfacenet: An end-to-end 3d neural network for multiview stereopsis. In Proceedings of the IEEE International Conference on Computer Vision, pages 2307–2315, 2017.
  • [11] A. Karparthy. A peak at trends in machine learning. 2017.
  • [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [13] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [16] N. Loizou and P. Richtárik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017.
  • [17] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
  • [18] K. Ogata. Discrete-time control systems, volume 2. Prentice Hall Englewood Cliffs, NJ, 1995.
  • [19] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • [20] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. 2018.
  • [21] P. Richtárik and M. Takáč. Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv preprint arXiv:1706.01108, 2017.
  • [22] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
  • [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
  • [24] R. Sutton. Two problems with back propagation and other steepest descent learning procedures for networks. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986, pages 823–832, 1986.
  • [25] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
  • [26] R. Vitthal, P. Sunthar, and C. D. Rao. The generalized proportional-integral-derivative (pid) gradient descent back propagation algorithm. Neural Networks, 8(4):563–569, 1995.
  • [27] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.
  • [28] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • [29] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  • [30] H. Zheng, L. Fang, M. Ji, M. Strese, Y. Özer, and E. Steinbach. Deep learning for surface material classification using haptic and visual information. IEEE Transactions on Multimedia, 18(12):2407–2416, 2016.

1 More 2D Examples

Aiming for an intuitive illustration of the performance of proposed SPI-Optimizer, we present more 2D examples on several well-known functions in optimization community, the Rosenbrock function Eqn. 7, the Goldstein-Price function Eqn. 8, and a non-convex function Eqn. 9.

Recall that the loss is defined as . The top charts of Fig. 8, Fig. 9, and Fig. 10 depict the loss in log scale over epochs. The left column of subfigures illustrates the convergence path of each algorithm. The right column of subfigures shows the change of the horizontal residue w.r.t. the optimal point over epochs. The weight update consists of the current gradient (red arrow) and the momentum (green arrow), which can be interpreted as two forces dragging the residual curve (blue).

The Rosenbrock function:

(7)

The convergence path depicted in Fig. 8 starts at the point (the blue dot) with the optimal point locating at (the red dot). The lr is .

The Goldstein-Price function:

(8)

The convergence path depicted in Fig. 9 starts at the point (the blue dot) with the optimal point locating at (the red dot). The lr is .

A 2D trigonometric function:

(9)

The convergence path depicted in Fig. 10 starts at the point (the blue dot) with the optimal point locating at (the red dot). The lr is .

Apparently, GD and SPI oscillate the least, and NAG tends to be more stable than MOM. This intuitively validates the previous hypothesis: the optimizer with smaller state delay is less likely to oscillate around about the optimal point.

(a) SPI
(b) GD
(c) MOM
(d) NAG
Figure 8: Convergence comparison among the proposed SPI-Optimizer, Gradient Descent (GD) [4], Momentum (MOM) [23], and Nesterov Accelerated Gradient (NAG) [17] on the Rosenbrock function Eqn. 7. The top chart depicts the L2 distance to the optimal point in log scale over epochs. The middle row of subfigures illustrate the convergance path of the each algorithm. The bottom row of subfigures show the change of the horizontal residual w.r.t. the optimal point over epochs. Two components for the weight update are the current gradient (red arrow) and the momentum (green arrow), which can be interpreted as two forces dragging the blue curve.
(a) SPI
(b) GD
(c) MOM
(d) NAG
Figure 9: Convergence comparison among the proposed SPI-Optimizer, Gradient Descent (GD) [4], Momentum (MOM) [23], and Nesterov Accelerated Gradient (NAG) [17] on the Goldstein-Price function Eqn. 8. The top chart depicts the L2 distance to the optimal point in log scale over epochs. The middle row of subfigures illustrate the convergance path of the each algorithm. The bottom row of subfigures show the change of the horizontal residual w.r.t. the optimal point over epochs. Two components for the weight update are the current gradient (red arrow) and the momentum (green arrow), which can be interpreted as two forces dragging the blue curve.
(a) SPI
(b) GD
(c) MOM
(d) NAG
Figure 10: Convergence comparison among the proposed SPI-Optimizer, Gradient Descent (GD) [4], Momentum (MOM) [23], and Nesterov Accelerated Gradient (NAG) [17] on the non-convex function Eqn. 9. The top chart depicts the L2 distance to the optimal point in log scale over epochs. The middle row of subfigures illustrate the convergance path of the each algorithm. The bottom row of subfigures show the change of the horizontal residual w.r.t. the optimal point over epochs. Two components for the weight update are the current gradient (red arrow) and the momentum (green arrow), which can be interpreted as two forces dragging the blue curve.

2 Convergence Proof of SPI-Optimizer

Given the mathematical representation of SPI-Optimizer as

(10)

we particularly introduce a diagonal matrix to replace the indicator function for ease of derivation. The diagonal elements in are all 1 or 0, indicating whether momentum terms are deserted on different dimensions. Then, we have , and Eqn. 10 is represented as

(11)

By further denoting , and as the global minimum point satisfying , it can be shown that several inequalities hold as follows,

Now we focus on studying whether the sequence generated by SPI converges to the best parameter point by decomposing as follows. Note that we have combined the three inequalities above during the derivation process.

Take a typical case that is a convex function for example, which is usually assumed to be a -strongly convex and -smooth function[21]. According to the convex optimization theory, we have inequalities as follows,

While the stochastic gradients of loss function is adopted by denoting the stochastic gradient in epoch as , according to the smooth property we can assume that such property holds for all stochastic gradients in the experiments, or we can increase the value of to satisfy it. Hence we have

Taking expectations on both sides, we have

The last step above is based on the inequality . For the case of SGD-momentum, we have . As a result, we can see that the bound of our SPI-Optimizer is more tight than that of SGD-momentum.  
 

Now we investigate whether such bound is sufficient enough for convergence. By denoting , we have

According to Lemma 9 in Loizou’s study [16], as long as , the convergence of sequence is guaranteed, which implies that

Considering that and are positive, we have . Hence its solution can be given as

The result indicates that the convergence of our SPI-Optimizer can be guaranteed under certain values of and . It is worth noting that we have used the inequality during our derivation, while for the SGD-momentum algorithm, none of the components is discarded. Consequently, the bound of our SPI-Optimizer will be tighter than that of SGD-momentum. In other words, our SPI-Optimizer tends to converge faster than SGD-momentum under certain parameters.

References

  • [1] W. An, H. Wang, Q. Sun, J. Xu, Q. Dai, and L. Zhang. A pid controller approach for stochastic optimization of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8522–8531, 2018.
  • [2] K. J. Astrom and L. Rundqwist. Integrator windup and how to avoid it. In American Control Conference, 1989, pages 1693–1698. IEEE, 1989.
  • [3] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le. Neural optimizer search with reinforcement learning. In International Conference on Machine Learning, pages 459–468, 2017.
  • [4] A. Cauchy. Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
  • [5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [7] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
  • [8] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
  • [9] D. J. Im, M. Tao, and K. Branson. An empirical analysis of the optimization of deep network loss surfaces. arXiv preprint arXiv:1612.04010, 2016.
  • [10] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang. Surfacenet: An end-to-end 3d neural network for multiview stereopsis. In Proceedings of the IEEE International Conference on Computer Vision, pages 2307–2315, 2017.
  • [11] A. Karparthy. A peak at trends in machine learning. 2017.
  • [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [13] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [16] N. Loizou and P. Richtárik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017.
  • [17] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
  • [18] K. Ogata. Discrete-time control systems, volume 2. Prentice Hall Englewood Cliffs, NJ, 1995.
  • [19] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • [20] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. 2018.
  • [21] P. Richtárik and M. Takáč. Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv preprint arXiv:1706.01108, 2017.
  • [22] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
  • [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
  • [24] R. Sutton. Two problems with back propagation and other steepest descent learning procedures for networks. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986, pages 823–832, 1986.
  • [25] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
  • [26] R. Vitthal, P. Sunthar, and C. D. Rao. The generalized proportional-integral-derivative (pid) gradient descent back propagation algorithm. Neural Networks, 8(4):563–569, 1995.
  • [27] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.
  • [28] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • [29] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  • [30] H. Zheng, L. Fang, M. Ji, M. Strese, Y. Özer, and E. Steinbach. Deep learning for surface material classification using haptic and visual information. IEEE Transactions on Multimedia, 18(12):2407–2416, 2016.