1 Introduction
Serving as a fundamental tool to solve practical problems in both scientific and engineering domains, a proper optimizer plays vital role. Taking the highly concerned deep learning successful stories
[7, 8, 10, 30, 28, 6, 8]as examples, stochastic gradient descent (SGD) serves as one of the most popular solvers, due to its ability in maintaining a good balance between efficiency and effectiveness. The expectation in training very deep networks substantially requires for even more efficient optimizer such as SGDMomentum (MOM)
[23]. However it suffers from the oscillation problem [18], with nonnegligible maximum overshoot and settling time. Such an oscillation phenomena hinders the convergence of MOM, requiring more training time and resources. As a result, an efficient as well as effective optimizer is urgently demanded yet very challenging, owing to the highly nonconvex nature of the optimization problems.Recently, some researchers investigate the conventional optimization problem by associating it with the ProportionalIntegralDerivative (PID) model that widely used in the feedback control system. By linking the calculation of errors in feedback control system and the calculation of gradient in network updating, [1] shows that MOM can be treated as a special case of classical PID controller with only Proportional (P) and Integral (I) components. It further artificially adds the Derivative (D) component to form a PID based optimizer, which reduces the oscillation phenomena by introducing troublesome hyperparameter induced by D component. In other words, the calculated coefficient of the derivative term can hardly adapt to the huge diversity of network architectures and different modalities of the training dataset.
On the contrary to extend PI to PID directly, we explore “why momentumbased method oscillates about the optimal point?” via thorough analysis from the perspective of inherent connection between MOM and PI controller. The indepth preanalysis (Section 3.1) reveals that the fluctuation problem in momentumbased method relates to the lag effect of integral (I) term in PI controller. Inspired by the conditional integration idea in classical control society, we propose SPIOptimizer, an integralSeparated PI controller based solver for stochastic Optimization scheme. SPIOptimizer separates momentum term adaptively when the inconsistency of current and historical gradient direction occurs.
More specifically, the insight of SPIOptimizer can be explained more explicitly as follows (more discussions in Section 3.3). For Conditional Integration used in classical control society (denoted as CI), the integral component is only considered as long as the magnitude of the feedback deviation (the gradient) is smaller than a threshold . That means SGD with only proportional (P) term can be viewed as CI. Similarly, MOM never separates out the integral (I) part and can be denoted as CI. While the oscillation phenomenon may be tuned by setting , the convergence speed of CI cannot be improved by trading off the parameter , which remains bounded by CI and CI. Our SPIOptimizer examines the sign consistency between the residual and the integral term before enabling the integral component, thus easing the oscillation phenomenon WITHOUT introducing extra hyperparameter. As a result, it can be theoretically shown that SPIOptimizer outperforms both CI and CI, owning more generalization ability for different network structures across several popular dataset. We summarize the technique contributions as follows.

By associating MOM with PI controller, we analytically show that the oscillation in momentumbased method corresponds to the lag effect of integral (I) term in PI controller, which inspires us to deal with I term instead of adding D term, as the latter one introduces extra hyperparameter.

A novel SPIOptimizer based on the integralSeparated PI controller is proposed to separate momentum term adaptively when the inconsistency of current and historical gradient direction occurs. The detailed discussion on the convergence of SPIOptimizer is provided theoretically.

SPIOptimizer eliminates the oscillation phenomenon without introducing any extra hyperparameter and leads to considerably faster convergence speed and more accurate result on popular network architectures.
2 Related Work
Among the various optimization schemes, the gradient based methods have served as the most popular optimizer to solve tremendous optimization problems. The representative ones include gradient descent (GD) [4], stochastic gradient descent (SGD) [22], heavy ball (HB)[19], Nesterov accelerated gradient (NAG) [17] etc. While GD is the simplest one, it is restricted by the redundant computations for large dataset, as it recomputes gradients for similar examples before updating each parameter. SGD improves it by sampling a random subset of the overall dataset, yet it is difficult to pass ravines [24]. HB puts forward by adding a fraction to accelerate the iteration, which is further developed and named as Momentum(MOM) [23]. NAG further uses the momentum term to update parameters and corrects gradient with some prescience. All of these classic optimization algorithms own fixed learning rates.
Lately, an increasing share of deep learning researchers train their models with adaptive learning rates methods [5, 29, 20, 12], due to the requirement of speeding up the training time [11]. They try to adapt updates to each individual parameter to perform larger or smaller updates depending on their importance.
Regardless the successful usage of adaptive method in in many applications owing to its competitive performance and its ability to work well despite minimal tuning, the recent findings by [27] show that handtuned SGD and MoM achieves better result at the same or even faster speed than adaptive method. Furthermore, the authors also show that for even simple quadratic problems, adaptive methods find solutions that can be ordersofmagnitude worse at generalization than those found by SGD(M). They put forward that a possible explanation for the worse results in adaptive methods lies in the convergence to different local minimums [9]. It is also noted that most stateoftheart deep models such as ResNet [6] and DenseNet [8] are usually trained by momentumbased method, as the adaptive methods generalize worse than SGDMomentum, even when these solutions have better training performance.
On the other hand, some researchers try to investigate stochastic optimization by associating it with the classical ProportionalIntegralDerivative (PID) controller that widely used in the feedback control system. The pioneer work [26] regarded the classical gradient descent algorithm as the PID controller that uses the Proportional (P) term merely. They added Integral (I) and Derivative (D) terms to achieve faster convergence. The latest work [1] interpreted that momentum can be treated as a PI controller, and a Derivative (D) term that is the predictive gradient difference is added to reduce oscillation and improve SGDMomentum on the largescale dataset. Unfortunately, either introducing I and D terms to GD, or introducing D term to SGDMomentum intensifies the task of tuning (which will be further elaborated in our experiments).
3 SPIOptimizer
In this section, we firstly conduct a thorough preanalysis on the oscillation phenomena of momentumbased algorithm in Section 3.1. Aided by the association with PI controller, the oscillation can be explained by the lag effect of integral (I) term in PI controller. We then propose a novel SPIOptimizer to separate I term from PI controller adaptively, which eases the oscillation problem WITHOUT introducing extra hyperparameter. Subsequently, indepth discussions to further evaluate SPIOptimizer are provided in Section 3.3.
3.1 Preanalysis of Oscillation
As introduced in [23], the Momentum algorithm (MOM) works by accumulating an exponentially decayed moving average of past gradients. Mathematically, the momentumbased optimizer can be defined as
(1)  
where and respectively define MOM and Nesterov Accelerated Gradient (NAG) [17].
Although the momentum component can accelerate the convergence in the case of small and consistent gradients, it suffers from oscillation phenomenon that the convergence path fluctuates about the optimal point, as shown in Fig. 0(a). Such oscillation can be quantitatively described by two concepts: the settling time , defined as the time required for the curve to reach and stay within a range of certain threshold () to the optimal point, and the maximum overshoot describing the difference between the maximum peak value and the optimal value:
(2) 
As defined in Eqn. (1), there are two components contributing to the weight update, i.e., the momentum and the current gradient . The integral term can introduce some nonnegligible weight updates that are opposite to the gradient descent direction. In that case, the momentum will lag the update of weights even if the weights should change their gradient direction. Analogous to the feedback control, such lag effect leads to more severe oscillation phenomenon, i.e., the convergence path fluctuates about the optimal point with larger maximum overshoot and longer settling time .
We further take a commonly used function for illustration. In Fig. 0(a), the convergence path is composed of multiple weight updates shown in different colors. By only considering the horizontal axis, Fig. 0(b) depicts the residual of the convergence path to the optimal point using blue curve, and the weight updates from both the momentum and the current gradient , shown as green and red arrows respectively. In the process of approaching the goal, we define several time stamps: as the time when the curve first exceeds the optimal point, as the time when it reaches the maximum overshoot, and as the settling time.
The weight updates (green and red arrows) start with the same direction (up) until . For the duration , because the weight exceeds the optimal point (origin point in this specific example), the gradient descent direction (red arrow) gets reversed. But owing to the large accumulated gradients value (green arrow), the weight update deviates from the current rising trend until when . As a result, the momentum introduces lag effect to the update of weights in the period of and leads to severe oscillation effect with large maximum overshoot and long settling time.
Compared with MOM, gradient descent (GD) oscillates less due to the lack of the accumulated gradients, shown in Fig. 0(c). Even though the maximum overshoot is much smaller than MOM, the settling time is unacceptably longer. Due to the lag effect of the momentum within the period , the oscillation phenomenon of NAG is as severe as MOM. In Fig. 1, SPIOptimizer obtained about convergent epoch reduction ratio.
3.2 integralSeparated PI Controller
The latest work [1] points out that if the optimization process is treated as a dynamic system, the optimizer Stochastic Gradient Descent (SGD) can be interpreted as a proportion (P) controller with . Then, MOM and NAG can be represented as ProportionalIntegral (PI) controller. As discussed in the previous subsection, the momentum term can lead to severe system oscillation about the optimal point owing to the lag effect of the integral / momentum. To ease the fluctuation phenomenon, [1] artificially adds the derivative (D) term to be a PID controller with a networkinvariant and datasetinvariant coefficient . However, it is questionable that an universal (D) coefficient , rather than a modelbased one, is applicable to diverse network structures. At the same time, the newly introduced hyperparameter for the derivative term needs more effort for empirical tuning. In contrast, we propose integralSeparated PI Controller based Optimizer (SPIOptimizer) to directly deal with the integral term, WITHOUT introducing any extra hyperparameter.
In a typical optimization problem, the loss function
is a metric to measure the distance between the desired output and the prediction given the weight . The gradient of the weights can be used to update the weights till the optimal solution with zero gradient. Hence the gradient w.r.t. weights can be associated with the “error” in the feedback control. Consequently, rethinking the problem from the perspective of control, although PI controller leads to faster respond compared with P controller, it can easily lag and destabilize the dynamic system by accumulating large historical errors.Inspired by the conditional integration [2] strategy in control community, which prevents the integral term from accumulating within predetermined bound to effectively suppress the lag effect, a simple conditional integration optimizer (CI) is proposed as follows:
(3)  
where
is the introduced threshold for each dimension of the state vectors. Unfortunately, such naive adoption leads to some drawbacks: (1) it requires extra effort to empirically tune the hyperparameter
, and has weak generalization capability across different cost function , (2) by manually selecting the gradient threshold, the performance of CI is almost bounded by SGD (CI) and MOM (CI) certainly.Recall that what we expect is an optimizer with short rising time and small maximum overshoot . As illustrated in Fig. 1 previously, the momentumbased algorithm has much shorter rising time than GD due to the accumulated gradients. However, the historical gradients lag the update of weights in the period when the gradient direction gets reversed, and lead to severe oscillation about the optimal point. To ease the fluctuation, the proposed SPIOptimizer isolates the integral component of the controller when the inconsistency of current and historical gradient direction occurs, i.e.,
(4) 
The SPIOptimizer is further presented by:
(5) 
The key insight here is that the historical gradients will lag the update of weights if the weights should not keep the previous direction, i.e., , leading to oscillation of gradients about the optimal point until the gradients compensates the momentum in the reversed direction. In this way, SPIOptimizer can converge as fast as MOM and NAG yet leads to much smaller maximum overshoot. On the other hand, we may interpret the SPIOptimizer from the perspective of state delay.
State Delay: Recall that the objective of this feedback system is to let the gradient approach 0, yet we only have the observation of the state . This can be understood as a dynamic system with measurement latency or temporal delay. The larger the delay is, the more likely severe oscillation or unstable system occurs. Analogously, we define as state delay in stochastic optimization:
(6)  
Hypothesis: One hypothesis is that for the momentumbased optimizer (PI controller), the optimizer with smaller state delay is highly likely having less oscillation, which is harmful for system stability. As the increase of in Eqn. 6 from GD to MOM, state delay of MOM has higher chance to be larger than that of GD , which explains why MOM usually oscillates more during optimization process. Similarly, NAG can be understood as a more robust PI controller with smaller state delay under the assumption that both and share the same probabilistic distribution. For SPIOptimizer, when the oscillation is detected, we reduce the state delay by assigning . Otherwise, it remains using PI controller to speed up.
3.3 Discussion
To make the hypothesis mentioned above more intuitive and rigorous, and to further quantify how much SPIOptimizer improves compared with other optimizer, we specifically take a very simple convex function and the McCormick function as examples to visualize the optimization procedure. The concerned representative P or PI based optimizers used for comparison are Momentum (MOM) [23], Gradient Descent (GD) [22], Nesterov Accelerated Gradient (NAG) [17] and Conditional Integration (CI) [2] with different thresholding parameter .
The convergence path of each optimizer applying on the two functions are depicted in Fig. 2 and Fig. 3 (subfigures with green background). The optimal point locates at the origin (the red point), and the convergence process starts from the blue point with the maximum 100 iterations. The loss is defined as with ,respectively. Apparently, GD and SPI oscillate the least, and NAG tends to be more stable than MOM. This intuitively validates the previous hypothesis: for the same type of optimization controller, the one with smaller state delay is highly likely having less oscillation.
Additionally, the convergence speed of all the methods can be inferred from the top chart in Fig. 2 and Fig. 3, where the naive conditional integration inspired controller with different thresholds are marked as CI. From the definition of CI, we can tell that the performance of CI is almost bounded by GD (CI) and MOM (CI), which can also be interpreted from both Fig. 2 and Fig. 3. It is worthy note that the hyperparameter aggravates the parameter tuning methodology, since it should be determined by the characteristics of the loss functions that depends on the training data , the network structure and the metric between and . Even in the toy 2D example, the extra introduced hyperparameter by the CI is not reliable for a favorable result.
In contrast, the proposed SPIOptimizer takes precautions against oscillation that may lead to unstable system and slow convergence, by preventing large state delays. So that the fluctuation phenomenon of the convergence curve gets eased. Meanwhile, the convergence rate of SPI is clearly superior to that of others, not only in the initial stages where the value of error function is significant, but also in the later part when the error function is close to the minimum. Quantitatively the convergence speed reaches up to and epoch reduction ratio respectively on the 2D function and when the L2 norm of the residual hits the threshold 1e5.
Convergence analysis: More importantly, we conduct theoretical analysis on the convergence of SPIOptimizer, and show that under certain condition that the loss function is strongly convex and smooth[21], and the learning rate and momentum parameter within a proper range, 1) the convergence of SPIOptimizer can be guaranteed strictly, 2) the convergence rate of SPIOptimizer is faster than MOM. Due to limited space, the detailed analysis is presented in the supplementary material.
4 Experiments
Following the discussion on the 2D demo with harsh fluctuation, this section studies the performance of SPIOptimizer on the convolutional neural networks (CNNs), consisting of convolution layers and nonlinear units.
In the first subsection, we compare with the most relevant solutions for dealing with the oscillation problem of the integral component. One method is the PID [1] that adds a derivative (D) term to MOM. Another counterpart is the conditional integration (CI) optimizer, introducing a hyperparameter to define a bound within which the momentum term gets prevented. In contrast, the proposed SPIOptimizer does not introduce extra hyperparameter and outperforms both of them.
Subsequently, experiments are conducted to evaluate the P controller (SGD) and the PI controllers (MOM and NAG) under various training settings, showing that SPIOptimizer is more robust to large learning rate range, different learning rate decay schemes, and various network structures.
Finally, SPIOptimizer is compared with the stateoftheart optimization methods in different datasets across different network structures to illustrate its better generalization ability.
Note that all the reported charts and numbers are averaged after 3 runs.
4.1 Compare with the Oscillation Compensators
Comparison with PID: Fig. 4 depicts the performance using CNNs^{1}^{1}1
The architecture of CNNs contains 2 alternating stages of 5x5 convolution lifters and 2x2 max pooling with stride of 1 followed by a fully connected layer output. The dropout noise applies on the fully connected layer.
on the handwritten digits dataset MNIST [15] consisting of 60k training and 10k test 28x28 gray images. Even though PID [1] can ease the oscillation problem of MOM, its hyperparameter requires much effort for empirical tuning to get relatively better result. Specifically, a large range of is tested from to ; however, SPIOptimizer performs better in terms of faster convergence speed and around error reduction ratio than PID.One may notice that the curve for and (blue dashed line) did not appear in the Fig. 4, since and can not lead to convergence but is followed equation of initial value selection in [1] when set learning rate is 0.12. It’s worth pointing out that a similar situation exists for other learning rate. That can be explained by the fact that the hyperparameter requires big effort to be tuned. One guess is that should depend on many factors, such as the training data, the network structure and the network loss metric.
Additionally, the comparison of the generalization ability across various network structures and datasets is listed in Tab. 2, where SPIOptimizer also constantly outperforms PID. More importantly, SPIOptimizer does not introduce extra hyperparameter.
Methods  CIFAR10  CIFAR100  
lr=0.05  lr=0.1  lr=0.18  lr=0.05  lr=0.1  lr=0.18  
SGD  24.812%  24.757%  25.522%  60.089%  60.240%  60.079% 
MOM  24.929%  27.766%  NaN  59.158%  68.312%  NaN 
NAG  24.753%  26.811%  NaN  58.945%  67.091%  NaN 
SPI  24.257%  24.245%  25.823%  58.188%  57.179%  57.223% 
Comparison with CI: As we observe from the toy examples in Section 3.3, the performance of CI is almost bounded by SGD (CI) and MOM (CI). Similarly, from Fig. 5 we get the same conclusion that CI can hardly outperforms SGD and MOM in a large searching range of . The comparison is conducted on the larger datasets CIFAR10 [13]^{2}^{2}2The CIFAR10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. and CIFAR100 [13]^{3}^{3}3The CIFAR100 is just like the CIFAR10, except it has 100 classes containing 600 images each. using AlexNet [14]. Quantitatively, without any extra hyperparameter, the proposed SPIOptimizer can reach higher accuracy ( error reduction ratio) and faster convergence ( speed up) than CI with ranging from to .
4.2 Comparison with P/PI Optimizers
High Adaptability Of Learning Rate: The comparison for different learning rate is demonstrated in Tab. 1, where three learning rate values are evaluated on both CIFAR10 and CIFAR100 datasets with AlexNet. Note that the symbol “NaN” indicates that the optimization procedure cannot converge with that specific setting. Interestingly, SPIOptimizer is the only surviving PI controller (momentum related optimizer) in the case with large learning rate . We can safely conclude that, compared with other P/PI optimizers (SGD, MOM, and NAG), SPIOptimizer is more robust to larger learning rate while retaining the best performance.
Learning Rate Decay: We investigate the influence of different learning rate update schemes to the performance. Firstly, Fig. 5(a) depicts the result of constant learned rate , which is corresponding to the best performance of the other methods in Tab. 1. Even though we choose the setting with small performance gap with the others, the convergence speedup can still reach by comparing with the 2nd best method (SGD). It is calculated based on the (epoch, error percentage) key points of SPIOptimizer () and SGD ().
Then, as a comparison, results with decayed learning rate by a factor of 0.1 in every 30 epochs is reported in Fig. 5(b). Even though MOM and NAG rise faster in the very beginning, SPIOptimizer still has a big accuracy improvement versus the others. So that the proposed method can performs good in different learning rate decay schemes.
Optimization Methods 





mean error reduction ratio  
SGD [22]  1.111%  24.757%  5.252%  60.079%  23.392%  7.9%  
MOM [23]  1.720%  24.929%  4.804%  59.158%  21.684%  11.5%  
NAG [17]  1.338%  24.753%  4.780%  58.945%  22.414%  8.3%  
Adam [12]  1.110%  27.031%  10.254%  63.397%  32.548%  23.5%  
RMSprop [25]  1.097%  30.634%  11.377%  65.704%  34.182%  27.5%  
PID [1]  1.3%  24.672%  5.055%  58.946%  21.93%  8.3%  
Addsign [3]  1.237%  24.811%  7.6%  60.482%  25.344%  16.4%  
SPI  1.070%  24.245%  4.320%  57.118%  20.890%   
Convergence speed and Accuracy: We investigate SPIOptimizer with other optimizers on CIFAR10 and CIFAR100 dataset with Resnet56[6] and Wide ResNet (WRN) [28]. [6] trained ResNet56 on CIFAR10, CIFAR100 by droping the learning rate by 0.1 at and of the training procedure and using weight decay of 0.0001. We use the same setting for the experiments in Fig. 7. WRN168 is selected that consists 16 layers with widening factor . Following [28] training method, we also used weight decay of 0.0005, minibatch size to 128. The learning rate is dropped by a factor 0.2 at 60th, 120th, and 160th epoch with total budget of 200 epochs. For each optimizer we report the best test accuracy out of 7 different learning rate settings ranging from 0.05 to 0.4. From Fig. 7 we can see that SPIOptimizer constantly performs the best in terms of faster and more error reduction ratio than the second best method.
4.3 Comparison with the Stateoftheart
To further demonstrate the effectiveness and the efficiency of SPIOptimizer, we conduct the comparison with several stateoftheart optimizers on different datasets MNIST, CIFAR10, and CIFAR100 by using AlexNet and WRN, as shown in Tab. 2. The compared methods includes P controller (SGD), PI controller (MOM and NAG), Adam [12], RMSprop [25], PID [1], and Addsign [3], , of which the test error is reported in the table. Additionally, the average of the error reduction ratio () across different network and dataset (along each row) w.r.t. the proposed method is listed in the last column. Similar conclusions as the ones in previous subsections can be made, that SPIOptimizer outperforms the stateoftheart optimizers by a large margin in terms of faster convergence speed (up to epochs reduction ratio) and more accurate classification result (up to mean error reduction ratio). Such performance gain can also verify the generalization ability of SPIOptimizer across different datasets and different networks.
5 Conclusion
By analyzing the oscillation effect of the momentumbased optimizer, we know that the lag effect of the accumulated gradients can lead to large maximum overshoot and long settling time. Inspired by the recent work in associating stochastic optimization with classical PID control theory, we propose a novel SPIOptimizer that can be interpreted as a type of conditional integral PI controller, which prevents the integral / momentum term by examining the sign consistency between residual and integral term. Such adaptability further guarantees the generalization of optimizer on various networks and datasets. The extensive experiments on MNIST, CIFAR10 and CIFAR100 using various popular networks fully support the superior performance of SPIOptimizer, leading to considerably faster convergence speed (up to epochs reduction ratio) and more accurate result (up to error reduction ratio) than the classical optimizers such as MOM, SGD, NAG on different networks and datasets.
References

[1]
W. An, H. Wang, Q. Sun, J. Xu, Q. Dai, and L. Zhang.
A pid controller approach for stochastic optimization of deep
networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 8522–8531, 2018.  [2] K. J. Astrom and L. Rundqwist. Integrator windup and how to avoid it. In American Control Conference, 1989, pages 1693–1698. IEEE, 1989.

[3]
I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le.
Neural optimizer search with reinforcement learning.
InInternational Conference on Machine Learning
, pages 459–468, 2017.  [4] A. Cauchy. Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
 [5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [7] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
 [8] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
 [9] D. J. Im, M. Tao, and K. Branson. An empirical analysis of the optimization of deep network loss surfaces. arXiv preprint arXiv:1612.04010, 2016.
 [10] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang. Surfacenet: An endtoend 3d neural network for multiview stereopsis. In Proceedings of the IEEE International Conference on Computer Vision, pages 2307–2315, 2017.
 [11] A. Karparthy. A peak at trends in machine learning. 2017.
 [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [13] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [16] N. Loizou and P. Richtárik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017.
 [17] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
 [18] K. Ogata. Discretetime control systems, volume 2. Prentice Hall Englewood Cliffs, NJ, 1995.
 [19] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 [20] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. 2018.
 [21] P. Richtárik and M. Takáč. Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv preprint arXiv:1706.01108, 2017.
 [22] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
 [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. nature, 323(6088):533, 1986.
 [24] R. Sutton. Two problems with back propagation and other steepest descent learning procedures for networks. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986, pages 823–832, 1986.
 [25] T. Tieleman and G. Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 [26] R. Vitthal, P. Sunthar, and C. D. Rao. The generalized proportionalintegralderivative (pid) gradient descent back propagation algorithm. Neural Networks, 8(4):563–569, 1995.
 [27] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.
 [28] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 [29] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 [30] H. Zheng, L. Fang, M. Ji, M. Strese, Y. Özer, and E. Steinbach. Deep learning for surface material classification using haptic and visual information. IEEE Transactions on Multimedia, 18(12):2407–2416, 2016.
1 More 2D Examples
Aiming for an intuitive illustration of the performance of proposed SPIOptimizer, we present more 2D examples on several wellknown functions in optimization community, the Rosenbrock function Eqn. 7, the GoldsteinPrice function Eqn. 8, and a nonconvex function Eqn. 9.
Recall that the loss is defined as . The top charts of Fig. 8, Fig. 9, and Fig. 10 depict the loss in log scale over epochs. The left column of subfigures illustrates the convergence path of each algorithm. The right column of subfigures shows the change of the horizontal residue w.r.t. the optimal point over epochs. The weight update consists of the current gradient (red arrow) and the momentum (green arrow), which can be interpreted as two forces dragging the residual curve (blue).
The Rosenbrock function:
(7) 
The convergence path depicted in Fig. 8 starts at the point (the blue dot) with the optimal point locating at (the red dot). The lr is .
The GoldsteinPrice function:
(8) 
The convergence path depicted in Fig. 9 starts at the point (the blue dot) with the optimal point locating at (the red dot). The lr is .
A 2D trigonometric function:
(9) 
The convergence path depicted in Fig. 10 starts at the point (the blue dot) with the optimal point locating at (the red dot). The lr is .
Apparently, GD and SPI oscillate the least, and NAG tends to be more stable than MOM. This intuitively validates the previous hypothesis: the optimizer with smaller state delay is less likely to oscillate around about the optimal point.












2 Convergence Proof of SPIOptimizer
Given the mathematical representation of SPIOptimizer as
(10) 
we particularly introduce a diagonal matrix to replace the indicator function for ease of derivation. The diagonal elements in are all 1 or 0, indicating whether momentum terms are deserted on different dimensions. Then, we have , and Eqn. 10 is represented as
(11) 
By further denoting , and as the global minimum point satisfying , it can be shown that several inequalities hold as follows,
Now we focus on studying whether the sequence generated by SPI converges to the best parameter point by decomposing as follows. Note that we have combined the three inequalities above during the derivation process.
Take a typical case that is a convex function for example, which is usually assumed to be a strongly convex and smooth function[21]. According to the convex optimization theory, we have inequalities as follows,
While the stochastic gradients of loss function is adopted by denoting the stochastic gradient in epoch as , according to the smooth property we can assume that such property holds for all stochastic gradients in the experiments, or we can increase the value of to satisfy it. Hence we have
Taking expectations on both sides, we have
The last step above is based on the inequality . For the case of SGDmomentum, we have . As a result, we can see that the bound of our SPIOptimizer is more tight than that of SGDmomentum.
Now we investigate whether such bound is sufficient enough for convergence. By denoting , we have
According to Lemma 9 in Loizou’s study [16], as long as , the convergence of sequence is guaranteed, which implies that
Considering that and are positive, we have . Hence its solution can be given as
The result indicates that the convergence of our SPIOptimizer can be guaranteed under certain values of and . It is worth noting that we have used the inequality during our derivation, while for the SGDmomentum algorithm, none of the components is discarded. Consequently, the bound of our SPIOptimizer will be tighter than that of SGDmomentum. In other words, our SPIOptimizer tends to converge faster than SGDmomentum under certain parameters.
References
 [1] W. An, H. Wang, Q. Sun, J. Xu, Q. Dai, and L. Zhang. A pid controller approach for stochastic optimization of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8522–8531, 2018.
 [2] K. J. Astrom and L. Rundqwist. Integrator windup and how to avoid it. In American Control Conference, 1989, pages 1693–1698. IEEE, 1989.
 [3] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le. Neural optimizer search with reinforcement learning. In International Conference on Machine Learning, pages 459–468, 2017.
 [4] A. Cauchy. Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
 [5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [7] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
 [8] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
 [9] D. J. Im, M. Tao, and K. Branson. An empirical analysis of the optimization of deep network loss surfaces. arXiv preprint arXiv:1612.04010, 2016.
 [10] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang. Surfacenet: An endtoend 3d neural network for multiview stereopsis. In Proceedings of the IEEE International Conference on Computer Vision, pages 2307–2315, 2017.
 [11] A. Karparthy. A peak at trends in machine learning. 2017.
 [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [13] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [16] N. Loizou and P. Richtárik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017.
 [17] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
 [18] K. Ogata. Discretetime control systems, volume 2. Prentice Hall Englewood Cliffs, NJ, 1995.
 [19] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 [20] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. 2018.
 [21] P. Richtárik and M. Takáč. Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv preprint arXiv:1706.01108, 2017.
 [22] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
 [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. nature, 323(6088):533, 1986.
 [24] R. Sutton. Two problems with back propagation and other steepest descent learning procedures for networks. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986, pages 823–832, 1986.
 [25] T. Tieleman and G. Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 [26] R. Vitthal, P. Sunthar, and C. D. Rao. The generalized proportionalintegralderivative (pid) gradient descent back propagation algorithm. Neural Networks, 8(4):563–569, 1995.
 [27] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.
 [28] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 [29] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 [30] H. Zheng, L. Fang, M. Ji, M. Strese, Y. Özer, and E. Steinbach. Deep learning for surface material classification using haptic and visual information. IEEE Transactions on Multimedia, 18(12):2407–2416, 2016.