Stochastic gradient-based optimization is most widely used in many fields of science and engineering. In recent years, many scholars have compared SGD with some adaptive learning rate optimization methods[2, 3].  shows that adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development/test set. Therefore, many excellent models [5, 6, 7] still use SGD for training. However, SGD is greedy for the objective function with many multi-scale local convex regions (cf. Figure 1 of  or Fig. 1, left) because the negative of the gradient may not point to the minimum point on coarse-scale. Thus, the learning rate of SGD is difficult to set and significantly affects model performance.
Unlike greedy methods, dynamic programming (DP)  converges faster by solving simple sub-problems that decomposed from the original problem. Inspired by this, we propose the virtual gradient to construct a stochastic optimization method that combines the advantages of SGD and adaptive learning rate methods.
Consider a general objective function with the following composite form:
where , functions and each component function of is first-order differentiable.
We note that:
In addition, when we minimize and with the same iterative method, the former should converge faster because the structure of is simpler than . Based on these facts, we construct sequences and that converge to and , respectively, with equations:
where is the learning rate, is an operator of mappping .
The difficulty in constructing operator is how to make the condition (3) holds true. Let , is an operator of mapping , we give the following iterations:
Since in Eqn.(6) is equivalent to the position of in gradient descent method, we define as the virtual gradient of function for variable .
For Eqn.(6), it is easy to prove that the condition (3) holds when is a linear mapping. If is a nonlinear mapping, let the second-derivatives of be bounded and , owing to (5) and (6) and Taylor formula, the following holds true:
In this case, the condition (3) holds, approximately.
) and this form is generally not unique, it is inconvenient for our algorithm design. We begin by introducing the computational graph. It is a directed graph, where each node indicates a variable that may be a scalar, vector, matrix, tensor, or even a variable of another type, and each edge unique corresponds to an operation which maps a node to another. We sometimes annotate the output node with the name of the operation applied. In particular, the computational graph corresponding to the objective function is a DAG(directed acyclic graphs). For example, the computational graph of the objective function shown in Fig. 2 (a), the corresponding composite form (1) is:
For a given general objective function, let correspond to a computational graph that maps the set of leaf values to the output value , where the set of hidden values is . Let . In this paper, the objective function in Eqn.(1) will be expressed as the following composite form:
where , , and
For example, Eqn.(8) can be expressed as:
In deeping learning, the gradient of the objective function is usually calculated by the Automatic Differentiation (AD) technique[12, 9]. Our following example introduces how to calculate the gradient of in Eqn.(8) using AD technique.
Find the Operation associated with output value and its input node , cf. Fig. 2 (b). Then, calculate the following gradients:
Perform the following steps by the partial order of :
Find Operation and it’s input nodes which associated with hidden value , cf. Fig. 2 (c). Let:
where ’’ denotes that it is treated as a constant during the calculation of gradients and will not be declared later. Calculate the following gradients:
Find Operation and it’s input nodes which associated with hidden value , cf. Fig. 2 (d). Let:
Calculate following gradients:
Calculate the gradients of :
According to the analysis above, the computational graph of can be shown as Fig. 3 (a). If is a broadcast-like operator, the computational graph of vitrual gradients can be shown as Fig. 3 (b), where , and is defined by the Eqn.(10).
According to the definition of virtual gradient, for any :
Obviously, where is an identity operator. The bprop operation is uniquely determined by .
Then, the Eqn.(6) can be written as the following virtual gradient descent iteration:
We prove that the SVGD (Alg. 1
) has advantages over SGD, RMSProp and Adam in training speed and test accuracy by experiments on multiple network models and datasets.
2 Stochastic Virtual Gradients Descent Method
In this section, we will use the accumulate squared gradient in the RMSProp to construct the operator . According to Eqn.(7), Eqn.(3) holds when the mapping is linear. Based on this fact, we designed the following SVGD algorithm. The functions and variables in the algorithm are given by Eqn.(9) and Eqn.(10).
SVGD works well in neural network training tasks (Fig.9, 11, 12), it has a relatively faster convergence rate and better test accuracy than SGD, RMSProp, and Adam.
For the linear operation Conv2D  and matrix multiplication MatMul as follows:
there are and . Thus, SVGD also has less memory requirements than RMSProp and Adam for deep neural networks.
For the same stochastic objective function, the learning rate at timestep t in SVGD has the following relationship with the stepsize in the SGD and RMSProp:
In this section, we introduce two methods to generate the computational graph of virtual gradient. We begin by assumming that the objective function is (cf. Fig. 4 (b)), the set (cf. Fig. 4 (a)), and the function used to construct the computational graph of gradients is "gradients", cf. Fig. 4 (c).
We hope to generate the computational graph of virtual gradients by using the function "gradients", Fig. 4 (c).
3.1 Extend the API libraries
As shown in Fig. 4, We begin by replacing with , where is a copy of but corresponds to a new bprob operation. Then, call the function "gradients" to generate the computational graph of virtual gradients.
3.2 Modify the topology of the calculation graph
4 Convergence Analysis
In this section, we will analyze the theoretical convergence of Eqn.(6) under some assumptions. Let be a random ()-matrix, be an i.i.d. variable from . Then
Let be the unit vector whose i-th component is 1, is bilinear, Then
Since be an i.i.d. variable from , the following holds true:
Fig. 7 proof our lemma.
For defined in Lemma 4, if , then:
be second-order differentiable functions with random variables in their expression, we set:
If each component of Jacobian matrix is an i.i.d. variable from , then, for and there exists a such that
Without loss of generality, we can assume . Then, the Maclaurin series for around the point is:
Let . According to corollary 4:
5 Related Work
First-order methods. For general first-order methods, The moving direction of the variables can be regarded as the function of the stochastic gradient :
Momentum: Let . Then:
RMSProp: Let . Then:
Adam: Let . Then:
However, in SVGD method, cannot be written as a function of . Thus, SVGD is not essentially a first-order method.
Global minimum. A central challenge of non-convex optimization is avoiding sub-optimal local minima. Although it has been shown that the variable can sometimes converges to a neighborhood of the global minimum by adding noise[16, 17, 18, 19, 20]
, the convergence rate is still a problem. Note that the DP method has some probability to escape “appropriately shallow” local minima because the moving direction of the variable is generated by solving several sub-problems instead of the original problem. We use computational graph and automatic differentiation to generate the sub-problems in DP, such as what we did in the SVGD method.
In this section, we evaluated our method on two benchmark datasets using several different neural network architectures. We train the neural networks using RMSProp, Adam, SGD, and SVGD to minimize the cross-entropy objective function with weight decay on the parameters to prevent over-fitting. To be fair, for different methods, a given objective function will be minimized with different learning rates. All extension libs, algorithm, and experimental logs in this paper can be found at the URL: https://github.com/LizhengMathAi/svgd.
The following experiments show that SVGD has a relatively faster convergence rate and better test accuracy than SGD, RMSProp, and Adam.
6.1 Multi-layer neural network
The model is trained with a mini-batch size of 32 and weight decay of . In Table 1, we decay at 1.6k and 3.6k iterations and summarize the optimal learning rates for RMSProp, Adam, SGD, and SVGD by hundreds of experiments.
|test top-1 error||1.80%||1.94%||1.76%||1.60%|
6.2 Convolutional neural network
The model is trained with a mini-batch size of 128 and weight decay of . In Table 2, we decay at 12k and 24k iterations and summarize the optimal learning rates for RMSProp, Adam, SGD, and SVGD by hundreds of experiments.
|test top-1 error||17.78%||18.02%||17.32%||17.07%|
6.3 Deep neural network
We use the same hyperparameters with to train ResNet-20 model(0.27M params) on the CIFAR-10 classification task. In Table 3, we decay at 12k and 24k iterations and summarize the optimal learning rates for RMSProp, Adam, SGD, and SVGD by hundreds of experiments.
|test top-1 error||11.18%||11.12%||10.69%||8.62%|
-  David Saad. Online algorithms and stochastic approximations. Online Learning, 5, 1998.
-  Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14, 2012.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.
-  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In , pages 580–587, 2014.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
-  Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. arXiv preprint arXiv:1702.05575, 2017.
-  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
-  Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
-  Krishnaiyan Thulasiraman and Madisetti NS Swamy. Graphs: theory and algorithms. Wiley Online Library, 1992.
-  Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research, 18:1–43, 2018.
-  Yann LeCun et al. Generalization and network design strategies. In Connectionism in perspective, volume 19. Citeseer, 1989.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
-  Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
-  Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with gradient descent. arXiv preprint arXiv:1511.04834, 2015.
-  Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
-  Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines. arXiv preprint arXiv:1511.06392, 2015.
-  Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
-  Albert Zeyer, Patrick Doetsch, Paul Voigtlaender, Ralf Schlüter, and Hermann Ney. A comprehensive study of deep bidirectional lstm rnns for acoustic modeling in speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2462–2466. IEEE, 2017.
The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine, 29(6):141–142, 2012.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial intelligence and statistics, pages 562–570, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in neural information processing systems, pages 2377–2385, 2015.