1 Introduction
Stochastic gradientbased optimization is most widely used in many fields of science and engineering. In recent years, many scholars have compared SGD[1] with some adaptive learning rate optimization methods[2, 3]. [4] shows that adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development/test set. Therefore, many excellent models [5, 6, 7] still use SGD for training. However, SGD is greedy for the objective function with many multiscale local convex regions (cf. Figure 1 of [8] or Fig. 1, left) because the negative of the gradient may not point to the minimum point on coarsescale. Thus, the learning rate of SGD is difficult to set and significantly affects model performance[9].
Unlike greedy methods, dynamic programming (DP) [10] converges faster by solving simple subproblems that decomposed from the original problem. Inspired by this, we propose the virtual gradient to construct a stochastic optimization method that combines the advantages of SGD and adaptive learning rate methods.
Consider a general objective function with the following composite form:
(1) 
where , functions and each component function of is firstorder differentiable.
We note that:
(2) 
In addition, when we minimize and with the same iterative method, the former should converge faster because the structure of is simpler than . Based on these facts, we construct sequences and that converge to and , respectively, with equations:
(3) 
Fig. 1 (right) shows the relationship between and . The sequence can be obtained by using firstorder iterative methods (see Sec.5 for details):
(4) 
where is the learning rate, is an operator of mappping .
The difficulty in constructing operator is how to make the condition (3) holds true. Let , is an operator of mapping , we give the following iterations:
(5) 
(6) 
Since in Eqn.(6) is equivalent to the position of in gradient descent method, we define as the virtual gradient of function for variable .
For Eqn.(6), it is easy to prove that the condition (3) holds when is a linear mapping. If is a nonlinear mapping, let the secondderivatives of be bounded and , owing to (5) and (6) and Taylor formula, the following holds true:
(7) 
In this case, the condition (3) holds, approximately.
According to the analysis above, the sequence yields similar convergence as in Eqn.(6) and Eqn.(5), but faster than minimizing the function with the same firstorder method, directly.
Note that the iterative method (6) is derived based on the composite form (1
) and this form is generally not unique, it is inconvenient for our algorithm design. We begin by introducing the computational graph. It is a directed graph, where each node indicates a variable that may be a scalar, vector, matrix, tensor, or even a variable of another type, and each edge unique corresponds to an operation which maps a node to another. We sometimes annotate the output node with the name of the operation applied. In particular, the computational graph corresponding to the objective function is a DAG(directed acyclic graphs)
[11]. For example, the computational graph of the objective function shown in Fig. 2 (a), the corresponding composite form (1) is:(8) 
For a given general objective function, let correspond to a computational graph that maps the set of leaf values to the output value , where the set of hidden values is . Let . In this paper, the objective function in Eqn.(1) will be expressed as the following composite form:
(9) 
where , , and
For example, Eqn.(8) can be expressed as:
where
In deeping learning, the gradient of the objective function is usually calculated by the Automatic Differentiation (AD) technique
[12, 9]. Our following example introduces how to calculate the gradient of in Eqn.(8) using AD technique.
Find the Operation associated with output value and its input node , cf. Fig. 2 (b). Then, calculate the following gradients:

Perform the following steps by the partial order of :

Find Operation and it’s input nodes which associated with hidden value , cf. Fig. 2 (c). Let:
where ’’ denotes that it is treated as a constant during the calculation of gradients and will not be declared later. Calculate the following gradients:
where .

Find Operation and it’s input nodes which associated with hidden value , cf. Fig. 2 (d). Let:
Calculate following gradients:
where .


Calculate the gradients of :
According to the analysis above, the computational graph of can be shown as Fig. 3 (a). If is a broadcastlike operator, the computational graph of vitrual gradients can be shown as Fig. 3 (b), where , and is defined by the Eqn.(10).
According to the definition of virtual gradient, for any :
(10) 
Obviously, where is an identity operator. The bprop operation is uniquely determined by .
2 Stochastic Virtual Gradients Descent Method
In this section, we will use the accumulate squared gradient in the RMSProp to construct the operator . According to Eqn.(7), Eqn.(3) holds when the mapping is linear. Based on this fact, we designed the following SVGD algorithm. The functions and variables in the algorithm are given by Eqn.(9) and Eqn.(10).
SVGD works well in neural network training tasks (Fig.
9, 11, 12), it has a relatively faster convergence rate and better test accuracy than SGD, RMSProp, and Adam.For the linear operation Conv2D [13] and matrix multiplication MatMul as follows:
there are and . Thus, SVGD also has less memory requirements than RMSProp and Adam for deep neural networks.
For the same stochastic objective function, the learning rate at timestep t in SVGD has the following relationship with the stepsize in the SGD and RMSProp:
3 Encapsulation
In this section, we introduce two methods to generate the computational graph of virtual gradient. We begin by assumming that the objective function is (cf. Fig. 4 (b)), the set (cf. Fig. 4 (a)), and the function used to construct the computational graph of gradients is "gradients", cf. Fig. 4 (c).
We hope to generate the computational graph of virtual gradients by using the function "gradients", Fig. 4 (c).
3.1 Extend the API libraries
As shown in Fig. 4, We begin by replacing with , where is a copy of but corresponds to a new bprob operation. Then, call the function "gradients" to generate the computational graph of virtual gradients.
In order to achieve the idea above, in programming, we need to extending core libraries to customize new operations of and its bprop operation. Fig. 5
shows that we need to extend 3 libraries in the layered architecture of TensorFlow
[14].3.2 Modify the topology of the calculation graph
4 Convergence Analysis
In this section, we will analyze the theoretical convergence of Eqn.(6) under some assumptions. Let be a random ()matrix, be an i.i.d. variable from . Then
(12) 
Proof.
Let be the unit vector whose ith component is 1, is bilinear, Then
(13) 
Since be an i.i.d. variable from , the following holds true:
Thus:
(14) 
∎
Fig. 7 proof our lemma.
For defined in Lemma 4, if , then:
(15) 
Let and
be secondorder differentiable functions with random variables in their expression, we set:
If each component of Jacobian matrix is an i.i.d. variable from , then, for and there exists a such that
Proof.
Without loss of generality, we can assume . Then, the Maclaurin series for around the point is:
Let . According to corollary 4:
∎
Although our convergence analysis in Thm.4
only applies to the assumption of uniform distribution, we empirically found that SVGD often outperforms other methods in general cases.
5 Related Work
Firstorder methods. For general firstorder methods, The moving direction of the variables can be regarded as the function of the stochastic gradient :

SGD: .

Momentum:[15] Let . Then:

RMSProp: Let . Then:

Adam: Let . Then:
However, in SVGD method, cannot be written as a function of . Thus, SVGD is not essentially a firstorder method.
Global minimum. A central challenge of nonconvex optimization is avoiding suboptimal local minima. Although it has been shown that the variable can sometimes converges to a neighborhood of the global minimum by adding noise[16, 17, 18, 19, 20]
, the convergence rate is still a problem. Note that the DP method has some probability to escape “appropriately shallow” local minima because the moving direction of the variable is generated by solving several subproblems instead of the original problem. We use computational graph and automatic differentiation to generate the subproblems in DP, such as what we did in the SVGD method.
6 Experiments
In this section, we evaluated our method on two benchmark datasets using several different neural network architectures. We train the neural networks using RMSProp, Adam, SGD, and SVGD to minimize the crossentropy objective function with weight decay on the parameters to prevent overfitting. To be fair, for different methods, a given objective function will be minimized with different learning rates. All extension libs, algorithm, and experimental logs in this paper can be found at the URL: https://github.com/LizhengMathAi/svgd.
The following experiments show that SVGD has a relatively faster convergence rate and better test accuracy than SGD, RMSProp, and Adam.
6.1 Multilayer neural network
In our first set of experiments, we train a 5layer neural network (Fig. 8) on the MNIST [21] handwritten digit classification task.
The model is trained with a minibatch size of 32 and weight decay of . In Table 1, we decay at 1.6k and 3.6k iterations and summarize the optimal learning rates for RMSProp, Adam, SGD, and SVGD by hundreds of experiments.
RMSProp  Adam  SGD  SVGD(s=0.1)  
iter:  0.001  0.001  0.1  0.01  
iter:  0.0005  0.00005  0.05  0.005  
iter:  0.00005  0.00005  0.01  0.001  
test top1 error  1.80%  1.94%  1.76%  1.60% 
6.2 Convolutional neural network
We train a VGG model (Fig. 10) on the CIFAR10 [22] classification task and follow the simple data augmentation in [23, 24] for training and evaluate the original image for testing.
The model is trained with a minibatch size of 128 and weight decay of . In Table 2, we decay at 12k and 24k iterations and summarize the optimal learning rates for RMSProp, Adam, SGD, and SVGD by hundreds of experiments.
RMSProp  Adam  SGD  SVGD(s=0.001)  
iter:  0.02  0.02  2.0  2.0  
iter:  0.01  0.01  0.5  0.5  
iter:  0.002  0.005  0.005  0.005  
test top1 error  17.78%  18.02%  17.32%  17.07% 
6.3 Deep neural network
We use the same hyperparameters with
[24] to train ResNet20 model(0.27M params) on the CIFAR10 classification task. In Table 3, we decay at 12k and 24k iterations and summarize the optimal learning rates for RMSProp, Adam, SGD, and SVGD by hundreds of experiments.RMSProp  Adam  SGD  SVGD(s=0.01)  
iter:  0.001  0.001  0.1  0.5  
iter:  0.0001  0.0001  0.01  0.02  
iter:  0.0001  0.00005  0.001  0.01  
test top1 error  11.18%  11.12%  10.69%  8.62% 
References
 [1] David Saad. Online algorithms and stochastic approximations. Online Learning, 5, 1998.
 [2] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of minibatch gradient descent. Cited on, 14, 2012.
 [3] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [4] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017.

[5]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.
Rich feature hierarchies for accurate object detection and semantic
segmentation.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 580–587, 2014.  [6] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [7] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
 [8] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. arXiv preprint arXiv:1702.05575, 2017.
 [9] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 [10] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
 [11] Krishnaiyan Thulasiraman and Madisetti NS Swamy. Graphs: theory and algorithms. Wiley Online Library, 1992.
 [12] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research, 18:1–43, 2018.
 [13] Yann LeCun et al. Generalization and network design strategies. In Connectionism in perspective, volume 19. Citeseer, 1989.
 [14] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
 [15] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 [16] Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with gradient descent. arXiv preprint arXiv:1511.04834, 2015.
 [17] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
 [18] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural randomaccess machines. arXiv preprint arXiv:1511.06392, 2015.
 [19] Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
 [20] Albert Zeyer, Patrick Doetsch, Paul Voigtlaender, Ralf Schlüter, and Hermann Ney. A comprehensive study of deep bidirectional lstm rnns for acoustic modeling in speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2462–2466. IEEE, 2017.

[21]
Li Deng.
The mnist database of handwritten digit images for machine learning research [best of the web].
IEEE Signal Processing Magazine, 29(6):141–142, 2012.  [22] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [23] ChenYu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervised nets. In Artificial intelligence and statistics, pages 562–570, 2015.
 [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [25] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in neural information processing systems, pages 2377–2385, 2015.