 # SVGD: A Virtual Gradients Descent Method for Stochastic Optimization

Inspired by dynamic programming, we propose Stochastic Virtual Gradient Descent (SVGD) algorithm where the Virtual Gradient is defined by computational graph and automatic differentiation. The method is computationally efficient and has little memory requirements. We also analyze the theoretical convergence properties and implementation of the algorithm. Experimental results on multiple datasets and network models show that SVGD has advantages over other stochastic optimization methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic gradient-based optimization is most widely used in many fields of science and engineering. In recent years, many scholars have compared SGD with some adaptive learning rate optimization methods[2, 3].  shows that adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development/test set. Therefore, many excellent models [5, 6, 7] still use SGD for training. However, SGD is greedy for the objective function with many multi-scale local convex regions (cf. Figure 1 of  or Fig. 1, left) because the negative of the gradient may not point to the minimum point on coarse-scale. Thus, the learning rate of SGD is difficult to set and significantly affects model performance.

Unlike greedy methods, dynamic programming (DP)  converges faster by solving simple sub-problems that decomposed from the original problem. Inspired by this, we propose the virtual gradient to construct a stochastic optimization method that combines the advantages of SGD and adaptive learning rate methods.

Consider a general objective function with the following composite form:

 J=F(σ),σ=f(θ)∈Ωσ, (1)

where , functions and each component function of is first-order differentiable.

We note that:

 F(σ∗)=F(f(θ∗)),σ∗=argminσ∈Ωσ F(σ),θ∗=argminθ∈Ωθ F(f(θ)). (2)

In addition, when we minimize and with the same iterative method, the former should converge faster because the structure of is simpler than . Based on these facts, we construct sequences and that converge to and , respectively, with equations:

 σ(t)=f(θ(t)),t=0,1,⋯. (3)

Fig. 1 (right) shows the relationship between and . The sequence can be obtained by using first-order iterative methods (see Sec.5 for details):

 σ(t+1)=σ(t)−αT∗∇σJ∣∣σ=σ(t), (4)

where is the learning rate, is an operator of mappping .

The difficulty in constructing operator is how to make the condition (3) holds true. Let , is an operator of mapping , we give the following iterations:

 σ(t+1)=σ(t)−αMMTT∇σJ∣∣σ=σ(t), (5)
 θ(t+1)=θ(t)−αMTT∇σJ∣∣θ=θ(t). (6)

Since in Eqn.(6) is equivalent to the position of in gradient descent method, we define as the virtual gradient of function for variable .

For Eqn.(6), it is easy to prove that the condition (3) holds when is a linear mapping. If is a nonlinear mapping, let the second-derivatives of be bounded and , owing to (5) and (6) and Taylor formula, the following holds true:

 \autobreak||σ(t+1)−f(θ(t+1))||=||(σ(t)−αMMTT∇σJ∣∣σ=σ(t))−f(θ(t)−αMTT∇σJ∣∣θ=θ(t))||=O(||αMTT∇σJ∣∣θ=θ(t)||22)=O(α2), (7)

In this case, the condition (3) holds, approximately.

According to the analysis above, the sequence yields similar convergence as in Eqn.(6) and Eqn.(5), but faster than minimizing the function with the same first-order method, directly.

Note that the iterative method (6) is derived based on the composite form (1

) and this form is generally not unique, it is inconvenient for our algorithm design. We begin by introducing the computational graph. It is a directed graph, where each node indicates a variable that may be a scalar, vector, matrix, tensor, or even a variable of another type, and each edge unique corresponds to an operation which maps a node to another. We sometimes annotate the output node with the name of the operation applied. In particular, the computational graph corresponding to the objective function is a DAG(directed acyclic graphs)

. For example, the computational graph of the objective function shown in Fig. 2 (a), the corresponding composite form (1) is:

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩J=F(σ),σ=[~σ1~σ2]=f(θ)=[~f1(~θ1,~θ2)~f2(~θ2,~θ3;~σ1)],θ=[~θT1,~θT2,~θT3]T~f1:Rnθ1×Rnθ2→Rnσ1,~f2:Rnθ2×Rnθ3×Rnσ1→Rnσ2 . (8) Figure 2: , , are nodes associated with leaf values, hidden values and output value, and , , are edges associated with operations ~f1,~f2,F.

For a given general objective function, let correspond to a computational graph that maps the set of leaf values to the output value , where the set of hidden values is . Let . In this paper, the objective function in Eqn.(1) will be expressed as the following composite form:

 J=F(σ),σ=⎡⎢ ⎢⎣~σ′1⋮~σ′M′⎤⎥ ⎥⎦=~f(θ)=⎡⎢ ⎢ ⎢ ⎢⎣~f1(~θ1,1,⋯,~θ1,N1;~σ′1,1,⋯,~σ′1,N′1)⋮~fM′(~θM′,1,⋯,~θM′,NM′;~σ′M′,1,⋯,~σ′M′,N′M′)⎤⎥ ⎥ ⎥ ⎥⎦ (9)

where , , and

For example, Eqn.(8) can be expressed as:

 J=F(σ),σ=[~σ′1~σ′2]=~f(θ)=[~f1(~θ1,1,~θ1,2)~f2(~θ2,1,~θ2,2;~σ2,1)],

where

 {~σ′1=~σ1,~θ1,1=~θ1,~θ1,2=~θ2~σ′2=~σ2,~σ2,1=~σ1,~θ2,1=~θ2,~θ2,2=~θ3 .

In deeping learning, the gradient of the objective function is usually calculated by the Automatic Differentiation (AD) technique

[12, 9]. Our following example introduces how to calculate the gradient of in Eqn.(8) using AD technique.

1. Find the Operation associated with output value and its input node , cf. Fig. 2 (b). Then, calculate the following gradients:

 gJw:=gw→J(~σ′1,~σ′2)=∇wF,w=~σ′1,~σ′2.
2. Perform the following steps by the partial order of :

1. Find Operation and it’s input nodes which associated with hidden value , cf. Fig. 2 (c). Let:

 F2(~θ2,1,~θ2,2;~σ′2,1):=F(⋅,~f2(~θ2,1,~θ2,2;~σ′2,1)),

where ’’ denotes that it is treated as a constant during the calculation of gradients and will not be declared later. Calculate the following gradients:

 g~σ′2w:=gw→~σ′2(∇~σ′2J;~θ2,1,~θ2,2;~σ′2,1)=∇wF2=(∂~σ′2∂w)T∇~σ′2J,w=~θ2,1,~θ2,2,~σ′2,1,

where .

2. Find Operation and it’s input nodes which associated with hidden value , cf. Fig. 2 (d). Let:

 F1(~θ1,1,~θ1,2):=F(~f1(~θ1,1,~θ1,2),~f2(⋅,⋅,~f1(~θ1,1,~θ1,2))).

 g~σ′1w:=gw→~σ′1(∇~σ′1J;~θ1,1,~θ1,2)=∇wF1=(∂~σ′1∂w)T∇~σ′1J,w=~θ1,1,~θ1,2,

where .

3. Calculate the gradients of :

 ∇~θ1J=g~σ′1~θ1,1,∇~θ2J=g~σ′1~θ1,2+g~σ′2~θ2,1,∇~θ3J=g~σ′2~θ2,2.

According to the analysis above, the computational graph of can be shown as Fig. 3 (a). If is a broadcast-like operator, the computational graph of vitrual gradients can be shown as Fig. 3 (b), where , and is defined by the Eqn.(10). Figure 3: are edges associated with operation + and x y denotes Tx y.

According to the definition of virtual gradient, for any :

 ∇(G,T)~θkJ =∑~σ′i∈Vθ→σ(∂~σ′i∂~θk)TTi∇~σ′iJ =∑~σ′i∈Vθ→σ∑jδ(~θi,j,~θk)g~θi,j→~σ′i(Ti,j∇~σ′iJ;~θi,1,⋯,~θi,N1;~σ′i,1,⋯,~σ′i,N′i). (10)

Obviously, where is an identity operator. The bprop operation is uniquely determined by .

Then, the Eqn.(6) can be written as the following virtual gradient descent iteration:

 ~θ(t+1)k=~θ(t)k−α∇(G,T)~θkJ∣∣θ=θ(t),∀~θk∈VGθ (11)

We prove that the SVGD (Alg. 1

) has advantages over SGD, RMSProp and Adam in training speed and test accuracy by experiments on multiple network models and datasets.

In Sec.2 we describe the operator and the SVGD algorithm of stochastic optimization. Sec.3 introduce two methods to encapsulate SVGD, and Sec.4 provides a theoretical analysis of convergence. Sec.6 compares our method with other methods by experiments.

## 2 Stochastic Virtual Gradients Descent Method

In this section, we will use the accumulate squared gradient in the RMSProp to construct the operator . According to Eqn.(7), Eqn.(3) holds when the mapping is linear. Based on this fact, we designed the following SVGD algorithm. The functions and variables in the algorithm are given by Eqn.(9) and Eqn.(10).

SVGD works well in neural network training tasks (Fig.

9, 11, 12), it has a relatively faster convergence rate and better test accuracy than SGD, RMSProp, and Adam.

For the linear operation Conv2D  and matrix multiplication MatMul as follows:

 {Conv2D:(RN×RHin×RWin×RCin,RHk×RWk×RCin×RCout)→RN×RHout×RWout×RCoutMatMul:(RN×RCin,RCin×RCout)→RN×RCout ,

there are and . Thus, SVGD also has less memory requirements than RMSProp and Adam for deep neural networks.

For the same stochastic objective function, the learning rate at timestep t in SVGD has the following relationship with the stepsize in the SGD and RMSProp:

 αSVGD(t)≈αSGD(t),s∗αSVGD(t)≈αRMSProp(t).

## 3 Encapsulation

In this section, we introduce two methods to generate the computational graph of virtual gradient. We begin by assumming that the objective function is (cf. Fig. 4 (b)), the set (cf. Fig. 4 (a)), and the function used to construct the computational graph of gradients is "gradients", cf. Fig. 4 (c).

We hope to generate the computational graph of virtual gradients by using the function "gradients", Fig. 4 (c).

### 3.1 Extend the API libraries

As shown in Fig. 4, We begin by replacing with , where is a copy of but corresponds to a new bprob operation. Then, call the function "gradients" to generate the computational graph of virtual gradients.

In order to achieve the idea above, in programming, we need to extending core libraries to customize new operations of and its bprop operation. Fig. 5

shows that we need to extend 3 libraries in the layered architecture of TensorFlow

.

### 3.2 Modify the topology of the calculation graph

According to Eqn.(10) and Fig. 3, the computational graph of the virtual gradients can be obtained by adding new nodes on the computational graph of the gradients and reroute the inputs and outputs of new nodes. cf. Fig. 6.

## 4 Convergence Analysis

In this section, we will analyze the theoretical convergence of Eqn.(6) under some assumptions. Let be a random ()-matrix, be an i.i.d. variable from . Then

 fλ(v,u):=E[vTMTMu∣∣ ||M||2=λ]∝vTu,∀λ>0,∀v,u∈Rn. (12)
###### Proof.

Let be the unit vector whose i-th component is 1, is bilinear, Then

 fλ(v,u)=n∑i=1n∑j=1viujfλ(ei,ej)=vTCu,Cij:=E[eTiMTMej∣∣ ||M||2=λ]. (13)

Since be an i.i.d. variable from , the following holds true:

 {E[eT1MTMe1| ||M||2=λ]=⋯=E[eTnMTMen| ||M||2=λ]:=c0>0E[eTiMTMej| ||M||2=λ]=0,i,j∈{1,⋯,n} .

Thus:

 fλ(v,u)=c0vTu. (14)

Fig. 7 proof our lemma. Figure 7: The relationship between vTuand the estimate of fλ(v,u). Each point corresponds to a pair of random vector (v,u)and a random matrix set {Mt|t=1,⋯,10000}.

For defined in Lemma 4, if , then:

 E[vTMTMu]>0. (15)

Let and

be second-order differentiable functions with random variables in their expression, we set:

 vTTv>0,i∈1,⋯,m, ∀v∈Rm∖{0}.

If each component of Jacobian matrix is an i.i.d. variable from , then, for and there exists a such that

 E[F(f(θ(t+1)))−F(f(θ(t)))]<0,
###### Proof.

Without loss of generality, we can assume . Then, the Maclaurin series for around the point is:

 F(f(θ(t+1)))−F(f(θ(t))) =−α(∇θJ|θ=θ(t))TMT∇σJ|θ=θ(t)+o(α) =−α(∇σJ|θ=θ(t))TMTMT∇σJ|θ=θ(t)+o(α).

Let . According to corollary 4:

 E[F(f(θ(t+1)))−F(f(θ(t)))]=−αE[vTMTMTv]+o(α)<0.

Although our convergence analysis in Thm.4

only applies to the assumption of uniform distribution, we empirically found that SVGD often outperforms other methods in general cases.

## 5 Related Work

First-order methods. For general first-order methods, The moving direction of the variables can be regarded as the function of the stochastic gradient :

• SGD:    .

• Momentum:    Let . Then:

 p(t)=Tg(t):=−m(t+1).
• RMSProp:    Let . Then:

 p(t)=Tg(t):=−g(t)√δ+r(t+1).

 p(t)=Tg(t):=−s(t+1)/(1−ρt1)√δ+r(t+1)/(1−ρt2).

However, in SVGD method, cannot be written as a function of . Thus, SVGD is not essentially a first-order method.

Global minimum. A central challenge of non-convex optimization is avoiding sub-optimal local minima. Although it has been shown that the variable can sometimes converges to a neighborhood of the global minimum by adding noise[16, 17, 18, 19, 20]

, the convergence rate is still a problem. Note that the DP method has some probability to escape “appropriately shallow” local minima because the moving direction of the variable is generated by solving several sub-problems instead of the original problem. We use computational graph and automatic differentiation to generate the sub-problems in DP, such as what we did in the SVGD method.

## 6 Experiments

In this section, we evaluated our method on two benchmark datasets using several different neural network architectures. We train the neural networks using RMSProp, Adam, SGD, and SVGD to minimize the cross-entropy objective function with weight decay on the parameters to prevent over-fitting. To be fair, for different methods, a given objective function will be minimized with different learning rates. All extension libs, algorithm, and experimental logs in this paper can be found at the URL: https://github.com/LizhengMathAi/svgd.

The following experiments show that SVGD has a relatively faster convergence rate and better test accuracy than SGD, RMSProp, and Adam.

### 6.1 Multi-layer neural network

In our first set of experiments, we train a 5-layer neural network (Fig. 8) on the MNIST  handwritten digit classification task. Figure 8: MLP architecture for MNIST with 5 parameter layers (245482 params).

The model is trained with a mini-batch size of 32 and weight decay of . In Table 1, we decay at 1.6k and 3.6k iterations and summarize the optimal learning rates for RMSProp, Adam, SGD, and SVGD by hundreds of experiments.

In Table 1 and Fig. 9 we compare the error rates and their descent process process on the CIFAR-10 test set, respectively. Figure 9: Comparison of first-order methods on MNISTdigit classification for 3.75 epochs.

### 6.2 Convolutional neural network

We train a VGG model (Fig. 10) on the CIFAR-10  classification task and follow the simple data augmentation in [23, 24] for training and evaluate the original image for testing. Figure 10: VGG model architecture for CIFAR-10 with 6 parameter layers (46126 params).

The model is trained with a mini-batch size of 128 and weight decay of . In Table 2, we decay at 12k and 24k iterations and summarize the optimal learning rates for RMSProp, Adam, SGD, and SVGD by hundreds of experiments.

In Table 2 and Fig. 11 we compare the error rates and their descent process on the CIFAR-10 test set, respectively. Figure 11: Comparison of first-order methods on CIFAR-10 dataset for 90 epochs.

### 6.3 Deep neural network

We use the same hyperparameters with

 to train ResNet-20 model(0.27M params) on the CIFAR-10 classification task. In Table 3, we decay at 12k and 24k iterations and summarize the optimal learning rates for RMSProp, Adam, SGD, and SVGD by hundreds of experiments.

In Table 3 and Fig. 12 we compare the error rates and their descent process on the CIFAR-10 test set, respectively. The top-1 error fluctuations in experiments do not exceed 1%. See  for more information on the CIFAR-10 experimental record. Figure 12: Comparison of first-order methods on CIFAR-10 dataset for 125 epochs.